Discussion
During our research, the objective was to establish a systematic
workflow and generate a high-quality dataset of MS1isotope distributions. To eliminate the inherent stochasticity
associated with working on unknown proteomes, we utilized the UPS2
standard kit. As the UPS2 standard kit only contains known proteins, we
know what proteins to search for, giving an increased reassurance in the
identifications made by the database search algorithm. Additionally, the
varying concentrations within the kit allows researchers to test the
sensitivity of their newly developed tools.
The initial step in our research involved performing a database search
on both UPS2 samples. To ensure the production of high-quality PSMs, we
employed MSFragger with a reverse target-decoy approach, maintaining an
FDR of 1%. An equal amount of PSMs was identified in both samples, and
there was a high level of agreement between the peptide and protein
identifications. Upon further investigation, it was found that the
protein concentration is one of the most influential factors for protein
identification. Specifically, proteins with lower concentrations in the
UPS2 standard kit exhibited reduced coverage and overall detection
probability. While this might seem like a logical finding, we do want to
express the importance of it. When using an unknown proteome to evaluate
different algorithms, the PSMs will be influenced by the concentrations
of the peptides and proteins present in the sample. While there are many
other factors influencing the probability of identifying proteins and
peptides, such as the preprocessing of samples or the dynamic range of
the LC-MS/MS device itself, it is an important point to consider and
well described in literature .
Next, we used a workflow developed in-house to extract
MS1 isotope distributions for the PSMs acquired by the
MSFragger database search. A total of 138.111 peptide isotope
distributions were acquired combined over both samples with at least
127.646 peptide isotope distributions having 2 or more peaks. There were
more MS1 isotope distributions extracted from sample
A11-12042 compared to sample A11-12043, which corresponds to sample
A11-12042 having more PSMs in comparison to sample A11-12043. The
spectral angle was used to check the similarity between the experimental
isotope distributions and their expected theoretical isotope
distributions computed by BRAIN. The spectral angle can take on values
between 0 and 1.57, with values closer to 0 indicating a higher
similarity between the experimental and theoretical isotope
distributions . The bell shape of the distributions of the spectral
angle scores in both samples lay close to 0, indicating a high
similarity between theoretical and experimental isotope distributions
(Figure 3). While the dataset still includes isotope distributions with
a high spectral angle score, indicating a high dissimilarity between the
theoretical and experimental isotope distributions, we opted to leave
them in the dataset, as they may still serve as valuable input for
training machine learning and deep learning models. There were 10.465
isotope distributions consisting of just the monoisotopic peak. There
are currently no ways of validating these monoisotopic peaks
MS1 spectra, that we are aware of. Their only
legitimacy is that they have been extracted at approximately the same
time as confidently identified PSMs and within the specified mass
window. Lastly, the complete MS1 isotope distribution
dataset consists out of 965 unique peptides based on their sequence,
modifications and charge state. While the complete dataset is quite
large, it is also limited to a set of unique UPS peptides. However, we
believe that the workflow presented may be used in the future to extract
more MS1 isotope distributions from proteome standards
such as the large-scale ProteomeTools dataset .
In this manuscript, we provided a data-driven approach to extract
MS1 isotope distributions of high-quality while
presenting them in a standardized manner. The proposed workflow can be
used in the future to further extend the benchmark dataset. The
benchmark dataset itself provides an ideal foundation for the
development of new bioinformatics tools in the future, such as new
machine learning and deep learning model. These novel algorithms may
further advance our understanding of the molecular underpinnings of
disease pathology. All code and algorithms have been made available
https://github.com/VilenneFrederique/MS1IsotopeDistributionsDatasetWorkflow
Acknowledgements
This research was funded by Research Foundation – Flanders (FWO) under
the “Beyond the Genome: Ethical Aspects of Large Cohort Studies”
project (Case number G070722N).
Conflicts of interest
The authors have declared no conflicts of interest.