Statement of Significance

LC-MS/MS-based proteomics is continuously advancing, allowing redefinition of disease at the molecular scale, transforming curative medicine to preventive and personalized medicine. While there are numerous large, annotated spectral databases available for the development of new bioinformatic tools focused on MS2data, the same cannot be said for research focused on MS1 data. In the MS1 setting, each spectrum contains multiple peptides, and the primary interest often lies in their isotope distributions. However, extracting this information is not a straightforward task. Therefore, we propose a method to extract these isotope distributions combined with other important MS1 features in a PSM data-driven manner and summarize them in a standardized format, creating an MS1 isotope distribution benchmark dataset. We applied this workflow on a proteomics standard and demonstrated the results, showing a high similarity between the extracted and theoretical isotope distributions. The workflow can be applied in the future to further extend the benchmark dataset. The dataset itself can act as the foundation to develop new bioinformatic tools. The availability of an extensive MS1 isotope distribution benchmark dataset will foster the development of innovative bioinformatic tools, enabling researchers to unlock new insights and further advance our understanding of molecular underpinnings of disease pathology.