2.3 | Mass Spectrometry
Mass spectrometry proteomics is able to detect translational products of sORFs directly in biological samples using either bottom-up (from peptide fragments) or top-down (intact precursor) modalities. However, specialized sample preparation and computational methods must be applied for high-sensitivity detection of small, unannotated microproteins. For example, a standard bottom-up proteomics experiment begins with isolation of the proteome, during which small molecules and proteolytic fragments are typically removed by SDS-PAGE or filter-aided sample preparation. Furthermore, most peptide and protein identification from proteomics data is accomplished via spectral matching against the annotated proteome database. For these reasons, sORF-encoded polypeptides are both de-enriched from proteomic samples, and absent from databases, and therefore cannot be detected with standard proteomic workflows and searches.
Multiple recent reviews and protocols describing microprotein identification via proteomics are available, so we provide a brief overview highlighting only the key concerns here. Microprotein discovery methods are built on the same technologies used for standard shotgun proteomics, with several modifications (Figure 3). First, because sORF-encoded microproteins are small, most are identified by only a single proteotypic or fingerprint tryptic fragment in a typical proteomics experiment. A major factor complicating detection of microproteins is coelution and/or cofragmentation of the one or two detectable tryptic peptides derived from a given microprotein with abundant tryptic and/or proteolytic fragments of larger proteins. Resulting ion suppression and/or complex spectra preclude detection and/or identification of the microprotein fragment, regardless of its abundance; this consideration is less severe for larger, canonical proteins, which generate many tryptic peptides and thus detection of any individual fragment is not required. Therefore, the first critical step of any sORF proteomic experiment is to achieve proteome extraction in the absence of proteolysis of canonical proteins (e.g., via boiling in acidic solution or application of protease inhibitors) to minimize sample complexity, followed by or concomitant with enrichment of the small proteome and exclusion of large proteins. Small protein enrichment can be achieved via multiple chemical and biophysical methods, such as solid phase extraction, peptide gels, GELFrEE resolution, and organic solvent or surfactant extraction. When they have been compared head-to-head, these methods have typically been shown to offer comparable numbers, but non-overlapping sets, of microproteins detected. Depending on the experimental goals, the size selection approach for microprotein proteomics can therefore be optimally chosen: for the deepest coverage, multiple methods should be employed on replicate samples and the results combined; for a rapid, robust and economical approach, organic solvent extraction may prove attractive.
Subsequent to small proteome isolation, most microprotein studies to date have employed bottom-up proteomic analysis, in which microproteins are enzymatically digested into peptide fragments (typically with trypsin, though multienzyme digests have been shown to improve small proteome coverage), followed by liquid chromatography-tandem mass spectrometry, often with multi-dimensional separation. This experiment provides thousands of raw peptide fragmentation spectra corresponding both to known canonical small proteins and microproteins, which must then be identified and distinguished. This is typically accomplished via peptide-spectral matching against expanded databases comprising the canonical proteome as well as candidate sORF sequences. For eukaryotes, databases can be derived from three-frame transcriptome translations, ribosome profiling-derived translatomes, or publicly available noncanonical ORF databases such as OpenProt and sORFs.org; six-frame genomic translation can be employed for prokaryotes. Peptide-spectral matching against any of these databases affords identifications of both canonical small proteins and unannotated microproteins. It is important to note that discrimination of false-positive identifications that arise from searching expanded databases is critical. One important consideration is use of a contaminants database to prevent aberrant matching of artefactual peptides (e.g., fragments of trypsin or keratin in dust) to sORF sequences. Another method commonly applied for this purpose is application of a stringent false-discovery rate of less than or equal to 1%, estimated by querying hits to a decoy database constructed from reversed amino acid sequences of the search database entries. However, the expansion of the decoy database also decreases sensitivity for true positive matches, as documented in work from Fournier and colleagues . An alternative approach is to employ permissive false discovery rates, followed by either manual inspection of fragmentation spectra or a secondary algorithm like PepQuery to exclude false positive spectra better explained by peptides arising from canonical, mutant or post-translationally modified proteins. After exclusion of peptides matching (or near-matching) annotated proteins, the resulting list of identifications represent candidate unannotated microproteins, which can be computationally mapped to the sORFs that encode them and experimentally validated.
Mass spectrometry typically detects one to two orders of magnitude fewer microproteins in a given experiment than ribosome profiling. This may be due to the abovementioned challenge in detecting single microprotein-derived fingerprint peptides; the relative insensitivity of mass spectrometry to some classes of microproteins, including membrane-localized, positively charged, and low-abundance species; the instability of some sORF translation products; reduced sensitivity for true-positive detections as a result of expanded decoy databases applied for stringent false discovery rate estimation; or all of these factors. Nonetheless, mass spectrometry offers several advantages. First, enrichment strategies, such as membrane fractionation and chemical labeling, can enable identification of microproteins that are refractory to shotgun analysis of whole-cell tryptic digests, thus beginning to address one of the major limitations of microprotein proteomics while at the same time affording functional information about microproteins (e.g., chemical reactivity, subcellular localization) that is inaccessible to sequencing methods. Second, without specialized analysis pipelines, ribosome profiling with elongation inhibitors is refractory to confident detection of sORFs that overlap canonical protein coding sequences in alternative reading frames, due to the requirement for three-nucleotide periodicity for ORF calling. In contrast, mass spectrometry can readily detect and identify microproteins derived from overlapping ORFs, which can represent as much as 30% of microproteins identified in a proteomic experiment. Given the complementary nature of genomics, ribosome profiling and mass spectrometry, it is likely that the combination of these methods offers the greatest power for large-scale, high-confidence microprotein identification.