2.1 | Computation
Accurate annotation of sORFs using computational tools is challenging not only due to their short lengths that impede statistical analyses, but also because they exhibit intermediate conservation relative to longer genes, which has been interpreted as evidence for the de novo evolution of some microproteins. Notwithstanding these challenges, algorithms and machine learning strategies are currently being developed to better find sORFs within genomes. Some computational efforts rely on phylogeny, nucleotide and amino acid homology, and secondary structure to identify unannotated sORFs with sequence or structural similarities to canonical proteins; examples include PhyloCSF and miPFinder. Additional dimensions of predictive information, including the presence of a ribosome binding site upstream of bacterial sORF start codons or a Kozak consensus sequence surrounding a eukaryotic sORF start codon, have been applied to sORF prediction. Ambitiously, OpenProt predicts all AUG-initiated sORFs and alternative ORFs (alt-ORFs) within all known mRNAs for several organisms, and curates experimental evidence (or lack thereof) for their expression. Finally, deep forest and deep learning models have been applied to sORF prediction, with application to individual microbial genomes, as well as the microbiome and metagenomes. These methods have highlighted new sORFs in intergenic regions, noncoding RNAs and in multicistronic/dual coding mRNAs.