1 Introduction
Similar to other high-throughput OMIC approaches, mass spectrometry-based proteomics (MSP) has high potential for quickly answering biological questions since it provides abundant new quantitative information on many proteins in a relatively short time [1, 2]. However, it is necessary to be aware of the hidden pitfalls of any high-throughput methodology. In proteomics, it is critical to prevent false-positive identification caused by the use of inappropriate databases. Selection of an optimal database for evaluation of tandem mass spectrometry (MS/MS) data is a critical factor that strongly affects the reliability and array of the reported results. Ideally, one would have an appropriate set of protein sequences relevant to the proteins of one or more organisms present in a complex sample [3–6]. However, depending on database selection, different numbers of sequences are used in data analysis. Although public databases such as NCBI and UniProt use the same taxonomic criteria, the number of retrieved sequences can differ. The difference can be meaningful for the “verified” sequences (e.g., RefSeq and UniProtKB) and “all” available sequences in the repositories [5, 6]. Increasing the size of the databases can considerably affect identification success and consequential reporting for low-abundance peptides/proteins. This database inflation effect is particularly obvious in proteogenomic searches. Despite the drawbacks of larger databases, they have an advantage in that they allow the identification of novel peptides and proteins in proteogenomic searches [7]. Overall, it is debatable whether it is better to obtain optimal or maximal results from proteomic analysis, and in this regard, designing an optimal database may be difficult [5, 6]. On the other hand, different methods of data evaluation provide different results and interpretations, and even an “optimal” and/or adequately broad database is used, important markers may be missed or overlooked. This indicates the need for secondary data evaluation using different, improved, or updated databases.
Proteomic analysis of Paenibacillus larvae that is causative agent of American foulbrood has facilitated the availability of protein sequences from annotated genomes of different genotypes that are traditionally denoted based on the Enterobacterial Repetitive Intergenic Consensus (ERIC) PCR technique ([8–10]. The existence of different genomes has enabled the mining of virulence factors of P. larvaedirectly from the genomic sequences [9, 10]. Beims et al. [9], based on the differences in genetic makeup, suggested that during evolution, the fast-killing genotypes of P. larvae lost certain virulence factors, leading to the development of slow-killing genotypes. Incidentally, more genotypes of P. larvae than those denoted as different based on ERIC typing may exist ([9]). The various genomes of the different genotypes/strains that have been annotated at the protein level are useful in proteomic data evaluation because they can provide different arrays of proteins or certain isoforms of proteins (e.g., those with amino acid substitutions). Therefore, despite the large decoy database, the selection of the different sets of protein sequences related to different genotypes within a P. larvae taxon is appropriate and rational for bottom-up proteomic analyses [11, 12]. However, despite comprehensive analysis and detailed data evaluation, some virulence factors could be overlooked, i.e., not selected as important. In addition, some proteins or virulence factors cannot be recognized due to absence of the sequence in currently available databases. In addition, some sequences may also be absent in further expanding databases because they have not yet been annotated. This shortcoming can be compensated for through a proteogenomic approach, which enables the identification of novel peptides and protein-coding regions [6, 7, 13–15]. Incidentally, the proteogenomic approach was successfully used in honey bees and its ectoparasite Varroa destructor to improve annotation of the honey bee genome [15] and to study developmental transitions of the parasite lacking an annotated genome [16], respectively.
In this study, we evaluated MS/MS data from P. larvae virulence investigations using the proteogenomic approach. We compared protein hit identification in a proteogenomic search against a database constructed of 15 complete genomic assemblies from GenBank and publicly available annotated sequences. In addition, we sought to successfully identify potentially overlooked/missed virulence factors among the database components.