1 Introduction
Similar to other high-throughput OMIC approaches, mass
spectrometry-based proteomics (MSP) has high potential for quickly
answering biological questions since it provides abundant new
quantitative information on many proteins in a relatively short time
[1, 2]. However, it is necessary to be aware of the hidden pitfalls
of any high-throughput methodology. In proteomics, it is critical to
prevent false-positive identification caused by the use of inappropriate
databases. Selection of an optimal database for evaluation of tandem
mass spectrometry (MS/MS) data is a critical factor that strongly
affects the reliability and array of the reported results. Ideally, one
would have an appropriate set of protein sequences relevant to the
proteins of one or more organisms present in a complex sample
[3–6]. However, depending on database selection, different numbers
of sequences are used in data analysis. Although public databases such
as NCBI and UniProt use the same taxonomic criteria, the number of
retrieved sequences can differ. The difference can be meaningful for the
“verified” sequences (e.g., RefSeq and UniProtKB) and “all”
available sequences in the repositories [5, 6]. Increasing the size
of the databases can considerably affect identification success and
consequential reporting for low-abundance peptides/proteins. This
database inflation effect is particularly obvious in proteogenomic
searches. Despite the drawbacks of larger databases, they have an
advantage in that they allow the identification of novel peptides and
proteins in proteogenomic searches [7]. Overall, it is debatable
whether it is better to obtain optimal or maximal results from proteomic
analysis, and in this regard, designing an optimal database may be
difficult [5, 6]. On the other hand, different methods of data
evaluation provide different results and interpretations, and even an
“optimal” and/or adequately broad database is used, important markers
may be missed or overlooked. This indicates the need for secondary data
evaluation using different, improved, or updated databases.
Proteomic analysis of Paenibacillus larvae that is causative
agent of American foulbrood has facilitated the availability of protein
sequences from annotated genomes of different genotypes that are
traditionally denoted based on the Enterobacterial Repetitive Intergenic
Consensus (ERIC) PCR technique ([8–10]. The existence of different
genomes has enabled the mining of virulence factors of P. larvaedirectly from the genomic sequences [9, 10]. Beims et al. [9],
based on the differences in genetic makeup, suggested that during
evolution, the fast-killing genotypes of P. larvae lost certain
virulence factors, leading to the development of slow-killing genotypes.
Incidentally, more genotypes of P. larvae than those denoted as
different based on ERIC typing may exist ([9]). The various genomes
of the different genotypes/strains that have been annotated at the
protein level are useful in proteomic data evaluation because they can
provide different arrays of proteins or certain isoforms of proteins
(e.g., those with amino acid substitutions). Therefore, despite the
large decoy database, the selection of the different sets of protein
sequences related to different genotypes within a P. larvae taxon
is appropriate and rational for bottom-up proteomic analyses [11,
12]. However, despite comprehensive analysis and detailed data
evaluation, some virulence factors could be overlooked, i.e., not
selected as important. In addition, some proteins or virulence factors
cannot be recognized due to absence of the sequence in currently
available databases. In addition, some sequences may also be absent in
further expanding databases because they have not yet been annotated.
This shortcoming can be compensated for through a proteogenomic
approach, which enables the identification of novel peptides and
protein-coding regions [6, 7, 13–15]. Incidentally, the
proteogenomic approach was successfully used in honey bees and its
ectoparasite Varroa destructor to improve annotation of the honey
bee genome [15] and to study developmental transitions of the
parasite lacking an annotated genome [16], respectively.
In this study, we evaluated MS/MS data from P. larvae virulence
investigations using the proteogenomic approach. We compared protein hit
identification in a proteogenomic search against a database constructed
of 15 complete genomic assemblies from GenBank and publicly available
annotated sequences. In addition, we sought to successfully identify
potentially overlooked/missed virulence factors among the database
components.