A proteogenomic analysis and use of an array of genome assemblies to
scout out new virulence factors---a case of Paenibacillus larvae
Abstract
High-throughput proteomics is an effective methodology for identifying a
variety of virulence factors of pathogens. Proteomic data are commonly
evaluated against annotated sequences present in publicly available
database repositories. A proteogenomic approach can be used if annotated
sequences are not available or to identify novel proteins/peptides.
However, a single genome is commonly utilized in proteomic and
proteogenomic analyses. We pose the question of whether utilizing a
number of different genome assemblies of a bacterial pathogen would be
beneficial. Here, we used previously obtained shot-gun label-free
nano-LC‒MS/MS data of the exoprotein fraction of four reference ERIC
I–IV genotypes of Paenibacillus larvae and evaluated them against
publicly available annotated sequences (from NCBI-protein, RefSeq,
UniProt) together with an array of protein sequences generated using a
six-frame direct translation of 15 genomic assemblies available in
GenBank. The wide search through 18 database components reliably
identified 453 protein hits. UpSet analysis categorized the hits into 50
groups based on the success protein identification by databases. The
relatively high variability in successful identification among the
genome assemblies facilitated the mining of markers based on uniqueness
and contrasting results prior to considering proteome differences. Data
evaluation provided novel and interesting markers that can be studied
further.