4 Discussion
The work described in this paper demonstrates that a proteogenomic
approach can be beneficial, especially for the assessment of differently
virulent bacterial strains for which more genome assemblies and
consequently more representative proteomes have been obtained. Although
the proteogenomic approach is becoming more commonly used, to our
knowledge, the analysis method presented here has not yet been applied.
In principle, here, we used a database constructed of the 15 raw genome
assemblies of different isolates of P. larvae enriched for
reference annotated sequences to identify the overall array of proteins
in exoprotein fractions expressed in vitro for four (ERIC I–IV)
type pathogen strains. For data inspection, we primarily utilized the
success of the database components in identification, while the
differences in exoproteome results among the four type genotypes/strains
of P. larvae were inspected afterward. We suggest that linking
the differences in success of identifications for different genome
sources with an array of proteomes of representative genotypes/strains
can be beneficial for mining of markers of virulence.
A wide database consisting of 18 components, 3 references and 15
proteogenomes, of P. larvae genome assemblies was used to
evaluate a proteomic dataset of P. larvae exoprotein fractions.
Identification of 453 reliable hits from a total of 28 analyses of
exoprotein fractions of four standard/model ERIC I–IV genotypes ofP. larvae showed that evaluation using the wide database was
successful and provided a useful dataset, despite the expansion of
search databases often leading to loss of identification of
lower-abundance peptides and proteins [7]. Incidentally, the trace
identifications are not of interest in the approach if specific protein
fraction is analyzed, e.g., the exoprotein bacterial fraction. It is
instead favorable to remove the traces from a dataset prior to detailed
data analysis [11]. On the other hand, similar analysis using a
large search database is not recommended/feasible for application on
datasets for identification of proteins of pathogens in the host (e.g.,
in [12]), since the pathogen proteins are relatively low-abundance
compared to the host proteome, and database inflation will in this case
overly adversely affect data evaluation [7]. Overall, the array of
453 considered protein hits was useful for the purpose of this study
given the type of matrix analyzed.
The 28 raw MS/MS data that were evaluated constituted a complex source
to identify the identification success of the proteogenomic database
components. In fact, these data are representative proteomes, each
analyzed in 7 biological replicates of four type genotypes of P.
larvae obtained from microbial collections [11]. However, the
division of the evaluated proteomic dataset into the ERIC I–IV
exoproteome profiles was first neglected, and instead, we first
determined the identification success in the entire dataset for the 18
components of the database search. Incidentally, the differences in 50
different variations of grouping based on database components was nicely
visualized as an UpSet plot [20, 21], which is in this case more
useful than a Venn diagram. Overall, the results of the proteogenomic
search resulted in relatively high variability in identification
success, since 45% of the protein hits (204/453 hits) were divided into
49 groups.
The differences identified in the reference databases highlight the
difference between the verified sequences, such as RefSeq, and wider
sequence resources related to the same taxon from the NCBI database.
However, there was also a difference between NCBI and UniProt, since not
all sequences are shared between database repositories, and the
identified individual differences confirmed the results of our analysis.
In this case, the choice of the optimal database depends on the benefits
it provides and can differ from case to case [5, 6]. If the common
array of proteins is searched for, then the narrower database can be
beneficial; however, if we also search for uncommon markers and
isoforms, a wider database appears to be beneficial. For analysis of a
pathogen with multiple variants/genotypes, a wider database that also
contains sequences of more strains is appropriate. Our findings are
consistent with the previous suggestion of P. larvae virulence
factor analysis, that is, the use of a wider database that contains
sequences of more strains/genotypes for proteomic analysis [11].
This approach will be beneficial, especially if the pathogen strain is
unknown, because some strains may express different isoforms, and there
can be differences in gene arsenal between strains for various reasons.
There can also be differences between the same P. larvae isolates
shared among different collections of microorganisms and those further
subcultured in laboratories [9] (i.e., D10 and D11 in our study).
Furthermore, the results of the proteogenomic search showed that from
the large database consisting of 15 genome assemblies from the pathogen
and reference databases, different arrays of proteins could be
identified. Our results showed that among these hits that differed among
the array of database components, markers that could be of interest in
relation to virulence could be identified. Overall, according to this
criterion, the results were primarily selected based on the difference
in successful decoy database and not by proteomic identification
(confirmed expression at the protein level) related to differences in
the exoproteomes of the P. larvae ERIC I–IV type genotypes. The
difference in virulence in relation to protein expression was examined
further by considering the uniqueness and contrasting results in the
UpSet analysis. Markers that we selected according to the consequent
application of the two criteria are discussed below.
Similar proteins that were identified by contrasting components of the
proteogenomic database facilitated identification of several isoforms
that also differed in protein expression in/among the genotypes. One of
the major markers of interest that we report here is GHL10‒FN3 (based on
result of domain analysis), which is expressed and secreted by the more
virulent P. larvae strains. We confirmed that one isoform was
expressed by ERICs II–IV and the second with amino acid substitutions
by ERIC II, which thus expressed two isoforms. To our knowledge,
GHL10‒FN3 (in databases denoted as Family 10
glycosylhydrolase/Fibronectin 3 or alpha-galactosidase) has not been
reported to be an important virulence factor of P. larvae .
Incidentally, this virulence marker was also not selected for emphasis
in our previous proteomic studies [11, 12], although it was detected
as confirmed by our reinspection of the earlier reported results for the
published data searches. The importance of GHL10‒FN3 as a virulence
marker is highlighted by the fact that we found it to be highly abundant
in the prepupa at different stages of infection, including the late
phase of infection, in which some other virulence factors may disappear/
be absent (e.g., collagenase/colA). Function of GHL10‒FN3 is disputable.
According to IntePro GHL10 belongs to glycosyl-hydrolase-like proteins
that may have a number of activities similar to xylanases and
cellulases. Incidentally, the protein is in our results annotated as a
fibronectin type-III domain-containing protein, which means that it is a
multifunctional bacterial carbohydrate-splitting enzyme that plays key
roles in adhesion and migration [27, 28]. We do not know why the
protein has been assigned (in NCBI/UniProt) to be apha-galactosidase.
Alpha-galactosidases are not as commonly studied as virulence factors,
although relatively few reports have indicated their high importance in
this regard. In Streptococcus pneumoniae, alpha-galactosidase is
a key player in the raffinose-family oligosaccharide utilization system
[29]. Alpha-galactosidase may participate in modifications of
glycoproteins and glycolipids via cleavage of oligosaccharides [30].
A different enzyme found as two isoforms distinguished by amino acid
substitutions is the collagenase colA. The expression patterns of colA
in the exoprotein fraction was similar to those of GHL10‒FN3, since one
isoform was expressed by ERICs I–IV and the second by ERIC II only.
Interestingly, the second isoform could be identified only from the D14
assembly and the reference databases. However, the proteogenomic
identification from the genome assemblies of P. larvae showed a
different pattern than that of GHL10‒FN3. Previously, we stressed the
importance of the collagenase colA in the virulence of P. larvae ,
but we did not report the possible occurrence of the isoforms in the
exoprotein fractions [11]. Furthermore, in an in vivo study,
colA was not highlighted as being important because it was not detected
in the samples that showed strongest infection (lysed) [12]. In
addition, notably, our re-evaluation here showed that the two
collagenase isoforms were not distinguished in the previous study.
The proteogenomics analysis allowed unique identification by the D14
database that was not possible with the reference databases. The D14
component of the database provided more unique and contrasting results;
it may be of interest that the P. larvae isolate SAG 10367
originated from Chile and was provided for genomic sequencing by A.
Alippi [31]. A new ORF was identified from the assembly, and our
analysis revealed that the protein combines glucose dehydrogenase and
the dihydroxyacetone kinase subunit DhaK, but in fact, two DhaK/DhaL
domains were predicted. Future study of this protein and examination of
whether it functions in the undivided/unsplit state may be of interest.
Several bacteria convert glycerol to dihydroxyacetone (Dha), which is
further converted to Dha phosphate by DhaK [32]. Interestingly,
glycerol metabolism, which is a major link between sugar and fatty acid
metabolism, may be linked to virulence [33, 34]. It may play a role
in adaptation to utilize glycerol-containing compounds such as
lipids/phospholipids, which can serve as a source of energy for
host-adapted bacteria [34, 35].
We verified the presence of the often discussed immune inhibitor A
(InhA) [10]. Interestingly, all 18 database components could
identify it, but consistent with a previous study, it was not identified
as being expressed in the ERIC I exoprotein fraction [11]. The
protein sequence showed the deletion of the peptide “EAGGGDLGE” that
is typical in various InhA proteins, including those of P.
larvae . Thus, it appears that the lower-virulence strain not expressing
InhA (“ERIC I”) emerged from a more virulent strain. This is
consistent with the suggestion of Beims et al. [9]. Whether the
ancestor of ERIC I was ERIC II or a different, more virulent type strain
(ERIC III–IV/V) is debatable. However, it is likely that intermediate
stages exist or existed among the virulent strains, which supports the
variability in different identifications from the genome assemblies in
this study. From our results, it is apparent that the type strain ERIC
I, which lacks InhA expression, also lacks other virulence factors, such
as colA and GHL10‒FN3. Our results also highlight the importance of
further investigating proteins with domains of unknown function, such as
DUF3221 (may be restricted only to Bacillus ) and DUF3862. Such
protein can have important functions, and they can be of high interest
since differences in their expression may indicate the presence of
intermediate stages among strains. Furthermore, the expression/presence
of CRISPR [36, 37] in strains is very important, as this is a poorly
understood virulence factor in P. larvae [11]. In addition,
we stress importance of ABC transporters related to iron–siderophore
uptake [38–41], because we identified their difference in
exoprotein fractions between ERICs I–II and ERICs III–IV, although the
proteins could be identified by all 18 database components. Until
present, in P. larvae studies focused on the system (proteins)
participating on siderophore production [9, 26].
Last but not least, in this study, we reused previously evaluated and
published proteomic (MS/MS) data; therefore, no newly performed
laboratory analyses were required to obtain new useful results. This
underlines the importance of sharing high-quality and clearly described
raw proteomics data [42, 43], similar to genome assemblies,
transcriptomes and other HTS projects, making them available for
reanalyses. Moreover, although the analysis here was performed only onP. larvae, the foundation provided by this study could also be
applied in the future to other bacterial pathogens of different hosts,
and the approach is also applicable to other microorganisms for studying
differences in genotypes. Based on this exemplary study, we propose that
in the future, proteomic experiments can be designed for the existing
array of different genomes and type genotypes available for the studied
organism.