4 Discussion
The work described in this paper demonstrates that a proteogenomic approach can be beneficial, especially for the assessment of differently virulent bacterial strains for which more genome assemblies and consequently more representative proteomes have been obtained. Although the proteogenomic approach is becoming more commonly used, to our knowledge, the analysis method presented here has not yet been applied. In principle, here, we used a database constructed of the 15 raw genome assemblies of different isolates of P. larvae enriched for reference annotated sequences to identify the overall array of proteins in exoprotein fractions expressed in vitro for four (ERIC I–IV) type pathogen strains. For data inspection, we primarily utilized the success of the database components in identification, while the differences in exoproteome results among the four type genotypes/strains of P. larvae were inspected afterward. We suggest that linking the differences in success of identifications for different genome sources with an array of proteomes of representative genotypes/strains can be beneficial for mining of markers of virulence.
A wide database consisting of 18 components, 3 references and 15 proteogenomes, of P. larvae genome assemblies was used to evaluate a proteomic dataset of P. larvae exoprotein fractions. Identification of 453 reliable hits from a total of 28 analyses of exoprotein fractions of four standard/model ERIC I–IV genotypes ofP. larvae showed that evaluation using the wide database was successful and provided a useful dataset, despite the expansion of search databases often leading to loss of identification of lower-abundance peptides and proteins [7]. Incidentally, the trace identifications are not of interest in the approach if specific protein fraction is analyzed, e.g., the exoprotein bacterial fraction. It is instead favorable to remove the traces from a dataset prior to detailed data analysis [11]. On the other hand, similar analysis using a large search database is not recommended/feasible for application on datasets for identification of proteins of pathogens in the host (e.g., in [12]), since the pathogen proteins are relatively low-abundance compared to the host proteome, and database inflation will in this case overly adversely affect data evaluation [7]. Overall, the array of 453 considered protein hits was useful for the purpose of this study given the type of matrix analyzed.
The 28 raw MS/MS data that were evaluated constituted a complex source to identify the identification success of the proteogenomic database components. In fact, these data are representative proteomes, each analyzed in 7 biological replicates of four type genotypes of P. larvae obtained from microbial collections [11]. However, the division of the evaluated proteomic dataset into the ERIC I–IV exoproteome profiles was first neglected, and instead, we first determined the identification success in the entire dataset for the 18 components of the database search. Incidentally, the differences in 50 different variations of grouping based on database components was nicely visualized as an UpSet plot [20, 21], which is in this case more useful than a Venn diagram. Overall, the results of the proteogenomic search resulted in relatively high variability in identification success, since 45% of the protein hits (204/453 hits) were divided into 49 groups.
The differences identified in the reference databases highlight the difference between the verified sequences, such as RefSeq, and wider sequence resources related to the same taxon from the NCBI database. However, there was also a difference between NCBI and UniProt, since not all sequences are shared between database repositories, and the identified individual differences confirmed the results of our analysis. In this case, the choice of the optimal database depends on the benefits it provides and can differ from case to case [5, 6]. If the common array of proteins is searched for, then the narrower database can be beneficial; however, if we also search for uncommon markers and isoforms, a wider database appears to be beneficial. For analysis of a pathogen with multiple variants/genotypes, a wider database that also contains sequences of more strains is appropriate. Our findings are consistent with the previous suggestion of P. larvae virulence factor analysis, that is, the use of a wider database that contains sequences of more strains/genotypes for proteomic analysis [11]. This approach will be beneficial, especially if the pathogen strain is unknown, because some strains may express different isoforms, and there can be differences in gene arsenal between strains for various reasons. There can also be differences between the same P. larvae isolates shared among different collections of microorganisms and those further subcultured in laboratories [9] (i.e., D10 and D11 in our study).
Furthermore, the results of the proteogenomic search showed that from the large database consisting of 15 genome assemblies from the pathogen and reference databases, different arrays of proteins could be identified. Our results showed that among these hits that differed among the array of database components, markers that could be of interest in relation to virulence could be identified. Overall, according to this criterion, the results were primarily selected based on the difference in successful decoy database and not by proteomic identification (confirmed expression at the protein level) related to differences in the exoproteomes of the P. larvae ERIC I–IV type genotypes. The difference in virulence in relation to protein expression was examined further by considering the uniqueness and contrasting results in the UpSet analysis. Markers that we selected according to the consequent application of the two criteria are discussed below.
Similar proteins that were identified by contrasting components of the proteogenomic database facilitated identification of several isoforms that also differed in protein expression in/among the genotypes. One of the major markers of interest that we report here is GHL10‒FN3 (based on result of domain analysis), which is expressed and secreted by the more virulent P. larvae strains. We confirmed that one isoform was expressed by ERICs II–IV and the second with amino acid substitutions by ERIC II, which thus expressed two isoforms. To our knowledge, GHL10‒FN3 (in databases denoted as Family 10 glycosylhydrolase/Fibronectin 3 or alpha-galactosidase) has not been reported to be an important virulence factor of P. larvae . Incidentally, this virulence marker was also not selected for emphasis in our previous proteomic studies [11, 12], although it was detected as confirmed by our reinspection of the earlier reported results for the published data searches. The importance of GHL10‒FN3 as a virulence marker is highlighted by the fact that we found it to be highly abundant in the prepupa at different stages of infection, including the late phase of infection, in which some other virulence factors may disappear/ be absent (e.g., collagenase/colA). Function of GHL10‒FN3 is disputable. According to IntePro GHL10 belongs to glycosyl-hydrolase-like proteins that may have a number of activities similar to xylanases and cellulases. Incidentally, the protein is in our results annotated as a fibronectin type-III domain-containing protein, which means that it is a multifunctional bacterial carbohydrate-splitting enzyme that plays key roles in adhesion and migration [27, 28]. We do not know why the protein has been assigned (in NCBI/UniProt) to be apha-galactosidase. Alpha-galactosidases are not as commonly studied as virulence factors, although relatively few reports have indicated their high importance in this regard. In Streptococcus pneumoniae, alpha-galactosidase is a key player in the raffinose-family oligosaccharide utilization system [29]. Alpha-galactosidase may participate in modifications of glycoproteins and glycolipids via cleavage of oligosaccharides [30].
A different enzyme found as two isoforms distinguished by amino acid substitutions is the collagenase colA. The expression patterns of colA in the exoprotein fraction was similar to those of GHL10‒FN3, since one isoform was expressed by ERICs I–IV and the second by ERIC II only. Interestingly, the second isoform could be identified only from the D14 assembly and the reference databases. However, the proteogenomic identification from the genome assemblies of P. larvae showed a different pattern than that of GHL10‒FN3. Previously, we stressed the importance of the collagenase colA in the virulence of P. larvae , but we did not report the possible occurrence of the isoforms in the exoprotein fractions [11]. Furthermore, in an in vivo study, colA was not highlighted as being important because it was not detected in the samples that showed strongest infection (lysed) [12]. In addition, notably, our re-evaluation here showed that the two collagenase isoforms were not distinguished in the previous study.
The proteogenomics analysis allowed unique identification by the D14 database that was not possible with the reference databases. The D14 component of the database provided more unique and contrasting results; it may be of interest that the P. larvae isolate SAG 10367 originated from Chile and was provided for genomic sequencing by A. Alippi [31]. A new ORF was identified from the assembly, and our analysis revealed that the protein combines glucose dehydrogenase and the dihydroxyacetone kinase subunit DhaK, but in fact, two DhaK/DhaL domains were predicted. Future study of this protein and examination of whether it functions in the undivided/unsplit state may be of interest. Several bacteria convert glycerol to dihydroxyacetone (Dha), which is further converted to Dha phosphate by DhaK [32]. Interestingly, glycerol metabolism, which is a major link between sugar and fatty acid metabolism, may be linked to virulence [33, 34]. It may play a role in adaptation to utilize glycerol-containing compounds such as lipids/phospholipids, which can serve as a source of energy for host-adapted bacteria [34, 35].
We verified the presence of the often discussed immune inhibitor A (InhA) [10]. Interestingly, all 18 database components could identify it, but consistent with a previous study, it was not identified as being expressed in the ERIC I exoprotein fraction [11]. The protein sequence showed the deletion of the peptide “EAGGGDLGE” that is typical in various InhA proteins, including those of P. larvae . Thus, it appears that the lower-virulence strain not expressing InhA (“ERIC I”) emerged from a more virulent strain. This is consistent with the suggestion of Beims et al. [9]. Whether the ancestor of ERIC I was ERIC II or a different, more virulent type strain (ERIC III–IV/V) is debatable. However, it is likely that intermediate stages exist or existed among the virulent strains, which supports the variability in different identifications from the genome assemblies in this study. From our results, it is apparent that the type strain ERIC I, which lacks InhA expression, also lacks other virulence factors, such as colA and GHL10‒FN3. Our results also highlight the importance of further investigating proteins with domains of unknown function, such as DUF3221 (may be restricted only to Bacillus ) and DUF3862. Such protein can have important functions, and they can be of high interest since differences in their expression may indicate the presence of intermediate stages among strains. Furthermore, the expression/presence of CRISPR [36, 37] in strains is very important, as this is a poorly understood virulence factor in P. larvae [11]. In addition, we stress importance of ABC transporters related to iron–siderophore uptake [38–41], because we identified their difference in exoprotein fractions between ERICs I–II and ERICs III–IV, although the proteins could be identified by all 18 database components. Until present, in P. larvae studies focused on the system (proteins) participating on siderophore production [9, 26].
Last but not least, in this study, we reused previously evaluated and published proteomic (MS/MS) data; therefore, no newly performed laboratory analyses were required to obtain new useful results. This underlines the importance of sharing high-quality and clearly described raw proteomics data [42, 43], similar to genome assemblies, transcriptomes and other HTS projects, making them available for reanalyses. Moreover, although the analysis here was performed only onP. larvae, the foundation provided by this study could also be applied in the future to other bacterial pathogens of different hosts, and the approach is also applicable to other microorganisms for studying differences in genotypes. Based on this exemplary study, we propose that in the future, proteomic experiments can be designed for the existing array of different genomes and type genotypes available for the studied organism.