3.2 Groups from UpSet analysis
3.2.1 Trends in grouping
Overall, 50 different variations in database grouping were found. Comparison of the identification success of the individual proteogenomic databases with the current level of genome annotation, complete or contig/scaffold, showed that it was not the key effect influencing data evaluation. For instance, D12 (level: contig) was the first, with the highest set size, D2 (level: contig) was the second, D1 (level: scaffold) and D8 (level: contig) were approximately in the middle, and only D9 had a low set size (i.e., the lowest set size).
No variation in identification success was found for 249 protein hits that were identified by all 18 database components. The second group with no variation consisted of 46 protein hits that were identified by 17 out of the 18 database components; only D9 did not identify this large array of protein hits. The third largest set, which consisted of 23 hits, was identified by the D1, D3, D5, D13, D15, and D16–18 database components. Interestingly, this pattern for success of identification by database components was similar for some other groups of hits with relatively small modifications; for instance, compared to the 23-hit group, a group of 6 hits was identified by one additional search component, D14, and another group of 5 hits was identified by two additional databases, D14 and D4. Moreover, the success of identification was very different among some groups of protein hits, such as between a group of 8 protein hits and its neighboring group of 6 hits, shown on the right in Figure 1. These results may indicate the similarity trend of identification success from proteogenomic databases. Further inspection of the groups determined using the UpSet analysis in relation to identifications among ERIC genotypes did not indicate that this grouping may be considered exactly specific in comparison to the categorization based on proteomic differences of exoprotein fractions among the analyzed ERICs. However, individual inspection of some of the contrasting groupings enabled the selection of some markers of interest and importance that are reported below.
3.2.2 Hits identified only by reference databases
The lowest variation in identification success was observed in the reference databases, i.e., D16–D18. The three reference databases together uniquely identified 1 protein hit (id 25), whose identification was not enabled by any of the proteogenomic databases (D1–D15). That unique identification relates to pyruvate dehydrogenase E1 component subunit beta or alpha-ketoacid dehydrogenase subunit beta (id 25), which was identified in all samples/genotypes. Furthermore, D16 and D17 were the sole database components that identified a group of 2 hits, a fructose-6-phosphate aldolase (id 481) identified in all samples and a hypothetical protein (id 482) identified only in ERIC III and IV.
3.2.3 Identifications missed by reference databases
D16 (NCBIall) did not identify only 1 protein hit; that hit was, however, uniquely identified by only D14. D17 (RefSeq) did not identify 6 hits and thus did not identify 5 hits identified by D16 (NCBIall). Interestingly, a group of 3 hits was not identified by only D17, despite identification by the remaining 17 components of the database search.
D18 (UniProtref) did not identify 4 protein hits, but 3 of them were identified by both D16 (NCBIall) and D17 (RefSeq); verification using BLASTp in UniProt confirmed that these protein sequences could not be identified as they were absent from the UniProt repository, although similar protein sequences were present. A DUF3221 domain-containing protein (id: 438; WP_096761230.1, ADZY03000171.1:False:24816) was one of the hits that was not identified by D18; however, it was identified by D16, D17 and D8. Proteins similar to DUF3221 were identified in our results in other hits (see section 3.3.3). A hypothetical protein (id: 482) (WP_268570747.1) was identified in only ERIC III and IV; proteins with only low similarity to different proteins from different organisms were found using BLAST in UniProt. Finally, a fructose-6-phosphate aldolase (id: 481) (WP_268569226.1) was identified in all 4 genotypes (ERIC I–IV) but had the highest intensity in ERIC I; it could be of importance that using BLAST, a highly similar protein of P. larvae (DSM 25430), V9W7D8 (identity 99.5%), was identified in UniProt, but it was absent in our reported results although it was included in the search. Thus, the program quantitatively evaluated the sequence that was present in the NCBI database and absent in UniProt but not the highly similar sequence present in UniProt.
3.2.4 Differences in DSM 25430 assemblages —D10 and D11
The database components D10 and D11 were constructed from different genome assemblages of the same P. larvae isolate/collection of microorganisms, DSM 25430. The results of the UpSet analysis (Figure 1) showed that there was a difference in the success of identification, although they were ordered based on the set size. Comparison of the total annotated proteins of the related BioProjects (Table 1; Protein seq.) showed that D11 had 29 fewer proteins than D10. Additionally, protein identification based on the proteogenomic database here was more successful for D10. Overall, 5 hits were not identified by D11 but were identified by D10: a) 4 hits (id: 468, 469, 470 and 473) in a group of hits from the combination of D4, D10, D14 and D16–18, and b) 1 hit (id: 471) in a group of hits from D10 and D16–18. These 5 hits were almost specific (the exception was id: 468, identified in 2/7 analyses in ERIC I) to the ERIC II exoprotein fraction that was also obtained by analysis of the same isolate/collection of microorganisms DSM 25430.