3.2 Groups from UpSet analysis
3.2.1 Trends in grouping
Overall, 50 different variations in database grouping were found.
Comparison of the identification success of the individual proteogenomic
databases with the current level of genome annotation, complete or
contig/scaffold, showed that it was not the key effect influencing data
evaluation. For instance, D12 (level: contig) was the first, with the
highest set size, D2 (level: contig) was the second, D1 (level:
scaffold) and D8 (level: contig) were approximately in the middle, and
only D9 had a low set size (i.e., the lowest set size).
No variation in identification success was found for 249 protein hits
that were identified by all 18 database components. The second group
with no variation consisted of 46 protein hits that were identified by
17 out of the 18 database components; only D9 did not identify this
large array of protein hits. The third largest set, which consisted of
23 hits, was identified by the D1, D3, D5, D13, D15, and D16–18
database components. Interestingly, this pattern for success of
identification by database components was similar for some other groups
of hits with relatively small modifications; for instance, compared to
the 23-hit group, a group of 6 hits was identified by one additional
search component, D14, and another group of 5 hits was identified by two
additional databases, D14 and D4. Moreover, the success of
identification was very different among some groups of protein hits,
such as between a group of 8 protein hits and its neighboring group of 6
hits, shown on the right in Figure 1. These results may indicate the
similarity trend of identification success from proteogenomic databases.
Further inspection of the groups determined using the UpSet analysis in
relation to identifications among ERIC genotypes did not indicate that
this grouping may be considered exactly specific in comparison to the
categorization based on proteomic differences of exoprotein fractions
among the analyzed ERICs. However, individual inspection of some of the
contrasting groupings enabled the selection of some markers of interest
and importance that are reported below.
3.2.2 Hits identified only by reference databases
The lowest variation in identification success was observed in the
reference databases, i.e., D16–D18. The three reference databases
together uniquely identified 1 protein hit (id 25), whose identification
was not enabled by any of the proteogenomic databases (D1–D15). That
unique identification relates to pyruvate dehydrogenase E1 component
subunit beta or alpha-ketoacid dehydrogenase subunit beta (id 25), which
was identified in all samples/genotypes. Furthermore, D16 and D17 were
the sole database components that identified a group of 2 hits, a
fructose-6-phosphate aldolase (id 481) identified in all samples and a
hypothetical protein (id 482) identified only in ERIC III and IV.
3.2.3 Identifications missed by reference databases
D16 (NCBIall) did not identify only 1 protein hit; that hit was,
however, uniquely identified by only D14. D17 (RefSeq) did not identify
6 hits and thus did not identify 5 hits identified by D16 (NCBIall).
Interestingly, a group of 3 hits was not identified by only D17, despite
identification by the remaining 17 components of the database search.
D18 (UniProtref) did not identify 4 protein hits, but 3 of them were
identified by both D16 (NCBIall) and D17 (RefSeq); verification using
BLASTp in UniProt confirmed that these protein sequences could not be
identified as they were absent from the UniProt repository, although
similar protein sequences were present. A DUF3221 domain-containing
protein (id: 438; WP_096761230.1, ADZY03000171.1:False:24816) was one
of the hits that was not identified by D18; however, it was identified
by D16, D17 and D8. Proteins similar to DUF3221 were identified in our
results in other hits (see section 3.3.3). A hypothetical protein (id:
482) (WP_268570747.1) was identified in only ERIC III and IV; proteins
with only low similarity to different proteins from different organisms
were found using BLAST in UniProt. Finally, a fructose-6-phosphate
aldolase (id: 481) (WP_268569226.1) was identified in all 4 genotypes
(ERIC I–IV) but had the highest intensity in ERIC I; it could be of
importance that using BLAST, a highly similar protein of P.
larvae (DSM 25430), V9W7D8 (identity 99.5%), was identified in
UniProt, but it was absent in our reported results although it was
included in the search. Thus, the program quantitatively evaluated the
sequence that was present in the NCBI database and absent in UniProt but
not the highly similar sequence present in UniProt.
3.2.4 Differences in DSM 25430 assemblages —D10 and D11
The database components D10 and D11 were constructed from different
genome assemblages of the same P. larvae isolate/collection of
microorganisms, DSM 25430. The results of the UpSet analysis (Figure 1)
showed that there was a difference in the success of identification,
although they were ordered based on the set size. Comparison of the
total annotated proteins of the related BioProjects (Table 1; Protein
seq.) showed that D11 had 29 fewer proteins than D10. Additionally,
protein identification based on the proteogenomic database here was more
successful for D10. Overall, 5 hits were not identified by D11 but were
identified by D10: a) 4 hits (id: 468, 469, 470 and 473) in a group of
hits from the combination of D4, D10, D14 and D16–18, and b) 1 hit (id:
471) in a group of hits from D10 and D16–18. These 5 hits were almost
specific (the exception was id: 468, identified in 2/7 analyses in ERIC
I) to the ERIC II exoprotein fraction that was also obtained by analysis
of the same isolate/collection of microorganisms DSM 25430.