Improving Protein Structure Prediction with Extended Sequence Similarity
Searches and Deep-Learning-Based Refinement in CASP15
Abstract
The human predictor team PEZYFoldings got third place with GDT-TS (First
place with the Assessor’s formulae) in the single-domain category and
tenth place in the multimer category in CASP15. In this paper, I
describe the exact method used by PEZYFoldings in competitions. As
AlphaFold2 and AlphaFold-Multimer, developed by DeepMind, are
state-of-the-art structure prediction tools, it was assumed that
enhancing the input and output of the tools was an effective strategy to
obtain the highest accuracy for structure prediction. Therefore, I used
additional tools and databases to collect evolutionarily related
sequences and introduced a deep-learning-based model in the refinement
step. In addition to these modifications, manual interventions were
performed to address various tasks. Detailed analyses were performed
after the competition to identify the main contributors to performance.
Comparing the number of evolutionarily related sequences I used with
those of the other teams that provided AlphaFold2’s baseline predictions
revealed that an extensive sequence similarity search was one of the
main contributors. The impact of the refinement model was minimal (p
<0.05 for the TM score). In addition, I noticed that I had
gained large Z-scores with the subunits of H1137, for which I performed
manual domain parsing considering the interfaces between the subunits.
This finding implies that the manual intervention contributed to my
performance. The prediction performance was low when I could not
identify the evolutionarily related sequences. T1130 is an example;
however, other teams can model better structures. Based on the
discussions from the CASP15 conference, the two teams that ranked higher
than PEZYFoldings had some hits for T1130. This may be because T1130 is
a eukaryotic protein, whereas the additional databases used were mainly
from metagenomic sequences, which primarily consist of prokaryotic
proteins. These results highlight the opportunities for improvement in
1) multimer prediction, 2) building larger and more diverse databases,
and 3) developing tools to predict structures from primary sequences
alone. In addition, transferring the manual intervention process to
automation is a future concern.