Improving Protein Structure Prediction with Extended Sequence Similarity Searches and Deep-Learning-Based Refinement in CASP15

Toshiyuki Oda

doi:10.22541/au.168170992.27078535/v1

loading page

Improving Protein Structure Prediction with Extended Sequence Similarity Searches and Deep-Learning-Based Refinement in CASP15

Toshiyuki Oda

Abstract

The human predictor team PEZYFoldings got third place with GDT-TS (First place with the Assessor’s formulae) in the single-domain category and tenth place in the multimer category in CASP15. In this paper, I describe the exact method used by PEZYFoldings in competitions. As AlphaFold2 and AlphaFold-Multimer, developed by DeepMind, are state-of-the-art structure prediction tools, it was assumed that enhancing the input and output of the tools was an effective strategy to obtain the highest accuracy for structure prediction. Therefore, I used additional tools and databases to collect evolutionarily related sequences and introduced a deep-learning-based model in the refinement step. In addition to these modifications, manual interventions were performed to address various tasks. Detailed analyses were performed after the competition to identify the main contributors to performance. Comparing the number of evolutionarily related sequences I used with those of the other teams that provided AlphaFold2’s baseline predictions revealed that an extensive sequence similarity search was one of the main contributors. The impact of the refinement model was minimal (p <0.05 for the TM score). In addition, I noticed that I had gained large Z-scores with the subunits of H1137, for which I performed manual domain parsing considering the interfaces between the subunits. This finding implies that the manual intervention contributed to my performance. The prediction performance was low when I could not identify the evolutionarily related sequences. T1130 is an example; however, other teams can model better structures. Based on the discussions from the CASP15 conference, the two teams that ranked higher than PEZYFoldings had some hits for T1130. This may be because T1130 is a eukaryotic protein, whereas the additional databases used were mainly from metagenomic sequences, which primarily consist of prokaryotic proteins. These results highlight the opportunities for improvement in 1) multimer prediction, 2) building larger and more diverse databases, and 3) developing tools to predict structures from primary sequences alone. In addition, transferring the manual intervention process to automation is a future concern.

16 Apr 2023Submitted to PROTEINS: Structure, Function, and Bioinformatics

Show details

Hide details

17 Apr 2023Submission Checks Completed

17 Apr 2023Assigned to Editor

17 Apr 2023Review(s) Completed, Editorial Evaluation Pending

19 Apr 2023Reviewer(s) Assigned

25 May 2023Editorial Decision: Revise Minor

26 Jun 20231st Revision Received

26 Jun 2023Submission Checks Completed

26 Jun 2023Assigned to Editor

26 Jun 2023Review(s) Completed, Editorial Evaluation Pending

26 Jun 2023Reviewer(s) Assigned

28 Jun 2023Editorial Decision: Accept

Abstract

Peer review status:ACCEPTED