Convolutional ProteinUnetLM competitive with LSTM-based protein
secondary structure predictors
Abstract
The protein secondary structure (SS) prediction plays an important role
in the characterization of general protein structure and function. In
recent years, a new generation of algorithms for SS prediction based on
embeddings from protein language models (pLMs) is emerging. These
algorithms reach state-of-the-art accuracy without the need for
time-consuming multiple sequence alignment (MSA) calculations.
LSTM-based SPOT-1D-LM and NetSurfP-3.0 are the latest examples of such
predictors. We present the ProteinUnetLM model using a convolutional
Attention U-Net architecture that provides prediction quality and
inference times at least as good as the best LSTM-based models for
8-class SS prediction (SS8). Additionally, we address the issue of the
heavily imbalanced nature of the SS8 problem by extending the loss
function with the Matthews correlation coefficient (MCC), and by proper
assessment using previously introduced adjusted geometric mean metric
(AGM). ProteinUnetLM achieved better AGM and sequence overlap score
(SOV) than LSTM-based predictors, especially for the rare structures
310-helix (G), beta-bridge (B), and high curvature loop (S). It is also
competitive on challenging datasets without homologs, free-modeling
targets, and chameleon sequences. Moreover, ProteinUnetLM outperformed
its previous MSA-based version ProteinUnet2, and provided better AGM
than AlphaFold2 for 1/3 of proteins from the CASP14 dataset, proving its
potential for making a significant step forward in the domain. To
facilitate the usage of our solution by protein scientists, we provide
an easy-to-use web interface under
[https://biolib.com/SUT/ProteinUnetLM/](https://biolib.com/SUT/ProteinUnetLM/).