TooT-PLM-P2S: Incorporating Secondary Structure Information into Protein
Language Models
Abstract
In bioinformatics, modeling the protein space to better predict function
and structure has benefitted from Protein Language Models (PLMs). Their
basis is the protein’s amino acid sequence and self-supervised learning.
Ankh is a prime example of such a PLM. While there has been some recent
work on integrating structure with a PLM to enhance predictive
performance, to date there has been no work on integrating secondary
structure rather than three-dimensional structure. Here we present
TooT-PLM-P2S that begins with the Ankh model pre-trained on 45 million
proteins using self-supervised learning. TooT-PLM-P2S builds upon the
Ankh model by initially using its pre-trained encoder and decoder. It
then undergoes an additional training phase with approximately 10,000
proteins and their corresponding secondary structures. This retraining
process modifies the encoder and decoder, resulting in the creation of
TooT-PLM-P2S. We then assess the impact of integrating secondary
structure information into the Ankh model by comparing Ankh and
TooT-PLM-P2S on eight downstream tasks including fluorescence and
solubility prediction, sub-cellular localization, and membrane protein
classification. For both Ankh and TooT-PLM-P2S the downstream tasks
required task-specific training. Few of the results showed statistically
significant differences. Ankh outperformed on three of the eight tasks,
TooT-PLM-P2S did not outperform on any task for the primary metric.
TooT-PLM-P2S did outperform for the precision metric for the task of
discriminating membrane proteins from non-membrane proteins. This study
requires future work with expanded datasets and refined integration
methods.