Introduction
Three
major types of protein secondary structure are alpha-helix (H),
beta-strand (E) and coil state (C) 1,
each of which represents the local
structure state of an amino acid in a folded polypeptide chain. The
predicted information of protein secondary structure is useful for many
applications in computational biology, such as protein residue-residue
contact prediction 2-4, protein folding5-7, ab-initio protein structure modeling8-10 and protein model quality assessment11-12.
For
instance, secondary structure prediction was widely utilized in the
template-based structure modeling through threading or comparative
modeling on those proteins that have structurally determined homologs10, 13-14, and in ab initio modeling for
those proteins whose sequences share
few sequential similarities with known solved structures15-16.
The
progress in protein secondary structure prediction over the past few
decades can be generally summarized from two aspects: the discovery of
novel features that are useful for prediction and the development of
effective machine learning algorithms 17-18.
The early attempts utilized
statistical propensities of single amino acid observed from known
structures to identify secondary structures in proteins19.
The
subsequent improvements came from the inclusion of sequence evolutionary
profile features inferred from multiple sequence alignment (MSA) such as
position-specific scoring matrices (PSSM)20-25.
In addition to the PSSM, the Hidden
Markov model (HMM) profiles derived from HHblits 26was proposed for predicting protein structural properties27. Atchley’s factors were also included in some
studies to
capture
the similarity between the types of amino acids 28-29.
Meanwhile, the machine learning algorithms for protein secondary
structure prediction also continued to improve. Several early approaches
applied shallow neural networks 30-31, information
theory and Bayesian analysis 32-34 to secondary
structure prediction.
PSIPRED21 method proposed a two-stage neural network to
predict the secondary structure from the PSI-BLAST sequence profiles.
SSpro 24 used bi-directional recurrent neural networks
to capture the long-range interactions between amino acids.
Deep
learning techniques recently achieved significant success in secondary
structure prediction 25, 29, 35-38.
DNSS 29 applied an
ensemble of deep belief networks to predict 3-state secondary structure.
SPIDER2 39 employed stacked sparse auto-encoder neural
networks to predict the several structural properties iteratively, and
this method was further advanced by bidirectional long- and short-term
memory (LSTM) neural networks to capture the long-range interactions37. DeepCNF 36 integrated the
convolutional neural networks with conditional random-field to learn the
complex sequence-structure relationship and interdependence
between sequence and secondary
structure. Porter 5.0 40 ensembled seven bidirectional
recurrent neural networks to improve the protein structure prediction.
Assisted
with the power of deep learning, the accuracy of 3-state secondary
structure prediction has been successfully improved above 84%36-38 on some benchmark datasets.
In this work, we developed an improved version of our ab initiosecondary structure method using multiple advanced deep learning
architectures (DNSS2).
Three
major improvements have been made over the original DNSS method.
Firstly, besides the PSSM profile features and Atchley’s factors used in
DNSS, we incorporated several novel features such as the emission and
transition probabilities derived from Hidden Markov model (HMM) profile26, and profile probabilities inferred from multiple
sequence alignment (MSA) 22.
All the three new features represent
the evolutionary conservation information for amino acids in sequence.
Secondly, we designed and integrated
six types of advanced one-dimensional deep networks for protein
secondary structure prediction, including traditional convolutional
neural network (CNN) 41,
recurrent
convolutional neural network (RCNN) 42, residual
neural network (ResNet) 43, convolutional residual
memory networks (CRMN) 44, fractal networks45, and Inception network 46. The
ensemble of six networks from DNSS2 significantly improved the secondary
structure prediction. Finally, DNSS2 was trained on a large dataset,
including 4,872 non-redundant protein structures with less than 25%
pairwise sequence identity and 2.5 \(\mathring{\mathrm{A}}\) resolution.
Our
method was extensively tested on the independent dataset and the latest
CASP13 dataset with other
state-of-art methods and delivered the state-of-the-art performance.
Materials and Methods