2.5 Training and evaluation procedure
Deeper
networks with complex architectures are generally difficult to train
effectively due to the high-dimensional hyper-parameter space. To obtain
good performance on specific feature sets within a reasonable amount of
time for each deep network, we developed an efficient heuristic random
sampling approach for model hyperparameter optimization. Specifically,
based on the several trials on network training, we first determined
heuristically a reasonable range for each type of the network
hyperparameters, including the number of filters from 20 to 50, the
number of convolution blocks from 3 to 7, and the filter size from 3 to
7. For each subsequent trial, the values of hyper-parameters were
randomly sampled from their specified range and the Q3 accuracy of the
network on the validation dataset under the specific parameter
combination was assessed. For each deep network, the best parameter set
was determined after 100 trials were evaluated. We found that using the
random sampling technique was able to generate better models in most
cases and was also more efficient than the traditional grid search or
greedy search.
The performance of different deep architectures and different feature
profiles on the secondary structure prediction were rigorously examined
using the training and validation set from original DNSS method. After
the parameters and input features were determined, we trained each deep
network on the latest curated dataset (DNSS2_TRAIN) and selected best
models using the Q3 accuracy on the independent validation dataset
(DNSS2_VAL). We used the Keras library (http://keras.io/) along with
Tensorflow as a backend to train all networks.
The performance of DNSS2 was
evaluated on the two independent datasets and compared with a variety of
the state-of-art secondary structure prediction tools, including
SSpro5.2 22,
PSSpred54, MUFOLD-SS 38, DeepCNF36, PSIPRED 55, SPIDER337, Porter 5 40 and our previous
method DNSS1 29. All the methods were assessed
according to the Q3 and SOV scores on each dataset.
Results
Benchmarking different deep architectures of DNSS2 with
DNSS1
The first evaluation was to investigate whether the new deep
architectures networks (DNSS2) outperform the deep belief network
(DNSS1) for the secondary structure prediction. In order to fairly
compare them, we trained and validated the six deep networks on the
original input features of the same 1,230 training and 195 validation
proteins used to train and test DNSS1. Table 1 compares the Q3
and Sov scores of DNSS1 and DNSS2 architectures on the validation set.
The results show that five out of six new advanced deep networks (RCNN,
ResNet, CRMN, FractalNet, and InceptionNet) except the standard CNN
network obtain higher Q3 scores than the deep belief network that used
in DNSS1. InceptionNet worked best among individual deep architectures.
The ensemble of the six deep architectures (DNSS2) achieved the highest
Q3 score of 83.04%, better than all the six individual deep
architectures and 79.1% Q3 score of DNSS1.
Impact of different input features
After the best deep learning architecture (i.e. InceptionNet) was
determined, it was utilized to examine the impact of the different input
features including PSSM, Atchley factor (FAC), Emission probabilities
(Em), Transition probabilities (Tr), and amino acids probabilities from
HHblits alignments (HHblitsMSA).
In
this analysis, the protein sequence databases required for alignment
generation were updated to latest and all the input features for DNSS1
datasets were regenerated. Specifically, the Uniref90 database that was
released at October 2018 was used to generate PSSM profiles by
PSI-BLAST, and the latest version of Uniclust30 database (October 2017)
was used to generate HMM profiles by HHblits. The Inception network was
then trained on the 1,230 proteins using the combination of five kinds
of
features.
We tested six feature combinations shown in Table 2 .
Hyper-parameter optimization was applied to obtain the best model on
each feature combination. Table 2 shows the performance of
different input feature combinations with the inception network on the
validation dataset of 195 proteins. Adding the emission profile inferred
from HMM model on top of PSSM and Atchley factor features increased the
Q3 score from 79.81% to 82.31%. Integrating all the five kinds of
features will yield the highest Q3 score (i.e. 82.72%) and Sov score
(75.89%).
The performance of the six deep
architectures and their ensemble on the latest features (the combination
of all five kinds of features) of the DNSS1 validation dataset was also
reported in Table 3 . All six architectures were re-trained on
the 1,230 proteins and evaluated on the validation dataset. Compared to
the results in Table 1 , the prediction accuracy of all the
networks on the validation set was improved. The Q3 and SOV scores of
the ensemble (DNSS2) were increased to 83.84% and 75.5%, respectively.
The results indicate that the update of the protein sequence databases
helps improve prediction accuracy.
Comparison of DNSS2 with eight state-of-the-art tools on two
independent test datasets
DNSS2
was compared with eight state-of-art methods including SSPro5.2, DNSS1,
PSSpred, MUFOLD-SS, DeepCNF, PSIPRED, SPIDER3, and Porter 5 on the
DNSS2_TEST dataset. The test dataset contains non-redundant proteins
released after Jan 1st, 2018. All the tools were
downloaded and configured based on their instructions. The sequence
databases that the tools require were updated to the latest version.
The Q3 score of each tool on the test dataset was reported inTable 4 . In general, DNSS2 is comparable to the two predictors
(Porter 5 and SPIDER3) on this dataset and outperforms the other six
methods. Specifically, DNSS2 achieved a Q3 accuracy of 85.02% and SOV
accuracy of 76.01% on the DNSS2_TEST dataset, which was significantly
better than DNSS 1.0 on
the
DNSS2_test dataset with p-value equal to 2.2E-16.
In addition to the DNSS2_test dataset, we also compared these methods
on the 82 protein targets of 2018 CASP13 experiment, which share less
than 25% sequence identity with the training proteins of DNSS2. Both
template-based (TBM) and free-modeling (FM) protein targets were used to
evaluate the methods and the results are summarized in the Table
5 . Consistent with the performance on the DNSS2_test dataset shown inTable 4 , DNSS2, SPIDER3 and Porter 5 performed best, while
DNSS2 achieved slightly better performance than SPIDER3 and Porter 5.Figure 3 plots the distribution of the Q3 scores for all CASP13
targets obtained by DNSS2 and the other eight methods. In general, the
distribution of DNSS2 consistently shifts to higher Q3 score compared
with other methods, even though the distribution of DNSS2 largely
overlaps with that of SPIDER3 and Porter 5.
Table 6 summarized the confusion matrix of predictions of three
kinds of secondary structures (helix, sheet, coil) by DNSS2 on the
CASP13 dataset. DNSS2 yields the highest accuracy for helical prediction
(87.91%), followed by the coil prediction (80.21%) and the sheet
prediction (76.45%). The prediction errors between helix, sheet, and
coil was also reported. The error rate of misclassifying helix as sheet
is the lowest (0.57%) and sheet as coil is the highest (22.46%).
Conclusion
In this work, we developed several advanced deep learning architectures
and their ensemble to improve secondary structure prediction. We
investigated six advanced deep learning architectures and five kinds of
input features on secondary structure prediction. Several deep learning
architectures such as inception network, fractal network, and recurrent
convolutional memory network are novel for protein secondary structure
prediction and performed better than the deep belief network. The
performance of the deep learning method is comparable to or better than
seven external state-of-the-art methods on the two independent test
datasets. Our experiment also demonstrated that emission/transition
probabilities extracted from hidden Markov model profiles are useful for
secondary structure prediction.
Acknowledgements
This work has been supported by an NIH grant (R01GM093123) and two NSF
grants (DBI1759934, IIS1763246) to JC.
Conflict of Interest
The authors have no conflict of interests to declare.
References
1. Pauling, L.; Corey, R. B.; Branson, H. R., The structure of proteins:
two hydrogen-bonded helical configurations of the polypeptide chain.Proceedings of the National Academy of Sciences 1951,37 (4), 205-211.
2. Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J., Accurate De Novo
Prediction of Protein Contact Map by Ultra-Deep Learning Model.PLOS Computational Biology 2017, 13 (1),
e1005324.
3. Adhikari, B.; Hou, J.; Cheng, J., DNCON2: Improved protein contact
prediction using two-level deep convolutional neural networks.Bioinformatics 2017, 34 (9), 1466-1472.
4. Michel, M.; Hurtado, D. M.; Elofsson, A., PconsC4: fast, accurate,
and hassle-free contact predictions. Bioinformatics2018 , bty1036-bty1036.
5. Hou, J.; Adhikari, B.; Cheng, J., DeepSF: deep convolutional neural
network for mapping protein sequences to folds. arXiv preprint
arXiv:1706.01010 2017 .
6. Jones, D. T.; Tress, M.; Bryson, K.; Hadley, C., Successful
recognition of protein folds using threading methods biased by sequence
similarity and predicted secondary structure. Proteins: Structure,
Function, and Bioinformatics 1999, 37 (S3), 104-111.
7. Myers, J. K.; Oas, T. G., Preorganized secondary structure as an
important determinant of fast protein folding. Nature Structural
& Molecular Biology 2001, 8 (6), 552-558.
8. Adhikari, B.; Cheng, J., CONFOLD2: improved contact-driven ab initio
protein structure modeling. BMC bioinformatics 2018,19 (1), 22.
9. Rohl, C. A.; Strauss, C. E.; Misura, K. M.; Baker, D., Protein
structure prediction using Rosetta. Methods in enzymology2004, 383 , 66-93.
10. Roy, A.; Kucukural, A.; Zhang, Y., I-TASSER: a unified platform for
automated protein structure and function prediction. Nature
protocols 2010, 5 (4), 725-738.
11. Uziela, K.; Shu, N.; Wallner, B.; Elofsson, A., ProQ3: Improved
model quality assessments using Rosetta energy terms. Scientific
reports 2016, 6 , 33509.
12. Cao, R.; Cheng, J., Integrated protein function prediction by mining
function associations, sequences, and protein–protein and gene–gene
interaction networks. Methods 2016, 93 , 84-91.
13. Webb, B.; Sali, A., Protein structure modeling with MODELLER.Protein Structure Prediction 2014 , 1-15.
14. Wang, Z.; Eickholt, J.; Cheng, J., MULTICOM: a multi-level
combination approach to protein structure prediction and its assessments
in CASP8. Bioinformatics 2010, 26 (7), 882-888.
15. Kryshtafovych, A.; Monastyrskyy, B.; Fidelis, K.; Moult, J.;
Schwede, T.; Tramontano, A., Evaluation of the template‐based modeling
in CASP12. Proteins: Structure, Function, and Bioinformatics2017 .
16. Ovchinnikov, S.; Park, H.; Kim, D. E.; DiMaio, F.; Baker, D.,
Protein structure prediction using Rosetta in CASP12. Proteins:
Structure, Function, and Bioinformatics 2017 .
17. Yang, Y.; Gao, J.; Wang, J.; Heffernan, R.; Hanson, J.; Paliwal, K.;
Zhou, Y., Sixty-five years of the long march in protein secondary
structure prediction: the final stretch? Briefings in
bioinformatics 2016 , bbw129.
18. Rost, B., Protein secondary structure prediction continues to rise.Journal of structural biology 2001, 134 (2-3),
204-218.
19. Chou, P. Y.; Fasman, G. D., Prediction of protein conformation.Biochemistry 1974, 13 (2), 222-245.
20. Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang,
Z.; Miller, W.; Lipman, D. J., Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic acids
research 1997, 25 (17), 3389-3402.
21. Jones, D. T., Protein secondary structure prediction based on
position-specific scoring matrices. Journal of molecular biology1999, 292 (2), 195-202.
22. Magnan, C. N.; Baldi, P., SSpro/ACCpro 5: almost perfect prediction
of protein secondary structure and relative solvent accessibility using
profiles, machine learning and structural similarity.Bioinformatics 2014, 30 (18), 2592-2597.
23. Pollastri, G.; Mclysaght, A., Porter: a new, accurate server for
protein secondary structure prediction. Bioinformatics2004, 21 (8), 1719-1720.
24. Pollastri, G.; Przybylski, D.; Rost, B.; Baldi, P., Improving the
prediction of protein secondary structure in three and eight classes
using recurrent neural networks and profiles. Proteins: Structure,
Function, and Bioinformatics 2002, 47 (2), 228-235.
25. Dor, O.; Zhou, Y., Achieving 80% ten‐fold cross‐validated accuracy
for secondary structure prediction by large‐scale training.Proteins: Structure, Function, and Bioinformatics 2007,66 (4), 838-845.
26. Remmert, M.; Biegert, A.; Hauser, A.; Söding, J., HHblits:
lightning-fast iterative protein sequence searching by HMM-HMM
alignment. Nature methods 2012, 9 (2), 173.
27. Meng, Q.; Peng, Z.; Yang, J.; Valencia, A., CoABind: a novel
algorithm for Coenzyme A (CoA)-and CoA derivatives-binding residues
prediction. Bioinformatics 2018, 1 , 7.
28. Atchley, W. R.; Zhao, J.; Fernandes, A. D.; Drüke, T., Solving the
protein sequence metric problem. Proceedings of the National
Academy of Sciences 2005, 102 (18), 6395-6400.
29. Spencer, M.; Eickholt, J.; Cheng, J., A deep learning network
approach to ab initio protein secondary structure prediction.IEEE/ACM Transactions on Computational Biology and Bioinformatics2015, 12 (1), 103-112.
30. Qian, N.; Sejnowski, T. J., Predicting the secondary structure of
globular proteins using neural network models. Journal of
molecular biology 1988, 202 (4), 865-884.
31. Holley, L. H.; Karplus, M., Protein secondary structure prediction
with a neural network. Proceedings of the National Academy of
Sciences 1989, 86 (1), 152-156.
32. Gibrat, J.-F.; Garnier, J.; Robson, B., Further developments of
protein secondary structure prediction using information theory: new
parameters and consideration of residue pairs. Journal of
molecular biology 1987, 198 (3), 425-443.
33. Stolorz, P.; Lapedes, A.; Xia, Y., Predicting protein secondary
structure using neural net and statistical methods. Journal of
Molecular Biology 1992, 225 (2), 363-377.
34. Schmidler, S. C.; Liu, J. S.; Brutlag, D. L., Bayesian segmentation
of protein secondary structure. Journal of computational biology2000, 7 (1-2), 233-248.
35. Faraggi, E.; Zhang, T.; Yang, Y.; Kurgan, L.; Zhou, Y., SPINE X:
improving protein secondary structure prediction by multistep learning
coupled with prediction of solvent accessible surface area and backbone
torsion angles. Journal of computational chemistry 2012,33 (3), 259-267.
36. Wang, S.; Peng, J.; Ma, J.; Xu, J., Protein secondary structure
prediction using deep convolutional neural fields. Scientific
reports 2016, 6 .
37. Heffernan, R.; Yang, Y.; Paliwal, K.; Zhou, Y., Capturing Non-Local
Interactions by Long Short Term Memory Bidirectional Recurrent Neural
Networks for Improving Prediction of Protein Secondary Structure,
Backbone Angles, Contact Numbers, and Solvent Accessibility.Bioinformatics 2017 , btx218.
38. Fang, C.; Shang, Y.; Xu, D., MUFOLD‐SS: New deep
inception‐inside‐inception networks for protein secondary structure
prediction. Proteins: Structure, Function, and Bioinformatics2018, 86 (5), 592-598.
39. Heffernan, R.; Paliwal, K.; Lyons, J.; Dehzangi, A.; Sharma, A.;
Wang, J.; Sattar, A.; Yang, Y.; Zhou, Y., Improving prediction of
secondary structure, local backbone angles, and solvent accessible
surface area of proteins by iterative deep learning. Scientific
reports 2015, 5 , 11476.
40. Torrisi, M.; Kaleel, M.; Pollastri, G., Porter 5: fast,
state-of-the-art ab initio prediction of protein secondary structure in
3 and 8 classes. bioRxiv 2018 , 289033.
41. Krizhevsky, A.; Sutskever, I.; Hinton, G. E. In Imagenet
classification with deep convolutional neural networks , Advances in
neural information processing systems, 2012; pp 1097-1105.
42. Liang, M.; Hu, X. In Recurrent convolutional neural network
for object recognition , Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015; pp 3367-3375.
43. He, K.; Zhang, X.; Ren, S.; Sun, J. In Deep residual learning
for image recognition , Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016; pp 770-778.
44. Moniz, J.; Pal, C., Convolutional residual memory networks.arXiv preprint arXiv:1606.05262 2016 .
45. Larsson, G.; Maire, M.; Shakhnarovich, G., Fractalnet: Ultra-deep
neural networks without residuals. arXiv preprint
arXiv:1605.07648 2016 .
46. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.;
Erhan, D.; Vanhoucke, V.; Rabinovich, A. In Going deeper with
convolutions , Proceedings of the IEEE conference on computer vision and
pattern recognition, 2015; pp 1-9.
47. Wang, G.; Dunbrack Jr, R. L., PISCES: a protein sequence culling
server. Bioinformatics 2003, 19 (12), 1589-1591.
48. Li, W.; Godzik, A., Cd-hit: a fast program for clustering and
comparing large sets of protein or nucleotide sequences.Bioinformatics 2006, 22 (13), 1658-1659.
49. Zemla, A.; Venclovas, Č.; Fidelis, K.; Rost, B., A modified
definition of Sov, a segment‐based measure for protein secondary
structure prediction assessment. Proteins: Structure, Function,
and Bioinformatics 1999, 34 (2), 220-223.
50. Kabsch, W.; Sander, C., Dictionary of protein secondary structure:
pattern recognition of hydrogen‐bonded and geometrical features.Biopolymers: Original Research on Biomolecules 1983,22 (12), 2577-2637.
51. Consortium, U., UniProt: a hub for protein information.Nucleic acids research 2014, 43 (D1), D204-D212.
52. Mirdita, M.; von den Driesch, L.; Galiez, C.; Martin, M. J.; Söding,
J.; Steinegger, M., Uniclust databases of clustered and deeply annotated
protein sequences and alignments. Nucleic acids research2016, 45 (D1), D170-D176.
53. Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.;
Salakhutdinov, R., Dropout: a simple way to prevent neural networks from
overfitting. Journal of Machine Learning Research 2014,15 (1), 1929-1958.
54. Yan, R.; Xu, D.; Yang, J.; Walker, S.; Zhang, Y., A comparative
assessment and analysis of 20 representative sequence alignment methods
for protein structure prediction. Scientific reports2013, 3 , 2619.
55. McGuffin, L. J.; Bryson, K.; Jones, D. T., The PSIPRED protein
structure prediction server. Bioinformatics 2000,16 (4), 404-405.
Figure Legend
Figure 1. Overview of the experimental workflow for improving
secondary structure prediction. (A) Six principal steps are conducted to
construct and train deep networks. The solid box represents an analysis
step. The dashed box represents the output from the previous step. The
scroll represents the dataset used in each step. (B) Dataset generation
and filtering process.
Figure 2. Six deep learning architectures: (A) CNN, (B) ResNet,
(C) InceptionNet, (D) RCNN, (E) CRNN, (F) FractalNet) for secondary
structure prediction. L: sequence length; K: number of features per
position.
Figure 3 . Comparison of the
distribution of Q3 scores of eight existing methods and that of DNSS2 on
all CASP13 targets.
Tables