2.5 Training and evaluation procedure
Deeper networks with complex architectures are generally difficult to train effectively due to the high-dimensional hyper-parameter space. To obtain good performance on specific feature sets within a reasonable amount of time for each deep network, we developed an efficient heuristic random sampling approach for model hyperparameter optimization. Specifically, based on the several trials on network training, we first determined heuristically a reasonable range for each type of the network hyperparameters, including the number of filters from 20 to 50, the number of convolution blocks from 3 to 7, and the filter size from 3 to 7. For each subsequent trial, the values of hyper-parameters were randomly sampled from their specified range and the Q3 accuracy of the network on the validation dataset under the specific parameter combination was assessed. For each deep network, the best parameter set was determined after 100 trials were evaluated. We found that using the random sampling technique was able to generate better models in most cases and was also more efficient than the traditional grid search or greedy search.
The performance of different deep architectures and different feature profiles on the secondary structure prediction were rigorously examined using the training and validation set from original DNSS method. After the parameters and input features were determined, we trained each deep network on the latest curated dataset (DNSS2_TRAIN) and selected best models using the Q3 accuracy on the independent validation dataset (DNSS2_VAL). We used the Keras library (http://keras.io/) along with Tensorflow as a backend to train all networks.
The performance of DNSS2 was evaluated on the two independent datasets and compared with a variety of the state-of-art secondary structure prediction tools, including SSpro5.2 22, PSSpred54, MUFOLD-SS 38, DeepCNF36, PSIPRED 55, SPIDER337, Porter 5 40 and our previous method DNSS1 29. All the methods were assessed according to the Q3 and SOV scores on each dataset.
Results
Benchmarking different deep architectures of DNSS2 with DNSS1
The first evaluation was to investigate whether the new deep architectures networks (DNSS2) outperform the deep belief network (DNSS1) for the secondary structure prediction. In order to fairly compare them, we trained and validated the six deep networks on the original input features of the same 1,230 training and 195 validation proteins used to train and test DNSS1. Table 1 compares the Q3 and Sov scores of DNSS1 and DNSS2 architectures on the validation set. The results show that five out of six new advanced deep networks (RCNN, ResNet, CRMN, FractalNet, and InceptionNet) except the standard CNN network obtain higher Q3 scores than the deep belief network that used in DNSS1. InceptionNet worked best among individual deep architectures. The ensemble of the six deep architectures (DNSS2) achieved the highest Q3 score of 83.04%, better than all the six individual deep architectures and 79.1% Q3 score of DNSS1.
Impact of different input features
After the best deep learning architecture (i.e. InceptionNet) was determined, it was utilized to examine the impact of the different input features including PSSM, Atchley factor (FAC), Emission probabilities (Em), Transition probabilities (Tr), and amino acids probabilities from HHblits alignments (HHblitsMSA). In this analysis, the protein sequence databases required for alignment generation were updated to latest and all the input features for DNSS1 datasets were regenerated. Specifically, the Uniref90 database that was released at October 2018 was used to generate PSSM profiles by PSI-BLAST, and the latest version of Uniclust30 database (October 2017) was used to generate HMM profiles by HHblits. The Inception network was then trained on the 1,230 proteins using the combination of five kinds of features. We tested six feature combinations shown in Table 2 . Hyper-parameter optimization was applied to obtain the best model on each feature combination. Table 2 shows the performance of different input feature combinations with the inception network on the validation dataset of 195 proteins. Adding the emission profile inferred from HMM model on top of PSSM and Atchley factor features increased the Q3 score from 79.81% to 82.31%. Integrating all the five kinds of features will yield the highest Q3 score (i.e. 82.72%) and Sov score (75.89%).
The performance of the six deep architectures and their ensemble on the latest features (the combination of all five kinds of features) of the DNSS1 validation dataset was also reported in Table 3 . All six architectures were re-trained on the 1,230 proteins and evaluated on the validation dataset. Compared to the results in Table 1 , the prediction accuracy of all the networks on the validation set was improved. The Q3 and SOV scores of the ensemble (DNSS2) were increased to 83.84% and 75.5%, respectively. The results indicate that the update of the protein sequence databases helps improve prediction accuracy.
Comparison of DNSS2 with eight state-of-the-art tools on two independent test datasets
DNSS2 was compared with eight state-of-art methods including SSPro5.2, DNSS1, PSSpred, MUFOLD-SS, DeepCNF, PSIPRED, SPIDER3, and Porter 5 on the DNSS2_TEST dataset. The test dataset contains non-redundant proteins released after Jan 1st, 2018. All the tools were downloaded and configured based on their instructions. The sequence databases that the tools require were updated to the latest version.
The Q3 score of each tool on the test dataset was reported inTable 4 . In general, DNSS2 is comparable to the two predictors (Porter 5 and SPIDER3) on this dataset and outperforms the other six methods. Specifically, DNSS2 achieved a Q3 accuracy of 85.02% and SOV accuracy of 76.01% on the DNSS2_TEST dataset, which was significantly better than DNSS 1.0 on the DNSS2_test dataset with p-value equal to 2.2E-16.
In addition to the DNSS2_test dataset, we also compared these methods on the 82 protein targets of 2018 CASP13 experiment, which share less than 25% sequence identity with the training proteins of DNSS2. Both template-based (TBM) and free-modeling (FM) protein targets were used to evaluate the methods and the results are summarized in the Table 5 . Consistent with the performance on the DNSS2_test dataset shown inTable 4 , DNSS2, SPIDER3 and Porter 5 performed best, while DNSS2 achieved slightly better performance than SPIDER3 and Porter 5.Figure 3 plots the distribution of the Q3 scores for all CASP13 targets obtained by DNSS2 and the other eight methods. In general, the distribution of DNSS2 consistently shifts to higher Q3 score compared with other methods, even though the distribution of DNSS2 largely overlaps with that of SPIDER3 and Porter 5.
Table 6 summarized the confusion matrix of predictions of three kinds of secondary structures (helix, sheet, coil) by DNSS2 on the CASP13 dataset. DNSS2 yields the highest accuracy for helical prediction (87.91%), followed by the coil prediction (80.21%) and the sheet prediction (76.45%). The prediction errors between helix, sheet, and coil was also reported. The error rate of misclassifying helix as sheet is the lowest (0.57%) and sheet as coil is the highest (22.46%).
Conclusion
In this work, we developed several advanced deep learning architectures and their ensemble to improve secondary structure prediction. We investigated six advanced deep learning architectures and five kinds of input features on secondary structure prediction. Several deep learning architectures such as inception network, fractal network, and recurrent convolutional memory network are novel for protein secondary structure prediction and performed better than the deep belief network. The performance of the deep learning method is comparable to or better than seven external state-of-the-art methods on the two independent test datasets. Our experiment also demonstrated that emission/transition probabilities extracted from hidden Markov model profiles are useful for secondary structure prediction.
Acknowledgements
This work has been supported by an NIH grant (R01GM093123) and two NSF grants (DBI1759934, IIS1763246) to JC.
Conflict of Interest
The authors have no conflict of interests to declare.
References
1. Pauling, L.; Corey, R. B.; Branson, H. R., The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain.Proceedings of the National Academy of Sciences 1951,37 (4), 205-211.
2. Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J., Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.PLOS Computational Biology 2017, 13 (1), e1005324.
3. Adhikari, B.; Hou, J.; Cheng, J., DNCON2: Improved protein contact prediction using two-level deep convolutional neural networks.Bioinformatics 2017, 34 (9), 1466-1472.
4. Michel, M.; Hurtado, D. M.; Elofsson, A., PconsC4: fast, accurate, and hassle-free contact predictions. Bioinformatics2018 , bty1036-bty1036.
5. Hou, J.; Adhikari, B.; Cheng, J., DeepSF: deep convolutional neural network for mapping protein sequences to folds. arXiv preprint arXiv:1706.01010 2017 .
6. Jones, D. T.; Tress, M.; Bryson, K.; Hadley, C., Successful recognition of protein folds using threading methods biased by sequence similarity and predicted secondary structure. Proteins: Structure, Function, and Bioinformatics 1999, 37 (S3), 104-111.
7. Myers, J. K.; Oas, T. G., Preorganized secondary structure as an important determinant of fast protein folding. Nature Structural & Molecular Biology 2001, 8 (6), 552-558.
8. Adhikari, B.; Cheng, J., CONFOLD2: improved contact-driven ab initio protein structure modeling. BMC bioinformatics 2018,19 (1), 22.
9. Rohl, C. A.; Strauss, C. E.; Misura, K. M.; Baker, D., Protein structure prediction using Rosetta. Methods in enzymology2004, 383 , 66-93.
10. Roy, A.; Kucukural, A.; Zhang, Y., I-TASSER: a unified platform for automated protein structure and function prediction. Nature protocols 2010, 5 (4), 725-738.
11. Uziela, K.; Shu, N.; Wallner, B.; Elofsson, A., ProQ3: Improved model quality assessments using Rosetta energy terms. Scientific reports 2016, 6 , 33509.
12. Cao, R.; Cheng, J., Integrated protein function prediction by mining function associations, sequences, and protein–protein and gene–gene interaction networks. Methods 2016, 93 , 84-91.
13. Webb, B.; Sali, A., Protein structure modeling with MODELLER.Protein Structure Prediction 2014 , 1-15.
14. Wang, Z.; Eickholt, J.; Cheng, J., MULTICOM: a multi-level combination approach to protein structure prediction and its assessments in CASP8. Bioinformatics 2010, 26 (7), 882-888.
15. Kryshtafovych, A.; Monastyrskyy, B.; Fidelis, K.; Moult, J.; Schwede, T.; Tramontano, A., Evaluation of the template‐based modeling in CASP12. Proteins: Structure, Function, and Bioinformatics2017 .
16. Ovchinnikov, S.; Park, H.; Kim, D. E.; DiMaio, F.; Baker, D., Protein structure prediction using Rosetta in CASP12. Proteins: Structure, Function, and Bioinformatics 2017 .
17. Yang, Y.; Gao, J.; Wang, J.; Heffernan, R.; Hanson, J.; Paliwal, K.; Zhou, Y., Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Briefings in bioinformatics 2016 , bbw129.
18. Rost, B., Protein secondary structure prediction continues to rise.Journal of structural biology 2001, 134 (2-3), 204-218.
19. Chou, P. Y.; Fasman, G. D., Prediction of protein conformation.Biochemistry 1974, 13 (2), 222-245.
20. Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997, 25 (17), 3389-3402.
21. Jones, D. T., Protein secondary structure prediction based on position-specific scoring matrices. Journal of molecular biology1999, 292 (2), 195-202.
22. Magnan, C. N.; Baldi, P., SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity.Bioinformatics 2014, 30 (18), 2592-2597.
23. Pollastri, G.; Mclysaght, A., Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics2004, 21 (8), 1719-1720.
24. Pollastri, G.; Przybylski, D.; Rost, B.; Baldi, P., Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics 2002, 47 (2), 228-235.
25. Dor, O.; Zhou, Y., Achieving 80% ten‐fold cross‐validated accuracy for secondary structure prediction by large‐scale training.Proteins: Structure, Function, and Bioinformatics 2007,66 (4), 838-845.
26. Remmert, M.; Biegert, A.; Hauser, A.; Söding, J., HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods 2012, 9 (2), 173.
27. Meng, Q.; Peng, Z.; Yang, J.; Valencia, A., CoABind: a novel algorithm for Coenzyme A (CoA)-and CoA derivatives-binding residues prediction. Bioinformatics 2018, 1 , 7.
28. Atchley, W. R.; Zhao, J.; Fernandes, A. D.; Drüke, T., Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences 2005, 102 (18), 6395-6400.
29. Spencer, M.; Eickholt, J.; Cheng, J., A deep learning network approach to ab initio protein secondary structure prediction.IEEE/ACM Transactions on Computational Biology and Bioinformatics2015, 12 (1), 103-112.
30. Qian, N.; Sejnowski, T. J., Predicting the secondary structure of globular proteins using neural network models. Journal of molecular biology 1988, 202 (4), 865-884.
31. Holley, L. H.; Karplus, M., Protein secondary structure prediction with a neural network. Proceedings of the National Academy of Sciences 1989, 86 (1), 152-156.
32. Gibrat, J.-F.; Garnier, J.; Robson, B., Further developments of protein secondary structure prediction using information theory: new parameters and consideration of residue pairs. Journal of molecular biology 1987, 198 (3), 425-443.
33. Stolorz, P.; Lapedes, A.; Xia, Y., Predicting protein secondary structure using neural net and statistical methods. Journal of Molecular Biology 1992, 225 (2), 363-377.
34. Schmidler, S. C.; Liu, J. S.; Brutlag, D. L., Bayesian segmentation of protein secondary structure. Journal of computational biology2000, 7 (1-2), 233-248.
35. Faraggi, E.; Zhang, T.; Yang, Y.; Kurgan, L.; Zhou, Y., SPINE X: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. Journal of computational chemistry 2012,33 (3), 259-267.
36. Wang, S.; Peng, J.; Ma, J.; Xu, J., Protein secondary structure prediction using deep convolutional neural fields. Scientific reports 2016, 6 .
37. Heffernan, R.; Yang, Y.; Paliwal, K.; Zhou, Y., Capturing Non-Local Interactions by Long Short Term Memory Bidirectional Recurrent Neural Networks for Improving Prediction of Protein Secondary Structure, Backbone Angles, Contact Numbers, and Solvent Accessibility.Bioinformatics 2017 , btx218.
38. Fang, C.; Shang, Y.; Xu, D., MUFOLD‐SS: New deep inception‐inside‐inception networks for protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics2018, 86 (5), 592-598.
39. Heffernan, R.; Paliwal, K.; Lyons, J.; Dehzangi, A.; Sharma, A.; Wang, J.; Sattar, A.; Yang, Y.; Zhou, Y., Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Scientific reports 2015, 5 , 11476.
40. Torrisi, M.; Kaleel, M.; Pollastri, G., Porter 5: fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes. bioRxiv 2018 , 289033.
41. Krizhevsky, A.; Sutskever, I.; Hinton, G. E. In Imagenet classification with deep convolutional neural networks , Advances in neural information processing systems, 2012; pp 1097-1105.
42. Liang, M.; Hu, X. In Recurrent convolutional neural network for object recognition , Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015; pp 3367-3375.
43. He, K.; Zhang, X.; Ren, S.; Sun, J. In Deep residual learning for image recognition , Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp 770-778.
44. Moniz, J.; Pal, C., Convolutional residual memory networks.arXiv preprint arXiv:1606.05262 2016 .
45. Larsson, G.; Maire, M.; Shakhnarovich, G., Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648 2016 .
46. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. In Going deeper with convolutions , Proceedings of the IEEE conference on computer vision and pattern recognition, 2015; pp 1-9.
47. Wang, G.; Dunbrack Jr, R. L., PISCES: a protein sequence culling server. Bioinformatics 2003, 19 (12), 1589-1591.
48. Li, W.; Godzik, A., Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Bioinformatics 2006, 22 (13), 1658-1659.
49. Zemla, A.; Venclovas, Č.; Fidelis, K.; Rost, B., A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment. Proteins: Structure, Function, and Bioinformatics 1999, 34 (2), 220-223.
50. Kabsch, W.; Sander, C., Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features.Biopolymers: Original Research on Biomolecules 1983,22 (12), 2577-2637.
51. Consortium, U., UniProt: a hub for protein information.Nucleic acids research 2014, 43 (D1), D204-D212.
52. Mirdita, M.; von den Driesch, L.; Galiez, C.; Martin, M. J.; Söding, J.; Steinegger, M., Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic acids research2016, 45 (D1), D170-D176.
53. Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R., Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 2014,15 (1), 1929-1958.
54. Yan, R.; Xu, D.; Yang, J.; Walker, S.; Zhang, Y., A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Scientific reports2013, 3 , 2619.
55. McGuffin, L. J.; Bryson, K.; Jones, D. T., The PSIPRED protein structure prediction server. Bioinformatics 2000,16 (4), 404-405.
Figure Legend
Figure 1. Overview of the experimental workflow for improving secondary structure prediction. (A) Six principal steps are conducted to construct and train deep networks. The solid box represents an analysis step. The dashed box represents the output from the previous step. The scroll represents the dataset used in each step. (B) Dataset generation and filtering process.
Figure 2. Six deep learning architectures: (A) CNN, (B) ResNet, (C) InceptionNet, (D) RCNN, (E) CRNN, (F) FractalNet) for secondary structure prediction. L: sequence length; K: number of features per position.
Figure 3 . Comparison of the distribution of Q3 scores of eight existing methods and that of DNSS2 on all CASP13 targets.
Tables