2.1 Experimental design
In this work, the main objective was to improve the secondary structure prediction by developing more advanced deep learning architectures and introducing more useful features. In the process, we have developed a systematic framework to effectively build deep learning architectures and obtain features to improve secondary structure prediction.Figure 1 provides an overview of our experimental design.Figure 1(A) lists the six major steps of designing, training and testing deep learning architectures. Figure 1(B)illustrates the process of creating training and validation datasets. The key analysis is to design appropriate architectures and investigate if they can improve prediction accuracy. Six different deep neural network architectures were evaluated in the study, including convolutional neural network (CNN) 41, recurrent convolutional neural network (RCNN) 42, ResNet43, convolutional recurrent memory network (CRMN)44, FractalNet 45, and Inception network 46. Most of these architectures were applied to secondary structure prediction for the first time. The detailed description of each network is included in Section 2.4. To ensure a fair comparison, each network was optimized using the original feature profiles of training proteins and evaluated on the same validation set of DNSS1. The network that achieved the best Q3 accuracy was selected to explore the feature space on the profiles derived from multiple sequence alignments (MSA) generated by PSI-BLAST 20 and HHblits26, Atchley factors, and emission/transition probabilities inferred from the Hidden Markov model (HMM) profile. The optimal feature set was determined according to the highest Q3 accuracy on the validation datasets. The networks were then re-trained using the optimal input profiles to obtain the best models.
Since combining predictors generally improved the prediction accuracy, the different combinations of networks were also evaluated. Finally, after the optimal sets of deep learning architectures and feature profiles were determined, all networks were re-trained on the large dataset that was manually curated including the non-redundant proteins whose structures have been released publicly before 2018. The final networks were used to predict the secondary structure for the test proteins. The probabilities of the three states (i.e. helix, sheet, and coil) for each residue predicted by six networks were averaged to make the final secondary structure prediction. Our method was then benchmarked with other state-of-art methods on the two independent test datasets.