2.1 Experimental design
In
this work, the main objective was to improve the secondary structure
prediction by developing more advanced deep learning architectures and
introducing more useful features. In the process, we have developed a
systematic framework to effectively build deep learning architectures
and obtain features to improve secondary structure prediction.Figure 1 provides an overview of our experimental design.Figure 1(A) lists the six major steps of designing, training
and testing deep learning architectures. Figure 1(B)illustrates the process of creating training and validation datasets.
The key analysis is to design appropriate architectures and investigate
if they can improve prediction accuracy. Six different deep neural
network architectures were evaluated in the study, including
convolutional neural network (CNN) 41,
recurrent
convolutional neural network (RCNN) 42, ResNet43, convolutional recurrent memory network (CRMN)44, FractalNet 45, and Inception
network 46. Most of these architectures were applied
to secondary structure prediction for the first time. The detailed
description of each network is included in Section 2.4. To ensure a fair
comparison, each network was optimized using the original feature
profiles of training proteins and evaluated on the same validation set
of DNSS1. The network that achieved the best Q3 accuracy was selected to
explore the feature space on the profiles derived from multiple sequence
alignments (MSA) generated by PSI-BLAST 20 and HHblits26, Atchley factors, and emission/transition
probabilities inferred from the Hidden Markov model (HMM) profile. The
optimal feature set was determined according to the highest Q3 accuracy
on the validation datasets. The networks were then re-trained using the
optimal input profiles to obtain the best models.
Since combining predictors generally improved the prediction accuracy,
the different combinations of networks were also evaluated. Finally,
after the optimal sets of deep learning architectures and feature
profiles were determined, all networks were re-trained on the large
dataset that was manually curated including the non-redundant proteins
whose structures have been released publicly before 2018. The final
networks were used to predict the secondary structure for the test
proteins. The probabilities of the three states (i.e. helix, sheet, and
coil) for each residue predicted by six networks were averaged to make
the final secondary structure prediction. Our method was then
benchmarked with other state-of-art methods on the two independent test
datasets.