2.2 Datasets and evaluation metric
As described in section 2.1, two training datasets were used in our experiment. In the first stage, the original DNSS dataset 29 that included 1,230 training proteins and 195 validation proteins was utilized to investigate whether the deep learning architectures and novel features can boost the prediction accuracy.
To utilize more data available since DNSS1 was published, a new, larger training set of DNSS2 was constructed from CullPDB 47curated on 18 October 2018 (Figure 1(B) ). The dataset consists of 12,566 proteins that share less than 25% sequence identity with 2.5\(\mathring{\mathrm{A}}\) resolution cutoff and R-factor cutoff 1. The structures of all the proteins were determined by X-ray crystallography. The dataset was then filtered by removing proteins with non-standard amino acids, chain-break (i.e. distance of adjacent Ca-Ca atoms is larger than 4 \(\mathring{\mathrm{A}}\)), and sequence length shorter than 30 or longer than 700 amino acids. Considering all external methods benchmarked in this work were developed prior to year 2018, the proteins that were released after Jan 1st, 2018 were extracted as independent test set (DNSS2_TEST). The resulting set of proteins was further filtered against DNSS2_TEST set using CD-HIT suite48 with criteria of 25% sequence identity cutoff and e-value threshold 0.1. Finally, 5,413 proteins released prior to Jan 1st, 2018 were obtained as our training set, in which 4,872 proteins were used for network training (DNSS2_TRAIN) and 547 proteins were used for model selection (DNSS2_VAL). In addition, the proteins of the CASP13 (2018) experiment were collected and the ones with at least 25% sequence identity with training proteins were removed, which results in a set of 82 test proteins. The proteins were also classified into template-based (TBM) and free-modeling (FM) targets based on the official CASP definition (CASP 13, 2018, http://www.predictioncenter.org/casp13/index.cgi). In summary, the final test set contain 429 proteins from DNSS2_TEST and 82 proteins from CASP13.
We evaluated our secondary structure prediction based on two primary metrics: Q3 accuracy and Segment Overlap measure (SOV). Q3 score represents the percent of correctly predicted secondary structure states in a protein. SOV score measures the similarity between the predicted segments of continuous structure states and those in the experimental structure 29, 49. The Q3 and SOV scores are complementary with each other for secondary structure evaluation. All training and testing proteins’ structure files were parsed by DSSP program 50 to obtain the real secondary structure classification for each amino acid for training and evaluation.