2.2 Datasets and evaluation metric
As described in section 2.1, two
training datasets were used in our experiment. In the first stage, the
original DNSS dataset 29 that included 1,230 training
proteins and 195 validation proteins was utilized to investigate whether
the deep learning architectures and novel features can boost the
prediction accuracy.
To utilize more data available since DNSS1 was published, a new, larger
training set of DNSS2 was constructed from CullPDB 47curated on 18 October 2018 (Figure 1(B) ). The dataset consists
of 12,566 proteins that share less than 25% sequence identity with 2.5\(\mathring{\mathrm{A}}\) resolution cutoff and R-factor cutoff 1. The
structures of all the proteins were determined by X-ray crystallography.
The dataset was then filtered by removing proteins with non-standard
amino acids, chain-break (i.e. distance of adjacent Ca-Ca atoms is
larger than 4 \(\mathring{\mathrm{A}}\)), and sequence length shorter
than 30 or longer than 700 amino acids. Considering all external methods
benchmarked in this work were developed prior to year 2018, the proteins
that were released after Jan 1st, 2018 were extracted
as independent test set (DNSS2_TEST). The resulting set of proteins was
further filtered against DNSS2_TEST set using CD-HIT suite48 with criteria of 25% sequence identity cutoff and
e-value threshold 0.1. Finally, 5,413 proteins released prior to Jan
1st, 2018 were obtained as our training set, in which
4,872 proteins were used for network training (DNSS2_TRAIN) and 547
proteins were used for model selection (DNSS2_VAL). In addition, the
proteins of the CASP13 (2018) experiment were collected and the ones
with at least 25% sequence identity with training proteins were
removed, which results in a set of 82 test proteins. The proteins were
also classified into template-based (TBM) and free-modeling (FM) targets
based on the official CASP definition (CASP 13, 2018,
http://www.predictioncenter.org/casp13/index.cgi). In summary, the
final test set contain 429 proteins from DNSS2_TEST and 82 proteins
from CASP13.
We evaluated our secondary structure prediction based on two primary
metrics: Q3 accuracy and Segment Overlap measure (SOV). Q3 score
represents the percent of correctly predicted secondary structure states
in a protein. SOV score measures the similarity between the predicted
segments of continuous structure states and those in the experimental
structure 29, 49. The Q3 and SOV scores are
complementary with each other for secondary structure evaluation. All
training and testing proteins’ structure files were parsed by DSSP
program 50 to obtain the real secondary structure
classification for each amino acid for training and evaluation.