2.4 Deep learning architectures
A widely used deep learning architecture in bioinformatics is deep convolutional neural networks (CNN). Convolutional neural networks have some distinctive advantages over the traditional neural networks for the bioinformatics problems in several ways: (1) it can learn informative representation directly from sequence features without requiring segmentation (e.g. sliding window) or dimension reduction (e.g. principle component analysis) techniques; (2) the convolutional network can learn both local and global features to discover complex patterns; and (3) the architecture is independent of input size (i.e. length or volume). In this work, we design a standard CNN and five advanced deep learning architectures based on both convolutional and other useful operations as in Figure 2.
Figure 2(A) illustrates our standard convolutional neural network (CNN) for secondary structure prediction, consisting of a sequence of convolutional blocks, each of which contains a convolutional layer, a batch-normalization layer, and an activation layer. The original input is a L × K vector (X ), where L is sequence length and K is the number of features per residue position in the sequence. For each convolution block, the feature maps are obtained after the convolution operation is applied by multiplying the weight matrices (called filters, W ) with a window of local features on the previous input layer and adding bias vectors (b ) according to the formula: \(X^{l+1}=\ W^{l+1}*X^{l}+b^{l+1}\) , where lis the layer number. The batch normalization layer is added to obtain a Gaussian normalization of convolved features coming out of each convolutional layer. Then an activation function such as rectified linear function (i.e. ReLU) is applied to extract non-linear patterns of the normalized hidden features. To avoid overfitting, regularization approaches such as dropout 53 can be applied in the hidden layers. The final output node (also a filter) in the output cell uses the softmax function to classify the input of each residue position from its previous layer into one of three secondary structure states. The output is a L × 3 vector, holding the predicted probability of three secondary structure states for each of L positions in a sequence. The final optimal CNN architecture includes 6 convolutional blocks, in which the filter size (window size) for each convolutional layer is 6, and the number of filters (feature maps) in each convolution layer is 40.
The residual network (ResNet) was designed to make traditional convolutional neural network deeper without gradient vanishing. The architecture constructs many residual blocks and stacked up them to form a deeper network, as shown in Figure 2 (B) . In each residual block, the input \(X^{l}\) is fed into a few convolutional layers to obtain the non-linear transformation outputG( \(X^{l+1}\)) . In order to make the network deeper, an extra skip connection (i.e. short-cut) is added to copy the input\(X^{l}\) to the output of non-linear transformation layer, where\(X^{\left(l+1\right)*}\)can be represented as \(X^{\left(l+1\right)*}\) = \(X^{l}+\)G( \(X^{l+1}\)) before applying another ReLU non-linearity. This process makes neural network deeper by adding shortcuts to facilitate gradient back-propagation during training and achieve better performance. The residual blocks with different configuration can be stacked to achieve higher accuracy. For instance, the final best architecture in DNSS2 is made up of 13 residual blocks, each of which includes 3 convolutional layers with filter size 1, 3, 1 respectively. The first three residual blocks used 37 filters to learn features, while the middle four blocks used 74 filters for each convolution layer, and the last six residual blocks used 148 filters. In total, 39 convolutional layers are included in the final residual network. In the network, the dropout and batch normalization were also added to prevent network from overfitting.
Inception network is an advanced architecture for building deeper networks by repeating a bunch of inception modules, as shown inFigure 2(c) . Instead of trying to determine the best values for certain hyper-parameters (i.e. number of filter size, number of layers, inclusion of pooling layer), inception network proposes to concatenate outputs of hidden layers with different configuration through an inception module and trains the network to learn patterns from the combination of diverse hyper-parameters. Despite its high computation cost, inception network has performed remarkably well in many applications 38, 46. For secondary structure prediction, a combination of three filter sizes \(1\times K\),\(3\times K\) and \(5\times K\) was applied to convolve feature input, where K is the number of original input features for each residue position. The concatenation of the convolution outputs is fed into an activation layer for non-linear activation calculation. This kind of inception module is repeated to make a deeper network. After the parameter tuning, the optimal inception network is comprised of three inception blocks with 24 convolution layers included.
In addition, we designed three more deep learning architectures: recurrent convolutional neural network (RCNN) 42, convolutional residual memory networks (CRMN) 44, and fractal network for secondary structure prediction. The recurrent convolutional neural network (RCNN) was designed to model sequential dependency hidden inside the sequential features (Figure 2(D)) , It firstly extracts the higher-level feature maps by a convolution block, and then uses a recurrent neural network (i.e. bi-directional Long-Short-Term Memory (LSTM) network) for modeling the inter-dependence among the convolved features. Such a recurrent convolutional block with 4 convolutional layers included is repeated 5 times to build a deep recurrent convolutional neural network for secondary structure prediction in this work. The CRMN network augmented the architectures by integrating convolutional residual networks with LSTM (Figure 2(E)) (e.g., 2 residual blocks and 2 LSTM in the network). Both methods advanced the convolutional neural network by introducing the memory mechanisms of recurrent neural network (RNN). Moreover, inspired by ResNet and Inception Network, we built a Fractal network stacking up different number of convolution blocks in both parallel and hierarchical fashion by adding several shortcut paths to connect lower-level layers and higher-level layers, as shown in Figure 2(F) . After tuning, the fractal network was assembled with 16 convolution layers for one fractal block.