2.3 Input features
The profile of each amino acid is represented by 21 numbers from PSI-BLAST-based position specific scoring matrix (PSSM), 20 emission probabilities and 7 transition probabilities extracted from Hidden Markov Model (HMM) profile, 20 probabilities of standard amino acid calculated from the multiple sequence alignment (MSA) and 5 numbers derived from Atchley’s factor. These features (73 numbers in total) represent the evolutionary conservation and physicochemical properties for residues in a protein sequence.
PSI-BLAST was run to generate multiple sequence alignment and PSSM profile through searching a sequence against filtered UniProt sequence database at 90% sequence identity (UniRef90) 51 with three iterations and an e-value cutoff 0.001 (‘-evalue .001 -inclusion_ethresh .002’). Less stringent threshold was used (‘-evalue 10 -inclusion_ethresh 10’) in case some proteins did not have homologous sequences returned. In a PSSM profile, each position is represented by 20 numbers related to the probabilities for 20 standard amino acids appearing at the position in the multiple sequence alignment. In addition, the sequence information in the second to the last column in PSI-BLAST profile is given for each residue.
HMM profile was generated by running three iteration of ‘HHblits’ against the uniclust30 database (version: October 2017)52. Two types of probabilities were associated with each residue in a HMM profile: emission probability and transition probability. Emission probability represents the probability of a given amino acid occurring at the position in the multiple sequence alignment. The transition probability represents the probability transiting from an alignment state (i.e. match, insertion, and deletion) to another. Similar to PSSM, the emission frequencies of the 20 standard amino acid for each residue were reported in the HMM profile, and the probabilities were calculated according to formula:
\(p_{\text{ik}}=\ 2^{(-\frac{\text{Freq}_{\text{ik}}}{1000})}\) (1)
where i is the i -th residue in sequence and k is the k -th standard amino acid. And the probability is set to 0 if the frequency is denoted as ‘*’. The transition probabilities for each amino acid were also derived in the same fashion. In total, 20 emission probabilities and 7 transition probabilities for each amino acid were collected to represent the residue conservation inferred from HMM.
Since HHblits was more sensitive to identify distant homologous sequences than PSI-BLAST, the probability matrix of amino acids was also calculated from the multiple sequence alignment (MSA) generated by HHblits. The conversion from MSA to a probability matrix follows the same calculation as SSpro 22.