2.3 Input features
The profile of each amino acid is represented by 21 numbers from
PSI-BLAST-based position specific scoring matrix (PSSM), 20 emission
probabilities and 7 transition probabilities extracted from Hidden
Markov Model (HMM) profile, 20 probabilities of standard amino acid
calculated from the multiple sequence alignment (MSA) and 5 numbers
derived from Atchley’s factor. These features (73 numbers in total)
represent the evolutionary conservation and physicochemical properties
for residues in a protein sequence.
PSI-BLAST was run to generate multiple sequence alignment and PSSM
profile through searching a sequence against filtered UniProt sequence
database at 90% sequence identity (UniRef90) 51 with
three iterations and an e-value cutoff 0.001 (‘-evalue .001
-inclusion_ethresh .002’). Less
stringent threshold was used (‘-evalue 10 -inclusion_ethresh 10’) in
case some proteins did not have homologous sequences returned. In a PSSM
profile, each position is represented by 20 numbers related to the
probabilities for 20 standard amino acids appearing at the position in
the multiple sequence alignment. In addition, the sequence information
in the second to the last column in PSI-BLAST profile is given for each
residue.
HMM profile was generated by running three iteration of ‘HHblits’
against the uniclust30 database (version: October 2017)52. Two types of probabilities were associated with
each residue in a HMM profile: emission probability and transition
probability. Emission probability represents the probability of a given
amino acid occurring at the position in the multiple sequence alignment.
The transition probability represents the probability transiting from an
alignment state (i.e. match, insertion, and deletion) to another.
Similar to PSSM, the emission frequencies of the 20 standard amino acid
for each residue were reported in the HMM profile, and the probabilities
were calculated according to formula:
\(p_{\text{ik}}=\ 2^{(-\frac{\text{Freq}_{\text{ik}}}{1000})}\) (1)
where i is the i -th residue in sequence and k is
the k -th standard amino acid. And the probability is set to 0 if
the frequency is denoted as ‘*’. The transition probabilities for each
amino acid were also derived in the same fashion. In total, 20 emission
probabilities and 7 transition probabilities for each amino acid were
collected to represent the residue conservation inferred from HMM.
Since HHblits was more sensitive to identify distant homologous
sequences than PSI-BLAST, the probability matrix of amino acids was also
calculated from the multiple sequence alignment (MSA) generated by
HHblits. The conversion from MSA to a probability matrix follows the
same calculation as SSpro 22.