Hist-i-fy: Multiple histidine function prediction based on protein
sequences using deep neural network
Abstract
Histidine (His) is the most reactive amino acid at enzyme active sites.
Multiple post-translational modifications (functions) are reported for
His side chains. The high-throughput sequencing techniques produce a
large number of protein sequences without functional annotations at the
amino acid level. Experimental characterization of His functions in
proteins is laborious and time-consuming. Computational characterization
based on protein sequences may complement the need. There are only a
handful of Histidine function prediction tools available and those
annotate only a single function. Here we curated a dataset of active
Histidine with known functions based on protein sequences obtained from
UniProt database (sample size n=1584) and trained against four machine
learning methods. The convolution neural network (CNN) model (â
Hist-i-fyâ) performed the best with 75% overall accuracy. The
external validation of Hist-i-fy on phosphorylated histidine data
(sample size 34) showed 94.1% prediction accuracy. For the first time,
we report multiple His function prediction, based on protein sequences
using deep neural networks. The inputs to the model are i) protein
sequence containing His, and ii) the His residue number. The model
predicts one out of the eight histidine functions, namely, acetylation,
ribosylation, glycosylation, hydroxylation, methylation, oxidation,
phosphorylation, and protein splicing. The novelty of the work is, it
predicts maximum number of histidine functions at a time with optimal
performance. There is a scope of improvement in the model upon
availability of a larger dataset. The model is available as a web
application
([https://histify.streamlit.app/](https://histify.streamlit.app/))
and a stand-alone code
[https://github.com/dibyansu24-maker/Histify](https://github.com/dibyansu24-maker/Histify)).