Machine Learning-based Prediction of Enzyme Substrate Scope: Application
to Bacterial Nitrilases
Abstract
Predicting the range of substrates accepted by an enzyme from its amino
acid sequence is challenging. Although sequence- and structure-based
annotation approaches are often accurate for predicting broad categories
of substrate specificity, they generally cannot predict which specific
molecules will be accepted as substrates for a given enzyme,
particularly within a class of closely related molecules. Combining
targeted experimental activity data with structural modeling, ligand
docking, and physicochemical properties of proteins and ligands with
various machine learning models provides complementary information that
can lead to accurate predictions of substrate scope for related enzymes.
Here we describe such an approach that can predict the substrate scope
of bacterial nitrilases, which catalyze the hydrolysis of nitrile
compounds to the corresponding carboxylic acids and ammonia. Each of the
four machine learning models (linear regression, random forest,
gradient-boosted decision trees, and support vector machines) performed
similarly (average ROC = 0.9, average accuracy = ~82%)
for predicting substrate scope for this dataset. The approach is
intended to be highly modular with respect to physicochemical property
calculations and software used for docking and modeling.