Machine Learning Models for Accurate Prioritization of Variants of
Uncertain Significance
Abstract
The growing use of new generation sequencing technologies on genetic
diagnosis has produced an exponential increase in the number of Variants
of Uncertain Significance (VUS). In this manuscript we compare three
machine learning methods to classify VUS as Pathogenic or No pathogenic,
implementing a Random Forest (RF), a Support Vector Machine (SVM), and a
Multilayer Perceptron (MLP). To train the models, we extracted 82,463
high quality variants from ClinVar, using 9 conservation scores, the
loss of function tool and allele frequencies. For the RF and SVM models,
hyperparameters were tuned using cross validation with a grid search.
The three models were tested on a set of 5,537 variants that had been
classified as VUS any time along the last three years but had been
reclassified in august 2020. The three models yielded superior accuracy
on this set compared to the benchmarked tools. The RF based model
yielded the best performance across different variant types and was used
to create VusPrize, an open source software tool for prioritization of
variants of uncertain significance. We believe that our model can
improve the process of genetic diagnosis on research and clinical
settings.