Simankov Nikolay

and 3 more

The sensitivity of state-of-the-art supervised classification models is compromised by contamination-prone biomedical datasets, which are vulnerable to the presence of missing or erroneous labels (i.e., inliers). From codon frequencies, electrocardiogram signals, biomarkers, morphological features, and patient questionnaires, we attempted to cover a wide range of typical biomedical databases exposed to the risk of inlier contamination. In some very niche fields, such as image recognition, missing labels have received a lot of attention, but in Life Sciences, where outliers are almost systematically filtered, inliers have remained orphans. Our study introduced a pragmatic and innovative methodology that consists of upcycling one-class semi-supervised anomaly detection models for filtering potential inliers in training datasets. By validating this methodology and benchmarking five one-class semi-supervised models and two ensemble methods were benchmarked against various traditional classifiers on 6 databases with 10 different contamination levels and 10 random samplings, achieving an average Matthews correlation coefficient of 78±17% in validation, whereas 22 supervised classifiers achieved an average score of 81±9% trained with the complete and uncontaminated trainset. Therefore, by filtering the training set with an isolation forest, the average resilience to inliers of 22 tested Machine Learning models increased from 69±11% to 95±1%, including neural networks and gradient-boosting methods. Taken together, our study showcased the efficacy of our versatile approach in enhancing the resilience of Machine Learnig models and highlighted the importance of accurately addressing the inliers challenge in the domains of Life Sciences.