Jayme Lewthwaite

and 2 more

The rapid increase in the availability of occurrence data has led to a rapid increase in the analyses that make use of these data. Because these data often aggregate many surveying efforts including community science data and historic museum records, plenty of errors end up in aggregate databases such as GBIF. Data cleaning pipelines have mainly focused on the most common types of errors, such as erroneous coordinates, dates, and taxonomic names. However, other errors exist which are harder to identify. For example, data points that occur outside of known species ranges or outside plausible distributions. In order to identify these errors, a strong collaboration between data analyzers and taxon experts is needed, as these data points often need to be painstakingly identified one-by-one. To ensure the reliability of the data, this often needs to be repeated whenever new data becomes available. Given the continuous (and often exponential) increase in data each year, finding ways to expedite the identification of these outliers is imperative. The rapid advancement of machine learning tools may be an invaluable approach to harness for this task. Using a manually-cleaned set of 147 butterfly species with labeled true outliers, we compared the ability of multiple methods (which varied from using simple regional checklists and ecoregions to neural networks) to identify these true outliers. Since we used real species data, we also evaluated distribution properties that may affect classification. Because these outliers tend to be relatively rare, classifiers and neural networks tended to perform worse than simple filtering based on ecoregion. We identified ways to improve classification accuracy, such as subsampling, yet even after these improvements simple filters still outperform more complex models. While previous studies have used simulated data to test methods for outliers, we found that real observational data is more unbalanced than previously simulated.