A comparison of outlier detection methods for species distribution data,
from geographical filters to machine learning
Abstract
The rapid increase in the availability of occurrence data has led to a
rapid increase in the analyses that make use of these data. Because
these data often aggregate many surveying efforts including community
science data and historic museum records, plenty of errors end up in
aggregate databases such as GBIF. Data cleaning pipelines have mainly
focused on the most common types of errors, such as erroneous
coordinates, dates, and taxonomic names. However, other errors exist
which are harder to identify. For example, data points that occur
outside of known species ranges or outside plausible distributions. In
order to identify these errors, a strong collaboration between data
analyzers and taxon experts is needed, as these data points often need
to be painstakingly identified one-by-one. To ensure the reliability of
the data, this often needs to be repeated whenever new data becomes
available. Given the continuous (and often exponential) increase in data
each year, finding ways to expedite the identification of these outliers
is imperative. The rapid advancement of machine learning tools may be an
invaluable approach to harness for this task. Using a manually-cleaned
set of 147 butterfly species with labeled true outliers, we compared the
ability of multiple methods (which varied from using simple regional
checklists and ecoregions to neural networks) to identify these true
outliers. Since we used real species data, we also evaluated
distribution properties that may affect classification. Because these
outliers tend to be relatively rare, classifiers and neural networks
tended to perform worse than simple filtering based on ecoregion. We
identified ways to improve classification accuracy, such as subsampling,
yet even after these improvements simple filters still outperform more
complex models. While previous studies have used simulated data to test
methods for outliers, we found that real observational data is more
unbalanced than previously simulated.