Optimal sequence similarity thresholds for clustering of molecular
operational taxonomic units in DNA metabarcoding studies
Abstract
Clustering approaches are pivotal to handle the many sequence variants
obtained in DNA metabarcoding datasets, therefore they have become a key
step of metabarcoding analysis pipelines. Clustering often relies on a
sequence similarity threshold to gather sequences in Molecular
Operational Taxonomic Units (MOTUs) that ideally each represent a
homogeneous taxonomic entity, e.g. a species or a genus. However, the
choice of the clustering threshold is rarely justified, and its impact
on MOTU over-splitting or over-merging even less tested. Here, we
evaluated clustering threshold values for several metabarcoding markers
under different criteria: limitation of MOTU over-merging, limitation of
MOTU over-splitting, and trade-off between over-merging and
over-splitting. We extracted sequences from a public database for eight
markers, ranging from generalist markers targeting Bacteria or
Eukaryota, to more specific markers targeting a class or a subclass
(e.g. Insecta, Oligochaeta). Based on the distributions of pairwise
sequence similarities within species and within genera and on the rates
of over-splitting and over-merging across different clustering
thresholds, we were able to propose threshold values minimizing the risk
of over-splitting, that of over-merging, or offering a trade-off between
the two risks. For generalist markers, high similarity thresholds
(0.96-0.99) are generally appropriate, while more specific markers
require lower values (0.85-0.96). These results do not support the use
of a fixed clustering threshold (e.g. 0.97). Instead, we advocate a
careful examination of the most appropriate threshold based on the
research objectives, the potential costs of over-splitting and
over-merging, and the features of the studied markers.