Illumina read processing and filtering
Detailed information on the read processing and filtering pipeline is
summarized in Table S2. Briefly, we demultiplexed raw reads allowing no
mismatch in the dual-index pair. Then, we used fastqc v.0.11.7
(Andrews, 2010) to quality check raw reads and cutadapt v.2.10
(Martin, 2011) to trim primers and filter out raw reads exhibiting any
variation from expected primer length and composition. Subsequently, we
used pear v.0.9.11 (Zhang, Kobert, Flouri, & Stamatakis, 2014)
to merge forward and reverse reads. Each metabarcoding sample was then
separately quality filtered, dereplicated discarding singletons, length
filtered retaining only reads 416-420 bp, de novo chimera
filtered using UCHIME3, and denoised using UNOISE3 as implemented in
vsearch v.2.9.1 (Rognes, Flouri, Nichols, Quince, & Mahé,
2016). Once denoising was performed, reads from all metabarcoding
samples were pooled and again dereplicated (discarding no sequences) to
generate a catalogue of unique putative haplotypes (ASVs). Subsequently,
we ran blast to compare all ASVs against a combined database
composed of the NCBI nt collection (accessed November 2020) and a
curated reference catalogue including the 344 Sanger sequences of the
‘voucher’ specimens plus 561 previously available sequences
corresponding to soil lineages of Acari, Collembola and Coleoptera
(Arribas et al. 2016, 2021b). Based on the blast output we
assigned the ASVs to high-rank taxonomic levels, by applying the
weighted lowest common ancestor algorithm in megan6 (Huson et
al., 2016; see also Hleap, Littlefair, Steinke, Hebert, & Cristescu,
2021). Only ASVs assigned to Acari, Collembola or Coleoptera were
retained and used for downstream analyses. We further filtered the ASVs
using metamate v.0.1b18 (Andújar et al., 2021), a novel
approach aiming at removing putative nuclear copies of mitochondrial DNA
(NUMTs; Lopez, Yuhki, Masuda, Modi, & O’Brien, 1994) and other types of
low-frequency erroneous sequences from denoised metabarcoding datasets.
This software allows the application of multiple read-abundance
filtering strategies and posterior evaluation of their effects on the
prevalence of known authentic mitochondrial haplotypes and presumed
non-mitochondrial copies (e.g., those violating the reading frame
or expected length, as expected for NUMTs and erroneous sequences) in
the final filtered dataset (Andújar et al., 2021). We selected the most
stringent filtering solution to ensure the removal of most erroneous
sequences (see Supplemental Information for details on the
metamate filtering). Subsequently, we used vsearch to
generate a read-count community table of the metamate-filtered
ASVs by matching them with a 100% identity value against the raw read
dataset before dereplicating, length filtering and denoising. We further
filtered these community tables by removing ASVs showing abundances of 2
or fewer reads and also those whose contribution to the total number of
reads per taxonomic group and library was lower than 1%. Finally,
filtered read-count community tables were converted to presence/absence
tables (see Jurburg, Keil, Singh, & Chase, 2021). Negative controls
were processed alongside actual samples throughout the filtering
workflow.