Data analysis
The two airborne eDNA samples collected from each site were combined and treated as one sample to maximise the recovery of cellular material from each site. The raw reads were firstly demultiplexed using QIIME2 (version 2022.8.0) (Estaki et al. 2020) based on the Illumina barcodes, creating separate samples for each. FastQC (version 0.11.9) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used on the demultiplexed reads to assess the overall data quality before and after adapter removal. Adapters were removed using Cutadapt (Martin 2011) (version 2.8-2). Bowtie2 (Langmead and Salzberg 2012) (version 2.2.5) and Samtools (Danecek et al. 2021) (version 1.10) were used to align the reads to a reference human mitochondrial genome sequence (accession number KX456569.1) and identify and remove human/primate-associated sequences from the dataset. In some instances, samples contained 99% human DNA and hence were removed from all further analyses (Supplementary Table 1).
To minimise the introduction of potentially erroneous ASVs, we used our positive control sample (known DNA source; Tursiops aduncus ) to optimise the filtering and denoising parameters used in the DADA2 package (Callahan et al. 2016) in R (version 1.26.0). We did this by varying the maximum expected errors from 1 to 2 and the truncation quality score parameters from 2 to 30. Filtering for the maximum expected error of 1 and truncation quality score of 30 significantly decreased the abundance of all reads but did not remove erroneous ASVs. Filtering and denoising parameters were therefore selected to maximise data retention by using default settings: Trimming left of 19 bp for forward reads and 21 bp for reverse reads, truncation length of 130 bp, maximum number of ambiguous bases (N’s) of 0, maximum expected errors of 2, truncation quality score of 2, removal of PhiX contamination set to TRUE. DADA2 was then used to remove chimeras from the sequence data using the removeBimeraDenovo function with the pooled method. We then produced two datasets for downstream analyses, one including all forward filtered and denoised reads (to maximise data retention) and the merged forward and reverse filtered and denoised reads. We found that the use of merged reads dramatically reduced the presence of detected contaminant DNA (Figure S1).
The merged, filtered and denoised reads were then used to produce amplicon sequence variants (ASVs) (Table S1), which clusters unique sequence variants based on a 100% sequence similarity. We then used BLAST to match ASVs to mammalian mitochondrial DNA sequences filtered from the NCBI’s GenBank non-redundant (nr) database and made species level taxonomic assignments to BLAST hits using the R package taxonomizr (version 0.9.3). Taxonomic assignment was performed using a combination of consensus between the 10 BLAST matches for each ASV and the percentage of identity for each match to produce a trust score. If the percentage identity of one or more matches was higher than the mean of the 10 blast hits, only those higher matches were used to establish a consensus for the assignment of the final taxonomy. The trust score was transformed into a 0-1 score using the formula: trust score = (percentage_identify - 80) / (100 - 80). Only ASVs that returned a mammalian mitochondrial DNA sequence BLAST match, had at least three reads across all samples and did not match to primate mitochondrial sequences were used for subsequent analyses.
A phylogenetic tree was constructed using the ASVs obtained after taxonomic assignment and included relative abundance calculations, normalized by library size. Reference sequences of relevant native and introduced species identified by BLAST matches were also included in the tree construction. These sequences and their accession numbers can be found in Supplementary Table 2. The alignment of ASVs was carried out with the MAAFT (Katoh and Standley 2013) algorithm, and a tree was produced with FastTree (Price et al. 2010) (version 2.1.11). The resulting set of ASVs were manually clustered based on three types of evidence, 1) presence at the same site, 2) the best match to the taxonomic reference, and 3) the normalised abundance. Where multiple, highly similar (≥ 97%) ASVs were present at the same site, the most abundant ASV also had the highest similarity to the taxonomic reference and was therefore chosen as the most accurate representative. This process of manual curation removed the ambiguity of erroneous ASVs from the dataset and aimed to ensure accurate representation of the species present at the sampling sites. A dendrogram was then constructed using this set of curated ASVs and relative abundance was calculated by summing the read counts from the cluster, normalized by library size. Reference sequences of relevant native and introduced animals were included in the tree construction. The tree was then visualized and refined in the Interactive Tree Of Life (iTOL) (Letunic and Bork 2007) tool (version 4.4.0). Additional graphics were applied using Adobe Illustrator.