Bioinformatic analyses
The bioinformatics analyses were carried out using the metabarcoding
analysis package DADA2 (Callahan et al., 2016) and the Phyloseq package
(McMurdie and Holmes, 2013). A pipeline in R v4.4.0 (R Core Team, 2021)
was used for read quality control, removal of adaptors (Cutadapt,
Martin, 2011), removal of sequencing errors and chimeric reads, reads
merge and for obtaining the Amplicon Sequencing Variants (ASVs)
distribution visualization and for taxonomic assignment. The taxonomic
assignment was conducted in two rounds. First, using the DADA2 RPD
classifier against a custom 12S database based on the database developed
by Milan et al. (2020) for both 12S markers, containing 252 DNA
sequences, with 181 specifically from São Francisco Basin. Secondly,
using local BLASTn (Camacho et al., 2008) against the NCBI nucleotide
database (Sayers, 2022; NCBInt). Both 98% and 99% percentual identity
thresholds were applied for identifications at the species level for COI
and 12S, respectively. The RRA (relative read abundance) was determined
by dividing the absolute counts of each ASV by the sum of the absolute
counts of all ASVs in a sample.
To compare species identifications between markers, Venn Diagrams were
built using the web application Lucidchart (https://lucid.app/). To
examine the potential difference between marker choice on sample
composition, a Permutational Multivariate Analysis of Variance
(PERMANOVA) and principal coordinate analysis (PCoA) were performed with
1000 permutations, applying the Jaccard and Bray-Curtis dissimilarity
indexes using the function ‘adonis’ (vegan 2.5–2 R package).
Due to the maximum 600 bp length limitation of the sequencing technology
available, the forward R1 and reverse R2 COI reads could not be merged
by overlap to reconstruct the barcoding amplicon, as each strand covers
a different region of the COI gene with possible distinct variations for
each taxon. Therefore, reads R1 and R2 were analyzed separately, and
each sample’s taxonomic assignment results were combined.
The ASVs found in the negative controls were removed from all other
samples. Additionally, considering that the high throughput could
amplify contaminations not detected by negative controls, and the risk
of false positives, but also aiming not to exclude underrepresented
taxa, only ASVs with more than 0.01% of relative read abundance (RRA)
in each sample were considered.