Bioinformatic analyses
The bioinformatics analyses were carried out using the metabarcoding analysis package DADA2 (Callahan et al., 2016) and the Phyloseq package (McMurdie and Holmes, 2013). A pipeline in R v4.4.0 (R Core Team, 2021) was used for read quality control, removal of adaptors (Cutadapt, Martin, 2011), removal of sequencing errors and chimeric reads, reads merge and for obtaining the Amplicon Sequencing Variants (ASVs) distribution visualization and for taxonomic assignment. The taxonomic assignment was conducted in two rounds. First, using the DADA2 RPD classifier against a custom 12S database based on the database developed by Milan et al. (2020) for both 12S markers, containing 252 DNA sequences, with 181 specifically from São Francisco Basin. Secondly, using local BLASTn (Camacho et al., 2008) against the NCBI nucleotide database (Sayers, 2022; NCBInt). Both 98% and 99% percentual identity thresholds were applied for identifications at the species level for COI and 12S, respectively. The RRA (relative read abundance) was determined by dividing the absolute counts of each ASV by the sum of the absolute counts of all ASVs in a sample.
To compare species identifications between markers, Venn Diagrams were built using the web application Lucidchart (https://lucid.app/). To examine the potential difference between marker choice on sample composition, a Permutational Multivariate Analysis of Variance (PERMANOVA) and principal coordinate analysis (PCoA) were performed with 1000 permutations, applying the Jaccard and Bray-Curtis dissimilarity indexes using the function ‘adonis’ (vegan 2.5–2 R package).
Due to the maximum 600 bp length limitation of the sequencing technology available, the forward R1 and reverse R2 COI reads could not be merged by overlap to reconstruct the barcoding amplicon, as each strand covers a different region of the COI gene with possible distinct variations for each taxon. Therefore, reads R1 and R2 were analyzed separately, and each sample’s taxonomic assignment results were combined.
The ASVs found in the negative controls were removed from all other samples. Additionally, considering that the high throughput could amplify contaminations not detected by negative controls, and the risk of false positives, but also aiming not to exclude underrepresented taxa, only ASVs with more than 0.01% of relative read abundance (RRA) in each sample were considered.