Step 2: Selecting appropriate gene region(s) and creating a reference database
The standard DNA barcode for plants is rbcL and matK (CBOL Plant Working Group et al., 2009), but other gene regions such astrnL , ITS1 and ITS2 are often used, especially for DNA metabarcoding as they have shown deeper resolution for several taxonomic groups (Wilson et al., 2021). Increasing the number and length of gene regions is expected to improve taxonomic resolution, and some studies on pollen mixtures have used whole plastid genomes (Lang et al., 2019) or whole genomes (Bell, Petit, et al., 2021). The barcode gene region(s) must be suitable for differentiating the taxa in the study system to the taxonomic level required and have primers universal enough to amplify all taxa. Recent research has shown a combination of rbcL and ITS2 to work better than other combinations for detecting the highest number of species at the lowest level of taxonomic discrimination (Jones, Twyford, et al., 2021). Several studies have worked towards developing more universal primers for ITS2 (e.g., Kolter & Gemeinholzer, 2021; Moorhouse-Gann et al., 2018) and primers to amplify shorter regions to account for pollen degradation in historical samples (Simanonok et al., 2021). More gene regions (and partial or whole genomes) lead to improved taxonomic resolution but also require more work in assembling the reference library. Using multiple gene regions has the additional challenge of determining the best method of combining results from different markers, with differing taxonomic resolution, for downstream analysis.
It is essential for the gene regions to have a comprehensive reference library for the species in the study system. Using a custom, relevant reference library reduces misidentifications and increases the accuracy of taxonomic assignments (Arstingstall et al., 2021; A. Keller et al., 2020). Software such as the BCdatabaser (A. Keller et al., 2020) for creating custom databases from species lists can be helpful where there is no national database, different gene regions are being used, or a more local database is desired. In addition, the recent software MetaCurator of Richardson, Sponsler, McMinn‐Sauder, and Johnson (2020) has two advantageous features to curate existing or generated reference datasets: 1) identifying the exact amplicon of interest and trimming away extraneous sequence to avoid non-overlapping amplicons of the same gene, and 2) dereplicating sequences by taxonomy so that barcodes are retained for multiple species even when there is no barcode gap (i.e., a higher and non-overlapping range of sequence divergence between species than among species). A consideration of this and many other methods is that they create a static database that needs to be updated frequently as new sequences become available.