A simple statistical model of the DNA barcode gap is outlined in both frequentist and Bayesian contexts. Here, accuracy of recently introduced nonparametric metrics, inspired by coalescent theory, is used to characterize the extent of proportional
overlap/separation in maximum and minimum pairwise genetic distances within and among species, respectively. Using a straightforward binomial count of overlapping specimen records, probabilities of taxon distance distribution overlap/separation are directly estimated. The mean and variance are derived for edge cases of no and full overlap, and are shown to have good asymptotic properties. Further, a new way to visualize distances and the DNA barcode gap estimators via the empirical cumulative distribution function (ECDF) appears revealing.
Using R and the probabilistic programming language, Stan, the proposed maximum likelihood estimators (MLEs) and Bayesian model are demonstrated on cytochrome \(b\) (CYTB) gene sequences from two Agabus diving beetle species, A. bipustulatus and
A. nevadensis (\(N\) = 701 and \(N\) = 2 individuals, respectively). Analyses clearly expose problems, showing much uncertainty in parameter estimates, particularly under the frequentist paradigm, and when specimen sample sizes for target species are small. Findings herein highlight the promise of the Bayesian approach using a conjugate beta prior for reliable posterior estimation over classical inference when available data are sparse. Obtained results can help shed light on foundational and applied research questions concerning DNA-based specimen identification and species delineation for studies in evolutionary biology and ecology, as well as biodiversity conservation,
forensics, management and restoration of wide-ranging taxa.