Low-coverage WGS for population assignment
In addition to elucidating fine-scale migratory connectivity patterns in
the American Redstart, our results provide important considerations for
other population assignment studies using lcWGS. We found that balancingeffective sample sizes of the source populations to within one
effective individual of each other was essential for accurate
assignment. Even when the actual number of individuals used per
population was the same, variation in mean depth (1.3X – 1.9X) between
populations skewed the effective sample sizes, resulting in decreased
assignment accuracy. Other studies with known genotypes from RADseq have
demonstrated the influence of actual sample size on overall assignment
accuracy but not how it affects assignment bias . The effective sample
sizes needed per population for accurate assignment and the degree of
standardizing these values will depend on the population structure of
the study system. For example, study systems with higher genetic
differentiation between populations may not need to finely standardize
effective sample size to achieve high assignment accuracy. We suggest
that other population assignment studies similarly evaluate the
influence of source population effective sample size on known source
individuals before assigning individuals of unknown origin. Reducing the
effective sample size of a sampled population can be achieved by either
removing individuals or down sampling the read depth. In this study, we
chose to remove individuals, and used the individuals’ effective sample
sizes as a guide for how many individuals to remove from each population
(resulting in 21 – 27 samples per population). For studies with smaller
sample sizes, it may be worthwhile to investigate if retaining all
individuals, but down sampling reads is a better alternative for
standardizing effective sample sizes to retain more variation from
individuals.
Importantly, here we demonstrate that individuals with very low whole
genome coverage (0.01X – 0.1X) can still be accurately assigned to
source populations with sufficient effective sample sizes. These results
suggest that increasing the number of samples and decreasing individual
sequencing depth is an effective study design strategy for population
assignment. For migratory connectivity studies, increased sampling (both
number of individuals at each location and the number of locations
sampled) across nonbreeding stages of the annual cycle can drastically
improve our understanding of population-level connectivity at low cost.
Combined with cost-effective approaches for library preparation (e.g. ),
lcWGS is increasingly becoming economically feasible for a wide-range of
studies. However, a trade-off with lcWGS is that the sequence data
processing requires additional costs associated with time spent on the
bioinformatics analysis. For studies interested in population assignment
with a large number of samples, increasing the number of samples per
lane, thereby decreasing the mean average sequencing depth, may make
lcWGS economically feasible compared to other sequencing methods. For a
comprehensive review of coverage guidelines for different types of
analyses with low-coverage WGS data see Lou et al. (2021).
An interesting aspect of our results was that all posterior
probabilities of assignment were > 0.8, even for
potentially admixed individuals. A standard method to determine
assignment confidence in population assignment studies is to use a
cutoff value for posterior probabilities of assignment . Individuals
with low posterior probabilities of assignment (e.g., < 0.8)
can be highly admixed. Thus, it is inaccurate to classify them as from a
specific population. However, we suspect that with lcWGS data, the high
prevalence of loci with single read results in the likelihood being
highest for a homozygous genotype. Thus, admixed individuals may
“switch” their population of maximum likelihood depending on the loci
used for assignment. Our use of an assignment consistency threshold
addressed this concern by creating subsets of genomic data for
population assignment to determine if individuals could reliably be
assigned to a single population when different loci were used. Testing
the assignment consistency threshold with known source individuals
revealed three individuals with inconsistent assignment (<
0.8, i.e., 8 out of 10 genomic datasets) and were likely admixed between
pure Northern Temperate and Southern Temperate populations. These
results highlight that the consistency of assignment may be more
reliable than posterior probabilities for confidently assigning
individuals of unknown origin. Further development of spatially explicit
assignment methods for genotype likelihood data would be helpful for
determining the likely origin of admixed individuals at the periphery of
source populations.