Estimates of heterozygosity from single nucleotide polymorphism markers
are context dependent and often wrong
Abstract
Heterozygosity is frequently used to describe variation in genetic
diversity amongst populations and is often estimated using single
nucleotide polymorphisms (SNPs). However, methods of calculating
heterozygosity from SNPs have been shown to be affected study design and
filtering parameters, reducing their utility and comparability across
studies. Though solutions have been proposed to account for identified
problems, in our own data, we continued to see inconsistent results.
Here, we aimed to further improve methods of reducing inconsistency in
these results, specifically by investigating how sample size and missing
data thresholds influenced autosomal estimates of heterozygosity
(heterozygosity calculated from across the genome, i.e., both fixed and
variable sites). We also investigated how the exclusion of tri- and
tetra-allelic sites, which is generally standard practice in such
studies, could affect eventual estimates of heterozygosity. Across three
distinct taxa (a frog, Litoria rubella; a tree, Eucalyptus
microcarpa; and a grasshopper, Keyacris scurra) we found
autosomal heterozygosity estimates to be affected by samples size when
missing data is not allowed and show that this is partly due to the
exclusion of tri- and tetra-allelic loci. We also show that the biases
introduced by these factors are not consistent between species, or even
populations, with higher levels of actual heterozygosity tending to
result in larger adverse effects. We propose a modified framework for
calculating heterozygosity to reduce these inherent issues and highlight
the need for further development in methods such that tri- and
tetra-allelic sites can be included in the calculation of population
genomics statistics.