Interrogation of the Ag1000g dataset and identification of
tagging markers
Whole genome sequence data in the Ag1000g data set have previously
revealed a strong selective sweep in Ugandan populations around theCyp6aa/Cyp6p cluster [19] (Figure 1). An isoleucine to
methionine substitution in codon 236 of CYP6P4 (see extended data Figure
10b in [19]) was identified provisionally as a swept haplotype
tagging SNP. Previous work [20] had shown that a duplication of theCyp6aa1 gene was also observed in these samples (previously
termed Cyp6aap-Dup1 ). To objectively determine how these
mutations segregated with the observed selective sweep [19] we
grouped the 206 Ugandan haplotypes (n=103 diploid individuals) by
similarity using the 1000 SNPs located immediately upstream and
downstream of the start of the Cyp6aa/Cyp6p gene cluster (500
non-singleton SNPs in each direction from position 2R:28,480,576).
Distances were calculated with the pairwise_distance function inscikit-allel [27] and converted to a nucleotide divergence
matrix (Dxy statistic) by correcting the distance by the number
of sequencing-accessible bases in that region. We defined clusters of
highly similar haplotypes by hierarchical clustering with a cutoff
distance of 0.001. This resulted in the identification of a cluster of
122 highly similar haplotypes. To determine whether the haplotype
cluster showed signs of a selective sweep, we estimated the extended
haplotype homozygosity (EHH) decay within each of the haplotype
groupings, around two different focal loci: (i) the putative sweep SNP
marker Cyp6p4-236M (2R:28,497,967 +/- 200 kbp; total 14,243
phased variants), and (ii) the 5′ and 3′ breakpoints of theCyp6aap-Dup1 duplication (2R:28,480,189 - 200 kbp and
2R:28,483,475 + 200 kbp; total 14,398 phased variants). We used theehh_decay function of scikit-allel[27].
We also used the haplotype groupings identified above to calculate the
Garud H statistics
[28] and the
haplotypic diversity at the Cyp6aa/Cyp6p cluster locus
(coordinates 2R:28,480,576 to 2R:28,505,816). Specifically, we used themoving_garud_h and moving_haplotype_diversityfunctions in scikit-allel to obtain a series of estimates for
each statistic in blocks of 100 variants located within the cluster, and
used a block-jackknife procedure to calculate averages and standard
errors of each estimate (jackknife function inscikit-allel misc module).