Interrogation of the Ag1000g dataset and identification of tagging markers
Whole genome sequence data in the Ag1000g data set have previously revealed a strong selective sweep in Ugandan populations around theCyp6aa/Cyp6p cluster [19] (Figure 1). An isoleucine to methionine substitution in codon 236 of CYP6P4 (see extended data Figure 10b in [19]) was identified provisionally as a swept haplotype tagging SNP. Previous work [20] had shown that a duplication of theCyp6aa1 gene was also observed in these samples (previously termed Cyp6aap-Dup1 ). To objectively determine how these mutations segregated with the observed selective sweep [19] we grouped the 206 Ugandan haplotypes (n=103 diploid individuals) by similarity using the 1000 SNPs located immediately upstream and downstream of the start of the Cyp6aa/Cyp6p gene cluster (500 non-singleton SNPs in each direction from position 2R:28,480,576). Distances were calculated with the pairwise_distance function inscikit-allel [27] and converted to a nucleotide divergence matrix (Dxy statistic) by correcting the distance by the number of sequencing-accessible bases in that region. We defined clusters of highly similar haplotypes by hierarchical clustering with a cutoff distance of 0.001. This resulted in the identification of a cluster of 122 highly similar haplotypes. To determine whether the haplotype cluster showed signs of a selective sweep, we estimated the extended haplotype homozygosity (EHH) decay within each of the haplotype groupings, around two different focal loci: (i) the putative sweep SNP marker Cyp6p4-236M (2R:28,497,967 +/- 200 kbp; total 14,243 phased variants), and (ii) the 5′ and 3′ breakpoints of theCyp6aap-Dup1 duplication (2R:28,480,189 - 200 kbp and 2R:28,483,475 + 200 kbp; total 14,398 phased variants). We used theehh_decay function of scikit-allel[27].
We also used the haplotype groupings identified above to calculate the Garud H statistics [28] and the haplotypic diversity at the Cyp6aa/Cyp6p cluster locus (coordinates 2R:28,480,576 to 2R:28,505,816). Specifically, we used themoving_garud_h and moving_haplotype_diversityfunctions in scikit-allel to obtain a series of estimates for each statistic in blocks of 100 variants located within the cluster, and used a block-jackknife procedure to calculate averages and standard errors of each estimate (jackknife function inscikit-allel misc module).