Box 3: Maximising the advantage of whole genome sequencing with haplotype data
All sequencing technologies allow allele frequencies to be measured. One of the key advantages of whole-genome resequencing over other technologies is the opportunity to exploit additional information, such as the haplotypes on which physically linked alleles are coinherited. Haplotype data enable the use of several powerful analytical methods (reviewed by Leitwein, Duranton, Rougemont, Gagnaire, & Bernatchez, 2020) that are relevant to invasion genomics.
Because recombination and mutation reconfigure haplotypes over time, the size and frequency of haplotypes convey evolutionary information – a phenomenon that Moorjani et al. (2016) refer to as the ‘recombination clock’. For example, a haplotype on which a beneficial allele arises is swept to fixation faster than recombination can break it down to its expected size under neutrality. Therefore a signature of selection is left by unusually large stretches of haplotype homozygosity (i.e. , linkage extends further from the selected locus than expected), and by the unexpectedly high frequency of a core haplotype (Sabeti et al., 2002). This is the basis for tests of extended haplotype homozygosity, used to scan the genome for signatures of selection (see Parts 1 and 4). Haplotype data are also useful for reconstructing population size change through time (Part 3). By analysing long haplotypes identical by descent (that have not yet been broken down by recombination), Browning and Browning (2015) were able to accurately reconstruct changes in human population size in the recent past (4 to 50 generations before present). This approach holds great potential for invasion genetics, where it is often difficult to reconstruct recent demography (see Part 3.1).
Haplotype data show most promise in recently admixed populations (see Part 5). Any analysis of hybridization using haplotype data will require the ancestry of an introgressed haplotype (‘ancestry tract’ or ‘ancestry block’) to be inferred (for a review of approaches to ancestry assignment see Leitwein et al., 2020). Duranton et al. (2019) studied the introgression of Atlantic sea bass (Dicentrarchus labrax ) into Mediterranean populations of the same species. By modelling the diffusion of introgressed haplotypes through space (by gene flow) as they are broken down over time (by recombination), the average per-generation dispersal distance could then be estimated. This approach is likely to be useful for reconstructing the spatial extent of introgression in invasive species (See Parts 2 and 5). Finally, adaptive introgression can be accurately detected using haplotype data (see Shchur, Svedberg, Medina, Corbett-Detig, & Nielsen, 2020). In summary, haplotype data open many possibilities in invasion genetics research, representing one of the key advantages of using WGR to study invasive species.
However, haplotype information cannot be directly extracted from WGR data generated using short reads. Therefore, until long-read sequencing becomes scalable, direct or indirect methods for inferring gametic phase (i.e. , the two DNA sequences on which alleles occur, in the case of diploids) need to be used to leverage haplotype information from WGR data.
Indirect or statistical phasing methods can be applied to whole-genome datasets obtained with short-read sequencing technology (reviewed by Rhee et al., 2016). The accuracy of these methods depend on factors such as the number of samples and the density of nucleotide polymorphisms (Browning & Browning, 2007). Phasing errors can affect the downstream biological interpretations made by analysing haplotypes. Direct phasing methods, on the other hand, record chromosomal haplotypes during the generation of sequence data. Linked-read sequencing is a newly developed family of direct phasing technologies that results in fewer errors than indirect statistical approaches (Amini et al., 2014; Choi, Chan, Kirkness, Telenti, & Schork, 2018).
Though linked-read sequencing approaches show great promise in population genomics (e.g ., Lutgen et al., 2020), many platforms are currently prohibitively expensive. One notable exception is haplotagging, a recent low-cost linked-read sequencing method (Meier et al., 2020). Through haplotagging, kilobase-length DNA fragments are tagged with unique barcodes as they wrap around unique microbeads in solution.