PoolParty2:
An integrated pipeline for analyzing pooled or indexed low coverage
whole genome sequencing data to discover the genetic basis of diversity
Stuart Willis*1, Steven Micheletti2,
Kimberly R. Andrews3, and Shawn
Narum1
1Hagerman Genetics Lab, Columbia River Inter-Tribal
Fish Commission, Hagerman, ID, USA
2Dept. Zoology, University of British Columbia,
Vancouver, BC, Canada
3Institute for Interdisciplinary Data Sciences,
University of Idaho, Moscow, ID, USA
4Dept. Fishery Science, Columbia River Inter-Tribal
Fish Commission, Portland, OR, USA
*corresponding author’s email: swillis@critfc.org
Abstract
Whole
genome sequencing data allow survey of variation from across the genome,
reducing the constraint of balancing genome sub-sampling with
recombination rates and linkage between sampled markers and target loci.
As sequencing costs decrease, low coverage whole genome sequencing of
pooled or indexed-individual samples is commonly utilized to identify
loci associated with phenotypes or environmental axes in non-model
organisms. There are, however, relatively few publicly available
bioinformatic pipelines designed explicitly to analyze these types of
data, and fewer still that process the raw sequencing data, provide
useful metrics of quality control, and then execute analyses. Here, we
present an updated version of a bioinformatics pipeline called
PoolParty2
that can effectively handle either pooled or indexed DNA samples and
includes new features to improve computational efficiency. Using
simulated data, we demonstrate the ability of our pipeline to recover
segregating variants, estimate their allele frequencies accurately, and
identify genomic regions harboring loci under selection. Based on the
simulated data set, we benchmark the efficacy of our pipeline with
another bioinformatic suite, angsd, and illustrate the
compatibility and complementarity of these suites by using
angsd
to generate genotype likelihoods as input for identifying linkage
outlier regions using alignment files and variants provided by
PoolParty2. Finally, we apply our updated pipeline to an
empirical dataset of low coverage whole genomic data from uncurated
population samples of Columbia River steelhead trout (Oncorhynchus
mykiss ), results from which demonstrate the genomic impacts of decades
of artificial selection in a prominent hatchery stock.
Introduction
A primary goal of molecular ecology is to understand the genetic basis
of diversity such as targets of divergent selection or loci underlying
heritable life history variations or ecotypes. Critical to this endeavor
is the ability to survey the genome to discover genetic variants
associated with phenotypic differences or environmental axes (Günther &
Coop, 2013; Hoban et al., 2016; Paril, Balding, & Fournier-Level,
2022). Massively parallel or ‘next-generation’ sequencing has
dramatically decreased the cost of surveying genetic variation across
statistically meaningful numbers of individuals and has made these kinds
of investigations accessible for researchers working with limited
budgets on non-model organisms. However, despite the rapid decrease in
per-base sequencing costs, sequencing the complete genome of each
surveyed individual at high coverage is often not practical, in part
because as the cost of sequencing has decreased but demands for
statistically robust sample sizes have become more ardent
(Schlötterer,
Tobler, Kofler, & Nolte, 2014). As a result, geneticists are still
faced with the task of determining the appropriate compromise between
number of reads devoted to surveying each individual, which in many
cases determines the extent of the genome that can be observed, and the
number of individuals surveyed
(Lou,
Jacobs, Wilder, & Therkildsen, 2021).
This compromise has been addressed in a number of ways depending on the
goals of the individual project. Many researchers have opted to survey
only a fraction of the genome, creating ‘reduced representation’
libraries wherein sequencing coverage is spread across fewer,
reproducible subsets of
loci
(e.g. Baird et al., 2008). With careful tuning of library preparation
and sequencing methods, the coverage at each locus may be sufficient to
confidently infer genotypes across nearly all individuals at hundreds to
tens of thousands of variable loci (Andrews, Good, Miller, Luikart, &
Hohenlohe, 2016; Puritz et al., 2014). For analyses where individual
genotypes are important but high genomic density of loci is not
critical, such as determining relatedness or migration among recently
diverged populations, these reduced representation techniques can
produce data cost effectively for hundreds of individuals (e.g. Willis,
Hollenbeck, Puritz, & Portnoy, 2022). However, while these techniques
provide data on many more loci than what was historically accessible,
only a fraction of the genome is ultimately surveyed, meaning that for
species for which linkage blocks are typically less than 100Kb, many
linkage blocks may not be surveyed. As a result, except in cases of
regions of high linkage disequilibrium such as inversions or strong
selective sweeps, investigations that only survey a few thousand linkage
groups may fail to identify loci strongly associated with selection or
heritable phenotypic variation
(Lowry
et al., 2017; Tiffin & Ross-Ibarra, 2014).
On
the other hand, for many association and genome scan methods the input
is allele frequencies rather than individual genotypes. And notably,
because of sampling variance, sampling more individuals at low coverage
actually provides more accurate estimates of phenotype or population
allele frequencies than sequencing fewer individuals at high coverage
(Futschik & Schlötterer, 2010; Günther & Coop, 2013; Schlötterer et
al., 2014; Zhu, Bergland, González, & Petrov, 2012). Many analyses,
including those that compare allele frequencies between phenotypic
variants or populations situated along an environmental gradient and
depend on high density sampling across linkage groups to discover the
regions of highest divergence, may thus be performed more effectively by
low coverage whole genome sequencing (lcWGS)
(Lamichhaney
et al., 2012; Lou et al., 2021; Schlötterer et al., 2014; Therkildsen &
Palumbi, 2017). Moreover, there have been a proliferation of analyses
that are able to account for uncertainty in the genotype of each
individual (likelihoods), even with data sequenced with <1x
coverage per individual (Lou et al., 2021). This low coverage sequencing
approach provides compromise among the portion of the genome surveyed,
accurate allele frequency estimates, and in many cases analyses that
require individual genotype data (Lou et al., 2021; Therkildsen &
Palumbi, 2017). However, while lcWGS data may be highly appropriate for
these types of investigations and the toolkit for analyzing allele
frequency and genotype probability data is expanding, there remain few
pipelines specifically designed to take unmapped lcWGS reads and convey
the data through quality control and bioinformatic analyses.
To address that need, an integrated, modular bioinformatic pipeline,
PoolParty, was developed that facilitates the use of lcWGS data
to search for genomic regions showing strong divergence between samples
with discrete phenotypic differences or other group-wise characteristics
(Micheletti & Narum, 2018). This pipeline has been applied to detect
genome-wide genetic association across multiple species (e.g.
Aguirre-Ramirez, Velasco-Cuervo, & Toro-Perea, 2021; Horn, Kamphaus,
Murdoch, & Narum, 2020; Lyu et al., 2021; Ren et al., 2021).
Although
most published applications have utilized data from libraries of pooled
DNA, the pipeline can also utilize data from individuals sequenced in
multiplex using indexed or barcoded libraries, which allows a
normalization procedure that corrects for uneven contribution to group
allele frequencies across individuals. This normalization is a
pseudo-genotying method wherein each individual, regardless of total
reads, is allowed to contribute only one or two alleles per locus to the
count from which allele frequencies are calculated, depending on that
individual’s depth of coverage and the ratio of major and minor alleles
(>10:1 is considered a homozygote; Figure 1).
PoolParty
shares this goal of managing uneven contribution among individuals when
estimating allele frequency with another bioinformatic suite designed
for use with
lcWGS
data,
angsd,
which also generates individual genotype likelihood or posterior
probabilities from lcWGS data (Korneliussen, Albrechtsen, & Nielsen,
2014).
However,
PoolParty
takes sequence read files as input, performs sequence cleaning and
mapping to a reference genome, produces numerous assurance reports
regarding sequence and mapping quality, and facilitates several analyses
to identify regions of significant genomic divergence between samples,
while
angsd
requires mapped read alignments produced by other tools as input.
Moreover, as demonstrated below, the alignment files created by
PoolParty
are compatible input to
angsd,
making these complementary bioinformatic tools for lcWGS data analysis.
To demonstrate various utilities and upgrades of the PoolParty2
pipeline and compatibility with angsd, we apply it to two lcWGS
datasets, one simulated and one empirical, that reflect the type of
questions to which PoolParty2 may be routinely applied. We
utilize data simulated to reflect different demographic contexts and
degrees of sequence coverage to show the relative strengths, accuracy,
and complementarity of
PoolParty2
and angsd to identify segregating loci, estimate their allele
frequencies, and identify outlier loci and the boundaries regions
affected by selection. Then, using barcoded lcWGS data from natural and
hatchery populations of steelhead trout (anadromous Oncorhynchus
mykiss ), we demonstrate the potential of integrated application of
these bioinformatic suites to identify regions under selection in
landscape-level population samples.