Bioinformatics
Read processing and variant calling were performed using the Fast-GBS
pipeline with default parameters (Torkamaneh, Laroche, Bastien, Abed, &
Belzile, 2017), which demultiplexes with sabre v1.000 (Joshi, 2011),
trims with cutadapt v1.14 (Martin, 2011), maps with BWA MEM
v0.7.12-r1039 (Li & Durbin, 2009) and calls variants with Platypus
v0.8.1 (Rimmer et al., 2014). This produced a large SNP dataset which
was then further filtered with plink v1.90b4.6 (Purcell et al. 2007).
Sites were removed if there were missing calls for more than 10% of the
samples or had less than 0.01 minor allele frequency. Individuals that
were missing calls at 50% or more sites were also removed.
To identify clonal replicates among collected plants, all samples were
included in a principal component analysis (PCA) performed with plink
(Purcell et al., 2007); within each visually identified genotype cluster
in the PCA, a pairwise genetic distance matrix was constructed. To
produce a group-specific similarity threshold to identify identical
genotypes, we then created a histogram of distances and heuristically
chose a threshold that separated the first and second peaks (see Fig.
S1). Refined variant datasets were then created with duplicate genotypes
removed; for each set of duplicate genotypes, only the accession with
the highest coverage was retained.
To prevent biases due to multiple collections of the same genotype,
minor allele frequency and missing data filters were repeated in plink
for the dataset consisting of unique genotypes only. An additional PCA
was then performed on this dataset. To assess population structure
within diploid P. vaginatum , ADMIXTURE v1.3.0 (Alexander,
Novembre, & Lange, 2009) analysis was performed on a further reduced
dataset comprising coarse- and fine-textured diploids, with assumed
population numbers ranging from K = 2-6. Cross validation error
estimation was used to determine the optimal K.