Bioinformatics
Read processing and variant calling were performed using the Fast-GBS pipeline with default parameters (Torkamaneh, Laroche, Bastien, Abed, & Belzile, 2017), which demultiplexes with sabre v1.000 (Joshi, 2011), trims with cutadapt v1.14 (Martin, 2011), maps with BWA MEM v0.7.12-r1039 (Li & Durbin, 2009) and calls variants with Platypus v0.8.1 (Rimmer et al., 2014). This produced a large SNP dataset which was then further filtered with plink v1.90b4.6 (Purcell et al. 2007). Sites were removed if there were missing calls for more than 10% of the samples or had less than 0.01 minor allele frequency. Individuals that were missing calls at 50% or more sites were also removed.
To identify clonal replicates among collected plants, all samples were included in a principal component analysis (PCA) performed with plink (Purcell et al., 2007); within each visually identified genotype cluster in the PCA, a pairwise genetic distance matrix was constructed. To produce a group-specific similarity threshold to identify identical genotypes, we then created a histogram of distances and heuristically chose a threshold that separated the first and second peaks (see Fig. S1). Refined variant datasets were then created with duplicate genotypes removed; for each set of duplicate genotypes, only the accession with the highest coverage was retained.
To prevent biases due to multiple collections of the same genotype, minor allele frequency and missing data filters were repeated in plink for the dataset consisting of unique genotypes only. An additional PCA was then performed on this dataset. To assess population structure within diploid P. vaginatum , ADMIXTURE v1.3.0 (Alexander, Novembre, & Lange, 2009) analysis was performed on a further reduced dataset comprising coarse- and fine-textured diploids, with assumed population numbers ranging from K = 2-6. Cross validation error estimation was used to determine the optimal K.