2.6 Variant calling and filtering
To compare the population genetic results between the threatenedA. cantabrica and the non-threatened A. halleri , we performed variant calling and population genetic analyses using 35A. cantabrica samples from six populations and six samples ofA. halleri subsp. nuria from a single population. We followed the pipelines and scripts provided by https://github.com/lindsawi/HybSeq-SNP-ExtractionSlimp et al. (2021, available at ) with some modifications. In their pipeline, Slimp et al. (2021) used supercontig sequences, demonstrating that most genetic variation occurred in flanking non-coding regions, which tend to accumulate mutations quickly due to limited functional constraints (Palumbi, 1996). We used sequences from supercontig and intron regions separately for comparative analyses. We prepared a reference file for supercontigs and introns using the same approach described above to generate the nQuire reference, in this case, selecting each gene’s longest supercontig and intron sequence. Additionally, we excluded any genes flagged by HybPiper for paralogy warnings (Bryc et al. , 2013).
To obtain single-nucleotide polymorphisms (SNPs) data, we used the framework developed by DePristo et al. (2011) in GATK (McKennaet al. , 2010). We combined aligned and unaligned reads to the reference, removed duplicate sequences, and performed genotype calling collectively for all samples after generating preliminary variants individually for each sample (Poplin et al. , 2018) in a Variant Call Format (VCF) file. The filtering conditions we conducted on the initial VCF file included using a ”hard filter” (QD < 5.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0), removing indels and SNPs with missing data in GATK, and removing linked SNPs in PLINK (Changet al. , 2015). We conducted a Base Quality Score Recalibration in GATK and repeated the variant calling step. To address the potential effects of polyploidy, which can artificially increase heterozygosity and allelic richness (Hokanson & Hancock, 1998), it is essential to filter fixed heterozygotes in SNP datasets in polyploid species (e.g., Douglas et al., 2015; Cornille et al., 2016; Blischak et al., 2018; Pavan et al., 2020). We removed loci with observed heterozygosity (H O) > 0.5 from A. cantabrica (Appendix S1) data using the R package ”VCFR” (Knaus & Grünwald, 2017). We established this filter by comparing heterozygosity and inbreeding coefficient results for the diploid A. halleri to those obtained for the tetraploid A. cantabrica (see Appendix S1 and Results). The unfiltered data were retained for comparative studies.