SNP calling and filtering
No reference genome is available for ponderosa pine (Pinus ponderosa ), but one does exist for loblolly pine (Pinus taeda)(Neale et al. 2014; Zimin et al. 2014). Of the conifers that have been sequenced to date, P. taeda is the most closely related to P. ponderosa (Gernandt et al. 2009; Willyardet al. 2009). Furthermore, the P. taeda reference genome was successfully used to design probes for sequence capture in P. contorta (Suren et al. 2016; Yeaman et al. 2016), a distant relative. Based on preliminary analyses, we selected the Stacks v.2.2 pipeline (Rochette & Catchen 2017) with this reference genome (https://treegenesdb.org/FTP/Genomes/Pita/) for SNP calling (Shu 2020). Each step in the Stacks reference pipeline is performed internally in Stacks algorithms except alignment with BWA v.0.7.17 (Li & Durbin 2009) and the Samtools v.1.9 (Li 2011) step used to get read position. Default settings were used in Stacks, BWA and Samtools.
After calling the SNPs, we ran SnpEff (Cingolani et al. 2012) to identify the location of the gene containing each SNP. We used the database of annotated genome and the reference genome of loblolly pine v.2.01 in TreeGenes (http://treegenesdb.org/FTP/Genomes/Pita/v2.01/). The location of each SNP is listed in the output file of SnpEff as one of six primary location categories, including intragenic variants, intergenic variants, upstream SNPs, downstream SNPs, synonymous, and missense variants in the gene coding sequence. In Snp Eff, ”intragenic” refers to SNPs in introns, while ”missense” refers to any non-synonymous mutation in the transcribed region.
Many SNPs identified by GBS fall between genes and regulatory regions (in the intergenic regions) and likely have no direct effect on gene expression or function. In addition, because of the low amount of linkage disequilibrium in conifers (Namroud et al. 2008; Isiket al. 2016), any associations identified between such intergenic SNPs and a phenotype or environment of interest are likely false positives rather than reflecting linkage between the SNP and a causal variant. Therefore, we first filtered out the intergenic SNPs before running the association analysis using a Python script (https://github.com/shumengjun/LFMM).