Identification of adaptive variation
To detect putative loci under selection, we used four methods that vary in whether they account for population structure and whether they allow for correlations among environmental variables. Specifically, we used a population structure outlier detection method (pcadapt ), which does not control for population structure and does not correlate with environmental variables, and three different methods to estimate genotype-environment associations (GEAs): a Latent Factor Mixed Model (LFMM), which is a univariate regression model that summarizes environmental variation in latent factors that correct the model for population structure; a redundancy analysis (RDA), which is a multivariate approach that looks for associations of SNPs with each environmental variable separately, and a partial redundancy analysis (pRDA), which is an RDA that accounts for population structure (Forester, Landguth, et al. 2018; Capblancq et al. 2018).
For the population structure outlier detection analysis we used the function pcadapt from the R package pcadapt version 4.1.0 (Luu et al. 2017) to detect candidate genes under selection (‘adaptive’ loci) based on principal components analysis. To determine the number of KPC to retain, we used a threshold for the singular values of the loading of each principal component (PC) above 5% of variance explained for the axes to be retained and used these KPC to control for population structure in the outlier detection analysis. We selected outlier loci based on an alpha threshold of 0.01 calculated using the qvalue function from theqvalue R package (Dabney et al. 2010), after correcting the p -values for multiple comparisons with a Benjamini-Hochberg Procedure using the function p.adjust .
GEA analyses require that the dataset does not have any missing data, so we split samples by species, and then imputed missing genotypes based on the most frequent allele across all individuals of that species. Imputation did not consider population of origin to avoid inflating population structure, particularly given the low number of samples in some populations. We acknowledge that there will likely be a bias for increasing representation of alleles from populations with more individuals and higher minor allele frequency (MAF) (Mitt et al.2017). However, this approach is more conservative in terms of the detection of local adaptation, and thus less likely to lead to false positives in terms of local adaptation. To test for associations between genetic and environmental data, we obtained bioclimatic data from across the study area using WorldClim (Fick & Hijmans 2017), as well as biomass (ESRI 2016), elevation (USGS 2012), land cover (CCRS et al. 2017), slope degree (ESRI 2013), soil particle size (SoilPartSize) as a proxy for burrow stability and heat retention (CONSBIO 2015a), soil temperature (SoilTemp) (Marchenko 2011) because it is likely important for hibernation (Goldberg 2018), soil orders (SoilOrder) that reflect exposure to climatic factors and biological processes (CONSBIO 2015b), land formation (LandForm) (ESRI 2015), and ecological facets (EcoFacets) which combine elevation, land formation, slope and heat load index (ESRI 2019). For categorical variables (land cover, soil order, land formation and ecological facets), we coded each category as present or absent and analyzed them as independent variables. We used the Shapiro-Wilk normality test in the shapiro.test function from the statsR package (R Core Team 2013) to determine if each variable was normally distributed in order to determine which correlation coefficient to use. For our data, none of the variables were normally distributed, so we then used the chart.Correlation function from thePerformanceAnalytics R package (Peterson et al. 2014) and the Kendall correlation estimator to examine correlations among variables (see Results), and the corr and corrplotfunctions from the corrplot R package (Wei et al. 2017) to plot the correlation matrix. We selected a subset of variables that were not highly correlated <|0.7| for subsequent analyses. In cases where correlations were above |0.7|, we excluded the variable that was correlated with a higher number of other variables or with the highest average correlation values, even if below |0.7|.
To ensure that associations between loci and environmental variables were not confounded by population structure, for the LFMM we performed a Principal Components Analysis (PCA) on our imputed dataset using therda function from the R package vegan (Oksanen et al. 2011). To determine which principal components (PCs) to retain, we then used the broken stick criterion, based on whether the observed eigenvalues were higher than the corresponding random broken stick components (Peres-Neto et al. 2003). We used the retained PCs as conditions when finding associations between the genetic data with the univariate uncorrelated environmental PC axes through the functionlfmm_ridge from the lfmm R package (Caye et al.2019). We then evaluated the genomic inflation factor (GIF) and selected a p -value following François et al. (2016), and then identified candidate loci using a false discovery rate (FDR) of 0.1.
For the RDA, we used the function rda from the vegan R package. We then used the R function RsquareAdj to determine the percentage of genetic variance explained by the RDA axes, after adjusting for the number of predictors. We then used the functionanova.cca from the vegan R package to check the significance of our model, as well as of each RDA axis individually, in explaining the variance in the data with an ANOVA. As a first step, we verified the variance inflation factors (VIF) of each variable using the functionvif.cca from the vegan R package, and excluded the variable with the highest VIF until all variables showed values below 10, i.e. no longer representing redundant constraints (Oksanen et al. 2011). Then, we determined which SNPs were candidates for local adaptation by identifying those present in the tails of the RDA axes loadings that were >2.5 standard deviations (two-tailedp -value = 0.012) from the mean.
For the pRDA, we performed the same analysis as for the RDA, except that before using the rda function, we performed a PCA on the genomic data to summarize neutral population structure by determining which PCs had eigenvalues greater than random, using the broken stick criterion. This analysis identified the number of PCs to condition for in therda function, and then used these PCs to exclude the variance explained by neutral population structure (Rellstab et al. 2015; Forester, Lasky, et al. 2018).
For the interspecific analysis (IDGS dataset), we conducted all four analyses for detecting adaptive loci (pcadapt, LFMM, RDA, and pRDA), and we compared the SNPs identified as candidates in the four analyses using the venn.diagram function from the VennDiagram R package (Chen & Boutros 2011). For the intraspecific analyses (NIDGS and SIDGS datasets), we only conducted the pRDA analysis.