Environmental Association Analysis (EAA)
Environmental values (dependent variable for EAA models) were assigned to the 1x1 km grid cells where individuals had been sampled, using QGIS v2.18.16 (QGIS Delopment Team 2015) and a layer of PCA-scores within polygons defined by sampling locations. The genotypes for each loci (independent variables for EAA models) were recoded as 0, 1 or 2 depending on whether they were homozygous for the reference allele, heterozygous or homozygous for the alternative allele, respectively. As we pruned the SNP dataset by linkage disequilibrium, we avoided including any collinear predictors in the models.
Our EAA was constrained by the fact that the 95 individuals genotyped belonged to only five different populations (18-20 individuals per population). While this ensures a sufficient characterization of genetic variation within populations, it leads to unavoidable pseudoreplication of environmental data (and, depending on the extent of genetic differentiation and aggregation, also of genetic data). To make sure that this problem did not affect our conclusions, we used four randomization approaches, with the same model selection steps applied to each of them, so that they could be compared.
Firstly, we based our EAA on 1,000 random assignments of genotypes within populations to 1x1 km sampled grid cells (hereafter intra-population randomization). Given that environmental variation is several orders of magnitude larger among than within populations (97.6 % of the variance in PCA scores explained by population adscription), we chose to highlight the among-populations component of the models by assuming that all genotypes could occupy every grid cell within the geographical boundaries of their own population (which were determined by the discontinuous distribution of suitable habitat). This is more realistic than assuming a large component of genotype-environment covariation at a local, within-population scale. Moreover, the genetic component of such covariation could not be detected by our methods of outlier detection, which were specifically designed to search for genetic divergence among populations. Our set of 1,000 intra-population randomizations should therefore capture the actual pattern of genotypic and environmental covariation at the scale of the sampled gradient.
Secondly, we randomized 1,000 times the geographical grid cell assigned to each genotype without taking into account its population (hereafter inter-population randomization). This allowed us to produce a null hypothesis of no association between genotypic and environmental variation, but which takes into account the fact that environmental values are geographically structured by population of origin (and thus pseudoreplicated).
Thirdly, we used a randomized set of genetic data to control for the potential effects of genetic structure among populations and genetic aggregation within them (hereafter randomization by neutral loci). For that purpose, we constructed 1,000 new sets by randomly selecting 21 loci from each genotype (i.e. the number of detected outliers out of 6,421 SNP) but without taking into account whether they were characterized as outliers or not. This was done to account for the fact that neutral genetic variation (randomly selected SNPs) is expected to have the same degree of aggregation than variation under selection (outliers). However, while we should expect that at least a fraction of the genetic variation subject to selection should be correlated with environmental variation, the opposite is true for the neutral differentiation of populations.
Finally, we were aware that outlier analyses should effectively sort through the randomized SNP databases to identify those that explain the greatest variance among ‘populations’ (or subgroups). The projection of that variance into environmental PC-space could in turn draw some shape around the five sampled populations’ environments that would be considered as suitable habitat, perhaps leading to significant genotype-environment associations. Because of that reason, the outlier selection step needs to be incorporated into our attempts to identify null expectations. To this end, our fourth randomization approach permuted the 6,421-SNP dataset and reran all the analysis from the outlier detection step onwards (hereafter complete randomization). We randomly assigned genotypes for all the 6,421 SNPs to individuals and ran Bayescan to detect outliers putatively under selection with the same methods we used with real data. Then we took the top 21 outlier SNPs to conduct genotype-environment association models and predict the species’ range. If the environmental signal of the SNPs generated by these simulations (and, as a consequence, expected by pure chance) is still smaller than that of real data, this must be interpreted as evidence that the SNPs selected by our EAA are not only correlated with the environmental gradient portrayed by our sampling populations, but also that they provide reasonably good proxies of genetic differentiation involved in local adaptation. We did not repeat the FLK extension test with this dataset because no significant phylogenetic (i.e. among-population) patterning is expected in a genetic dataset where genotypes are randomly assigned to individuals (and hence to populations). We performed 500 complete randomization tests instead of 1,000 due to computational limitations.
We performed a backward stepwise multiple regression analysis with each randomized data set (N = 3,500 EAAs in total), with SNPs as predictors and environmental scores as the dependent variable using lmfunction in R core. By doing this, we obtained a distribution of adjusted R2 estimates and p-values for each set of randomizations. Final model building was achieved by considering the mean p-values of partial correlations calculated for all the datasets obtained with the intra-population randomization strategy. In each step, we removed all SNPs with a mean p-value > 0.5 (calculated from the p-values of all randomized datasets), and we recalculated all partial correlations with the remaining SNPs. In the last step, when all remaining markers had mean p-values < 0.5 (the mean number of SNPs removed in this step was 14 in intra-population randomizations, 18 in inter-population randomizations, 10.83 in randomizations by neutral loci, and 11.30 in complete randomizations), we removed all SNPs with mean p-values > 0.05. Our final model (genotype-environment association model, or GEAM) was built with the mean intercept and mean beta values of the remaining SNPs.