Data analysis:
Table 1 presents key characteristics of all capture sites. Most site characteristics (e.g., latitude, longitude, altitude) were obtained from Google Earth based on coordinates determined on-site at the time of capture. Other factors were known a priori (e.g., native versus non-native status) or were obtained from peer-reviewed literature (e.g., Colwell’s predictability indices). Yet other factors (e.g., genetic group membership) were determined from ongoing projects (Ravinet et al. 2018), and one factor, urbanization at the capture site, was quantified by us using Google Earth. For this factor, we first found capture sites in Google Earth using latitude and longitude data. A screen shot of the site was then taken such that a 10 km radius transect from the capture site was identifiable. ImageJ was then used to quantify the area within this 10 km circle that was urbanized (e.g., obvious human-built structures, which could be confirmed by higher-resolution images assessable in Goole Earth).
Table 1 also lists dates of birds capture from each location. Whereas time of year (i.e., phase of the breeding season) could have affected gene expression, we could not design our study to avoid any such effect for two reasons. First, field work had to coincide with the availability of our collaborators at each site; alternative timing of sampling was not possible. Second, breeding phenology of many sparrow populations is unknown (e.g., Senegal, Vietnam). We did our best to sample birds outside what we expected to be the breeding season for each population, and except for Israel, rarely did we capture obviously immature individuals.
As gene expression data were non-normal, all were log10transformed before analyses. Our first directive was to evaluate whether there were interactions between gene and tissue shaping gene expression across the populations in our study, which would determine whether we should analyze gene expression for each tissue separately. First, we constructed a univariate model using the lmer function from the lme4 package in r (Bates 2014) with log10 gene expression as a response variable, and gene (factor, 3 levels), tissue (factor, 3 levels), and their interaction as fixed effects. Individual and country were included as random effects to account for non-independence of tissue samples from the same individuals and individuals sampled in the same countries, respectively. We also calculated the adjusted repeatability for random effects (individual and country) following published methods (Nakagawa and Schielzeth 2010). We then used the ‘sim’ function from the ‘arm’ package (Gelman et al. 2007) to generate a posterior distribution of estimates and report the posterior mode and 95% CI for repeatability estimates. These analyses revealed significant gene:tissue interactions, therefore, we analyzed gene expression separately in subsequent analyses to simplify interpretation.
Next, we determined the set of variables that best-explained among country variation in gene expression for each gene by model selection using the “dredge” function from the MuMIn package in R (Barton and Barton 2015). First, we constructed 3 separate global models (i.e., one for each gene) with gene expression as the response variable. All global models included the same explanatory variables: tissue, population type (native or non-native), temperature predictability (high or low), precipitation predictability (high or low), genetic group (1 or 2), urbanization, latitude, altitude, sex, and body mass at capture. Individual was fit as a random effect to account for repeated measures (i.e., tissues) within the same individual. We did not include country as a random effect since this led to model singularity given that the combination of values for fixed effects was unique to each country. The “dredge” function runs all subsets of the global model. We then determined the best-fit models based on AICc values. When alternative models were within Δ2AICc of the top model, we retained these models and used the ‘model.ave’ function from the MuMIn package and report the average model results here.
Finally, given that our initial models revealed significant among country and among individual repeatability, we assessed correlations in gene expression across different levels (i.e., within individuals, among individuals, among countries). Expression levels for each gene were treated as separate response variables, and we included tissue as a fixed effect to account for tissue specific differences in gene expression. Individual ID (band) and study population (country) were also included as random effects, which allowed us to estimate covariation between response variables at the among-country, among-individual, and within-individual (i.e., residual) levels. Models were constructed using the MCMCglmm function in R (Hadfield 2010). Models were run for 106,000 iterations with a burn-in of 6000 and thinning of 100. Thus, 1000 estimates were retained for estimating the posterior distributions. We extracted among country, among individual, and within individual correlations between each pairwise combination of genes following published methods (Houslay and Wilson 2017). Results presented here used a non-informative inverse-Wishart prior, however, we verified that results were robust across different prior specifications, which they were (results not shown). Correlations presented are posterior modes and 95% confidence intervals.