Statistical Analysis:
Preference data are typically analyzed using non-parametric analyses, such as the Wilcoxon sign-rank test, in which preferences are expressed as ratios and compared to a null. However this approach has several drawbacks. For example, (1) it does not measure the strength of preferences, only whether they are statistically different from the null, (2) it fails to account for different counts per replicate and does not adjust confidence intervals accordingly, and (3) it does not model variation in individual-level choice—for instance, if half of individuals in a population strongly prefer one option and half another, the population preference will still be 0.5 and may not differ significantly from the null (Fordyce et al. 2011). Given these drawbacks, we chose to use a hierarchical Bayesian model designed for ecological preference and count data, implemented using the R packages bayespref (version 1.0) and coda (version 1.0) (Plummer et al. 2006; Fordyce et al. 2011). The benefit of the hierarchical Bayesian model is that it directly estimates the strength of preference, appropriately models uncertainty, and gives estimates of both individual and population-level preferences.
The R package bayespref uses MCMC chains to first assess the probability of the experimental count data given an individual preference, thus modeling individual preference, and at each step in the chain further assesses the probability of an individual preference given a population preference, thus modeling population preference. In a two-choice scenario like ours, individual preferences are modeled as binomial distributions and population preferences as beta distributions. The first model was run using bayespref for female preference for conspecific vs. heterospecific song, to confirm that females were showing the expected preference for conspecific song and that the operant trials were reflecting preference. The second model was run for Solver vs. Non-Solver song to determine whether females showed a population-level preference for Solver song. Models were run with 10,000 generation MCMC chains with 1,000-generation burn-ins. Mixing of search chains was verified via diagnostic plots of MCMC samples and checking the effective sample size.
In the first model we used data for each female, where the total number of conspecific hops over two days was modeled as one possible outcome and the total number of heterospecific hops over two days was modeled as the alternative outcome. In the second model we also had data for each female, where the total number of Solver hops over all test days was modeled as one possible outcome and the total number of Non-Solver hops over all test days was modeled as the alternative outcome. This controlled for side bias, which in some females was noticeable even in conspecific vs. heterospecific trials. To evaluate the impact of stimulus set on female preference, and to ensure that no one stimulus set was disproportionately affecting results, we ran a third hierarchical Bayesian model using stimulus song at the individual level and preference for all stimulus songs in a given category (Solver or Non-Solver) at the population level. In this model we had data for each stimulus set (1 – 6), where the total number of female hops across all trials by all birds for the Solver song was modeled as one possible outcome and the total number of female hops across all trials for the Non-Solver song was modeled as the alternative outcome. In the same way that the first model calculated individual female preference and confirmed that no one female was driving population-level preference, this model confirmed that no one song set was disproportionately “attractive” and driving preference for Solver song.
Because of concern that one of the Non-Solvers was categorically different in ability than the other 5, we also ran these Bayesian models using only data from the 5 stimulus sets in which the Non-Solvers all failed at the same step.
Although we considered the Bayesian approach sufficient to answer our research questions, we also ran a mixed effects linear model using the preference data for each trial and including stimulus ID, female ID, and order of presentation as random effects. The results of this model can be found in Supplementary Materials.
We further tested whether any measures of song complexity (song length, number of phrases, phrase length, number of elements, or number of unique elements) differed between stimulus pairs. To test this we used paired t-tests. We corrected for multiple testing (here, five song variables) using a Bonferroni correction, such that the adjusted alpha was 0.01.
Because we were interested in whether neophobia affected performance, we also tested whether Non-Solvers took longer than Solvers to solve the first stage of the task in which they were introduced to the block, and the second stage in which they were introduced to the lids, using an independent t-test. All statistical analyses were performed in R (Version 3.5.0; R Development Core Team 2016).