Statistical Analysis:
Preference data are typically analyzed using non-parametric analyses,
such as the Wilcoxon sign-rank test, in which preferences are expressed
as ratios and compared to a null. However this approach has several
drawbacks. For example, (1) it does not measure the strength of
preferences, only whether they are statistically different from the
null, (2) it fails to account for different counts per replicate and
does not adjust confidence intervals accordingly, and (3) it does not
model variation in individual-level choice—for instance, if half of
individuals in a population strongly prefer one option and half another,
the population preference will still be 0.5 and may not differ
significantly from the null (Fordyce et al. 2011). Given these
drawbacks, we chose to use a hierarchical Bayesian model designed for
ecological preference and count data, implemented using the R packages
bayespref (version 1.0) and coda (version 1.0) (Plummer et al. 2006;
Fordyce et al. 2011). The benefit of the hierarchical Bayesian model is
that it directly estimates the strength of preference, appropriately
models uncertainty, and gives estimates of both individual and
population-level preferences.
The R package bayespref uses MCMC
chains to first assess the probability of the experimental count data
given an individual preference, thus modeling individual preference, and
at each step in the chain further assesses the probability of an
individual preference given a population preference, thus modeling
population preference. In a two-choice scenario like ours, individual
preferences are modeled as binomial distributions and population
preferences as beta distributions. The first model was run using
bayespref for female preference for conspecific vs. heterospecific song,
to confirm that females were showing the expected preference for
conspecific song and that the operant trials were reflecting preference.
The second model was run for Solver vs. Non-Solver song to determine
whether females showed a population-level preference for Solver song.
Models were run with 10,000 generation MCMC chains with 1,000-generation
burn-ins. Mixing of search chains was verified via diagnostic plots of
MCMC samples and checking the effective sample size.
In the first model we used data for each female, where the total number
of conspecific hops over two days was modeled as one possible outcome
and the total number of heterospecific hops over two days was modeled as
the alternative outcome. In the
second model we also had data for each female, where the total number of
Solver hops over all test days was modeled as one possible outcome and
the total number of Non-Solver hops over all test days was modeled as
the alternative outcome. This controlled for side bias, which in some
females was noticeable even in conspecific vs. heterospecific trials. To
evaluate the impact of stimulus set on female preference, and to ensure
that no one stimulus set was disproportionately affecting results, we
ran a third hierarchical Bayesian model using stimulus song at the
individual level and preference for all stimulus songs in a given
category (Solver or Non-Solver) at the population level. In this model
we had data for each stimulus set (1 – 6), where the total number of
female hops across all trials by all birds for the Solver song was
modeled as one possible outcome and the total number of female hops
across all trials for the Non-Solver song was modeled as the alternative
outcome. In the same way that the first model calculated individual
female preference and confirmed that no one female was driving
population-level preference, this model confirmed that no one song set
was disproportionately “attractive” and driving preference for Solver
song.
Because of concern that one of the Non-Solvers was categorically
different in ability than the other 5, we also ran these Bayesian models
using only data from the 5 stimulus sets in which the Non-Solvers all
failed at the same step.
Although we considered the Bayesian approach sufficient to answer our
research questions, we also ran a mixed effects linear model using the
preference data for each trial and including stimulus ID, female ID, and
order of presentation as random effects. The results of this model can
be found in Supplementary Materials.
We further tested whether any measures of song complexity (song length,
number of phrases, phrase length, number of elements, or number of
unique elements) differed between stimulus pairs. To test this we used
paired t-tests. We corrected for multiple testing (here, five song
variables) using a Bonferroni correction, such that the adjusted alpha
was 0.01.
Because we were interested in whether neophobia affected performance, we
also tested whether Non-Solvers took longer than Solvers to solve the
first stage of the task in which they were introduced to the block, and
the second stage in which they were introduced to the lids, using an
independent t-test. All statistical analyses were performed in R
(Version 3.5.0; R Development Core Team 2016).