Statistical analysis
We evaluated the role of microhabitat variables in nestbox occupancy
using machine learning Random Forest regression methods (Cutler et al.
2007). This approach ensembles multiple regression or classification
trees allowing the estimation of variable importance and conditional
effects (Breiman 2001). Random Forests were generated based on 10,000
classification trees using the function “randomForest” from the R
package “randomForest” (Liaw & Weiner 2002). We first defined a model
with all 74 measured variables
(Table 1) plus the height from the ground to the nestbox and a
categorical variable for site (woodland or hedgeline ). We
evaluated variable importance with the package “randomForestExplainer”
(Paluszynska et al. 2020) considering seven metrics: mean minimal depth
from top trees, total number of nodes that use the variable to split the
data, the total number of trees in which the variable is used, mean
decrease in prediction accuracy after the variable is permuted, mean
decrease in the Gini index of node impurity by splits based on the
variable, total number of trees in which the variable is used for
splitting the root node, p-value from a binomial test comparing the
number of nodes in which the variable was used compared to the expected
number if variables were assigned to nodes at random. To facilitate the
selection of the most relevant variables we focused on variables with
significant p-values in the binomial test, which were explored in detail
using plots representing all metrics and further confirmed via the
function “important_variables” from the package
“randomForestExplainer”. We then built a simplified model for
prediction based on the most important variables (relationships between
importance metrics shown in Appendix S1). Based on this simplified model
we generated dependence plots to show how each variable influences the
probability of occupancy using the function “partial” from the R
package “pdp” (Greenwell 2017). For the complete and simplified models
we report OOB (Out-Of-Bag) overall error, false positive, and false
negatives rates (and their reciprocals: model accuracy, specificity, and
sensitivity). OOB samples represented approximately one-third of the
observations drawn with replacement (the default setting). In addition
to the OOB validation we further validated the model by comparing
predictions for sampled nestboxes with observed dormice occupancy
between June and October 2021 (this information was not used to define
occupancy for model fitting).