Statistical analysis
We evaluated the role of microhabitat variables in nestbox occupancy using machine learning Random Forest regression methods (Cutler et al. 2007). This approach ensembles multiple regression or classification trees allowing the estimation of variable importance and conditional effects (Breiman 2001). Random Forests were generated based on 10,000 classification trees using the function “randomForest” from the R package “randomForest” (Liaw & Weiner 2002). We first defined a model with all 74 measured variables
(Table 1) plus the height from the ground to the nestbox and a categorical variable for site (woodland or hedgeline ). We evaluated variable importance with the package “randomForestExplainer” (Paluszynska et al. 2020) considering seven metrics: mean minimal depth from top trees, total number of nodes that use the variable to split the data, the total number of trees in which the variable is used, mean decrease in prediction accuracy after the variable is permuted, mean decrease in the Gini index of node impurity by splits based on the variable, total number of trees in which the variable is used for splitting the root node, p-value from a binomial test comparing the number of nodes in which the variable was used compared to the expected number if variables were assigned to nodes at random. To facilitate the selection of the most relevant variables we focused on variables with significant p-values in the binomial test, which were explored in detail using plots representing all metrics and further confirmed via the function “important_variables” from the package “randomForestExplainer”. We then built a simplified model for prediction based on the most important variables (relationships between importance metrics shown in Appendix S1). Based on this simplified model we generated dependence plots to show how each variable influences the probability of occupancy using the function “partial” from the R package “pdp” (Greenwell 2017). For the complete and simplified models we report OOB (Out-Of-Bag) overall error, false positive, and false negatives rates (and their reciprocals: model accuracy, specificity, and sensitivity). OOB samples represented approximately one-third of the observations drawn with replacement (the default setting). In addition to the OOB validation we further validated the model by comparing predictions for sampled nestboxes with observed dormice occupancy between June and October 2021 (this information was not used to define occupancy for model fitting).