The Power of Integrated Models
Leveraging multiple data types, whether through pooling or explicit integration such as via joint likelihood approaches, has been shown to generally improve SDM performance by estimating more precise and accurate environmental relationships (Fletcher et al. 2019, Paradinas et al. 2023, Braun et al. 2023b). Although recent research has highlighted the application of combining various data for SDMs (Bedriñana-Romano et al. 2018, Rufener et al. 2021, Paradinas et al. 2023, Braun et al. 2023b), few studies have demonstrated their capacity to forecast and project potential distributional shifts under novel environmental conditions (Chevalier et al. 2021). Our study suggests that while all model approaches used here perform well during periods of normal environmental conditions, joint likelihood approaches that explicitly account for the biases in each data source (i.e., iSDMs) maintain robust and ecologically realistic forecasts as environmental conditions become increasingly novel. We demonstrate that iSDMs effectively mitigate issues that are broadly attributed to a model’s forecast skill. Our findings confirm that explicit integration of diverse datasets represents a promising approach to overcome the potential biases inherent in a single data source, as it enables harnessing the strength of various data types to facilitate more accurate inferences about a species’ distribution (Isaac et al. 2020). The models we tested all exhibited high predictive skill (average AUC > 0.83, MAE < 0.25) and strong ecological realism. This can be particularly beneficial for highly migratory pelagic species, such as albacore, as using a single data source may only capture a portion of their range, such as that represented by a fishery, which could lead to mischaracterizing a species’ realized niche (Paradinas et al. 2023, Braun et al. 2023b). However, our results also suggest that predictive skill may be higher for fishery-dependent data compared to fishery-independent sources, as seen in the deviations observed in early 2016 (Figure 3). This aligns with previous findings (Braun et al. 2023b; Farchadi et al. in revision ), where models were more effective at predicting the fishery’s interaction with a species rather than broader habitat suitability. These differences underscore the need to carefully consider the representativeness of each data source when interpreting forecasted distributions.
The improved predictive performance of iSDMs under increasing environmental novelty may stem from differences in the fitted species-environmental response curves (Thuiller et al. 2004) and their ability to account for spatiotemporal variation (Muhling et al. 2019, Simmonds et al. 2020). Previous studies evaluating SDM forecasting performance, whether in the near-term (Muhling et al. 2020, Barnes et al. 2022) or long-term (Thuiller et al. 2004, Karp et al. 2023), have emphasized that biased or limited species-environmental response curves can lead to erroneous predictions. This limitation is often an inherent bias in training data, such as in fishery catch data that only captures a portion of the species’ preferred habitat conditions due to sampling bias (e.g., clustering, gear selectivity, limited spatial and/or temporal coverage), resulting in truncated species-environmental response curves (Chevalier et al. 2021, Barnes et al. 2022, Paradinas et al. 2023). Our results indicate that leveraging diverse data types can help capture the full range of environmental conditions a species occupies, but the species response curves depend on how the model framework combines data types. For example, more generalized species-environmental relationships were estimated for both spatially explicit models (i.e. GF, iSDM) which performed better than the spatially implicit model. This is likely due, at least in part, to HE response curves that exhibited greater overfitting and were heavily biased towards distributions of the more data-rich vessel logbook records, particularly for MLD (Figure 5). Notably, the GF and iSDM response curves for MLD closely matched the known diving behavior of juvenile albacore tuna, which regularly dive to approximately 100 meters (Frawley et al. 2024) but are often vertically-limited by colder temperatures below the mixed layer (Graham and Dickson 1981). In contrast, the HE model suggested albacore suitability declined with deeper MLDs, particularly > 10 meters, a pattern that mirrors the environmental conditions targeted by the pole-and-line and troll fisheries along the U.S. West Coast (Figure S1). This demonstrates that the inclusion of GMRFs in the spatially explicit models helped account for unmeasured variation in albacore distribution. By modeling the spatial structure separately, these models provided more reliable estimates of environmental relationships, reducing the risk of response curves being artifacts of sampling biases in the fishery data.
Our results also highlight how approaches to spatial dependence and combining disparate data sources can influence an SDM’s capacity to accurately forecast species distributions under novel environmental conditions. Consistent with previous studies, we found that habitat envelope models produce narrower response curves than spatially explicit frameworks, likely due to their inability to capture residual variability (Thorson 2018, Simmonds et al. 2020). Consequently, tightly fit response curves may fail to account for non-stationary species-environment relationships under novel conditions. In contrast, the broader, more generalized response curves generated by iSDMs better capture these dynamics over time (Yates et al. 2018, Muhling et al. 2020; Figure 5). Additionally, the strong performance of spatially explicit models may stem from their ability to incorporate variation across multiple temporal scales. Consistent with prior findings, our analysis suggests that including GMRFs—analogous to seasonal or climatological covariates—enhances forecast skill, particularly in the near term (Barnes et al. 2022). Furthermore, differences between the two spatially explicit models, GF and iSDM, highlight the influence of data integration methods. While the GF model pools data sources, potentially masking differences in sampling design (Fletcher et al. 2019), iSDMs estimate data-specific spatial fields, allowing for improved handling of spatiotemporal variation and biases while also balancing disproportionate sample sizes). This, in turn, can lead to more accurate representation of the underlying ecology of the species. Given the challenges of identifying and addressing bias in different data sources, ongoing evaluation of integration methods remains essential for optimizing predictive performance in species distribution modeling.