Licheng LIU

and 12 more

Improving the estimation of CO2 exchange between the atmosphere and terrestrial ecosystems is critical to reducing the large uncertainty in the global carbon budget. Large amounts of the atmospheric CO2 assimilated by plants return to the atmosphere by ecosystem respiration (Reco), including plant autotrophic respiration (Ra) and soil microbial heterotrophic respiration (Rh). However, Ra and Rh are challenging to be estimated at large regional scales because of the limited understanding of the complex interactions among physical, chemical, and biological processes and the resulting high spatio-temporal dynamics. Traditional approaches for estimating Reco including process-based (PB) models are limited by human knowledge resulting in limited accuracy and efficiency. Accumulation of the in situ observation of net ecosystem exchange (NEE), weather, and soil, and satellite data of GPP, LAI and soil moisture make it possible for applying data driven machine learning (ML) approaches. But the ML model approach has disadvantages of omission of domain knowledge and lack of interpretability. Here we propose a novel knowledge guided machine learning (KGML) method for predicting daily Ra and Rh in the US crop fields. With Gated Recurrent Unit (GRU) as the basis, we develop the KGML models constructing the hierarchical structure of ML with a mass balance constraint. The KGML models were pre-trained using synthetic data generated by an advanced agroecosystem model, ecosys, and re-trained with real-world FLUXNET observation data. We extrapolate the best KGML model to crop fields over the US with the help of satellite data, reanalysis climate forcings, and soil database to reveal the spatio-temporal variations and key controlling factors. We believe this study advances the interpretable machine learning concept for carbon cycle estimation and will shed light on many other process-based biogeochemistry research.

Zachary McEachran

and 8 more

We present a knowledge-guided machine learning framework for operational hydrologic forecasting at the catchment scale. Our approach, a Factorized Hierarchical Neural Network (FHNN), has two main components: inverse and forward models. The inverse model uses observed precipitation, temperature, and streamflow data to generate a representation of the current underlying catchment state. The forward model predicts streamflow using the learned catchment state. The FHNN architecture is designed to model multi-scale processes and capture their interactions while providing explainability and interpretability. FHNN also improves forecasts based on real-time data through an inference-based data integration approach. FHNN’s data integration approach improves forecasts in response to observed data more efficiently than data assimilation methods (e.g., ensemble Kalman filtering) that require computationally intensive optimization. Once an inverse model is trained, it can quickly infer catchment states directly based on data in real-time. To show the operational performance of FHNN, we compare the FHNN forecasts with that of an expert human hydrologic forecaster using a physics-based model where both use the same imperfectly known future precipitation forecast in their modeling. The expert human forecaster creates a more accurate forecast within the first 18 hours of a forecast’s issuance, but FHNN has significantly better predictions at longer lead times. Additionally, FHNN internal states correlate strongly with internal physics-based model states, such as soil moisture, in a synthetic case. This research lays the groundwork for leveraging the predictive performance of AI-based models with the expertise in forecasting agencies to produce better river forecasts at all lead times.

Xiang Li

and 11 more

Streamflow prediction is a long-standing hydrologic problem. Development of models for streamflow prediction often requires incorporation of catchment physical descriptors to characterize the associated complex hydrological processes. Across different scales of catchments, these physical descriptors also allow models to extrapolate hydrologic information from one catchment to others, a process referred to as “regionalization”. Recently, in gauged basin scenarios, deep learning models have been shown to achieve state of the art regionalization performance by building a global hydrologic model. These models predict streamflow given catchment physical descriptors and weather forcing data. However, these physical descriptors are by their nature uncertain, sometimes incomplete, or even unavailable in certain cases, which limits the applicability of this approach. In this paper, we show that by assigning a vector of random values as a surrogate for catchment physical descriptors, we can achieve robust regionalization performance under a gauged prediction scenario. Our results show that the deep learning model using our proposed random vector approach achieves a predictive performance comparable to that of the model using actual physical descriptors. The random vector approach yields robust performance under different data sparsity scenarios and deep learning model selections. Furthermore, based on the use of random vectors, high-dimensional characterization improves regionalization performance in gauged basin scenario when physical descriptors are uncertain, or insufficient.

Licheng LIU

and 11 more

Nitrous oxide (N2O) is one of the important greenhouse gases (GHGs), with its global warming potential 265 times greater than that of carbon dioxide (CO2). About 60% of the anthropogenic N2O emission is from agriculture production. To date, estimating N2O emissions from cropland remains a challenging task because the related microbial origin processes (e.g. incomplete nitrification and denitrification) are controlled by a diverse factors of climate, soil, plant and human activities. In this study, we developed a ML model with physical/biogeochemical domain knowledge, namely knowledge guided machine learning (KGML), for simulating daily N2O fluxes from the agriculture ecosystem. The Gated Recurrent Unit (GRU) was used as the basis to build the model structure. A range of ideas have been implemented to optimize the model performance, including 1) hierarchical structure based on variable causal relations, 2) intermediate variable (IMV) prediction and transfer, 3) inputting IMV initials for constraints, 4) model pretrain/retrain, and 5) multitask learning. The developed KGML was pre-trained by millions of synthetic data generated by an advanced PB model, ecosys, and then re-trained by observations from six mesocosm chambers during three growing seasons. Six other pure ML models were developed using the same data from mesocosm chambers to serve as the benchmark for the KGML model. The results show that KGML can always outperform the PB model in efficiency and ML models in prediction accuracy of capturing N2O flux magnitude and dynamics. Besides, the reasonable predictions of IMVs increase the interpretability of KGML. We believe the footprint of KGML development in this study will stimulate a new body of research on interpretable machine learning for biogeochemistry and other related geoscience processes.