Abstract
As science becomes increasingly cross-disciplinary and scientific models
become increasingly cross-coupled, standardized practices of model
evaluation are more important than ever. For normally distributed data,
mean-squared error (MSE) is ideal as an objective and general-purpose
measure of model performance, but MSE gives little insight into what
aspects of model performance are ‘good’ or ‘bad’. This apparent weakness
has led to a myriad of specialized error metrics, which are often
aggregated to form a composite score. Such scores are inherently
subjective, however, and while their components are interpretable, the
composite itself is not. We contend that a better approach to model
benchmarking and interpretation is to decompose the MSE into more
interpretable components. To demonstrate the versatility of this
approach, we outline some fundamental types of decomposition and apply
them to predictions at 1,021 streamgages across the conterminous United
States from three streamflow models. Through this demonstration, we show
that each component in a decomposition represents a distinct concept and
that simple decompositions can be combined to represent more complex
concepts forming an expressive language through which to interrogate
models and data.