As science becomes increasingly cross-disciplinary and scientific models become increasingly cross-coupled, standardized practices of model evaluation are more important than ever. For normally distributed data, mean-squared error (MSE) is ideal as an objective and general-purpose measure of model performance, but MSE gives little insight into what aspects of model performance are ‘good’ or ‘bad’. This apparent weakness has led to a myriad of specialized error metrics, which are often aggregated to form a composite score. Such scores are inherently subjective, however, and while their components are interpretable, the composite itself is not. We contend that a better approach to model benchmarking and interpretation is to decompose the MSE into more interpretable components. To demonstrate the versatility of this approach, we outline some fundamental types of decomposition and apply them to predictions at 1,021 streamgages across the conterminous United States from three streamflow models. Through this demonstration, we show that each component in a decomposition represents a distinct concept and that simple decompositions can be combined to represent more complex concepts forming an expressive language through which to interrogate models and data.