The Global Carbon Project estimates that the terrestrial biosphere has absorbed about one-third of anthropogenic CO2 emissions during the 1959-2019 period. This sink-estimate is produced by an ensemble of terrestrial biosphere models collectively referred to as the TRENDY ensemble and is consistent with the land uptake inferred from the residual of emissions and ocean uptake. The purpose of our study is to understand how well TRENDY models reproduce the processes that drive the terrestrial carbon sink. One challenge is to decide what level of agreement between model output and observation-based reference data is adequate considering that reference data are prone to uncertainties. To define such a level of agreement, we compute benchmark scores that quantify the similarity between independently derived reference datasets using multiple statistical metrics. Models are considered to perform well if their model scores reach benchmark scores. Our results show that reference data can differ considerably, causing benchmark scores to be low. Model scores are often of similar magnitude as benchmark scores, implying that model performance is reasonable given how different reference data are. While model performance is encouraging, ample potential for improvements remains, including a reduction in a positive leaf area index bias, improved representations of processes that govern soil organic carbon in high latitudes, and an assessment of causes that drive the inter-model spread of gross primary productivity in boreal regions and humid tropics. The success of future model development will increasingly depend on our capacity to reduce and account for observational uncertainties.