Discussion
Almost 30 years ago, EBM23 was introduced to wide medical audience, subsequently being assessed to represent one of the most important medical milestones of the last 160 years, in the same category as innovations such as antibiotics and anesthesia.24 At the heart of EBM is notion that “not all evidence is created equal”- some evidence is more credible than others; the higher quality of evidence, the more accurate and trustworthy are our estimates about true effects of health interventions. 1 Surprisingly, however, the relationship between CoE and estimates of treatment effects has not been empirically evaluated.
Here, we provide the first empirical support for the foundational EBM principle that low-quality evidence changes more often than high CoE (Fig 2). However, we found no difference in effect sizes between studies appraised as very low vs high [or, very low/low vs. moderate/high CoE (Fig 3)]. This implies that effects that are assessed as less trustworthy/potentially unreliable (as when CoE is low) cannot be distinguished from those assessments, which are presumably more trustworthy/accurate (as when CoE is high). If the magnitude of treatment effects cannot be meaningfully distinguished from evidence appraised as high vs. low quality, then the core principle of EBM seems to be challenged.
Our “negative” results should not be construed as a challenge to sound, normative EBM epistemological principles, which hold that optimal practice of medicine requires explicit and conscientious attention to the nature of medical evidence.1,25,26 Rather, in assessing the relationship between CoE and “true” effects of health interventions, more salient question is to ask if the current appraisal methods capture CoE as intended by the EBM principles. Critical appraisal of CoE is integral aspect of conduct of systematic reviews, guidelines development and is widely accepted in the curricula in most medical and allied professional schools across the world. Over the years, many critical appraisal methods have been developed1 to eventually culminate in development of GRADE methodology, which has been endorsed by more than 110 professional organizations.7 However, as we demonstrate here, despite GRADE’s capacity to distinguish across CoE categories, it could not- and we suspect none of other appraisal methods that GRADE has replaced- reliably discerned the influence of CoE on the estimates of treatment effects. The results agree with those of Gartlehner et al who, based on cumulative meta-analysis of 37 Cochrane reviews, found27 limited value of GRADE in predicting stability of strength of evidence as new studies emerged.
The finding that the magnitude of effect size is not reflected in a change of CoE is surprising as previous meta-epidemiological studies showed that various study limitations that affect CoE significantly influence estimates of treatment effects28 (although not always consistently16). For example, as measured by ROR, inadequate or unclear (vs. adequate) random-sequence generation, inadequate or unclear (vs. adequate) allocation concealment, or lack of or unclear double-blinding (vs. double-blinding) led to statistically significant exaggeration of treatment effects by 11%, 7% and 13%, respectively.28 These study limitations are taken into account in rating of CoE using GRADE method6, so one would expect that effect size would differ between low vs high CoE in the GRADE assessment. However, on further examination, we observe that GRADE combines the study limitations such as adequacy of allocation concealment, blinding etc (risk of bias) with the assessment of inconsistency, imprecision, indirectness and publication bias to assign the final rating of CoE (from very low to high quality) in additive fashion. 12,29 It appears that using additive means to report the properties of negative and positive changes in treatment effect could unhelpfully neutralize this effect and cause imprecision in the overall estimate. Thus, one can have the same estimates of treatment effects but completely different GRADE ratings. This is, however, problematic because central assumption of GRADE is that estimates underpinned by high CoE are unlikely to change, whereas the very low/low CoE estimates are more likely to change.
A potential limitation of our study is that we have not collected data on the individual factors that drove assessment of CoE (i.e., study limitations/risk of bias vs inconsistency, imprecision, or indirectness, for example). However, the present empirical report targets, for first time, the end- stage level assessment of CoE, according to GRADE specifications, which is how CoE is used in practice to aid interpretation of evidence and affect development of clinical guidelines.
We also detected imprecision in the estimates of effects sizes and relatively wide ROR confidence intervals, particularly in the subgroup of meta-analyses describing treatment effects in the reviews with CoE that changed from moderate/high to low/very low. It may be argued that the current methods of CoE appraisal are simply not sensitive enough and that with much larger sample size of SR/MAs, we would be able to differentiate between effect sizes across categories of CoE. This point was made by Howick and colleagues30 who showed no change in the CoE between original and updated reviews in a set of the 48 trials they examined, albeit they made no attempt to identify changes in effect sizes. However, obtaining larger sample sizes is unrealistic given that we reviewed almost all SRs in the Cochrane database since the GRADE assessment of CoE was mandated (up to May 2021). Finally, few Cochrane Reviews we analyzed included observational studies. It is possible that GRADE may not differentiate the quality of randomized evidence well but that it may perform better if the comparison is made between randomized vs observational studies. The Cochrane Reviews, however, are typically based on randomized trials. Therefore, categorization of CoE based on currently mandated critical appraisal system using GRADE in the Cochrane Reviews does not meaningfully separate effect sizes across the existing gradation of CoE (although, capacity of GRADE to distinguish the magnitude of effect size between randomized and observational studies outside of the purview of Cochrane Reviews remains a worthwhile goal for further empirical research).
Given that studies can be well done, and correctly estimated treatment effects, but be poorly reported31,32, it is also possible that we could not detect influence of CoE on the estimates of treatment effects because current critical appraisal methods depend on the quality of reporting of the trials that are selected for meta-analysis. However, if we believe that quality of reporting does not matter, then the entire critical appraisal efforts can be considered misplaced to begin with.