DISCUSSION
Compared to composite score modelling20, the IRT approach differentiates the items by their sensitivity level and has shown the potential to reduce trial sample size for detecting drug effects9,22. The sample-size saving is an attractive proposition, especially as the field advances towards increasingly personalized medicine, where a certain therapy is expected to be effective only in a small population.
Multi-variable IRT models with item-level interaction across domains have been published; but they were not readily adaptable for application to analysis of Part III alone12,22. In this work, we used only items in Part III, aiming to support early development of PD drugs where a Go/No-Go decision hinges on their effect on (the more objective) motor examinations. There is also a differentiating methodological feature of our analysis: the analyses reported by others used the IRT model to simulate the total scores; applied hypothetical drug effects to both the severity endpoint and the simulated total scores; and compared the two endpoints – severity and total score - for the sample size requirement to detect the drug effect. This approach could potentially bias against the total score endpoint, in the event the simulation inflated the noise in total score. In contrast, we applied the drug effect directly to the SoS, as to the severity. In doing so, the two endpoints were treated more fairly.
To compare the sample size requirement between the IRT and the conventional SoS methods, we applied a range of relevant potential reduction in progression rate that a new agent could cause. The normally distributed effects centered at 0.3 and had the 5th – 95th range of 0.1 to 0.5 which has been considered as clinically meaningful effect range for neurodegenerative indications such as Parkinson’s disease and Alzheimer’s disease9,22. While the center of the range represented an effect that’s highly relevant and reasonably plausible, the lower and higher tails were respectively less relevant and plausible. As such, the effect levels further away from the center carried less weight in the computation of the overall PoS, which is then effectively the collective power weighted by the distribution of the effect level. We consider this as a useful approach to account for the uncertainty in the eventual effect size that a new agent could produce. Figure 4 lower panel illustrates the (expected) difference between the PoS under this effect distribution and the power under the more extreme effect sizes. For the same sample size, the power for detecting a large treatment effect would be higher than the PoS for detecting a range of potential effects. Under this condition, we found that the IRT method could lead to a tremendous saving of about 50% in sample size compared to the conventional SoS method. This magnitude of sample size savings is consistent with our recent analysis of a placebo-controlled clinical trial of ropinirole – an established dopaminergic agent.33
The tremor tests showed poor discrimination power; they each and collectively held very little information (Table 2). For most of the tremor items, the probability of score 0 (normal) was disproportionally high, regardless of a patient’s severity as defined by the overall instrument (Figure 2, lower left and right). Consistent with these observations, the clinical trial PoS was not affected by whether the tremor items were included in the analyses or not (Figure 4 upper). Interestingly, a Rasch measurement theory analysis revealed disordered threshold for several tremor-related items.34 These observations supported the view that the tremor tests might measure a different construct, hence perhaps should be assessed using a separate and more sensitive scale.22,31,35
Interestingly, all seven left-side non-tremor items were among the most informative ones (Table 2). Compared to their right-side counter items, they showed higher discriminatory power (aj ), and generally lower values and narrower ranges of difficulty parameters (bj1 to bj4 ). This was also reflected by the left side’s better differentiated ICCs (Figure 2 lower left) and slightly higher proportion of higher scores (Figure 2, lower right). Similarly, Gottipati et al. identified “left hand finger tapping” as the most informative among the sided items12. In a previously-reported analysis, we explored the PoS for four different approaches: by IRT and SoS, using all items or only the seven left-side items. For the same sample size, the order of estimated trial PoS was: IRT on all items > IRT on seven items > SoS on seven items > SoS on all items.34 This order illustrated IRT’s ability to enhance signal-noise ratio by item differentiation; indeed, its advantage over SoS was reduced when only the most informative items were included in the analysis. These findings were consistent with earlier analysis of combined Part II and Part III data by Buatois et al.22
A recent cross-section analysis also found the discrimination parameters to be higher and difficulty parameters to be lower for the left-side items then for the right-side items.35 Similar findings were reported from an item-response analysis of multiple latent variables, although that analysis also reported a majority (58%) of the patients having more advanced baseline disability on the right side of the body.12 The lower difficulty parameters, or worse test performance, for the left side items may be a reflection of most people being right handed, despite neuroimaging and meta-analyses suggesting the dominant side might be affected earlier25,26,27. Change of hand preference while the disease progresses has also been reported.36 This is an area to be investigated further, in different datasets and at different stages of the symptom progression. Another possible reason for the consistent worse performance by the left side was this side being always examined later per UPDRS form. Conceivably, this hypothesis may be tested by randomizing the order of the sided tests.
We introduced an inter-occasion (visit) variability in the longitudinal model to reflect the commonly recognized disease fluctuation; this improved the estimation of the progression rate. The model suggested that patients with lower baseline severity had faster progression, support the report that the progression, when measured by MDS-UPDRS Part III, was slower at the more advanced stage21. The effects of other factors such as genotype, comorbidity, age, disease history and diagnostic biomarkers on disease progression remain to be assessed.23,24,30
That IRT analysis of MDS-UPDRS Pat III required a smaller sample size is relevant to composite scales used in other indications. Because of the less informative items, composite scores could compromise signal-to-noise ratio. Some instruments are also long, hence physically and mentally exhausting for debilitated patients, and leading to incomplete or poor data. Therefore, a bespoke and shorter instrument is often desired. The development, validation and user training are costly and time consuming; and a new instrument suffers the risk of missing out relevant information when used for assessing a new drug of unestablished profile and lack of comparability with existing data. The IRT approach can enhance signal detection power and reduce sample size through directly accessing and weighting of item-level data of a well-established instrument that’s accepted by regulators. When item scores are used directly, incomplete data are still useful. By extension, it may be possible to reduce patient burden by asking each patient to take only a stratified partial test. Other potential applications of this approach include bridging between different versions of an evolving instrument for meta-analysis or cross-study comparison,28 and translating clinical trial results to patient outcome expectations. These areas require extensive further research and experience building by the clinical research community.