Standardised and reproducible phenotyping using distributed analytics
and tools in the Data Analysis and Real World Interrogation Network
(DARWIN EU®)
Abstract
Purpose The generation of representative disease phenotypes is important
for ensuring the reliability of the findings of observational studies.
The aim of this manuscript is to outline a reproducible framework for
reliable and traceable phenotype generation based on real world data for
use in the Data Analysis and Real-World Interrogation Network (DARWIN
EU®). We illustrate the use of this framework by generating phenotypes
for two complex diseases: pancreatic cancer and systemic lupus
erythematosus (SLE). Methods The phenotyping process involves a 14-step
process based on a standard operating procedure co-created by the DARWIN
EU® Coordination Centre in collaboration with the European Medicines
Agency. A number of bespoke R packages were utilised to generate and
review codelists for two phenotypes based on real world data mapped to
the OMOP Common Data Model. Results Phenotypes were generated for both
pancreatic cancer and SLE, and cohorts were generated using the Clinical
Practice Research Datalink (UK primary care records) and Pharmetrics (US
health claims data). Diagnostic checks were performed, which showed
these cohorts had broadly similar incidence and prevalence figures to
previously published literature. Additionally, co-occurrent symptoms,
conditions, and medication use were in keeping with pre-specified
clinical descriptions based on previous knowledge. Conclusions Our
detailed phenotyping process makes use of bespoke tools and allows for
comprehensive codelist generation and review, as well as large-scale
exploration of the characteristics of the generated cohorts. Wider use
of structured phenotyping methods will be important in ensuring the
reliability of observational studies for regulatory purposes.