Methods
This retrospective cohort study was approved by the Children’s Hospital
of Orange County Institutional Review Board (IRB #2008107).
Data Sources, Patients, and Variables
Oracle EHR Real-World Data (OERWD)– a large multicenter electronic
health records (EHR) database – was used for this study. The database
contains data from more than 125 US health systems as of September 2023.
OERWD is fully de-identified, encrypted, and secured in compliance with
the Health Insurance Portability and Accountability Act of 1996 privacy
regulation 14,15. Details about the database are
available in the data descriptor paper by Ehwerhemuepha et. al. 202214.
Cohort for the study was defined and retrieved from the database as
preterm infants with gestational age of 32 weeks or less and/or
birthweight less than 1500 gram. Patients’ assigned sex, race,
ethnicity, healthcare plan, gestational age, birthweight, vital signs,
FiO2, partial pressure of oxygen (PaO2), and partial pressure of carbon
dioxide (PCO2) values. Perinatal or maternal information was not
retrieved due to inability to link mother and infant data in the
database used. CPAP failure was defined as the introduction of invasive
mechanical ventilation within 72 hours of birth.
Machine Learning Modeling
Data collected was split into training (75%) and test (25%) sets.
Extreme gradient boosting (XGBoost) - an implementation of stochastic
gradient boosting – was selected given its ability to capture complex
nonlinear relationships between outcome and predictor variables in the
presence of missing data 16,17. Ten-fold
cross-validation on the training set was used to determine the optimal
hyperparameters of the model from a grid of values consisting of
learning rates (to control model convergence); and maximum tree depth
(2, 4, or 6) to control complexities of the trees built while setting
other hyperparameters to their default values. The optimal
hyperparameters were used to develop a final model on the training
dataset. Variables used in the model were ranked for importance using
the “Gain” metric, which measures the improvement in model performance
by a feature on given branches of the trees it is on. The Shapley
Additive Explanation (SHAP) values were used to provide visual
explanation of the risk of CPAP failure given specified values of a
feature. Higher SHAP values imply greater risk of CPAP failure. The test
set was used to evaluate unbiased estimates of model performances such
as the area under the receiver operator characteristic curve (AUROC),
area under precision-recall curve (AUCPR), F-1 score, positive and
negative predictive values, and relative risk of failing CPAP given a
positive prediction.
This study was carried out using the Oracle Health HealtheDataLab
platform as well as the R Statistical Programming Language.