Univariate Biomarker Distribution Simulations
Dataset: The National Center for Health Statistics (NCHS)
conducts an annual survey that assesses the health and nutritional
status of adults and children in the United States by means of
laboratory measurements, physical screening, and surveys, which are
released to the public in biannual cycles 9. Here, we
obtained and pooled the NHANES data from 2009-2010, 2011-2012,
2013-2014, 2015-2016 and 2017-2018 cycles.
We identified a set of 16 diverse diabetes-relevant biomarkers for
investigating the utility of GANs for modeling high-dimensional
biomarker joint distributions. The following biomarkers were selected:
urine creatinine, fasting glucose, insulin, body mass index,
glycohemoglobin, triglyceride, total cholesterol, alanine
aminotransferase (ALT), aspartate aminotransferase (AST), gamma glutamyl
transferase (GGT), uric acid, high sensitivity C-reactive protein,
direct HDL-cholesterol, average systolic blood pressure, and ferritin.
Age, sex, race/ethnicity were obtained as demographic descriptors.
Data Pre-processing: Average systolic blood pressure was a
derived variable calculated as the average of 3 systolic blood pressure
readings for those with ≥ 3 readings and on 2 readings for those with
only 2 readings.
The biomarker data were log-transformed, and min-max scaled to the range\(\left[-1,1\right]\). The pooled data were randomly
split into training (80%) and test (20%) data sets. Listwise exclusion
was employed.
GAN Architecture: A common generator neural network
architecture was used for modeling all the 16 univariate biomarker
distributions, i.e., for the 1-dimensional case.
The generator takes input data from a 10-dimensional latent space that
was trained to create output data resembling the training data
distribution. The generator neural network model consisted of two dense
layers. The hidden layers were comprised of a rectified linear unit
(ReLU) function 10 and a batch normalization layer11. The batch normalization layers standardize the
inputs to the dense layer for each batch and stabilize the learning
process 11. The final layer of the generator was tanh
activated.
The discriminator takes input from the generator and predicts whether it
belongs to the training distribution. The discriminator model contained
two dense layers with ReLu activation. The output was passed to a
sigmoid activation function to obtain a classification score. The
discriminator network is trained using a binary cross-entropy loss.
Our GANs were prototyped using the AI software tools Keras/TensorFlow.
Keras is a neural network library that is integrated with TensorFlow,
the open-source library for AI and machine learning.
Training was conducted for 5000 epochs or until the Kolmogorov-Smirnov
test p -value for the training vs. generated sample distributions
was > 0.05.
Data Analysis: GAN performance was assessed by comparing the
GAN-generated biomarker distributions to the test data. For
visualization of each biomarker distribution, density histograms
containing a sample of 1000 generated and test data samples were used.
Quantile-quantile plots of test data vs. GAN-generated data were also
assessed.