Univariate Biomarker Distribution Simulations
Dataset: The National Center for Health Statistics (NCHS) conducts an annual survey that assesses the health and nutritional status of adults and children in the United States by means of laboratory measurements, physical screening, and surveys, which are released to the public in biannual cycles 9. Here, we obtained and pooled the NHANES data from 2009-2010, 2011-2012, 2013-2014, 2015-2016 and 2017-2018 cycles.
We identified a set of 16 diverse diabetes-relevant biomarkers for investigating the utility of GANs for modeling high-dimensional biomarker joint distributions. The following biomarkers were selected: urine creatinine, fasting glucose, insulin, body mass index, glycohemoglobin, triglyceride, total cholesterol, alanine aminotransferase (ALT), aspartate aminotransferase (AST), gamma glutamyl transferase (GGT), uric acid, high sensitivity C-reactive protein, direct HDL-cholesterol, average systolic blood pressure, and ferritin. Age, sex, race/ethnicity were obtained as demographic descriptors.
Data Pre-processing: Average systolic blood pressure was a derived variable calculated as the average of 3 systolic blood pressure readings for those with ≥ 3 readings and on 2 readings for those with only 2 readings.
The biomarker data were log-transformed, and min-max scaled to the range\(\left[-1,1\right]\). The pooled data were randomly split into training (80%) and test (20%) data sets. Listwise exclusion was employed.
GAN Architecture: A common generator neural network architecture was used for modeling all the 16 univariate biomarker distributions, i.e., for the 1-dimensional case.
The generator takes input data from a 10-dimensional latent space that was trained to create output data resembling the training data distribution. The generator neural network model consisted of two dense layers. The hidden layers were comprised of a rectified linear unit (ReLU) function 10 and a batch normalization layer11. The batch normalization layers standardize the inputs to the dense layer for each batch and stabilize the learning process 11. The final layer of the generator was tanh activated.
The discriminator takes input from the generator and predicts whether it belongs to the training distribution. The discriminator model contained two dense layers with ReLu activation. The output was passed to a sigmoid activation function to obtain a classification score. The discriminator network is trained using a binary cross-entropy loss.
Our GANs were prototyped using the AI software tools Keras/TensorFlow. Keras is a neural network library that is integrated with TensorFlow, the open-source library for AI and machine learning.
Training was conducted for 5000 epochs or until the Kolmogorov-Smirnov test p -value for the training vs. generated sample distributions was > 0.05.
Data Analysis: GAN performance was assessed by comparing the GAN-generated biomarker distributions to the test data. For visualization of each biomarker distribution, density histograms containing a sample of 1000 generated and test data samples were used. Quantile-quantile plots of test data vs. GAN-generated data were also assessed.