Conditional Biomarker Distribution Simulations for Under-represented Groups
A conditional GAN was used to evaluate whether the GAN method could be used to generate biomarker distributions in Black, Hispanic, Other and White under-represented minority groups.
The number (%) of Black, Hispanic, Other and White subjects in the test data set were 1730 (20.8%), 2228 (26.8%), 1137 (13.7%) and 3230 (38.8%); the total number of subjects was 8325.
The t-SNE projections of the GAN-generated data and the test data distributions for the four race categories are compared in Figure 4. The corresponding UMAP projections are summarized in Supplementary Figure 3. The t-SNE and UMAP projections for the GAN-generated distributions were qualitatively well-dispersed across the test data for the four race/ethnicity groups. The box plots in Figure 5 and Supplementary Figure 4 compare the univariate distributions of the 14 biomarkers and demonstrate the concordance of the GAN-generated data with the test data for each race.
Together, these results demonstrate that the GAN strategy can generate satisfactory approximations for high dimensional biomarker joint distributions in under-represented groups.