Electronic Health Records and Augmented Data
Roblin and colleague’s retrospective cohort study uses Kaiser Permanente
Georgia’s EHR system from 2006 to 2015 to potentially identify TG
patients (n=271).13 The authors describe a 3-step
algorithm, which included an initial EHR search for International
Classification of Disease (ICD)-9 diagnosis codes (Supplemental
Table 1 ) and key text-strings relevant to TG status from supplemental
digitized provider notes, validation of TG status through having at
least two diagnosis codes or validation by manual review of
text-strings, then determination of patient sex assigned at birth after
their inclusion in the cohort. After internal validation of patients
through a committee manually reviewing the key-text strings, they found
that the application of key text-strings only, diagnosis codes only, and
both diagnosis and key text-strings led to positive predictive values
(PPV) of 45%, 56%, and 100%, respectively. A similar study by Quinn
and colleagues used Kaiser Permanente’s Georgia and California EHR
system to identify potential TG individuals (n=6456) to build the Study
of Transition, Outcomes and Gender (STRONG) cohort.14The study uses the same 3-step algorithm, and published their full
extensive list of key text-strings. In this study, only 10% of patients
were found from diagnosis codes alone, while 61% were found from both
diagnosis codes and keywords. The PPV for key text-strings, diagnosis
codes, and both were 26%, 54%, and 98% respectively.
Gerth and colleague’s study utilizes the STRONG cohort to assess
agreement between medical records and a self-reported
survey.15 The survey contained the recommended
self-reported gender identity method that asks for sex assigned at birth
and current gender identity.5,27 They distributed the
survey to a subset of cohort members in order to confirm TG status
(transmasculine or transfeminine) based on gender affirming treatment
(e.g. testosterone, estrogen hormone therapy) and surgery (e.g. chest or
genital reconstruction surgery) through Kaiser Permanente. They found
high agreement between self-reported gender identity and gender
affirming treatment records with a sensitivity of 99% and specificity
of 99%.15
Guo et. al built upon Quinn and colleagues’ work to apply a CP within
the University of Florida Health integrated data repository which
included the Epic EHR system from 2012 to 2019.6,14They used gender identity information, ICD-9 and ICD-10 diagnosis codes,
Current Procedural Terminology (CPT) codes, and key text-strings
relevant to TG status in clinician text notes as potential mechanisms to
find the best performing CP for their data. Authors validated their CPs
through a manual chart review of selected samples and then identified
subgroups and used natal sex assignment for confirmation of
transmasculine or transfeminine gender identity. Guo and colleagues
found 19,600 potential TG patients and their best performing CP for both
structured and unstructured data was when a TG patient had a recorded TG
gender identity or had at least one relevant diagnosis code and at least
one relevant key text-string relevant to TG status, which led to an
F1-score of 0.954.6
Foer and colleagues’ retrospective chart review used Epic data from two
primary academic teaching institutions in Boston, MA from 2015-2019 to
identify 13,424 potential TG patients.20 They were
able to utilize key text-strings within clinician notes with TG-related
text, as well as F64 ICD-10 diagnosis codes, and gender identity field
entries. Manual chart reviews were performed on a subset to validate the
classification of patients as gold standard. They were able to find all
patients through a legal sex field (100%), while sex assigned at birth
was available for 48.7% of patients, and 48% had a completed gender
identity field. They found 15.7% of TG patients through diagnosis and
key text-strings, 89% from key text-strings alone, 14% from a gender
identity field (14%), 1.2% from ICD diagnosis codes, and 5.1% from TG
status listing. After validation via chart review of a subset of 324
patients, they confirmed 8% of patients as TG. 24 patients with gender
fields alone were misclassified as TG when they were cisgender based on
chart reviews. However, they had a high specificity after applying their
algorithm to a random set of patients and found none to be TG. In this
study, key text-strings and diagnosis codes were more sensitive to
identify TG patients than gender related fields.20
Blosnich and colleagues applied a CP of ICD-9 and ICD-10 diagnosis codes
relevant to TG status to identify 7560 TG patients through the US
Department of Veterans Affairs Corporate Data Warehouse from 2000 to
2016.16 Their validation method used a search
algorithm of clinical text notes to find key text-strings related to TG
status. Their search algorithm reached a sensitivity of 89.30%, with a
specificity of 99.95%. False positives were similar to Roblin and
colleagues of key text-strings that were discussions about TG relatives
or friends of the patient.13 They were also able to
find false negatives through key text-strings for 1.1% of
patients.16
Wolfe and colleagues used EHR from the Veterans Health Administration
from 2006 to 2018 to create their cohort of TG veterans
(n=10,769).21 Their CP included: 1) 1 or more gender
identity disorder diagnosis code in outpatient or inpatient data during
the study period, 2) a diagnosis code of non-specified endocrine
disorder, 3) change in sex marker field lasting at least 1 year to
reflect stability, 4) sex hormone prescription discordant with sex, and
5) excluded those with specific non-diabetes endocrine code, such as
adrenal or thyroid disease, and prostate cancer, as well as had minimum
dosage levels for hormones. They used a hierarchal strategy that
prioritized diagnosis codes or hormones, then non-specific endocrine
disorder with hormone prescription, then endocrine disorders with change
in sex markers, then hormone therapy with change in sex marker, to
finally hormone prescriptions only, which is very similar to Jasuja et
al.19 They validated the algorithm through performing
a chart review of a random sample of veterans from each of the 5 groups.
Wolfe and colleagues found that TG veterans with a gender identity
disorder diagnosis code had the highest positive predictive value (83%)
compared to non-gender identity disorder coded veterans (2%), and
concluded that gender identity disorder diagnosis codes were the most
reliable approach for identification of TG patients in the
VHA.21
Alpert and colleague’s cross-sectional study utilized CancerLinQ data by
the American Society of Clinical Oncology (ASCO) Learning HealthCare
System to identify TG cancer patients (n=557).22 Their
CP had three categories: category 1) diagnosis related to gender
identity (transsexualism or gender identity disorder); (category 2)
recorded gender male with at least one diagnosis code indicating cancer
of the ovaries, cervix, vulva, vagina, uterus, placenta, or other
related organs; and/or (category 3) recorded gender female with at least
one diagnosis code indicating cancer of the prostate, testes, penis, or
other related organs. 557 individuals matched their inclusion criteria
within CancerLinQ data: 42 in category 1, 316 in category 2, and 199 in
category 3. 76% of those with an ICD-9 or ICD-10 diagnosis code
relevant to TG status were confirmed to be TG, while only 2% and 3%
were identified through categories 2 and 3,
respectively.22 There was very low specificity for
categories 2 and 3, as many patients identified ended up being false
positives (i.e. cisgender).
Chyten-Brennan and colleagues created a CP to identify TG patients
(n=213) among people living with HIV through the Montefiore Health
System in New York City from 1997 to 2017.23 Their CP
contained: 1) ICD-9 or ICD-10 diagnosis codes; 2) gender-affirming
medications; 3) key text-strings, and 4) gender identity variables
(e.g., yes/no field for TG). After manual chart review to validate TG
status, they were able to confirm 84% of patients (PPV). Only 13.5%
were identified through ICD-9 or ICD-10 diagnosis codes alone, while
60% were found from multiple categories. They were not able to confirm
the TG status of 22% of those found only through ICD-9 or ICD-10
diagnosis codes. However, they were able to accurately identify 15% of
TG patients through HIV-funding related gender identity data, which is
not found in other EHR-based algorithms. Without this data, they would
have differentially misclassified a large portion of TG people, which
would lead to biased estimates.
EHR data was able to overcome the key limitation of validation for
claims data by having access to conduct manual chart reviews, as well as
self-reported gender identity when the data was collected and available.
Similar to claim-based CPs, the strongest CPs in EHR data contained
diagnosis codes accompanied by other information, which for EHR data was
key text-strings relevant to TG status. If key text-strings were
available, the PPV of the CP has the potential to be
100%.13 In terms of algorithm components to identify
TG patients, Wolfe et. al and Alpert et. al were able to find the
highest proportion of TG patients through diagnosis codes
alone.21,22 However, Chyten-Brennan and colleagues
were only able to identify 13.5% of TG patients through diagnosis
codes, and Foer and colleagues found that key text-strings were able to
identify almost 90% of patients.20,23 Additionally,
Chyten-Brennan and colleagues access to self-reported gender identity
data added a large amount of TG patients that would have otherwise been
classified as cisgender through their medical records alone.