PHI Labeling Metrics

In this section, we present an analysis of the population characteristics and note attributes for the labeled sample used in our study. We compare these characteristics with the broader STARR-OMOP population to provide context and highlight any significant differences. The analysis includes age distribution, note length distribution and and note types. For PHI (Protected Health Information) distribution it is only shown on the labeled sample.

Age Distribution

Below a histogram of the distribution of ages of the population represented in the labeled notes. For comparison the characteristics of the entire STARR-OMOP population are shown as well. The Age of the patient is calculated as at the moment of the extraction and not at the moment the note was written.

Demographic Groups

In this section, we present a detailed analysis of the demographic characteristics of the labeled sample used in our study. We compare these characteristics with the broader STARR-OMOP population to provide context and highlight any significant differences. The analysis includes age distribution, sex, race, and ethnicity. This comparison helps to understand the representativeness of the labeled sample and identify teh characteristics of the biased sample.

Age Group
Labeled Sample
STARR-OMOP
n1 % n1 %
0-17 59 12.55 24,709,539 10.82
18-44 112 23.83 51,841,502 22.70
45-64 121 25.74 53,873,202 23.59
65+ 178 37.87 97,932,528 42.89
1 Note: 'n' represents the number of notes, not the number of patients.
Sex
Labeled Sample
STARR-OMOP
n1 % n1 %
FEMALE 248 52.77 128,810,665 56.41
MALE 220 46.81 99,506,081 43.57
No matching concept 2 0.43 NA NA
Unknown NA NA 40,025 0.02
1 Note: 'n' represents the number of notes, not the number of patients.
Race
Labeled Sample
STARR-OMOP
n1 % n1 %
American Indian or Alaska Native 23 4.89 985,689 0.43
Asian 77 16.38 40,808,962 17.87
Black or African American 62 13.19 11,182,306 4.90
Native Hawaiian or Other Pacific Islander 35 7.45 2,854,092 1.25
Unknown 161 34.26 51,605,777 22.60
White 112 23.83 120,919,945 52.95
1 Note: 'n' represents the number of notes, not the number of patients.
Ethnicity
Labeled Sample
STARR-OMOP
n1 % n1 %
Hispanic or Latino 151 32.13 39,152,701 17.15
Not Hispanic or Latino 242 51.49 173,951,827 76.18
Unknown 77 16.38 15,252,243 6.68
1 Note: 'n' represents the number of notes, not the number of patients.

Note Length Distribution

Below the note length distribution for the clinical notes sampled for the labeling task. For comparison the distribution of the lengths for STARR-OMOP is shown. The note length distribution is shown in characters.

Range
STARR-OMOP
Labeled Sample
n % n %
0 - 500 96,583,879 42.30 60 12.77
500 - 2500 71,191,327 31.18 407 86.60
2500 - 10000 49,905,133 21.85 3 0.64
10000 - 50000 10,631,213 4.66 NA NA
50000 - 1e+05 41,626 0.02 NA NA
1e+05 - 1150000 3,593 0.00 NA NA

PHI Distribution

In this section, we present an analysis of the distribution of PHI entities within the labeled sample. The analysis includes the frequency of different PHI entities and their distribution across documents. This helps to understand the prevalence of various PHI types in the dataset and provides insights into the labeling process. The analysis is divided into two parts: the total count of each PHI entity type and the distribution of PHI entities per document. It is important to clarify that DOCTOR and HOSPITAL are not classified as PHI in the Safe-Harbor definition. However, being able to identify and obfuscate such information may be important for potential data sharing use cases.

Note Types

In this section, we present an analysis of the different types of clinical notes included in the labeled sample used in our study. We compare these note types with the broader STARR-OMOP population to provide context and highlight any significant differences. This comparison helps to understand the representativeness of the labeled sample and identify the differences in the bias sample in the types of notes included. The analysis includes the distribution of note types and their frequencies in both the labeled sample and the STARR-OMOP population. In the distribution below there are several types that contain radiology and pathology reports. Among those are the notes labeled as procedures. Those are a combination of clinical results that are a result of procedures.

Distribution of Note Types in the Dataset
Note Type
Labeled Sample
STARR-OMOP
n % n %
care plan note 17 3.62 4,139,518 1.81
letter 17 3.62 6,889,804 3.02
procedures 16 3.40 2,542,658 1.11
discharge instructions 14 2.98 1,983,103 0.87
anesthesia postprocedure evaluation 13 2.77 915,373 0.40
care plan 13 2.77 145,491 0.06
nursing note 13 2.77 1,605,802 0.70
progress notes 13 2.77 54,086,167 23.68
imaging 12 2.55 27,170,511 11.90
patient instructions 12 2.55 9,786,849 4.29
assessment & plan note 11 2.34 2,240,039 0.98
lab 11 2.34 24,825,380 10.87
anesthesia procedure notes 10 2.13 762,798 0.33
microbiology culture 10 2.13 1,097,747 0.48
er notes 9 1.91 43,442 0.02
clinic support note 8 1.70 3,041,493 1.33
ecg 8 1.70 527,231 0.23
h&p 8 1.70 1,787,536 0.78
lab panel 8 1.70 118,246 0.05
pathology and cytology 8 1.70 395,258 0.17
rtf letter 8 1.70 3,549,891 1.55
telephone encounter 8 1.70 38,675,938 16.94
microbiology 7 1.49 812,474 0.36
unmapped external results 7 1.49 900,521 0.39
advance care planning 6 1.28 97,282 0.04
consults 6 1.28 3,313,915 1.45
ed notes 6 1.28 6,496,913 2.85
ed provider notes 6 1.28 1,862,895 0.82
pathology 6 1.28 1,812,971 0.79
sign out note 6 1.28 2,080,313 0.91
unmapped external note 6 1.28 277,058 0.12
code blue/rapid response team note 5 1.06 8,238 0.00
documentation clarification 5 1.06 179,822 0.08
h&p interval 5 1.06 244,086 0.11
interval h&p note 5 1.06 200,704 0.09
pft 5 1.06 172,001 0.08
ancillary 4 0.85 1,176,069 0.52
anesthesia post-op follow-up note 4 0.85 50,193 0.02
consult follow-up 4 0.85 1,159,388 0.51
lactation note 4 0.85 274,944 0.12
operative report 4 0.85 969,031 0.42
physical therapy 4 0.85 184,380 0.08
point of care testing 4 0.85 368,205 0.16
pr charge 4 0.85 617,387 0.27
rehab daily note 4 0.85 516,657 0.23
transplant summary 4 0.85 1,282,579 0.56
consult 3 0.64 533,155 0.23
discharge summary 3 0.64 1,040,761 0.46
hiv lab non-restricted 3 0.64 25,412 0.01
hospice 3 0.64 4,221 0.00
manual entry echo 3 0.64 11,058 0.00
miscellaneous 3 0.64 6,922 0.00
procedure note 3 0.64 241,589 0.11
accountable care division cm note 2 0.43 143,998 0.06
anesthesia post-op 2 0.43 88,070 0.04
anesthesia preprocedure evaluation 2 0.43 1,148,683 0.50
blood bank 2 0.43 21,218 0.01
cardiac angio 2 0.43 6,686 0.00
committee review 2 0.43 21,630 0.01
discharge instr - other orders 2 0.43 37,133 0.02
ed triage notes 2 0.43 13,480 0.01
group note 2 0.43 66,342 0.03
h&p (view-only) 2 0.43 171,763 0.08
imaging non-reportable 2 0.43 1,584,644 0.69
lab only 2 0.43 1,341 0.00
lab only - beaker 2 0.43 45,177 0.02
neurology 2 0.43 112,384 0.05
nsg picc refer 2 0.43 127,275 0.06
occupational therapy 2 0.43 193,917 0.08
operative note 2 0.43 521,214 0.23
outpatient letter 2 0.43 307,086 0.13
plan of care 2 0.43 646,309 0.28
result encounter note 2 0.43 37,539 0.02
rn transfer note 2 0.43 238,578 0.10
transfer of care summary 2 0.43 21,769 0.01
addendum note 1 0.21 2,098,264 0.92
anesthesia followup note 1 0.21 48,523 0.02
cardiac monitors 1 0.21 732 0.00
cardiac services 1 0.21 40,949 0.02
care conference 1 0.21 29,554 0.01
case communication 1 0.21 8,315 0.00
cath angio 1 0.21 42,514 0.02
dermatology 1 0.21 5,511 0.00
discharge instr - activity 1 0.21 15,477 0.01
discharge instr - radiology 1 0.21 291 0.00
echo 1 0.21 642,794 0.28
echocardiography 1 0.21 287,578 0.13
electrophysiology 1 0.21 26,980 0.01
hiv lab restricted 1 0.21 162,940 0.07
hospital course 1 0.21 18,606 0.01
immediate post op note 1 0.21 83,068 0.04
ip letter 1 0.21 301,987 0.13
ir procedure notes 1 0.21 13,264 0.01
manual entry imaging 1 0.21 7,752 0.00
manual entry lab 1 0.21 31,434 0.01
medication review 1 0.21 20,691 0.01
nursing 1 0.21 19,401 0.01
nursing referral 1 0.21 205,098 0.09
ob 1 0.21 41,869 0.02
or postop 1 0.21 4,351 0.00
or surgeon 1 0.21 14,425 0.01
patient care conference 1 0.21 25,671 0.01
pharmacy medication review 1 0.21 50,057 0.02
pre-procedure instructions 1 0.21 1,953 0.00
radiation oncology treatment summary 1 0.21 25,144 0.01
reference labs 1 0.21 503,157 0.22
research note 1 0.21 4,243 0.00
respiratory care 1 0.21 10,302 0.00
transfer center follow up clinical screen 1 0.21 9,203 0.00
transfer center initial clinical screen 1 0.21 20,031 0.01
vascular ultrasound 1 0.21 78,447 0.03
wound care 1 0.21 46,420 0.02