PHI Labeling Process

Introduction

In this report, we outline the methodology used to create a labeled dataset with annotated PHI from clinical notes. Specifically, we describe the process for sampling clinical notes and performing annotations. This process was designed to achieve the following objectives:

  • Ensuring demographic representation across diverse patient populations
  • Maintaining consistency in annotation across multiple annotators
  • Balancing efficiency with quality in the annotation process
  • Managing the resource-intensive nature of medical text annotation

Figure 1 provides an overview of the labeling process.

Figure 1: Labeling Process

Data Sampling Methodology

The data sampling was performed in two steps. First, we conducted a random sampling across the entire set of notes. Then, we applied diversity sampling using a set cover algorithm. The details of these procedures are outlined below.

Note Characteristics Analysis

The sampling process incorporates textual characteristics to ensure diversity in content:

  • Note titles are categorized to represent different types of clinical documentation.
  • Note length distribution is analyzed using a binning approach:
    • Calculation of the median length per note title
    • Computation of the median of medians
    • Selection of notes between the 25th and 75th percentiles
    • Quantization into four distinct bins for balanced representation

Set Cover Algorithm Implementation

To optimize sampling while maintaining representation across all demographic and textual characteristics, we implemented a modified set cover algorithm:

  • Initial random sampling of 1M notes to create a computationally manageable subset
  • Application of a set cover algorithm to ensure representation across all defined partitions
  • Optimization for maximum coverage with minimal redundancy

Annotation Guidelines Development

Guideline Structure

Comprehensive annotation guidelines were developed through an iterative process. These guidelines include:

  • Clear definitions of annotation targets
  • Explicit inclusion and exclusion criteria
  • Representative examples
  • Decision trees for ambiguous cases
  • Standard operating procedures for the annotation workflow

The labeling guidelines can be found here.

Guideline Validation

The guidelines underwent multiple validation cycles, including:

  • Pilot testing with experienced annotators
  • Refinement based on inter-annotator agreement analysis
  • Integration of edge cases identified during initial annotation phases

Pre-labeling and Annotation Distribution Process

LLM Pre-Annotation Implementation

The pre-labeling process utilized GPT-4 via Azure’s API (version 2023-05-15), incorporating:

  • Prompt engineering optimized using the I2B2 dataset
  • Integration of annotation guidelines into prompt design
  • Systematic validation of pre-labeled outputs

Annotation Distribution

Notes were assigned following a structured protocol:

  • Random assignment to ensure unbiased distribution
  • Dual annotator requirement for each note to enable quality assessment
  • Tracking system implementation to monitor annotation progress
  • Regular checkpoints for interim quality assessments

Quality Control Framework

The quality assessment framework employs a bilateral F1 measure approach:

  • Primary F1 calculation:
    • Annotator A’s labels serve as the ground truth against Annotator B’s annotations
    • Annotator B’s labels serve as the ground truth against Annotator A’s annotations
    • The final F1 score is computed as the arithmetic mean of both directions

Adjudication Process

All clinical notes without 100% agreement across annotated spans undergo an adjudication process. Experienced annotators review and resolve disagreements to ensure annotation quality. Only the notes with 100% agreement or adjudicated notes are used as gold standard.