Data Labeling and Modeling

This page provides an overview of our data labeling initiatives focused on extracting structured information from unstructured clinical data.

PHI Labeling

Our current focus is on labeling Protected Health Information (PHI) in clinical notes. This project is crucial for:

  • Developing robust de-identification systems
  • Ensuring patient privacy compliance
  • Creating high-quality training data for machine learning models

AI for Automatic Synoptic Reporting

Our team is using AI models to automatically extract structured information from pathology reports. Key aspects of this work include:

  • Identifying key data elements from College of American Pathologists (CAP) cancer protocols
  • Converting unstructured free-text reports into standardized synoptic formats
  • Validating model accuracy against human expert annotation

This initiative aims to improve data standardization, reduce manual extraction efforts, and enhance the completeness of cancer registry data.

Project Resources

February 2025 Release

May 2025 Release

August 2025 Release

November 2025 Release