Data Dictionary

OMOP-CDM

This document provides a comprehensive overview of the OMOP Common Data Model (CDM) tables available in the dataset. The OMOP-CDM is a standardized data model designed to facilitate the analysis and sharing of healthcare data across different institutions and studies. By using a common structure and terminology, the OMOP-CDM enables researchers to perform large-scale observational research and generate real-world evidence.

In this data dictionary, you will find detailed descriptions of each table and its columns, including data types, requirements, and any specific operations applied to the data. This information is crucial for understanding the structure and content of the dataset, ensuring accurate and meaningful analysis.

Version History: The data lake has evolved from OMOP CDM 5.3.1 (used in February-August 2025 releases) to OMOP CDM 5.4.2 (November 2025 release). The November 2025 upgrade includes new tables, enhanced field linking capabilities, updated naming conventions, and vocabulary modernization from January 2023 to August 2025. For detailed information about CDM changes, see the OHDSI CDM 5.4 Changes.

The following OMOP-CDM 5.4.2 tables are available in the November 2025 release:

The following OMOP-CDM 5.3.1 tables were available in the February, May, and August 2025 releases:

NeuralFrame

The following NeuralFrame tables are available in the release of the dataset:

Phillips ISPM

The following tables are available in the initial release of the dataset:

STAMP Add-On

Epic Media Server Documents

HIPPO Benchmark

HIPPO stands for Human-in-the-loop Identification of PHI with Preserved Output. This internal fully identified dataset consists of two tables:

  • Gold Standard Full Text: This table contains the clinical text from which the annotations were gathered. It also contains basic demographic information about the patients to whom the notes refer.

  • Gold Standard Spans: This table contains the annotated spans where PHI was identified, along with the corresponding labels and the note identifier to which the spans correspond.

Lung Resection CAP Forms

This dataset is composed by two groups of tables and one cleaned dataset in QA format:

  • EPIC Clarity tables: These are the source EPIC Clarity tables required to assemble the information contained in the CAP forms as well as the pathology reports. The description of those
  • Processed tables: These are two tables: cap_forms and shc_pathology_reports. The first table contains the structured information with the CAP forms. The second table contains the pathology reports with the individual elements that compose them each one separated.
  • QA cleaned dataset: This is a QA dataset where the questions, answers and context and clearly defined. This dataset is formatted as a HuggingFace dataset and can be loaded directly for training.

PHI Scrubbing Operations Definitions

In order to protect patient information, various PHI (Protected Health Information) scrubbing operations are applied to the dataset. These operations are designed to scrub sensitive data elements. Below are the definitions of the PHI scrubbing operations used in this dataset:

PHi-Scrubbing Operation Name Description
Not Stable between Data Refreshes Not Stable between Data Refreshes specific to the project and dataset column
Jitter Jitter specific to the project and dataset column
Del Delete the values in this Column
Sub Substitute with a value specific to the project and dataset column
Whitelist Only allow the values in the ‘allowed list’ table to pass through; otherwise replace with NULL
Hash Replace with a hash value based on a hashing function specific to the project and dataset column
TiDE Pass through TiDE, Stanford’s text PHI scrubbing algorithm
RedZip Reduce the Zipcode precision based on Stanford’s University Privacy Office guidelines as outlined here, Sec 1.4
Drop Drop this column in the PHI scrubbed dataset
Stable Between Data Refreshes Substitute with a stable identifier specific to the project and dataset column that will persist between data refreshes. This may be an integer or alphanumeric based on other requirements of the table
Not Stable between Data Refreshes Substitute with an identifier that is only specific and self consistent for that data refresh -could be an integer or alphanumeric based on requirements of the table
None pass as is- no PHI scrubbing operation

DICOM PHI Scrubbing

For DICOM data the PHI scrubbing module protects PII and sensitive information through a two-stage process. First, it uses regex patterns to identify and mask specific data types including dates, times, email addresses, URLs, numeric identifiers, and hexadecimal sequences. Second, it tokenizes the text using configurable delimiters and numeric-to-alpha and alpha-to-numeric transitions, then replaces any tokens not found in a predefined allowlist with ‘X’ characters of equivalent length. The system preserves the original text structure by maintaining spacing and delimiters while systematically redacting content. It includes specialized date handling through regex patterns that match various date formats, ensuring that only approved vocabulary passes through unredacted while obscuring all other potentially sensitive information.

This redaction only applies to series_description, study_description, and protocol_name attributes (tags).