Data Dictionary

OMOP-CDM

This document provides a comprehensive overview of the OMOP Common Data Model (CDM) tables available in the dataset. The OMOP-CDM is a standardized data model designed to facilitate the analysis and sharing of healthcare data across different institutions and studies. By using a common structure and terminology, the OMOP-CDM enables researchers to perform large-scale observational research and generate real-world evidence.

In this data dictionary, you will find detailed descriptions of each table and its columns, including data types, requirements, and any specific operations applied to the data. This information is crucial for understanding the structure and content of the dataset, ensuring accurate and meaningful analysis.

Version History: The data lake has evolved from OMOP CDM 5.3.1 (used in February-August 2025 releases) to OMOP CDM 5.4.2 (November 2025 release). The November 2025 upgrade includes new tables, enhanced field linking capabilities, updated naming conventions, and vocabulary modernization from January 2023 to August 2025. For detailed information about CDM changes, see the OHDSI CDM 5.4 Changes.

OMOP CDM 5.4.2
OMOP CDM 5.3.1

The following OMOP-CDM 5.4.2 tables are available in the November 2025 release:

Category	Table Name
Clinical Data	condition_occurrence, drug_exposure, device_exposure, measurement, observation, procedure_occurrence, image_occurrence, visit_occurrence, visit_detail, note, death, specimen
Standardized Derived Elements	observation_period, person
Vocabularies	concept, concept_ancestor, concept_class, concept_relationship, concept_synonym, domain, relationship, vocabulary, source_to_concept_map, cohort, cohort_definition
Health System Data	care_site, location, provider, cost
Derived Elements	condition_era, drug_era, dose_era
Health Economics	payer_plan_period
Metadata	cdm_source, metadata

The following OMOP-CDM 5.3.1 tables were available in the February, May, and August 2025 releases:

Category	Table Name
Clinical Data	condition_occurrence, drug_exposure, device_exposure, measurement, observation, procedure_occurrence, image_occurrence, visit_occurrence, visit_detail, note, death, specimen
Standardized Derived Elements	observation_period, person
Vocabularies	concept, concept_ancestor, concept_class, concept_relationship, concept_synonym, domain, relationship, vocabulary, source_to_concept_map, cohort_definition
Health System Data	care_site, location, provider, cost
Derived Elements	condition_era, drug_era, dose_era
Health Economics	payer_plan_period
Metadata	cdm_source, metadata

NeuralFrame

The following NeuralFrame tables are available in the release of the dataset:

Phillips ISPM

The following tables are available in the initial release of the dataset:

STAMP Add-On

STAMP Add On Data

Epic Media Server Documents

Epic Media Server Documents

HIPPO Benchmark

HIPPO stands for Human-in-the-loop Identification of PHI with Preserved Output. This internal fully identified dataset consists of two tables:

Gold Standard Full Text: This table contains the clinical text from which the annotations were gathered. It also contains basic demographic information about the patients to whom the notes refer.
Gold Standard Spans: This table contains the annotated spans where PHI was identified, along with the corresponding labels and the note identifier to which the spans correspond.

Lung Resection CAP Forms

This dataset is composed by two groups of tables and one cleaned dataset in QA format:

EPIC Clarity tables: These are the source EPIC Clarity tables required to assemble the information contained in the CAP forms as well as the pathology reports. The description of those
Processed tables: These are two tables: cap_forms and shc_pathology_reports. The first table contains the structured information with the CAP forms. The second table contains the pathology reports with the individual elements that compose them each one separated.
QA cleaned dataset: This is a QA dataset where the questions, answers and context and clearly defined. This dataset is formatted as a HuggingFace dataset and can be loaded directly for training.

PHI Scrubbing Operations Definitions

In order to protect patient information, various PHI (Protected Health Information) scrubbing operations are applied to the dataset. These operations are designed to scrub sensitive data elements. Below are the definitions of the PHI scrubbing operations used in this dataset:

PHi-Scrubbing Operation Name	Description
Not Stable between Data Refreshes	Not Stable between Data Refreshes specific to the project and dataset column
Jitter	Jitter specific to the project and dataset column
Del	Delete the values in this Column
Sub	Substitute with a value specific to the project and dataset column
Whitelist	Only allow the values in the ‘allowed list’ table to pass through; otherwise replace with NULL
Hash	Replace with a hash value based on a hashing function specific to the project and dataset column
TiDE	Pass through TiDE, Stanford’s text PHI scrubbing algorithm
RedZip	Reduce the Zipcode precision based on Stanford’s University Privacy Office guidelines as outlined here, Sec 1.4
Drop	Drop this column in the PHI scrubbed dataset
Stable Between Data Refreshes	Substitute with a stable identifier specific to the project and dataset column that will persist between data refreshes. This may be an integer or alphanumeric based on other requirements of the table
Not Stable between Data Refreshes	Substitute with an identifier that is only specific and self consistent for that data refresh -could be an integer or alphanumeric based on requirements of the table
None	pass as is- no PHI scrubbing operation

DICOM PHI Scrubbing

For DICOM data the PHI scrubbing module protects PII and sensitive information through a two-stage process. First, it uses regex patterns to identify and mask specific data types including dates, times, email addresses, URLs, numeric identifiers, and hexadecimal sequences. Second, it tokenizes the text using configurable delimiters and numeric-to-alpha and alpha-to-numeric transitions, then replaces any tokens not found in a predefined allowlist with ‘X’ characters of equivalent length. The system preserves the original text structure by maintaining spacing and delimiters while systematically redacting content. It includes specialized date handling through regex patterns that match various date formats, ensuring that only approved vocabulary passes through unredacted while obscuring all other potentially sensitive information.

This redaction only applies to series_description, study_description, and protocol_name attributes (tags).