Data Dictionary
OMOP-CDM
This document provides a comprehensive overview of the OMOP Common Data Model (CDM) tables available in the dataset. The OMOP-CDM is a standardized data model designed to facilitate the analysis and sharing of healthcare data across different institutions and studies. By using a common structure and terminology, the OMOP-CDM enables researchers to perform large-scale observational research and generate real-world evidence.
In this data dictionary, you will find detailed descriptions of each table and its columns, including data types, requirements, and any specific operations applied to the data. This information is crucial for understanding the structure and content of the dataset, ensuring accurate and meaningful analysis.
Version History: The data lake has evolved from OMOP CDM 5.3.1 (used in February-August 2025 releases) to OMOP CDM 5.4.2 (November 2025 release). The November 2025 upgrade includes new tables, enhanced field linking capabilities, updated naming conventions, and vocabulary modernization from January 2023 to August 2025. For detailed information about CDM changes, see the OHDSI CDM 5.4 Changes.
The following OMOP-CDM 5.4.2 tables are available in the November 2025 release:
| Category | Table Name |
|---|---|
| Clinical Data | condition_occurrence, drug_exposure, device_exposure, measurement, observation, procedure_occurrence, image_occurrence, visit_occurrence, visit_detail, note, death, specimen |
| Standardized Derived Elements | observation_period, person |
| Vocabularies | concept, concept_ancestor, concept_class, concept_relationship, concept_synonym, domain, relationship, vocabulary, source_to_concept_map, cohort, cohort_definition |
| Health System Data | care_site, location, provider, cost |
| Derived Elements | condition_era, drug_era, dose_era |
| Health Economics | payer_plan_period |
| Metadata | cdm_source, metadata |
The following OMOP-CDM 5.3.1 tables were available in the February, May, and August 2025 releases:
| Category | Table Name |
|---|---|
| Clinical Data | condition_occurrence, drug_exposure, device_exposure, measurement, observation, procedure_occurrence, image_occurrence, visit_occurrence, visit_detail, note, death, specimen |
| Standardized Derived Elements | observation_period, person |
| Vocabularies | concept, concept_ancestor, concept_class, concept_relationship, concept_synonym, domain, relationship, vocabulary, source_to_concept_map, cohort_definition |
| Health System Data | care_site, location, provider, cost |
| Derived Elements | condition_era, drug_era, dose_era |
| Health Economics | payer_plan_period |
| Metadata | cdm_source, metadata |
NeuralFrame
The following NeuralFrame tables are available in the release of the dataset:
Phillips ISPM
The following tables are available in the initial release of the dataset:
STAMP Add-On
Epic Media Server Documents
HIPPO Benchmark
HIPPO stands for Human-in-the-loop Identification of PHI with Preserved Output. This internal fully identified dataset consists of two tables:
Gold Standard Full Text: This table contains the clinical text from which the annotations were gathered. It also contains basic demographic information about the patients to whom the notes refer.
Gold Standard Spans: This table contains the annotated spans where PHI was identified, along with the corresponding labels and the note identifier to which the spans correspond.
Lung Resection CAP Forms
This dataset is composed by two groups of tables and one cleaned dataset in QA format:
- EPIC Clarity tables: These are the source EPIC Clarity tables required to assemble the information contained in the CAP forms as well as the pathology reports. The description of those
- Processed tables: These are two tables:
cap_formsandshc_pathology_reports. The first table contains the structured information with the CAP forms. The second table contains the pathology reports with the individual elements that compose them each one separated. - QA cleaned dataset: This is a QA dataset where the questions, answers and context and clearly defined. This dataset is formatted as a HuggingFace dataset and can be loaded directly for training.
PHI Scrubbing Operations Definitions
In order to protect patient information, various PHI (Protected Health Information) scrubbing operations are applied to the dataset. These operations are designed to scrub sensitive data elements. Below are the definitions of the PHI scrubbing operations used in this dataset:
| PHi-Scrubbing Operation Name | Description |
|---|---|
| Not Stable between Data Refreshes | Not Stable between Data Refreshes specific to the project and dataset column |
| Jitter | Jitter specific to the project and dataset column |
| Del | Delete the values in this Column |
| Sub | Substitute with a value specific to the project and dataset column |
| Whitelist | Only allow the values in the ‘allowed list’ table to pass through; otherwise replace with NULL |
| Hash | Replace with a hash value based on a hashing function specific to the project and dataset column |
| TiDE | Pass through TiDE, Stanford’s text PHI scrubbing algorithm |
| RedZip | Reduce the Zipcode precision based on Stanford’s University Privacy Office guidelines as outlined here, Sec 1.4 |
| Drop | Drop this column in the PHI scrubbed dataset |
| Stable Between Data Refreshes | Substitute with a stable identifier specific to the project and dataset column that will persist between data refreshes. This may be an integer or alphanumeric based on other requirements of the table |
| Not Stable between Data Refreshes | Substitute with an identifier that is only specific and self consistent for that data refresh -could be an integer or alphanumeric based on requirements of the table |
| None | pass as is- no PHI scrubbing operation |
DICOM PHI Scrubbing
For DICOM data the PHI scrubbing module protects PII and sensitive information through a two-stage process. First, it uses regex patterns to identify and mask specific data types including dates, times, email addresses, URLs, numeric identifiers, and hexadecimal sequences. Second, it tokenizes the text using configurable delimiters and numeric-to-alpha and alpha-to-numeric transitions, then replaces any tokens not found in a predefined allowlist with ‘X’ characters of equivalent length. The system preserves the original text structure by maintaining spacing and delimiters while systematically redacting content. It includes specialized date handling through regex patterns that match various date formats, ensuring that only approved vocabulary passes through unredacted while obscuring all other potentially sensitive information.
This redaction only applies to series_description, study_description, and protocol_name attributes (tags).