About
Welcome to the VISTA Oncology Data Lake project. This datalake brings together data from different clinical modalities and sources including harmonized and standardized Electronic Health Records Data in the OMOP common Data Model, data from the Stanford Clinical Registry, Genomic Data from Philips ISPM, additional STAMP addon metadata with more data types coming soon. This website provides documentation for the datasets released as part of the project.
VISTA Inclusion Criteria
In order to be included in any of the VISTA Oncology Data Lake resources, patients must be present in STARR-OMOP and also have one of the following:
- an encounter of type ‘tumor board’ OR
- a case record in Stanford’s Neuralframe dataset
STARR-OMOP contains all Stanford patients whose information may be used for research and who have at least one clinical event included in the OMOP data model (visit, condition, procedure, drug exposure, device exposure, image occurrence, observation, or note) since Jan 1, 2000. Tumor board encounters are reliably present in the EHR system for tumor board discussions beginning in 2018 (prior years have incomplete coverage). Patients are present in Neuralframe if they have received their first line of cancer treatment at Stanford.
Dataset Overview
Oncology-OMOP
The OMOP-CDM is a standardized data model designed to facilitate the analysis and sharing of healthcare data across different institutions and studies. By using a common structure and terminology, the OMOP-CDM enables researchers to perform large-scale observational research and generate real-world evidence.
Our EHR data was initially standardized using OMOP CDM version 5.3.1 and subsequently upgraded to OMOP CDM version 5.4.2 to leverage the latest data model enhancements. For more details about the November 2025 upgrade and data evolution across releases, see the Data Evolution Overview page.
The identified dataset is created using EPIC Clarity tables, which include patient and encounter data permissible for research. The tables in the OMOP-CDM common data model that are part of this dataset are listed in the Data Dictionary page, along with information on whether they contain PHI and a brief description of each table.
NeuralFrame
NeuralFrame data encompasses research-eligible patients who have case records in the Neural Frame, also known as the Stanford Cancer Registry. This dataset is categorized into four main areas:
Outcome: This category includes information related to patient outcomes, such as survival rates, disease progression, and overall health status.
Diagnoses: This section contains details about the diagnoses made for each patient, including cancer types, staging, and any relevant comorbidity.
Treatment: This category outlines the various treatments administered to patients, including surgical interventions, chemotherapy, radiation therapy, and other therapeutic approaches.
Miscellaneous: This section includes additional data that may not fit into the other categories, such as demographic information, patient-reported outcomes, and other relevant clinical data.
Philips ISPM
Philips ISPM Orders: This table contains order-level information from the Philips IntelliSpace Precision Medicine (ISPM) genomics database at Stanford. The fields in this table are related to diagnostic orders, patient demographics, and specimen accession numbers which can be used to link to the other Philips ISPM tables.
Philips ISPM Aberration: This table contains genomic testing information from the Philips IntelliSpace Precision Medicine (ISPM) genomics database at Stanford. The fields in this table are related to genomic testing details about each sample, as well as the specimen accession number which can be used to link to the Philips ISPM Orders table.
Philips ISPM Specimen: This table includes specimen-related information from the Philips IntelliSpace Precision Medicine (ISPM) genomics database at Stanford. The fields in this table are related to specimen details, including accession numbers and collection information.
Following assays are included in the Philips ISPM data:
STAMP: The Stanford Actionable Mutation Panel for Solid Tumors (STAMP) is a targeted next-generation sequencing (NGS) assay designed to detect clinically actionable mutations as well as other genes frequently altered in cancer. STAMP employs a target enrichment-based sequencing method to capture specific genomic regions of interest. The sequencing approach and integrated bio informatics pipeline are optimized for ultra-deep sequencing of formalin-fixed, paraffin-embedded (FFPE) tumor biopsy specimens. This panel focuses on clinically actionable genes, selected based on:
- Gene Selection Criteria:
- Their utility as targets of current or emerging anti-cancer therapies
- Their prognostic value
- Their mutation frequency across known cancer types
- Technical Specifications:
- Sequencing is performed using an Illumina platform
- Minimum limit of detection (LOD) of 5% variant allele frequency (VAF)
- Genomic coordinates reported relative to GRCh37 (hg19)
- Gene Selection Criteria:
Heme-STAMP: The Stanford Actionable Mutation Panel for Hematopoietic and Lymphoid Malignancies (Heme-STAMP) is a targeted next-generation sequencing (NGS) assay designed to detect single nucleotide variants (SNVs), short insertion–deletions (indels), and selected gene fusions across 203 genes recurrently altered in myeloid and lymphoid neoplasms. Heme-STAMP employs a target enrichment–based sequencing method, beginning with acoustic shearing of genomic DNA, followed by sequencing library preparation and capture of genomic regions of interest using custom-designed oligonucleotide probes.
- Gene Selection Criteria:
- Clinical relevance as targets of existing or emerging anti-cancer therapies
- Prognostic significance in hematopoietic malignancies
- Recurrence frequency across patients with myeloid and lymphoid neoplasms
- Technical Specifications:
- Sequencing performed on an Illumina platform
- Minimum limit of detection (LOD) of 5% variant allele frequency (VAF) for SNVs and indels
- Targets 203 genes (either fully or partially covered)
- Pooled libraries prepared through acoustic shearing and targeted enrichment
- Gene Selection Criteria:
FoundationOne Heme is designed to include genes known to be somatically altered in human hematologic malignancies and sarcomas that are validated targets for therapy, either approved or in clinical trials, and/or that are unambiguous drivers of oncogenesis based on current knowledge. The current assay utilizes DNA sequencing to interrogate 406 genes as well as selected introns of 31 genes involved in rearrangements, in addition to RNA sequencing of 265 genes. The assay will be updated periodically to reflect new knowledge about cancer biology.
FoundationOne CDx™ (F1CDx) is a next generation sequencing based in vitro diagnostic device for detection of substitutions, insertion and deletion alterations (indels), and copy number alterations (CNAs) in 324 genes and select gene rearrangements, as well as genomic signatures including microsatellite instability (MSI) and tumor mutational burden (TMB) using DNA isolated from formalin-fixed paraffin embedded (FFPE) tumor tissue specimens.
FoundationOne Liquid CDx is a qualitative next generation sequencing based in vitro diagnostic test that uses targeted high throughput hybridization-based capture technology to detect and report substitutions, insertions and deletions (indels) in 311 genes, including rearrangements in eight (8) genes, and copy number alterations in three (3) genes. FoundationOne Liquid CDx utilizes circulating cell-free DNA (cfDNA) isolated from plasma derived from anti-coagulated peripheral whole blood of cancer patients.
STAMP Add-On
- The STAMP Add-On includes the ‘assay performed’ value, which specifies the type of STAMP test result; For example, the Stanford Actionable Mutation Panel for Solid Tumors (STAMP, Order Code: STAMPT or Heme-STAMP, which includes both Heme-STAMP, Blood (Order Code: HSTAMPB) and Heme-STAMP, Non-Blood (Order Code: HSTAMPT. In addition, the Add-On contains pipeline version information, along with patient identifiers and accession numbers.
HIPPO Benchmark
HIPPO stands for Human-in-the-loop Identification of PHI with Preserved Output. This is a corpus of 1,394 clinical notes sampled from STARR-OMOP. The PHI within the corpus was annotated by expert annotators. The full methodology describing dataset generation can be found here.
Lung Resection CAP Forms
This is a fully identified dataset that contains a combination of structured and unstructured information. A description of the CAP forms itself and their purpose can be found here and here. This dataset only contains Lung Resection CAP forms.
Epic Media Server Documents
This is a fully identified dataset that contains all PDFs for cancer patients that currently exist in the EPIC Media (BLOB) Server and are included in the VISTA data lake. These documents:
- Are displayed in the Media and Lab tab in Epic Hyperspace
- Typically originate from third party service providers
- Are pulled into VISTA via the Mulesoft API
Metadata
The metadata includes DICOM dictionary, listed genes for STAMP and HEME-STAMP versions, and relative BED files.
Dataset Releases
We periodically release updated versions of our datasets to ensure that researchers have access to the most current and comprehensive data available. Below are the details of our latest dataset releases:
Data Privacy and Security
We prioritize patient privacy and data security. Our datasets undergo rigorous phi scrubbing processes. Most vocabulary tables contain standard terminology, do not vary between institutions, and do not contain any PHI. We have included all populated tables in the CDM, along with descriptions of the fields, on this website.
Project Resources
For more information regarding our source data and methods, please refer to the following resources:
- Source Data
- Methods
- Publications
Acknowledgments
This research was funded, in part, by the Advanced Research Projects Agency for Health (ARPA-H). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.