CAP Forms QA cleaned dataset

This dataset is a HuggingFace Dataset using a question and answer format. The dataset is compressed in a single tar.gz file for convenience. Inside that file we have the following files typical from a HuggingFace Dataset serialization using parquet files

The dataset can be loaded using the code below once it is uncompressed

from datasets import load_from_disk
fpath = <path to where the uncompressed folder is located>
dataset = load_from_disk(fpath)

The dataset should show this

DatasetDict({
    train: Dataset({
        features: ['id', 'context', 'question', 'answers', 'metadata'],
        num_rows: 22420
    })
})

An example of a record is shown below

{'id': '671f2274-7987-4b40-8c76-95b37ggh5d08',
 'context': <pathology report text without synoptic report>,
 'question': 'ADDITIONAL FINDINGS',
 'answers': {'answer_start': [-1],
  'text': ['ATYPICAL ADENOMATOUS HYPERPLASIA']},
 'metadata': {'concept_hierarchy': 'LUNG > ADDITIONAL FINDINGS > Additional Findings > Atypical adenomatous hyperplasia',
  'order_proc_id': 123456789,
  'stanford_patient_uid': '98765432 | 1999-10-14',
  'synoptic_name': 'LUNG: RESECTION',
  'taken_time': '2023-01-10T09:04:00'}}

Each field description is below: