Loading Datasets

After installation, Parthenon has all table structures but no clinical data. The Dataset Acquisition Tool lets you browse and load publicly available OHDSI datasets with a guided terminal interface.

Quick Start

# From the Parthenon directory
./parthenon-data

Or equivalently:

python3 -m datasets

The tool checks your system, then presents an interactive menu where you can choose a recommended bundle or pick individual datasets.

Available Datasets

The tool catalogs 17 public datasets across 6 categories:

Vocabulary

Dataset	Size	Description
OMOP Vocabulary (Athena)	~4 GB download / 16 GB loaded	Standard OHDSI concept vocabulary with 7M+ concepts, relationships, and ancestors. Required for most research workflows.

Manual download required

The OMOP vocabulary requires a free account at athena.ohdsi.org. Download the vocabulary ZIP file first, then provide the path when prompted by the tool.

CDM Datasets

Dataset	Size	Description
Eunomia GiBleed	~38 MB	OHDSI standard demo dataset with 2,694 GI bleeding patients. Includes full CDM tables and Achilles characterization. Perfect for evaluation.
CMS SynPUF 1K	~200 MB	Synthetic Medicare claims for 1,000 patients in OMOP format. (Coming soon)
CMS SynPUF Full	~10 GB	Full CMS Synthetic Public Use Files with 2.3 million patients. (Coming soon)
SyntheA 1K	~500 MB	Synthea-generated synthetic patients with realistic clinical histories. (Coming soon)

Genomics

Dataset	Size	Description
ClinVar	~500 MB	NCBI ClinVar database — clinical significance annotations for human genetic variants
GIAB HG001	~3.5 GB	Genome in a Bottle reference — NA12878, the most extensively characterized human genome
GIAB HG002–HG007	~3.5 GB each	Ashkenazim Trio (son, father, mother) and Chinese Trio (son, father, mother)

Imaging

Dataset	Size	Description
Class-3 Malocclusion CBCT	~21 MB	Orthodontic cone-beam CT imaging dataset
Harvard COVID-19 CT	~242 GB	1,000 subjects with 491K DICOM instances from The Cancer Imaging Archive (manual download)

GIS / Geospatial

Dataset	Size	Description
US Census TIGER/Line	~100 MB	Census tracts, counties, ZIP crosswalks, CDC SVI, EPA air quality, CMS hospital data

Phenotypes

Dataset	Size	Description
OHDSI Phenotype Library	~5 MB	1,100+ community-curated cohort definitions from the OHDSI Phenotype Library

Recommended Bundles

Instead of picking datasets one by one, start with a bundle:

Bundle	What's Included	Total Size	Best For
Quick Start	Eunomia + Phenotype Library	~50 MB	Quick evaluation, demos
Research Ready	Eunomia + Vocabulary + ClinVar + GIAB HG001 + GIS + Phenotypes	~12 GB	Active research teams
Genomics Focus	Eunomia + ClinVar + GIAB HG001 & HG002 + Phenotypes	~8 GB	Genomics workflows
Full Platform	All auto-downloadable datasets	~50 GB	Full capability evaluation

Bundles are just presets — after selecting one, you can add or remove individual datasets before confirming.

TUI Walkthrough

1. System Status Check

The tool verifies your Docker environment:

╭─ Parthenon Dataset Acquisition ─╮
│                                  │
│  Checking system status...       │
│                                  │
│  OK Docker running               │
│  OK PHP container (healthy)      │
│  OK PostgreSQL reachable         │
│  OK 142 GB free disk space       │
│                                  │
│  Already loaded (2):             │
│    OK Eunomia GiBleed            │
│    OK Achilles analysis catalog  │
│                                  │
╰──────────────────────────────────╯

2. Selection Mode

Choose a bundle for quick setup, or go a la carte:

? How would you like to select datasets?
  > Start with a recommended bundle
    Pick individual datasets (a la carte)
    Exit

3. Dataset Picker

Toggle datasets with the space bar. Already-loaded items are greyed out, and dependencies are auto-resolved:

? Select datasets to install:
  ── Vocabulary ──
  [ ] OMOP Vocabulary (Athena)     ~4 GB    [manual download]

  ── CDM Datasets ──
  [x] Eunomia GiBleed             38 MB    [already loaded]
  [ ] CMS SynPUF 1K               200 MB   [coming soon]

  ── Genomics ──
  [x] ClinVar variant database    500 MB
  [x] GIAB HG001 (NA12878)        3.5 GB

  ── Phenotypes ──
  [x] OHDSI Phenotype Library     5 MB

4. Confirmation and Loading

The tool resolves dependencies, shows a summary with disk estimates, then downloads and loads each dataset with progress bars:

Ready to install 3 dataset(s):

  1. ClinVar variant database
  2. GIAB HG001 (NA12878)
  3. OHDSI Phenotype Library

  Download: ~1.0 GB | Loaded: ~4.0 GB

? Proceed? Yes

── ClinVar variant database ──
  Syncing ClinVar variant database...
  Done.

── GIAB HG001 (NA12878) ──
  HG001.vcf.gz  ━━━━━━━━━━━━━━━  800 MB  2.1 MB/s  eta 0:00:00
  Decompressing HG001.vcf.gz -> HG001.vcf...
  Importing HG001 VCF into database...
  Done.

CLI Options

List all datasets

./parthenon-data --list

Shows all 17 datasets with sizes and availability status.

Load specific datasets (non-interactive)

# Load just Eunomia and phenotypes
./parthenon-data --only eunomia phenotype-library

# Load vocabulary (will prompt for ZIP path)
./parthenon-data --only vocabulary

# Load all genomics
./parthenon-data --only clinvar giab-hg001 giab-hg002

Re-run after partial failure

The tool detects what's already loaded and skips it. Safe to re-run anytime:

./parthenon-data --only giab-hg001  # Resumes download if interrupted

Dependency Resolution

Some datasets depend on others. For example, CMS SynPUF requires the OMOP vocabulary (since SynPUF records reference standard concept IDs). The tool automatically detects and adds missing dependencies:

Auto-adding dependencies:
  + OMOP Vocabulary (Athena) (required by CMS SynPUF 1K)

Loading from the Installer

During initial installation, Phase 5 handles dataset loading. You can specify which datasets to load in a defaults file:

{
  "datasets": ["eunomia", "phenotype-library", "clinvar"]
}

python3 install.py --defaults-file config.json

If no datasets are specified, the installer loads Eunomia by default. You can always load more datasets later with ./parthenon-data.

Data Sources After Loading

After loading datasets, configure them as data sources in Parthenon:

Log in as an administrator
Go to Administration > Data Sources
The Eunomia source is automatically registered during loading
For your own OMOP CDM databases, click Add Source and provide connection details

See Data Sources for detailed configuration.

Dataset Storage

Downloaded files are stored in:

Directory	Contents
`downloads/`	Temporary download files (cleaned up after import)
`vcf/giab_NISTv4.2.1/`	Decompressed GIAB VCF files
`dicom_samples/`	Extracted DICOM imaging files
`GIS/`	Raw geospatial data

All clinical data is loaded into the Docker PostgreSQL database. Downloaded files can be deleted after successful import to reclaim disk space.

Quick Start​

Available Datasets​

Vocabulary​

CDM Datasets​

Genomics​

Imaging​

GIS / Geospatial​

Phenotypes​

Recommended Bundles​

TUI Walkthrough​

1. System Status Check​

2. Selection Mode​

3. Dataset Picker​

4. Confirmation and Loading​

CLI Options​

List all datasets​

Load specific datasets (non-interactive)​

Re-run after partial failure​

Dependency Resolution​

Loading from the Installer​

Data Sources After Loading​

Dataset Storage​