Skip to main content

ETL Tools

The ETL Tools page is a unified workspace that brings together three data ingestion capabilities under a single tabbed interface. It provides tools for profiling source databases, generating synthetic OMOP data, and ingesting FHIR resources into the CDM. Each tab operates independently, allowing you to switch between workflows without losing state.

Accessing ETL Tools

Navigate to Ingestion > ETL Tools from the main navigation. The page displays three tabs:

  1. Source Profiler — WhiteRabbit-based database profiling
  2. Synthea Generator — Synthetic patient data generation
  3. FHIR Ingestion — FHIR R4 resource ingestion

Tab 1: Source Profiler

The Source Profiler tab provides a streamlined version of the full Source Profiler for quick database scanning. It connects to the WhiteRabbit service to scan your data source and produce a table-by-table quality report.

Usage

  1. Verify the WhiteRabbit service health indicator shows "available" at the top of the tab.
  2. Select a Data Source from the dropdown (all registered Parthenon sources are listed).
  3. Optionally enter a Table Filter — a comma-separated list of table names to restrict the scan scope.
  4. Click Scan Database and wait for the scan to complete.

Results

After scanning, the tab displays:

  • Summary cards — Tables scanned, total columns, total rows, and scan time
  • Data Quality Flags — A warning list of tables that have columns with > 50% null values, identifying specific column names
  • Table accordions — Collapsible rows for each table showing column name, data type (color-coded badge), null percentage bar, distinct value count, and top-5 sample values

Click Export Report to download the scan results as a JSON file.

Quick Scan vs Full Profiler

The Source Profiler tab here provides the essential scanning workflow. For advanced features like the completeness heatmap, data quality scorecard with letter grades, table size distribution chart, scan history, and sorting/filtering controls, use the dedicated Source Profiler page.

Tab 2: Synthea Generator

The Synthea Generator loads pre-generated Synthea CSV files into an OMOP CDM source, converting synthetic patient records into properly structured CDM tables.

Configuration

FieldDescription
Target SourceThe registered data source where CDM records will be inserted
Patient CountNumber of synthetic patients to load (1 to 100,000)
Synthea CSV Output FolderAbsolute filesystem path to the directory containing Synthea CSV output files (e.g., patients.csv, encounters.csv). This path must be accessible from the R runtime container.
CDM VersionTarget OMOP CDM version: 5.4 (default) or 5.3

Running the Generator

  1. Verify the Synthea ETL service health indicator shows "available." The status also displays the service version and supported capabilities.
  2. Select a target source, set the patient count, and provide the CSV folder path.
  3. Click Generate to start the ETL process.
CSV Files Must Exist First

The Synthea Generator does not generate the Synthea CSV files itself. You must run Synthea separately to produce the CSV output, then point this tool at the output directory. The generator reads the CSVs and loads them into the CDM.

Results

After generation completes, the tab shows:

  • Persons Generated — Number of person records created
  • Total Rows Inserted — Sum of all CDM records across tables
  • Elapsed Time — Duration of the ETL process
  • Per-Table Row Counts — A bar chart showing the number of rows inserted into each CDM table, sorted by count

Tab 3: FHIR Ingestion

The FHIR Ingestion tab provides an embedded version of the FHIR Ingestion tool for converting FHIR R4 resources into OMOP CDM records. It supports both JSON Bundle paste and NDJSON file upload, with resource preview, mapping coverage metrics, and error logging.

For complete documentation of the FHIR ingestion workflow, resource type mappings, and API reference, see the dedicated FHIR Ingestion page.

Common Workflow: Profile, Generate, Verify

A typical workflow using all three tabs:

  1. Profile your target source using the Source Profiler tab to establish a baseline. Note any existing tables and row counts.
  2. Generate synthetic data using the Synthea Generator tab (or ingest real data via FHIR) to populate the CDM.
  3. Profile again after loading to verify that expected tables are populated, null rates are acceptable, and row counts match expectations.
Use for Development and Testing

The ETL Tools are particularly useful during development. Generate a Synthea dataset for your test environment, then use the Source Profiler to verify the data loaded correctly before running Achilles characterization or cohort generation.

Service Dependencies

The ETL Tools page depends on several backend services:

ServiceRequired ByHealth Check
WhiteRabbitSource Profiler tabShown as status badge
Synthea ETL (R Runtime)Synthea Generator tabShown as status badge
FHIR Ingestion ServiceFHIR Ingestion tabShown as status badge

If a service is unavailable, its corresponding tab will display a warning and operations will fail. Check the System Health Dashboard for service status details.

Data Type Color Coding

Both the Source Profiler and Synthea results use color-coded type badges to help you quickly identify column categories:

Data TypeColor
varchar, textBlue
integer, int, bigintTeal
numeric, float, doubleGold
date, datetime, timestampPurple
boolean, boolOrange

Error Handling

Each tab handles errors independently. When a scan, generation, or ingestion fails:

  • An error banner appears with a red background showing the failure reason
  • The operation can be retried without losing your configuration
  • Previous successful results remain visible until you clear them
Check Service Logs for Detailed Errors

The error messages shown in the UI are summaries. For detailed stack traces and debugging information, check the Docker service logs:

  • WhiteRabbit: docker compose logs -f php
  • Synthea ETL: docker compose logs -f r-runtime
  • FHIR Ingestion: docker compose logs -f php

Permissions

Access to the ETL Tools page requires the etl:manage permission. By default, this permission is granted to users with the data-engineer or super-admin role. Standard researcher accounts do not have access to ETL operations. Contact your system administrator to request access if needed.