Skip to main content

Source Profiler

The Source Profiler is a WhiteRabbit-based database scanning tool that profiles your source data before and after ETL. It inspects every table and column in a data source, measuring row counts, null fractions, distinct value counts, data types, and sample values. The output is a comprehensive quality assessment that helps you identify data completeness issues, empty tables, and low-cardinality columns that may require attention during the ETL mapping process.

Prerequisites

The Source Profiler requires the WhiteRabbit service to be running. A health indicator at the top of the page shows whether the service is available, along with its version number. If the service is offline, scans will fail.

WhiteRabbit Service Required

The scan operation connects to your data source through the WhiteRabbit backend service. Ensure the service is healthy before initiating large scans. Check the health badge at the top of the profiler page.

Running a Scan

Step 1: Select a Data Source

Choose a configured data source from the Data Source dropdown. All sources registered in Parthenon are available for scanning.

Step 2: Apply a Table Filter (Optional)

Enter a comma-separated list of table names in the Table Filter field to restrict the scan to specific tables. For example: person, visit_occurrence, condition_occurrence. Leave this field empty to scan all tables in the source.

Step 3: Execute the Scan

Click Scan Database to begin profiling. A loading indicator shows scan progress. Large databases with hundreds of tables and millions of rows may take several minutes to complete.

Scan Results

After a successful scan, the profiler displays results in several sections.

Summary Cards

Four metric cards appear at the top of the results:

MetricDescription
Tables ScannedNumber of tables profiled
Total ColumnsSum of columns across all scanned tables
Total RowsSum of row counts across all tables
Scan TimeWall-clock duration of the scan in seconds

Data Quality Scorecard

The scorecard provides a letter grade (A through F) representing overall data completeness, calculated from the average null fraction across all columns and tables. Individual quality checks include:

CheckThresholdSeverity
High-null columns> 50% nullWarning
Nearly-empty columns> 99% nullError
Low cardinality< 5 distinct values (in tables with > 100 rows)Info
Single-value columnsExactly 1 distinct valueWarning
Empty tables0 rowsError

Each check shows a count-of-total ratio (e.g., "3/42") indicating how many columns or tables trigger the condition.

Completeness Heatmap

The heatmap displays a matrix of tables (rows) versus columns (columns), color-coded by null fraction:

  • Green (< 1% null) — Excellent completeness
  • Light green (1-10% null) — Good completeness
  • Gold (10-25% null) — Moderate gaps
  • Dark gold (25-50% null) — Significant gaps
  • Red (> 50% null) — Poor completeness
  • Dark — Column not present in that table

Hover over any cell to see the exact table, column, and null percentage. Up to 30 columns are displayed; export the full report for complete coverage.

Table Size Distribution

A horizontal bar chart shows the top 20 tables by row count. Each bar includes the table's quality grade badge (A-F) based on its average null fraction.

Table Detail Accordion

Below the charts, each scanned table appears as a collapsible accordion row. The header shows the table name, row count, column count, and badges for high-null and low-cardinality column counts. Expand a table to see a detailed column-level breakdown:

Column InfoDescription
Column NameThe column name, with a warning icon if null fraction > 50%
TypeThe database data type, color-coded by category (text, integer, numeric, date, boolean)
Null %A progress bar and percentage showing the fraction of null values
DistinctCount of distinct non-null values
Sample ValuesUp to 5 most frequent values with their occurrence counts

View Modes

Toggle between two view modes using the icons in the results toolbar:

  • List View — The default accordion-based table list with expandable column details
  • Heatmap View — The completeness heatmap visualization for a cross-table overview

Scan History

The profiler persists scan results in browser local storage (up to 20 entries). The Scan History panel lists previous scans with their source name, timestamp, table count, row count, scan time, and overall grade. Click any history entry to restore its results without re-scanning. Delete individual entries or clear all history using the provided controls.

Scan Before and After ETL

Run the profiler on your source data before beginning ETL mapping to establish a baseline. After loading data into the CDM, run it again on the target source to verify that null rates and row counts match expectations.

Exporting Results

Click Export Report to download the complete scan results as a JSON file. The export includes all table profiles, column-level statistics, null fractions, distinct counts, and sample values. This file can be shared with data engineers for offline review or archived for audit purposes.

Sorting and Filtering

Within the results view, tables can be sorted by:

  • Name — Alphabetical table name
  • Rows — Row count (ascending or descending)
  • Columns — Column count
  • Null Score — Average null fraction
  • Grade — Quality letter grade

A search field allows filtering tables by name to quickly locate specific tables in large scans.

API Reference

EndpointMethodDescription
/api/v1/etl/white-rabbit/scanPOSTInitiate a database scan for a given source
/api/v1/etl/white-rabbit/healthGETCheck WhiteRabbit service availability

Scan Request

{
"source_id": 1,
"tables": ["person", "visit_occurrence"]
}

The tables parameter is optional. When omitted, all tables in the source are scanned.

Quality Grade Scale

The Source Profiler assigns letter grades based on the average null fraction across a table's columns:

GradeNull FractionInterpretation
A0 - 5%Excellent data completeness
B5 - 15%Good completeness with minor gaps
C15 - 30%Moderate completeness; review recommended
D30 - 50%Significant gaps; likely needs ETL attention
F> 50%Poor completeness; major data quality concerns

The overall grade is the average of all table-level grades. Individual table grades appear as colored badges next to each table name in the accordion and bar chart views.

Performance Considerations

Scan duration depends on the number of tables and their row counts. General guidelines:

  • Small sources (< 20 tables, < 1M total rows): Scans complete in under 30 seconds
  • Medium sources (20-100 tables, 1-50M rows): Expect 1-5 minutes
  • Large sources (100+ tables, 50M+ rows): May take 5-15 minutes or more

Use the Table Filter to restrict scans to specific tables when you need quick feedback on a subset of your data.

Scan Specific CDM Tables

For routine quality checks after ETL, scan only the core CDM tables: person, visit_occurrence, condition_occurrence, drug_exposure, procedure_occurrence, measurement, observation. This keeps scan times short while covering the most critical data.