Skip to main content

Population Statistics

The Population Statistics section in Data Explorer provides high-level demographic and temporal summaries of the patient population in a data source. These summaries are essential for quickly assessing database coverage, demographic composition, and suitability for a specific research question before investing time in full cohort definition and generation.

Overview Panel

The Overview panel consolidates key database metrics into a single summary view:

StatisticDescriptionSource
Total PatientsUnique person_id count in the CDMperson table
Date RangeEarliest to latest observation period start/end datesobservation_period table
Observation YearsTotal person-years of follow-up across all patientsSum of observation period durations
Median Follow-upMedian observation period duration in daysAchilles analysis 105
Vocabulary VersionName and date of the loaded OMOP vocabularyvocabulary table metadata
CDM VersionDetected OMOP CDM schema versioncdm_source table
Metric interpretation

Total person-years is a better measure of database "size" than patient count alone. A database with 100,000 patients and 10 years of follow-up (1M person-years) is more useful for longitudinal studies than one with 500,000 patients and 6 months of follow-up (250K person-years).

Age and Gender Distribution

A population pyramid chart displays the age-gender distribution in 5-year age bands (0-4, 5-9, ..., 85+). This visualization provides immediate insight into the demographic profile:

  • Medicare databases -- pyramid heavily weighted toward 65+ age bands
  • Medicaid databases -- bimodal distribution (children and young adults)
  • Commercial claims -- concentrated in working-age adults (18-64)
  • EHR databases -- distribution reflects the patient population of the health system

Controls

  • Toggle counts/percentages -- switch between absolute counts and relative percentages
  • Hover tooltips -- exact count for each age-gender cell
  • Export -- download the chart as PNG or the underlying data as CSV

Observation Period Timeline

A time-series line chart shows the number of patients with active observation in each calendar month or quarter. This chart reveals critical information about your database:

  • Enrollment patterns -- gradual growth (expanding health system), seasonal drops (academic medical centers), or sharp changes (insurance plan changes)
  • Coverage gaps -- months with anomalously low patient counts may indicate data capture failures or ETL issues
  • Effective study window -- the years with stable, consistent coverage suitable for epidemiological analysis
  • Data currency -- how recent the data extends, and whether the latest months show complete capture
Study period selection

When designing a study, use the timeline chart to identify the period with the most stable coverage. Avoid using the first and last 6 months of data -- enrollment ramp-up and data lag often create incomplete capture at the boundaries.

Domain Coverage

The domain coverage table shows record counts and patient coverage across all major OMOP CDM clinical domains:

DomainRecord CountPatient Count% of PopulationAvg Records per Patient
condition_occurrence--------
drug_exposure--------
measurement--------
procedure_occurrence--------
visit_occurrence--------
observation--------
death--------
device_exposure--------

Interpreting Domain Coverage

  • Low patient coverage in a domain relative to the overall population size may indicate incomplete data capture. For example, if only 10% of patients have measurement records in an EHR database, lab data may not be routinely captured.
  • Zero records in a domain means the ETL did not populate that table -- it does not necessarily mean the source system lacks that data.
  • High records-per-patient in measurement (often 50-200+) is normal for EHR databases with lab data; claims databases typically have much lower measurement density.

Multi-Source Comparison

When multiple data sources are configured, the Compare Sources view provides side-by-side population statistics for cross-database assessment:

Comparison Table

A multi-column table aligning key metrics across selected sources:

MetricSource ASource BSource C
Total patients------
Date range------
Person-years------
% Female------
Median age------
Condition records------
Drug records------

Use Cases for Comparison

  • Source selection -- identify which database has the longest follow-up for your target population
  • Multi-site studies -- compare demographic distributions across sites to assess representativeness
  • Data completeness -- compare domain coverage to identify which sources have the richest data for specific domains
  • Feasibility assessment -- quickly determine whether sufficient patients exist across sources for a multi-database study

Overlay Charts

Toggle the Overlay view to superimpose age-gender pyramids and timeline charts from multiple sources on a single plot, with each source shown in a different color.

Research feasibility

Use Population Statistics as your first stop when evaluating feasibility for a new study. Before building cohort definitions, verify that:

  1. The database covers your intended study period
  2. Sufficient patients exist in the age/gender strata of interest
  3. The relevant clinical domains (conditions, drugs, procedures, labs) have adequate coverage
  4. Observation period durations are long enough for your required follow-up time

This 5-minute feasibility check can save hours of cohort development work on an unsuitable database.

Population Analytics Tab

The Population Analytics tab extends basic statistics with advanced demographic analysis:

  • Age distribution histogram -- continuous distribution with kernel density estimate
  • Enrollment duration distribution -- how long patients typically remain in the database
  • Year-over-year growth -- patient acquisition and attrition rates
  • Geographic distribution -- if location data is available (state, region)
  • Payer mix -- distribution of insurance types (if captured in the CDM)

These analytics draw from both Achilles pre-computed results and real-time queries against the CDM tables, providing a comprehensive demographic profile of each data source.