Characterization

Cohort characterization produces a comprehensive summary of the clinical and demographic features of one or more cohorts. It answers the fundamental question: "Who are the patients in this cohort, and what are their baseline characteristics?" Characterization is typically the first analysis performed in any pharmacoepidemiological, comparative effectiveness, or safety study.

Characterization is one of seven analysis types available in Parthenon, accessible from the Analyses page which displays an AnalysisStatsBar with counts across all types.

What Characterization Produces

For each cohort and each feature, a characterization analysis computes:

Prevalence (binary features): The percentage of cohort members with at least one occurrence of the feature during the specified time window. For example, "42.3% of the target cohort had a diagnosis of hypertension in the 365 days before index."
Mean and standard deviation (continuous features): For numeric measurements like HbA1c, BMI, systolic blood pressure, and age. Reported as mean +/- SD with the count of patients who had at least one measurement.
Distribution statistics: Percentiles (p25, median, p75) for age, observation time, and other stratified continuous variables.
Standardized Mean Difference (SMD): When comparing two cohorts, the SMD quantifies how different each feature is between them. This is the standard metric for assessing covariate balance in observational studies.

Feature Categories

Characterization draws features from all major OMOP CDM domains. Each category has a default time window that can be customized:

Category	CDM Domain	What is captured	Default Window
Demographics	Person	Age, gender, race, ethnicity, year of birth	At index date
Conditions	ConditionOccurrence	All recorded diagnoses	365 days before index
Drug Exposures	DrugExposure	All dispensed/prescribed medications	365 days before index
Procedures	ProcedureOccurrence	All recorded procedures	365 days before index
Measurements	Measurement	Lab tests, vital signs	365 days before index
Observations	Observation	Clinical observations (smoking, BMI, etc.)	365 days before index
Visit Types	VisitOccurrence	Inpatient, outpatient, ED visits	365 days before index
Charlson Comorbidity	Multiple domains	17-condition Charlson Comorbidity Index	365 days before index

Charlson Comorbidity Index

The Charlson index is a validated composite score derived from 17 condition categories (myocardial infarction, congestive heart failure, cerebrovascular disease, diabetes, renal disease, cancer, etc.). Parthenon computes both the individual component prevalences and the weighted composite score. The Charlson index is widely used in outcomes research as a summary measure of comorbidity burden.

Creating a Characterization

Navigate to Analyses in the top navigation.
Ensure the Characterizations tab is active (it is the default).
Click New Analysis and select Characterization from the dropdown.
You are taken to the characterization design page.

Design Configuration

The characterization design page includes the following settings:

Cohorts:

Target cohorts: Add one or more cohorts whose members you want to characterize. Click + Add Cohort and select from your generated cohort definitions.
Comparator cohorts (optional): Add cohorts for comparative analysis. When comparator cohorts are specified, the results include SMD calculations between each target-comparator pair.

Feature Types: Select which feature categories to include. The default selection covers the most commonly used categories:

Demographics
Conditions
Drugs

You can expand the selection to include procedures, measurements, observations, visits, and Charlson scoring. Including all categories provides the most complete picture but increases computation time and result size.

Stratification:

Stratify by gender: Break results down by male/female/other
Stratify by age: Break results down by age bands (configurable)

Additional settings:

Top N: Maximum number of features to return per category (default: 100). Features are ranked by prevalence.
Minimum cell count: Suppress features with fewer than N patients for privacy protection (default: 5).

Time Window Configuration

Each feature category can have a custom time window, defined as days relative to the cohort start date:

Window notation	Meaning
`[-365, 0]`	In the 365 days up to and including the index date
`[-365, -1]`	In the 365 days before index, excluding index day itself
`[0, 30]`	At index through 30 days after
`[-9999, -1]`	Any time before index (full lookback)
`[-365, 365]`	One year before through one year after index

Window selection impacts results

The time window determines what "baseline" means for your study. A 365-day window captures one year of history, which is standard for most pharmacoepi studies. A shorter window (e.g., 90 days) captures more acute/recent conditions. A longer window (e.g., all prior history) captures lifetime comorbidities but may include resolved conditions.

Executing a Characterization

After configuring the design:

Click Save to persist the configuration.
Select the Data Source to execute against.
Click Execute.

The characterization job is dispatched to the Horizon queue. Execution time depends on the number of cohorts, feature categories, and CDM database size. A typical characterization on a 1M-patient database with 3 feature categories completes in 1--5 minutes.

Viewing Results

After execution completes, the results page displays an interactive table:

Column	Description
Feature Name	Human-readable feature description (e.g., "Hypertension", "metformin", "Age 65-69")
Concept ID	OMOP concept ID for the feature
Count	Number of patients with the feature
Percentage	Prevalence as a percentage of the cohort
Mean / SD	For continuous features (measurements)
SMD	Standardized Mean Difference (when comparing two cohorts)

Interactive Controls

Search: Filter features by name using the search box.
Sort: Click column headers to sort by prevalence, count, SMD, or concept ID.
Category filter: Toggle which feature categories are displayed.
SMD threshold highlight: Features with SMD > 0.1 are highlighted, indicating meaningful imbalance between cohorts.

Understanding SMD

The Standardized Mean Difference (SMD) is calculated as:

SMD = (mean_target - mean_comparator) / sqrt((sd_target^2 + sd_comparator^2) / 2)

For binary features (prevalence), the formula uses the prevalence proportions and their corresponding standard deviations.

SMD Range	Interpretation
< 0.1	Well-balanced; unlikely to confound
0.1 -- 0.2	Moderate imbalance; may need adjustment
> 0.2	Substantial imbalance; must be addressed in any causal analysis

SMD is direction-independent

An SMD of -0.15 and +0.15 both indicate the same magnitude of imbalance. The sign indicates which cohort has the higher prevalence, but the absolute value is what matters for balance assessment.

Exporting Results

Click Export CSV to download the full characterization table. The exported file contains one row per feature per cohort and includes:

Feature name and concept ID
Cohort name and ID
Count, percentage
Mean, SD (for continuous features)
SMD (for comparative analyses)

The CSV export is suitable for inclusion in study reports, regulatory submissions, and "Table 1" generation in publications.

Table 1 in publications

Characterization results directly produce the standard "Table 1" (baseline characteristics) found in every observational study publication. Export the CSV, filter to the features you want to report, and format according to journal requirements.

Best Practices

Always characterize before estimating: Understanding baseline differences between cohorts is a prerequisite for any causal analysis. SMD values from characterization inform your propensity score model and confounder adjustment strategy.
Use standard feature sets: Start with the predefined feature sets (demographics, conditions, drugs) rather than custom concept sets. This ensures comprehensive coverage and comparability across studies.
Check for empty features: If many features show 0% prevalence, the concept sets may not be resolving correctly against your vocabulary, or the time window may be too narrow.
Compare across data sources: Run the same characterization on multiple data sources to understand how population composition varies. This is essential for interpreting heterogeneity in network study results.
Document your choices: Record which feature categories, time windows, and stratification settings you used. These details are required for study protocol documentation and reproducibility.

What Characterization Produces​

Feature Categories​

Charlson Comorbidity Index​

Creating a Characterization​

Design Configuration​

Time Window Configuration​

Executing a Characterization​

Viewing Results​

Interactive Controls​

Understanding SMD​

Exporting Results​

Best Practices​