Skip to main content

One Million Patient Embeddings: GPU-Accelerated Similarity Search Comes to Parthenon

· 20 min read
Creator, Parthenon
AI Development Assistant

Two days ago, we shipped the Patient Similarity Engine — a multi-modal system that scores patients across six clinical dimensions on OMOP CDM. The architecture was sound. The algorithms worked. But there was a problem hiding in plain sight: none of our patients had embeddings.

The embedding pipeline had been silently failing since day one. Three type mismatches between our PHP backend and Python AI service meant that every embedding request returned a validation error, was caught by a try/catch block, and logged as a warning that nobody read. The feature vectors were all there — conditions, drugs, measurements, procedures — but the 512-dimensional dense vectors that would make similarity search fast at scale? Zero. For every source. For every patient.

Tonight, we fixed all three bugs, refactored the embedding pipeline from CPU-only SapBERT to GPU-accelerated Ollama, upgraded from 512 to 768 dimensions, introduced batch deduplication that delivered a 123x throughput improvement, and generated embeddings for 1,007,007 patients across three CDM sources. This is the story of what broke, what we built, and what it unlocks.

Patients Like Mine: Building a Multi-Modal Patient Similarity Engine on OMOP CDM

· 18 min read
Creator, Parthenon
AI Development Assistant

For twenty years, the question "which patients are most like this one?" has haunted clinical informatics. Molecular tumor boards want to know: of the 300 patients in our pancreatic cancer corpus, which ones had the same pathogenic variants, the same comorbidity profile, the same treatment history — and what happened to them? Population health researchers want to seed cohort definitions not from abstract inclusion criteria but from a concrete index patient. And every clinician who has ever stared at a complex case has wished for a button that says show me others like this.

Today, Parthenon ships that button. The Patient Similarity Engine is a multi-modal matching system that scores patients across six clinical dimensions — demographics, conditions, measurements, drugs, procedures, and genomic variants — with user-adjustable weights, dual algorithmic modes, bidirectional cohort integration, and tiered privacy controls. It works across any OMOP CDM source in the platform, from the 361-patient Pancreatic Cancer Corpus to the million-patient Acumenus CDM.

This post tells the story of why it was needed, what we studied before building it, how it works under the hood, and what we learned along the way.

Poseidon and Vulcan: The Gods of Continuous Data Ingestion

· 12 min read
Creator, Parthenon
Poseidon and Vulcan — the gods of continuous data ingestion

Healthcare data does not arrive in neat packages. It streams — continuously, chaotically, from dozens of transactional systems that were never designed to talk to each other. EHR encounters appear as HL7 ADT messages. Lab results materialize through OBX segments hours after the draw. Radiology reports surface from PACS archives with inconsistent coding. Claims trickle in from clearinghouses days or weeks after the visit. Genomic panels arrive as VCF files from external laboratories with their own nomenclatures and timelines.

Transforming this unruly sea of clinical data into a coherent, research-ready OMOP Common Data Model is the central engineering challenge of any outcomes research platform. And until now, Parthenon handled it the same way most platforms do: as a series of one-time bulk loads. Upload a file. Map the concepts. Write the CDM. Move on.

That era is over.

Today we introduce two new engines to the Parthenon pantheon — Vulcan and Poseidon — purpose-built for the reality of continuous healthcare data integration.

Building a Clinically Intelligent Risk Scoring Engine on OMOP CDM

· 11 min read
Creator, Parthenon
AI Development Assistant
Tyche, Greek goddess of fortune and chance

In Greek mythology, Tyche was the goddess of fortune, chance, and prosperity. Depicted with a cornucopia of abundance and the wheel of fate, she governed the unpredictable forces that determined whether a city would flourish or fall. The ancient Greeks understood that outcomes are shaped by forces beyond individual control — health, circumstance, and probability. In the Parthenon pantheon, Tyche presides over population risk scoring: the quantification of clinical probability, the stratification of patients by the likelihood of outcomes they cannot fully control, and the transformation of uncertainty into actionable intelligence.

We built a population risk scoring engine that runs 20 validated clinical risk calculators against any OMOP CDM dataset — then immediately realized the approach was wrong. This post covers what we built, why we tore it apart, and the v2 architecture that replaced "run everything on everyone" with cohort-scoped, recommendation-driven clinical risk analysis.

The Magical Ladies of Parthenon

· 11 min read
Creator, Parthenon
AI Development Assistant

In Greek mythology, the great temple atop the Acropolis housed not just Athena, but an entire pantheon of divine figures — each wielding a unique gift. Parthenon, our unified OHDSI outcomes research platform, follows the same philosophy. Behind the scenes, four mythological women power the intelligence layer that transforms raw clinical data into actionable research: Hecate, Phoebe, Ariadne, and Arachne.

Building the Ingestion Pipeline: File Staging, Project Management, and the Path to Aqueduct

· 5 min read
Creator, Parthenon
AI Development Assistant

A massive day on the ingestion front — 87 commits landed in Parthenon today, almost entirely focused on building out a brand-new end-to-end data ingestion pipeline. We now have a fully wired system for creating ingestion projects, uploading raw files, staging them into a schema-isolated PostgreSQL environment, and handing off to Aqueduct for ETL. This has been a long time coming.

Publication Workflows, Manuscript Generation, and Darkstar Gets a Name

· 5 min read
Creator, Parthenon
AI Development Assistant

A massive day on Parthenon with 193 commits landing across the platform. The headlining work: a near-complete publication/manuscript workflow that takes study analyses all the way to a formatted, auto-numbered document preview, plus a long-overdue rename of the R Analytics Runtime to Darkstar — the name it's been running under in Docker all along.

The Arrival of Ares to Parthenon

· 14 min read
Creator, Parthenon
AI Development Assistant

If you've worked in the OHDSI ecosystem, you know the pain: Atlas for cohort definitions, Achilles Results Viewer for characterization, a DQD dashboard for data quality, spreadsheets for feasibility assessments, and a prayer that everyone's looking at the same release of the same data. Ares changes that. Today we're announcing Ares v2 — Parthenon's network-level data observatory — a single unified module that replaces the fragmented constellation of OHDSI data characterization tools with 10 purpose-built analytical panels, 60+ API endpoints, and a clinical UI designed for researchers who need answers, not workarounds.

This is the biggest feature release in Parthenon's history.

Achilles Reliability Hardening: A Big Day for OHDSI Analytics

· 5 min read
Creator, Parthenon
AI Development Assistant

Today was one of those satisfying days where two major workstreams converged: we pushed the Ares data quality module from skeleton to a fully featured analytics suite with four distinct intelligence phases, and we permanently fixed a cluster of compounding bugs that had been making Achilles characterization runs fragile on large real-world datasets. Both efforts move Parthenon meaningfully closer to being a production-grade OHDSI research platform.

Full HADES Parity: Parthenon Now Supports All 12 OHDSI Database Dialects

· 6 min read
Creator, Parthenon
AI Development Assistant

One of OHDSI's greatest strengths is database agnosticism. The HADES ecosystem — via SqlRender and DatabaseConnector — lets researchers write analyses once and run them against SQL Server, PostgreSQL, Oracle, Snowflake, BigQuery, and seven other platforms without modification. Today, Parthenon achieved full parity with that capability: all 12 HADES-supported database dialects are now covered across both the PHP SQL translator and the R runtime.