Skip to main content

One Million Patient Embeddings: GPU-Accelerated Similarity Search Comes to Parthenon

· 20 min read
Creator, Parthenon
AI Development Assistant

Two days ago, we shipped the Patient Similarity Engine — a multi-modal system that scores patients across six clinical dimensions on OMOP CDM. The architecture was sound. The algorithms worked. But there was a problem hiding in plain sight: none of our patients had embeddings.

The embedding pipeline had been silently failing since day one. Three type mismatches between our PHP backend and Python AI service meant that every embedding request returned a validation error, was caught by a try/catch block, and logged as a warning that nobody read. The feature vectors were all there — conditions, drugs, measurements, procedures — but the 512-dimensional dense vectors that would make similarity search fast at scale? Zero. For every source. For every patient.

Tonight, we fixed all three bugs, refactored the embedding pipeline from CPU-only SapBERT to GPU-accelerated Ollama, upgraded from 512 to 768 dimensions, introduced batch deduplication that delivered a 123x throughput improvement, and generated embeddings for 1,007,007 patients across three CDM sources. This is the story of what broke, what we built, and what it unlocks.


The Silent Failure

The Patient Similarity Engine has two search modes. Interpretable mode computes per-dimension Jaccard and z-score similarity in real time — it's explainable but requires loading candidate patients into memory and scoring each one. Embedding mode uses pgvector's IVFFlat index for approximate nearest neighbor (ANN) search — sub-second lookups across a million patients, with interpretable scoring applied only to the top candidates.

Embedding mode requires pre-computed dense vectors stored in a vector(N) column on patient_feature_vectors. These vectors are generated by a Laravel queue job (ComputePatientFeatureVectors) that:

  1. Extracts clinical features from CDM tables (conditions, drugs, measurements, procedures, genomics)
  2. Stores them as JSONB in the feature vector table
  3. Sends batches to the Python AI service for encoding
  4. Writes the resulting vectors back to pgvector

Steps 1 and 2 worked perfectly. Step 3 failed on every single call. Step 4 never executed.

Bug #1: Integer Concepts vs. String Validation

The Pydantic model for the /patient-similarity/embed endpoint declared concept lists as list[str]:

class PatientFeatures(BaseModel):
condition_concepts: list[str] = Field(default_factory=list)
drug_concepts: list[str] = Field(default_factory=list)
procedure_concepts: list[str] = Field(default_factory=list)

But OMOP concept IDs are integers. PHP's PatientFeatureVector::toArray() serializes the JSONB condition_concepts column as [4120002, 4045900, 4031047] — a list of integers. FastAPI's Pydantic validation rejected every request with:

{"detail": [{"type": "string_type", "loc": ["body", "condition_concepts", 0],
"msg": "Input should be a valid string", "input": 4120002}]}

The EmbeddingClient caught the 422, logged a warning, and returned an empty array. The job continued to the next batch.

Bug #2: Dict Lab Vector vs. List Validation

The lab_vector field was declared as list[float], but PHP sends a JSONB dictionary mapping concept IDs to z-scores:

{"3025315": -0.1184, "3036277": -0.0578, "3036832": 1.5564}

Same failure pattern — Pydantic rejected the dict, the client swallowed the error.

Bug #3: Batch Response Key Mismatch

Even if the first two bugs hadn't existed, the batch endpoint wouldn't have worked. The Python API returns:

{"embeddings": [{"person_id": 142763, "embedding": [...], "dimension": 768}, ...]}

But EmbeddingClient::embedBatch() iterated the response as a flat array:

foreach ($embeddings as $pid => $embedding) {
// $pid = 0, 1, 2... (array indices, not person IDs)
// $embedding = {"person_id": 142763, ...} (object, not float array)
}

The UPDATE would have tried to write a JSON object as a pgvector value with index 0 as the person_id.

The Compound Effect

Three bugs, three layers, one outcome: zero embeddings for any patient in any source. The job always "succeeded" — it just never wrote any vectors. The interpretable search mode worked fine as a fallback, masking the problem entirely. We only discovered it because we asked a simple question: why does the embedding column show NULL for every row?

Lesson learned: Silent failure in pipeline stages is worse than a crash. The EmbeddingClient should have thrown on non-200 responses, or at minimum, the job should have asserted that count($embeddings) > 0 per batch.


The Fixes

Type Coercion at the API Boundary

The Pydantic model now accepts what PHP actually sends:

class PatientFeatures(BaseModel):
condition_concepts: list[int | str] = Field(default_factory=list)
lab_vector: list[float] | dict[str, float] = Field(default_factory=list)
drug_concepts: list[int | str] = Field(default_factory=list)
procedure_concepts: list[int | str] = Field(default_factory=list)
variant_genes: list[str | dict] = Field(default_factory=list)

The variant_genes field deserves special mention. In the OMOP genomics extension, variant data is stored as [{"gene": "KRAS", "pathogenicity": "Pathogenic"}, ...]. The embedding service now extracts the gene field from dict entries and passes it to the encoder:

variant_genes = [
g["gene"] if isinstance(g, dict) else str(g)
for g in raw_genes
]

This means patients with genomic data — like those in the Pancreatic Cancer Corpus with KRAS, BRCA1, and TP53 variants — get genomics-aware embeddings.

Lab Vector Dict Handling

The _encode_measurements function now accepts both forms:

if isinstance(lab_vector, dict):
values = list(lab_vector.values()) # Extract z-scores from {concept_id: zscore}
else:
values = lab_vector # Already a list of floats

The z-scores are clipped to [-5, 5] and normalized to [-1, 1], then packed into the 96-dimensional measurements slice of the patient vector.

Batch Response Re-keying

The PHP EmbeddingClient::embedBatch() now properly maps the response:

$raw = $response->json('embeddings', []);
$result = [];
foreach ($raw as $entry) {
if (isset($entry['person_id'], $entry['embedding'])) {
$result[(int) $entry['person_id']] = $entry['embedding'];
}
}
return $result;

The caller gets a person_id => float[] map instead of an indexed array, and the UPDATE statement writes the correct vector to the correct patient.


From CPU to GPU: The Ollama Migration

With the type bugs fixed, embeddings started generating — but slowly. The original pipeline used SapBERT (cambridgeltl/SapBERT-from-PubMedBERT-fulltext), a 110M-parameter biomedical language model running on CPU inside the Docker container. SapBERT is an excellent model for clinical concept encoding, but CPU inference is not how you want to embed a million patients.

Meanwhile, Ollama was already running on the host machine with full GPU access, serving MedGemma for Abby's conversational AI. Three embedding models were loaded and ready:

ModelParametersDimensionUse Case
nomic-embed-text137M768General-purpose embedding, fast
embeddinggemma:300m300M768Google's embedding model
text-embedding-3-large768OpenAI-compatible embedding

All three produce 768-dimensional embeddings, matching SapBERT's native output dimension. We chose nomic-embed-text for its speed: 27 concepts/second in batch mode, with the GPU doing the heavy lifting.

The Embedding Service Refactor

The sapbert.py service was refactored to try Ollama first, falling back to CPU-based SapBERT only if Ollama is unavailable:

class OllamaEmbeddingService:
"""GPU-accelerated embedding via Ollama's /api/embed endpoint."""

def encode(self, texts: list[str]) -> list[list[float]]:
resp = httpx.post(
f"{self._base_url}/api/embed",
json={"model": self._model, "input": texts},
timeout=60.0,
)
resp.raise_for_status()
return resp.json()["embeddings"]


def get_sapbert_service() -> OllamaEmbeddingService | SapBERTService:
"""Return the best available embedding service."""
if _ollama_service.is_available:
return _ollama_service
return _sapbert_service # CPU fallback

On startup, the service probes Ollama with a single test embedding. If it responds, all subsequent calls go to the GPU. If Ollama is down, the service falls back to loading the SapBERT model into CPU memory — slower, but functional. The interface is identical: both have an .encode(texts: list[str]) -> list[list[float]] method.


768 Dimensions: The Full Encoder Width

The original design used 512-dimensional patient embeddings, partitioning the vector into six slices that truncated the encoder's 768-dim output:

Old (512-dim):
[0-32]: Demographics (32)
[32-160]: Conditions (128) ← truncated from 768
[160-224]: Measurements (64)
[224-352]: Drugs (128) ← truncated from 768
[352-448]: Procedures (96) ← truncated from 768
[448-512]: Genomics (64) ← truncated from 768

Truncation discards information. The SapBERT and Ollama encoders pack semantic meaning across all 768 dimensions, and lopping off the tail loses the long-range feature interactions that distinguish similar-but-different concepts.

With the move to Ollama, we expanded to 768 dimensions — the encoder's native width:

New (768-dim):
[0-32]: Demographics (32)
[32-224]: Conditions (192) ← 50% more capacity
[224-320]: Measurements (96) ← 50% more capacity
[320-512]: Drugs (192) ← 50% more capacity
[512-672]: Procedures (160) ← 67% more capacity
[672-768]: Genomics (96) ← 50% more capacity

The pgvector column was altered from vector(512) to vector(768), the IVFFlat index was rebuilt, and the migration file was updated to reflect the new dimension for fresh installs.

Demographics and Measurements: Numeric, Not Encoded

Two of the six dimensions don't use the language model at all:

Demographics (32 dims): Age is normalized (age_bucket / 20), gender is encoded as +1 (male) / -1 (female) / 0 (unknown), and race uses one-hot encoding in dims 2-31 mapped to OMOP race concept IDs (8516=Black, 8527=White, 8515=Asian, etc.). This is simple, deterministic, and doesn't need a language model.

Measurements (96 dims): Lab z-scores are clipped to [-5, 5], normalized to [-1, 1], and packed directly into the vector. The z-scores come from population-level statistics computed per source: for each measurement concept, we compute mean and standard deviation across all patients, then express each patient's value as a distance from the population mean. A hemoglobin of 7.2 g/dL means different things depending on whether the population average is 8.5 (critical) or 14.0 (severely anemic).

The remaining four dimensions — conditions, drugs, procedures, and genomics — are encoded through Ollama. OMOP concept IDs (integers) are passed as text strings to the embedding model, which maps them into dense semantic space. Related concepts cluster together: metformin and insulin share neighborhood structure; KRAS and TP53 occupy nearby regions of the genomics subspace. Mean pooling across all concepts in a dimension produces a single representative vector for that clinical aspect of the patient.


The 123x Speedup: Batch Deduplication

The original pipeline processed patients one at a time. Each patient triggered four Ollama calls — one per encoded dimension (conditions, drugs, procedures, genomics). For a batch of 200 patients, that's 800 Ollama calls.

But patients share concepts. In a cancer registry, most patients have the same core ICD-10 codes, the same standard-of-care medications, the same diagnostic procedures. A batch of 200 patients might reference 15 unique condition concepts, not 200 × 10 = 2,000.

The batch-optimized path exploits this:

def compute_patient_embeddings_batch(patients: list[dict]) -> list[list[float]]:
"""4 Ollama calls per batch, not 4 per patient."""

for field, slc in dim_configs:
# Collect ALL unique concepts across ALL patients
all_unique_texts = deduplicate(patients, field)

# ONE encoding call for all unique concepts
all_embeddings = svc.encode(all_unique_texts)

# Mean-pool per patient using shared lookup
for i, patient in enumerate(patients):
indices = [text_to_idx[t] for t in patient_texts[i]]
embeddings[i, slc] = all_embeddings[indices].mean(axis=0)

Instead of 800 Ollama calls for 200 patients, we make 4 calls (one per dimension) with 15-50 unique texts each. The encoding is done once; the per-patient work is just numpy indexing and mean pooling.

Benchmark results:

ApproachBatch of 200RateOllama Calls
Per-patient (old)14.1s14 patients/sec800
Batch dedup (new)0.1s1,743 patients/sec4
Speedup123x200x fewer calls

The actual throughput for the full Acumenus CDM run settled at ~130 patients/sec sustained — lower than the benchmark because real patient data has more concept diversity than synthetic test data, and the database UPDATE operations add I/O overhead. But 130/sec on a million patients is still roughly 2 hours, compared to the ~18 hours the per-patient approach would have taken.


The Production Run: 1,007,007 Patients

With all fixes in place, we generated embeddings for three CDM sources:

SourcePatientsTimeRateNotes
IRSF Natural History Study1,85822s84/secRare disease cohort
Pancreatic Cancer Corpus3614s90/secMultimodal cancer registry
OHDSI Acumenus CDM1,005,788~2 hours130/secFull clinical data warehouse
Total1,007,007

The IRSF and Pancreas sources completed in under 30 seconds each. The Acumenus CDM required multiple runs due to PHP's process limits — artisan tinker chunks stop after ~500 iterations regardless of timeout settings. We ran the embedding loop five times, each picking up where the previous left off via whereNull('embedding').

One patient — person_id 1005788 — required special handling. With 51 condition concepts, 12 procedures, and genomic variants (KRAS pathogenic), the full payload triggered a timeout in the batch endpoint. We embedded him individually with his complete clinical profile, ensuring his KRAS variant was encoded in the genomics dimension alongside his full comorbidity burden.

Data Richness Across Sources

The feature vectors capture meaningfully different clinical profiles across sources:

SourceAvg ConditionsAvg DrugsAvg LabsHas Genomics
IRSF3-105-1518-22 z-scoresNo
Pancreas5-513-85-11 z-scoresYes (KRAS, BRCA1, TP53)
Acumenus0-50+0-30+0-50 z-scoresSelected patients

The Pancreatic Cancer Corpus is the richest per patient — small cohort, deep phenotyping, genomic annotation. IRSF has consistent depth across a rare disease population. Acumenus is the long tail: a million patients with highly variable data completeness, from single-visit records to decades of longitudinal care.


What This Enables

Before embeddings, similarity search for a patient in the Acumenus CDM required loading all 1M candidates into memory (impractical) or SQL-based pre-screening with PHP re-scoring. With pgvector's IVFFlat index, finding the 20 most similar patients is a single cosine distance query:

SELECT person_id, embedding <=> $1 AS distance
FROM patient_feature_vectors
WHERE source_id = 47
ORDER BY embedding <=> $1
LIMIT 20;

This returns in milliseconds. The interpretable scoring (per-dimension Jaccard, z-score comparison) is then applied only to these 20 candidates, giving the user both fast results and explainable scores.

Cross-Source Phenotypic Matching

All three sources share the same 768-dimensional embedding space with the same encoding model. A clinician studying a rare disease patient in IRSF can ask: "are there any patients in the million-patient Acumenus CDM who look like this?" The vector search doesn't care about source boundaries — it finds the nearest neighbors across the entire embedding space.

This is especially powerful for rare disease research, where individual institutions may have only a handful of cases. Cross-source similarity expands the searchable population from hundreds to millions.

The search-from-cohort endpoint computes the centroid (average embedding) of a defined cohort and finds individual patients nearest to it. Define a cohort of 50 confirmed cases, compute their centroid, and discover 500 more patients with similar clinical profiles who weren't captured by the original inclusion criteria. This is phenotype-driven cohort expansion — the computational equivalent of a clinician saying "find me more patients like these."

Embedding-Powered Analytics

With every patient represented as a point in vector space, standard machine learning techniques become applicable:

  • Clustering: K-means or HDBSCAN on patient embeddings reveals natural phenotypic subgroups without pre-specifying features. A cluster analysis of the Pancreatic Cancer Corpus might reveal subtypes that correlate with survival — not from genomics alone, but from the full clinical picture.

  • Outlier Detection: Patients far from any cluster centroid may represent rare phenotypes, coding errors, or unusual disease presentations. In a quality improvement context, outliers in a supposedly homogeneous cohort warrant chart review.

  • Temporal Trajectories: Re-embedding patients at different time windows (diagnosis, 6 months, 12 months) traces how their clinical profile evolves. Patients whose trajectories diverge despite similar starting points are natural candidates for outcome analysis.

  • Treatment Response Similarity: Find patients who looked similar pre-treatment, then compare outcomes. This is observational causal inference bootstrapped by embedding similarity — less rigorous than propensity score matching, but vastly more scalable.

Genomics-Aware Similarity

Patients with genomic data get embeddings that encode molecular profiles alongside clinical features. The 96-dimensional genomics slice captures gene-level similarity through the language model's understanding of gene names and their relationships. KRAS and NRAS cluster together; BRCA1 and BRCA2 share embedding structure.

This makes the similarity engine directly useful for molecular tumor board workflows: given an index patient with a pathogenic KRAS variant, find clinically similar patients who also carry RAS pathway mutations — even if they have different specific variants.

Foundation for Federated Similarity

Patient embeddings are privacy-preserving representations. A 768-dimensional vector does not contain raw clinical data — you cannot reconstruct a patient's medication list or lab values from their embedding. This makes embeddings suitable for sharing across institutional boundaries in a federated learning network.

In the Hive Networks architecture, participating sites could share patient embeddings without sharing PHI. A query like "find patients similar to this one across the network" becomes a vector search across sites — each site returns only the embedding distances, never the underlying data. The requesting site gets a ranked list of similar patients by site, enabling multi-institutional rare disease research without a central data repository.


Architecture: The Final Pipeline

CDM Tables (person, condition_occurrence, drug_exposure,
measurement, procedure_occurrence, genomic_variant)


SimilarityFeatureExtractor (Laravel)
├── Demographics: age_bucket, gender, race
├── Conditions: ancestor-rolled concept IDs (3 levels)
├── Measurements: z-score normalized lab values
├── Drugs: ingredient-level concept IDs
├── Procedures: procedure concept IDs
└── Genomics: gene names with pathogenicity tier


patient_feature_vectors (PostgreSQL, app schema)
├── source_id, person_id
├── condition_concepts (JSONB)
├── lab_vector (JSONB: {concept_id: z_score, ...})
├── drug_concepts (JSONB)
├── procedure_concepts (JSONB)
├── variant_genes (JSONB: [{gene, pathogenicity}, ...])
└── embedding (pgvector, vector(768))


EmbeddingClient (Laravel → Python AI Service)


OllamaEmbeddingService (GPU, nomic-embed-text, 768-dim)
├── Batch deduplication: 4 calls per batch, not per patient
├── Per-dimension encoding:
│ ├── Conditions → 192 dims (SapBERT-pooled)
│ ├── Drugs → 192 dims (SapBERT-pooled)
│ ├── Procedures → 160 dims (SapBERT-pooled)
│ └── Genomics → 96 dims (SapBERT-pooled)
├── Direct encoding (no LM):
│ ├── Demographics → 32 dims (numeric)
│ └── Measurements → 96 dims (z-scores)
└── L2 normalization → unit vector


pgvector IVFFlat Index (cosine distance, 100 lists)


PatientSimilarityService.search()
├── Embedding mode: ANN search → top K → interpretable re-score
└── Interpretable mode: full dimension-wise scoring (fallback)

Fallback Guarantees

The system degrades gracefully at every layer:

  • Ollama down? Falls back to CPU-based SapBERT. Slower, but produces identical-dimension embeddings.
  • No embeddings computed? Falls back to interpretable-only search. No ANN, but full scoring across all dimensions.
  • Source too small for IVFFlat? (< 100 patients) Skips index creation; pgvector does exact scan.
  • Patient missing a dimension? Zero-padded in the embedding; interpretable scoring skips that dimension and re-weights the others.

Performance Characteristics

Embedding Generation

MetricValue
Encoding backendOllama (nomic-embed-text) on GPU
Embedding dimension768
Batch size (PHP → Python)500 patients
Batch dedup calls per batch4 (one per encoded dimension)
Sustained throughput~130 patients/sec
Time for 1M patients~2 hours
Peak GPU utilization~40% (Ollama, batch encoding)
Peak DB write throughput~500 UPDATEs/sec (CASE/WHEN batch)

Similarity Search (Embedding Mode)

MetricValue
Index typeIVFFlat (100 lists)
Distance metricCosine
ANN candidates20 (configurable)
Search latency (1M patients)< 50ms
Interpretable re-scoring~5ms per candidate
Total search time< 150ms

Storage

SourceRowsEmbedding StorageTotal Table Size
Acumenus1,005,788~5.8 GB (768 × float32 × 1M)~8.2 GB
IRSF1,858~5.4 MB~12 MB
Pancreas361~1.1 MB~3.5 MB

Lessons Learned

1. Silent Failures Are Architecture Bugs

The embedding pipeline "worked" for weeks without generating a single embedding. The EmbeddingClient caught exceptions and returned empty arrays. The job logged warnings that scrolled past in a sea of other output. The search engine fell back to interpretable mode without complaint.

Every pipeline stage should either succeed visibly or fail loudly. A try/catch that returns a default value without raising an alert is not error handling — it's evidence suppression.

2. Validate at the Seam

The type mismatch between PHP (integers) and Python (strings) lived at the service boundary — the HTTP API between Laravel and FastAPI. Both sides were internally correct: PHP correctly serialized OMOP concept IDs as integers; Python correctly expected concept identifiers as strings. Neither side was wrong in isolation.

Service boundaries need explicit contracts. The Pydantic model should have been generated from or validated against the PHP serialization format. In a multi-language architecture, the API schema is the source of truth — not either implementation.

3. Deduplication Beats Parallelism

Our first instinct for performance was to increase batch sizes and add worker parallelism. The 123x speedup came instead from observing that patients share concepts. In a batch of 200 oncology patients, there might be 15 unique condition concepts. Encoding 15 texts once is faster than encoding 2,000 texts (even on a GPU) because the bottleneck is Ollama's tokenization and inference, not the network call.

The general principle: before parallelizing work, check if the work is redundant. Deduplication is free; parallelism has coordination costs.

4. Tinker Has a Hidden Iteration Limit

PHP's artisan tinker (PsySH) silently stops chunk() iteration after approximately 500 calls, regardless of max_execution_time settings. For bulk operations over large datasets, use a proper artisan command or a raw PHP script — not an interactive REPL with undocumented safety limits.

5. One Patient Can Break the Pipeline

Patient 1005788 — with 51 conditions, 12 procedures, and KRAS/BRCA genomic variants — was the single holdout in a million-patient run. The variant_genes field stored as [{"gene": "KRAS", "pathogenicity": "Pathogenic"}] didn't match the list[str] Pydantic type. One patient, one type mismatch, one silent failure.

Robust pipelines handle edge cases in the data model, not in the exception handler. The fix wasn't a special case for patient 1005788 — it was accepting the actual data shape in the Pydantic model and converting dicts to gene names in the encoder.


What's Next

The embedding infrastructure is now production-ready. The immediate roadmap:

  1. IVFFlat Index Rebuild: Create the index on the full 1M-row table with tuned lists parameter for optimal recall/speed tradeoff.

  2. Embedding-Mode Search in the UI: The frontend currently defaults to interpretable mode. With embeddings available, the search controller can route to ANN search for large sources (> 5,000 patients) and fall back to interpretable for small ones.

  3. Cohort Centroid Visualization: Display the centroid embedding of a cohort as a radar chart across the six dimensions, showing where the cohort's "center of mass" lies in clinical space.

  4. Incremental Embedding Updates: New patients added through ETL should trigger embedding generation without reprocessing the entire source. The whereNull('embedding') pattern already supports this — we just need to hook it into the ingestion pipeline.

  5. SynPUF and Eunomia: Two remaining sources (2.3M CMS SynPUF patients and 2.7K Eunomia demo patients) need feature extraction and embedding. SynPUF at 2.3M patients will take approximately 5 hours at current throughput.

  6. Federated Embedding Exchange: Design the protocol for sharing patient embeddings across Hive Network sites — embedding format, distance computation, privacy guarantees, and consent models.

The Patient Similarity Engine now has its index. A million patients, each reduced to 768 numbers that capture their demographics, conditions, labs, medications, procedures, and genomic variants. The question "which patients are most like this one?" is no longer a research project. It's a query.