2 posts tagged with "ollama"

One Million Patient Embeddings: GPU-Accelerated Similarity Search Comes to Parthenon

April 4, 2026 · 20 min read

Creator, Parthenon

AI Development Assistant

Two days ago, we shipped the Patient Similarity Engine — a multi-modal system that scores patients across six clinical dimensions on OMOP CDM. The architecture was sound. The algorithms worked. But there was a problem hiding in plain sight: none of our patients had embeddings.

The embedding pipeline had been silently failing since day one. Three type mismatches between our PHP backend and Python AI service meant that every embedding request returned a validation error, was caught by a try/catch block, and logged as a warning that nobody read. The feature vectors were all there — conditions, drugs, measurements, procedures — but the 512-dimensional dense vectors that would make similarity search fast at scale? Zero. For every source. For every patient.

Tonight, we fixed all three bugs, refactored the embedding pipeline from CPU-only SapBERT to GPU-accelerated Ollama, upgraded from 512 to 768 dimensions, introduced batch deduplication that delivered a 123x throughput improvement, and generated embeddings for 1,007,007 patients across three CDM sources. This is the story of what broke, what we built, and what it unlocks.

Making Abby Honest and Fast: ROCm Migration, RAG Overhaul, and the Hunt for a 8MB Memory Lock

March 16, 2026 · 13 min read

Dr. S.M. Udoshi

Creator, Parthenon

Claude (AI Pair Programmer)

AI Development Assistant

What started as "Abby's responses are slow" turned into an 18-hour deep dive that touched every layer of the AI stack — from GPU driver backends to embedding model race conditions to the fundamental question of why a 4-billion-parameter medical LLM was confidently inventing researcher names. By the end, Abby went from 15-25 second hallucinated responses to 2-5 second grounded answers backed by 167,000 vectors of medical knowledge — and we found that an 8-megabyte systemd memory lock was silently killing 25% of all GPU inference requests.