Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is the technique that connects Abby's knowledge base to the language model. Rather than relying solely on the model's training data, Abby retrieves relevant knowledge before generating each response — ensuring answers are grounded in Parthenon documentation, OHDSI literature, and clinical reference data.

How RAG Works

The RAG pipeline executes in five stages for every user query:

Stage 1: Page Context Detection

When a question arrives, the AI service identifies the current page context (e.g., cohort_builder, estimation, vocabulary). This determines:

Which persona Abby adopts (specialist framing for the response)
Whether clinical collections are queried (only on clinical pages)
Which help content is injected into the system prompt

Stage 2: Parallel Retrieval

The query is embedded and searched against multiple collections simultaneously:

Collection	When Queried	Embedder	Top-K	Threshold
`docs`	Always	MiniLM (384-dim)	3	0.3 cosine distance
`conversations\_user\_\{id\}`	If user authenticated	MiniLM (384-dim)	3	0.3
`faq_shared`	Always	MiniLM (384-dim)	3	0.3
`ohdsi_papers`	Clinical pages only	SapBERT (768-dim)	3	0.3
`clinical_reference`	Clinical pages only	SapBERT (768-dim)	3	0.3

Clinical pages include: Cohort Builder, Vocabulary, Data Explorer, Data Quality, Analyses, Incidence Rates, Estimation, Prediction, Genomics, Imaging, Patient Profiles, and Care Gaps.

The threshold of 0.3 cosine distance means only chunks with 70%+ similarity are included. This prevents irrelevant noise from entering the prompt.

Stage 3: Context Assembly

Retrieved chunks are formatted into a structured context block:

KNOWLEDGE BASE (use this context to inform your response):

Documentation:
- [chunk from platform docs]
- [chunk from platform docs]

Previous conversations:
- [relevant prior Q&A from this user]

Common questions:
- [relevant shared FAQ entry]

OHDSI research literature:
- [chunk from research paper or Book of OHDSI]
- [chunk from HADES vignette]

Clinical reference:
- [relevant OMOP concept description]

This context block is injected into the system prompt alongside the page-specific persona instructions and any help content for the current page.

Stage 4: Generation

The assembled prompt is sent to MedGemma 27B via Ollama. MedGemma is a medical domain LLM from Google, purpose-built for clinical and biomedical text generation. It runs entirely locally - no API calls to external services.

The model receives:

System prompt — page persona + behavioral instructions
RAG context — retrieved knowledge from all collections
Help content — structured feature documentation for the current page
User message — the actual question

Stage 5: Memory Storage

After generating the response, the Q&A pair is embedded and stored in the user's conversation memory collection. This is a fire-and-forget operation (non-blocking) to avoid adding latency to the response.

Page Personas

Abby maintains 22 specialized personas that activate based on the current page:

Page	Persona Focus
Cohort Builder	Inclusion/exclusion criteria, cohort expressions, temporal logic, era settings
Vocabulary Browser	Concept search, hierarchy navigation, domain filtering, semantic matching
Concept Set Builder	Descendant flags, exclude flags, concept mapping strategies
Population-Level Estimation	Propensity scores, negative controls, study diagnostics, IPTW vs matching
Patient-Level Prediction	Feature selection, model evaluation (AUROC, calibration), external validation
Characterization	Covariate selection, baseline characteristics, feature extraction settings
Incidence Rates	Rate calculation, time-at-risk, age/sex stratification
Treatment Pathways	Event sequencing, pathway analysis design, sunburst interpretation
Data Explorer	Achilles results, data quality metrics, population distributions
Data Quality	DQD check interpretation, threshold configuration, remediation guidance
Data Ingestion	Schema mapping, concept mapping, file format handling, ETL guidance
Genomics	VCF interpretation, ClinVar annotations, variant pathogenicity, tumor boards
Imaging	DICOM viewer, modality guidance, PACS connectivity, NLP extraction
HEOR	Cost-effectiveness, care gap analysis, economic modeling
FHIR Integration	SMART auth, bulk export, FHIR-to-OMOP mapping, IG compliance
GIS Explorer	Spatial analysis, geographic health disparities, SVI data
Administration	System configuration, user management, health monitoring
SCCS	Self-controlled case series design, risk windows, age/season adjustment
Evidence Synthesis	Meta-analysis, forest plots, heterogeneity assessment
Studies	Study packages, protocol design, network study coordination
Patient Profiles	Timeline navigation, encounter details, longitudinal patient view
General	Broad platform knowledge (fallback when no specific page detected)

Retrieval Quality

Cosine Similarity Scoring

ChromaDB returns results ranked by cosine distance (0 = identical, 2 = opposite). Abby converts this to a similarity score:

similarity = 1.0 - cosine_distance

Only results with cosine_distance <= 0.3 (similarity >= 0.70) are included. This threshold was tuned to balance recall (finding relevant content) against precision (excluding noise).

Multi-Collection Deduplication

When the same content appears in multiple collections (e.g., a concept appears in both documentation and clinical reference), the retrieval pipeline deduplicates by text content to avoid injecting redundant context into the prompt.

Source Attribution

Each retrieved chunk carries metadata about its source:

{
  "source": "ohdsi_corpus",
  "title": "Large-scale propensity score analysis...",
  "doi": "10.1038/s12345",
  "year": 2023,
  "chunk_index": 4,
  "total_chunks": 12
}

This metadata enables future enhancements like citation grounding — including DOI references in Abby's responses so researchers can verify claims against the source literature.

Performance

Metric	Typical Value
Retrieval (5 collections)	50-150ms
Prompt assembly	under 10ms
MedGemma generation	hardware dependent, typically 8-30 seconds for 27B
Memory storage	under 50ms (async)
Total response time	3-9 seconds

The Solr acceleration layer for the 3D vector explorer reduces projection queries from ~8-10 seconds (live PCA+UMAP) to under 500ms (pre-computed).

How RAG Works​

Stage 1: Page Context Detection​

Stage 2: Parallel Retrieval​

Stage 3: Context Assembly​

Stage 4: Generation​

Stage 5: Memory Storage​

Page Personas​

Retrieval Quality​

Cosine Similarity Scoring​

Multi-Collection Deduplication​

Source Attribution​

Performance​