Abby 2.0 Phase 2: The Intelligence Upgrade — A Hybrid Brain with Safety Rails
Abby now has two brains. Simple queries stay local on MedGemma (fast, free). Complex reasoning escalates to Claude via API (powerful, cloud). A PHI sanitizer blocks any patient data from leaving the network, and a cost tracker with circuit breaker keeps spending within budget. Researchers get smarter answers on hard questions without compromising privacy or breaking the bank.
The Problem: One Brain Isn't Enough
MedGemma 1.5 4B is fast and medically literate — perfect for "What is concept 201826?" or "How many patients in our CDM?" But ask it to design a complex cohort with temporal logic, critique a study methodology for immortal time bias, or synthesize information across multiple memory tiers into a coherent research recommendation — and it hits its ceiling.
The answer isn't to replace MedGemma. It's to route the right questions to the right brain.
The Architecture: Two-Stage Routing
Every message flows through a two-stage router before hitting any LLM:
User Message
|
Stage 1: Deterministic Rules (<1ms, zero cost)
| "Create a cohort..." → CLOUD (action word detected)
| "Hello Abby" → LOCAL (greeting detected)
| "What is concept X?" → LOCAL (simple lookup)
| 200+ chars with multiple clauses → CLOUD (complexity signal)
|
Stage 2: Bootstrap Scoring (when Stage 1 is uncertain)
| Complexity indicators → boost cloud score
| Simplicity indicators → boost local score
| Tie? → Cloud wins (err toward capability)
~70% of requests stay local. Greetings, lookups, navigation help, simple questions — all handled by MedGemma in under 2 seconds at zero cost. The remaining 30% — cohort design, methodology review, multi-step reasoning, analysis interpretation — go to Claude.
Users never see the routing. Abby is one persona. The response just gets smarter when the question warrants it.
Bootstrap → Classifier Pipeline
The router starts with deterministic rules and heuristic scoring. As users provide thumbs-up/down feedback, each rating is stored alongside the routing decision (which model answered, why it was routed there). Once 500+ labeled samples accumulate, a fine-tuned distilbert classifier can replace the bootstrap scoring for even better routing accuracy.
PHI Protection: Defense in Depth
Clinical data must never leave the network. We enforce this at two levels:
Primary Defense: Architectural Boundary
The Context Assembly Pipeline has an explicit allowlist. Only approved data sources can enter a cloud-bound prompt:
| Allowed (cloud-safe) | Blocked (never sent) |
|---|---|
| User's natural language query | cdm.person (individual records) |
| Aggregate statistics | cdm.visit_occurrence |
| Concept/vocabulary names | cdm.condition_occurrence |
| Cohort definition structure | cdm.drug_exposure |
| Help documentation | cdm.measurement |
| Institutional knowledge | All individual-level clinical tables |
Secondary Defense: PHI Sanitizer
A PHISanitizer runs on every cloud-bound prompt as defense-in-depth:
- Regex detection: SSN patterns, MRN (with medical record context), phone numbers (with contact context), email addresses, dates of birth (with DOB context)
- NER detection: spaCy
en_core_web_smPERSON entity recognition for name detection - Clinical context guard: Prevents false positives on OMOP concept IDs (e.g., "Concept ID 4329847" is NOT a medical record number)
- Circuit breaker: If PHI is detected, the cloud request is blocked entirely (not just redacted) and falls back to MedGemma locally
Audit Trail
Every cloud API call is logged to abby_cloud_usage with: timestamp, user_id, department, token counts, cost, SHA-256 hash of the sanitized payload, redaction count, and routing reason.
Cost Controls: Budget Enforcement with Circuit Breaker
Cloud API costs are tracked per-request and enforced per-month:
- Multi-tier alerting: Alerts at 50%, 80%, and 95% of monthly budget
- Circuit breaker at 95%: Cloud routing is disabled. All requests fall back to MedGemma with a user-visible confidence indicator showing "low" confidence
- Degraded mode: When budget is exhausted, Abby prepends: "Note: This response was generated locally due to usage limits. For a more thorough analysis, try again later."
- Per-user tracking: Every cloud call records the user and department for cost attribution
Confidence & Attribution
Every response now carries:
- Confidence indicator:
high(Claude + context available),medium(MedGemma + context),low(degraded mode / budget exhausted) - Routing metadata: Which model answered, why it was routed there, which stage decided
- Source attribution: Context pieces are tagged with tier and source for traceability
What Shipped
| Component | Tests | Purpose |
|---|---|---|
RuleRouter | 14 | Two-stage routing with deterministic rules + bootstrap scoring |
ClaudeClient | 5 | Anthropic SDK wrapper with cost estimation |
PHISanitizer | 15 | Regex + spaCy NER detection with clinical context guard |
CloudSafetyFilter | 7 | Allowlist-based content protection |
CostTracker | 6 | Budget enforcement with circuit breaker |
| Context Assembler (Claude profile) | 9 | 28K token budget for Claude's 128K context window |
| Pipeline Integration | — | Hybrid routing wired into chat endpoint |
| Integration Tests | 6 | End-to-end routing, safety, and budget verification |
169 tests passing across the Python AI service.
What's Next: Phase 3 — Semantic Knowledge Graph
Phase 2 gave Abby a bigger brain. Phase 3 gives her deeper understanding.
Abby will gain a structured understanding of OMOP concept relationships — not just keyword matching, but true inference. She'll know that metformin is a drug for Type 2 diabetes mellitus, which is a subtype of diabetes mellitus. She'll know which domains have sparse data in your CDM and warn you before you build a cohort on a foundation of missing records.
Materialized views over the existing concept_relationship and concept_ancestor tables, a Redis cache for hot paths, and a KnowledgeGraphService with hierarchy traversal — no separate graph database needed.
