Abby 2.0: From Chatbot to Cognitive Research Assistant — The Complete Architecture
In a single development session, we shipped three phases of a cognitive architecture that transforms Abby from a stateless RAG chatbot into a persistent, intelligent, context-aware research assistant. She now remembers who you are, routes complex questions to a more powerful brain, traverses clinical concept hierarchies, and warns you when your data has gaps. This post tells the complete story — the problems we solved, the architecture we built, and the engineering decisions behind 188 passing tests across 60+ new files.
Where Abby 1.0 Left Off
By March 16th, Abby was already impressive for an on-premises clinical research assistant:
- RAG-enabled with 115,000+ vectors across 6 ChromaDB collections (documentation, per-user conversations, shared FAQs, clinical references, OHDSI papers, unified)
- Database-aware with 8 live context tools querying concept sets, cohort definitions, vocabulary concepts, Achilles characterization, DQD results, CDM summaries, and analysis executions in real time
- Page-contextual with 18 specialized personas adapting to wherever the researcher was in the platform
- Streaming via SSE through the Laravel proxy for responsive token-by-token output
- Privacy-preserving — everything ran on-premises via MedGemma 1.5 4B through Ollama
But she had four fundamental limitations:
- Amnesia — she forgot everything between sessions. Every conversation started from zero.
- One brain — MedGemma 4B handled every query, whether it was "Hello" or "Design a complex incident cohort with temporal logic and propensity score matching."
- Flat knowledge — she treated SNOMED concepts as keywords, not as nodes in a hierarchy. She couldn't tell you that metformin is used for Type 2 diabetes.
- No agency — she could answer questions but couldn't take actions.
Abby 2.0 addresses the first three. Agency comes in Phase 4.
The Design: Cognitive Architecture
We designed Abby 2.0 as a cognitive architecture inspired by human memory models, with a comprehensive spec covering six phases. The design was brainstormed, reviewed, and refined through two full spec review cycles (17 issues found and resolved) before a single line of code was written.
The three phases shipped today:
Phase 1: Memory Foundation → She remembers who you are
Phase 2: Intelligence Upgrade → She gets a bigger brain when needed
Phase 3: Knowledge Graph → She understands concept relationships
Each phase was implemented via a formal plan (reviewed, approved), executed through subagent-driven TDD development, and verified with integration tests before merge.
Phase 1: Memory Foundation
The Problem
A senior epidemiologist with 10 years of OHDSI experience got the same boilerplate explanations as a first-year medical student. A researcher who asked about diabetes cohorts every day never heard "Welcome back — still working on that T2DM study?" The context pipeline dumped everything into the prompt without prioritization — help docs, RAG results, live queries, page data — hoping MedGemma would sort it out within a cramped 4K token window.
Four-Tier Memory System
We implemented a layered memory model:
| Tier | Scope | What It Stores | Persistence |
|---|---|---|---|
| Working Memory | Single session | Intent stack (active topics), scratch pad (SQL drafts, cohort specs) | In-memory, cleared on restart |
| Episodic Memory | Per user | Research profile (interests, expertise, preferences), conversation archive | PostgreSQL + pgvector |
| Semantic Memory | Domain | RAG collections, knowledge graph (Phase 3) | ChromaDB + PostgreSQL |
| Institutional | Organization | Shared discoveries, templates, FAQs | Phase 6 (future) |
Intent Stack: Tracking What You're Actually Doing
Conversations aren't random — they have threads. The IntentStack is a bounded stack (max depth 3) that tracks active topics using domain keyword detection across 10 clinical research areas (diabetes, cardiovascular, oncology, respiratory, neurology, mental health, infectious disease, rheumatology, nephrology, endocrinology).
# Turn 1: "What's the diabetes prevalence in our CDM?"
# Turn 2: "Break it down by age group"
# → Intent stack knows turn 2 is about diabetes, not a fresh query
# Turn 5: "Now let's look at hypertension"
# → Explicit topic change detected, stack cleared
Topics expire after 10 turns of inactivity. The stack serializes for session persistence and feeds into the context assembly pipeline.
Profile Learner: Understanding Who You Are
The ProfileLearner extracts research interests, interaction preferences, and expertise levels from every conversation — using keyword matching and regex pattern detection, not LLM calls. Zero additional compute cost.
What it learns:
- Research interests: Detected via domain keywords ("I'm studying incident diabetes in elderly populations" →
diabetesinterest added) - Interaction preferences: Detected via behavioral patterns ("just give me the SQL, I don't need the explanation" →
verbosity: terse) - Corrections: Tracked for disambiguation ("no, I meant Type 2 specifically" → stored for future reference)
- Expertise calibration: Requires 5+ interactions before adjusting (a single basic question doesn't downgrade an expert), uses exponential decay weighting so recent interactions matter more
Critical design decision: immutability. learn_from_conversation() returns a new UserProfile instance — it never mutates the input. This follows our project-wide coding standard and prevents subtle bugs in concurrent contexts.
The profile is visible to users via a "My Research Profile" panel in the chat interface, showing learned interests as teal tags and expertise as progress bars. Users can reset their profile at any time.
Context Assembly Pipeline: Every Token Counts
The old approach concatenated everything and hoped for the best. The new ContextAssembler scores every piece of context by relevance and allocates tokens within strict per-tier budgets:
MedGemma (4K) Claude (28K)
Working Memory 1,500 8,000
Page Context 500 2,000
Live Database 800 4,000
Episodic Memory 400 4,000
Semantic Knowledge 600 6,000
Institutional 200 4,000
Safety-critical context (data quality warnings) gets guaranteed minimum allocation regardless of budget pressure — if a warning exists for the domain being discussed, it cannot be truncated.
PostgreSQL-Backed Conversation Archive
ChromaDB's 90-day TTL conversations are migrated to PostgreSQL with pgvector embeddings. The abby_messages table now carries a vector(384) column with an HNSW cosine index for fast similarity search. During the migration period, a MigrationBridge queries PostgreSQL first and falls back to ChromaDB, with automatic deduplication across sources.
What Shipped
- 7 Python memory module components (intent stack, scratch pad, context assembler, profile learner, conversation store, summarizer, migration bridge)
abby_user_profilestable with TEXT[] and JSONB columns- pgvector embedding column + HNSW index on
abby_messages - Profile API (GET/PUT/POST reset) under
auth:sanctum - Frontend profile panel with TanStack Query hook
- ChromaDB → PostgreSQL migration script
- 41 new unit tests + 3 integration tests
Phase 2: Intelligence Upgrade
The Problem
MedGemma 1.5 4B is fast (sub-2-second responses) and medically literate — it handles simple lookups, vocabulary navigation, and factual questions well. But it hits a ceiling on:
- Multi-step reasoning ("Design a cohort for incident statin users with prior cardiovascular events, excluding those with liver disease")
- Methodology critique ("Are there any biases in this study design?")
- Analysis interpretation ("What does this characterization tell me about my cohort?")
- Complex NL-to-SQL translation
The answer isn't to replace MedGemma — it's to route the right questions to the right brain.
Two-Stage Model Router
Every message flows through a two-stage router:
Stage 1: Deterministic Rules (<1ms, zero cost)
"Create a cohort..." → CLOUD (action word: create)
"Hello Abby" → LOCAL (greeting detected)
"What is concept 201826?" → LOCAL (simple lookup, <80 chars)
"Modify the inclusion..." → CLOUD (action word: modify)
200+ chars with 2+ clauses → CLOUD (complexity signal)
Stage 2: Bootstrap Heuristic Scoring (when Stage 1 is uncertain)
Complexity indicators (interpret, analyze, critique, methodology, bias, SQL, propensity score, immortal time...) boost the cloud score by 0.2 each. Simplicity indicators (bare "what is X?", "yes/no", "show me...") boost the local score by 0.3. The router errs toward Claude when scores are tied — better to over-spend than under-deliver on a clinical research question.
~70% of requests stay local. The remaining 30% go to Claude. Users never see the routing.
Bootstrap → Classifier Pipeline: As users provide thumbs-up/down feedback, each rating is stored alongside the routing decision. Once 500+ labeled samples accumulate, a fine-tuned distilbert classifier will replace the heuristic scoring. The CostTracker already has get_routing_labels() and get_routing_label_count() methods ready for this transition.
PHI Protection: Defense in Depth
Clinical data must never leave the network. We enforce this at two levels:
Primary Defense — Architectural Boundary: The CloudSafetyFilter maintains an explicit allowlist. Only approved data sources can enter a cloud-bound prompt. All 15+ individual-level CDM tables (person, visit_occurrence, condition_occurrence, drug_exposure, measurement, observation, etc.) are permanently blocked. Content patterns like person_id: 12345 trigger blocking even from unknown sources.
Secondary Defense — PHI Sanitizer: A PHISanitizer runs on every cloud-bound prompt as defense-in-depth, combining:
- Regex patterns for SSN, MRN (with medical record context), phone (with contact context), email, DOB (with birth context)
- spaCy
en_core_web_smPERSON entity recognition for name detection - Clinical context guard preventing false positives on OMOP concept IDs
Circuit Breaker: If PHI is detected, the cloud request is blocked entirely (not just redacted) and falls back to MedGemma locally. Monthly audit of all cloud API logs.
Cost Controls
Cloud API costs are tracked per-request to the abby_cloud_usage table and enforced per-month:
- Multi-tier alerting at 50%, 80%, and 95% of the monthly budget
- Circuit breaker at 95%: Cloud routing disabled, all requests fall back to local with a user-visible confidence indicator ("low") and a degraded-mode caveat
- Per-user/department tracking for cost attribution
- SHA-256 audit hash on every cloud request payload
Confidence & Attribution
Every response now carries:
- Confidence indicator:
high(Claude + context),medium(MedGemma + context),low(degraded mode) - Routing metadata: Which model, why, which stage decided
- Source attribution: Context pieces tagged with tier and source
What Shipped
- 5 Python routing module components (rule router, Claude client, PHI sanitizer, cloud safety filter, cost tracker)
abby_cloud_usagetable with department, token, cost, and audit columns- Context assembler extended with Claude 28K budget profile
- Hybrid routing integrated into chat pipeline with PHI blocking and cost tracking
- Router calibration feedback loop for classifier training
- 47 new tests across 5 test suites
Phase 3: Semantic Knowledge Graph
The Problem
OMOP's vocabulary is inherently hierarchical. SNOMED CT organizes 350,000+ clinical concepts into parent-child trees. ICD-10 nests codes in chapters, blocks, and categories. RxNorm links ingredients to clinical drugs to branded products.
But Abby treated concepts as flat keywords. Ask "What are the subtypes of diabetes?" and she'd search for the word "diabetes" in her RAG knowledge base — returning whatever documents mentioned diabetes, not the actual concept hierarchy.
KnowledgeGraphService: Relational Understanding
Five core operations, all backed by concept_ancestor and concept_relationship with Redis caching (1-hour TTL):
| Method | Query Pattern | Use Case |
|---|---|---|
get_ancestors(id, levels) | Walk up concept_ancestor | "What broader category does metformin belong to?" |
get_descendants(id, levels) | Walk down concept_ancestor | "What are all subtypes of diabetes?" |
get_siblings(id) | Parent → children | "What other drugs are in the same class as metformin?" |
find_related(id, types) | concept_relationship | "What conditions are associated with this drug?" |
get_concept(id) | Direct lookup | "What is concept 201826?" |
No separate graph database. The existing OMOP concept_ancestor table (with min_levels_of_separation and max_levels_of_separation columns) already encodes the full transitive closure. We just made it fast and queryable from the AI service via Redis caching. Hot paths become in-memory lookups after first access.
DataProfileService: Know Your Data Before You Query It
Abby now maintains a living understanding of the institution's CDM:
- Person count: Total patients
- Temporal coverage: Observation period date range
- Domain density: Records per clinical domain (conditions, drugs, procedures, measurements, observations, visits, devices), sorted by density
- Gap detection: Three warning types:
- Critical: CDM has zero patients
- Sparse domain: <1 record per patient (unreliable for research)
- Temporal: <3 years coverage (insufficient for longitudinal studies)
Data quality warnings are injected as safety-critical context in the system prompt — they cannot be truncated by the token budget allocator. When a researcher asks about measurements and the Measurement domain has 500 records for 100K patients, Abby says so before the researcher wastes hours building a cohort on empty data.
Integration via Intent Detection
Five new intent patterns route queries to the knowledge graph:
"ancestors of concept 201826" → get_ancestors
"subtypes of diabetes" → get_descendants
"similar concepts to metformin" → get_siblings
"relationships for this drug" → find_related
"how much data do we have" → data_profile
Results are injected into the live context pipeline with section headers (CONCEPT HIERARCHY:, CDM DATA PROFILE:) for structured prompt composition.
What Shipped
- 2 Python knowledge module components (graph service, data profile service)
- 5 new intent patterns + 2 tool functions in live context pipeline
- Data quality warning injection in chat system prompt
- Redis caching with configurable TTL and prefix
- 19 new tests across 2 test suites + 2 integration tests
The Numbers
Code Impact
| Metric | Value |
|---|---|
| Total commits | 42 across 3 feature branches |
| Files created | 60+ new files |
| Files modified | 20+ existing files |
| Lines added | ~7,000+ |
| Python test suites | 15 |
| Total tests passing | 188 (3 pre-existing failures unrelated) |
| TypeScript | Compiles cleanly |
| Vite build | Succeeds |
Architecture Components
| Layer | Components |
|---|---|
| Python Memory Module (7) | IntentStack, ScratchPad, ContextAssembler, ProfileLearner, ConversationStore, Summarizer, MigrationBridge |
| Python Routing Module (5) | RuleRouter, ClaudeClient, PHISanitizer, CloudSafetyFilter, CostTracker |
| Python Knowledge Module (2) | KnowledgeGraphService, DataProfileService |
| Database (3 tables) | abby_user_profiles, abby_messages (pgvector), abby_cloud_usage |
| Laravel (4) | AbbyUserProfile model, AbbyProfileController, UpdateAbbyProfileRequest, AiService extensions |
| Frontend (4) | memory.ts types, useAbbyProfile hook, AbbyProfilePanel component, AskAbbyChannel integration |
Request Flow (After Phase 2)
User message arrives
↓
Working memory: Intent stack tracks topic, scratch pad holds artifacts
↓
Router: Stage 1 deterministic rules → Stage 2 heuristic scoring
↓
Budget check: Cost tracker circuit breaker (95% = local only)
↓
┌─ CLAUDE PATH ──────────────────────┐ ┌─ LOCAL PATH ────────────┐
│ Cloud safety filter (allowlist) │ │ MedGemma via Ollama │
│ PHI sanitizer (regex + NER) │ │ 4K token budget │
│ Context assembler (28K budget) │ │ <2s response time │
│ Claude API call │ │ │
│ Cost recorded to abby_cloud_usage │ │ │
└────────────────────────────────────┘ └─────────────────────────┘
↓
Context enrichment: RAG + Live DB + Knowledge Graph + Data Profile
↓
Data quality warnings injected (safety-critical, never truncated)
↓
Response: reply + suggestions + confidence + sources + routing metadata
↓
Profile learner: Updates user research profile (non-blocking)
Memory storage: Conversation persisted with pgvector embedding
Engineering Decisions Worth Noting
Why PostgreSQL + Redis, Not a Graph Database
The OMOP concept_ancestor table already contains the full transitive closure of concept hierarchies. Adding Neo4j or a similar graph database would introduce operational complexity (another service to manage, data sync concerns, schema divergence) for a capability that PostgreSQL handles well with indexed queries. Redis caching makes hot paths effectively in-memory.
Why Rule-Based Routing Instead of an LLM Classifier
Using MedGemma itself to classify complexity creates a circular dependency — the model must judge its own limitations. Small models are poorly calibrated for confidence estimation. A rule-based pre-filter + heuristic scoring is predictable, debuggable, and zero-cost. The classifier will be trained on real production data once 500+ labeled routing decisions accumulate.
Why Block PHI Instead of Redact-and-Send
The spec's PHI circuit breaker blocks the entire cloud request when PHI is detected, rather than redacting and sending. This is intentional. Redaction creates a false sense of safety — if the redactor misses something (and regex-based detectors will miss edge cases), patient data leaks. Blocking ensures the worst case is a slightly less capable response, not a data breach. The fallback to MedGemma locally still produces a useful answer.
Why Immutable Data Patterns Everywhere
Every learn_from_conversation() call, every profile update, every context assembly returns a new object rather than mutating the input. This prevents subtle bugs in concurrent FastAPI workers, makes debugging deterministic, and follows the project-wide coding standard. The cost (extra object allocation) is negligible compared to the LLM inference time that dominates every request.
What's Next
Three phases remain in the Abby 2.0 roadmap:
| Phase | Goal | Key Capability |
|---|---|---|
| Phase 4: Agency Framework | Abby can take actions | Plan-Confirm-Execute loop with risk classification, rollback, and audit trail |
| Phase 5: Advanced Agency | Complex workflows | DAG-based multi-step orchestration with parallel execution |
| Phase 6: Institutional Intelligence | Organization-wide learning | Automatic knowledge capture, shared artifact library, FAQ auto-promotion |
Phase 4 is where Abby gets hands. "Build me a cohort of Type 2 diabetes patients on metformin with HbA1c > 9" won't just get an explanation — it'll get a reviewable plan that, with one click, creates the concept sets, defines the cohort, generates the patient count, and links you to the result.
The memory foundation, hybrid intelligence, and semantic understanding we built today are the platform everything else stands on. Abby remembers, reasons, and understands. Next, she acts.
