Skip to main content

Abby 2.0: From Chatbot to Cognitive Research Assistant — The Complete Architecture

· 15 min read
Creator, Parthenon
AI Development Assistant

In a single development session, we shipped three phases of a cognitive architecture that transforms Abby from a stateless RAG chatbot into a persistent, intelligent, context-aware research assistant. She now remembers who you are, routes complex questions to a more powerful brain, traverses clinical concept hierarchies, and warns you when your data has gaps. This post tells the complete story — the problems we solved, the architecture we built, and the engineering decisions behind 188 passing tests across 60+ new files.

Abby AI assistant

Where Abby 1.0 Left Off

By March 16th, Abby was already impressive for an on-premises clinical research assistant:

  • RAG-enabled with 115,000+ vectors across 6 ChromaDB collections (documentation, per-user conversations, shared FAQs, clinical references, OHDSI papers, unified)
  • Database-aware with 8 live context tools querying concept sets, cohort definitions, vocabulary concepts, Achilles characterization, DQD results, CDM summaries, and analysis executions in real time
  • Page-contextual with 18 specialized personas adapting to wherever the researcher was in the platform
  • Streaming via SSE through the Laravel proxy for responsive token-by-token output
  • Privacy-preserving — everything ran on-premises via MedGemma 1.5 4B through Ollama

But she had four fundamental limitations:

  1. Amnesia — she forgot everything between sessions. Every conversation started from zero.
  2. One brain — MedGemma 4B handled every query, whether it was "Hello" or "Design a complex incident cohort with temporal logic and propensity score matching."
  3. Flat knowledge — she treated SNOMED concepts as keywords, not as nodes in a hierarchy. She couldn't tell you that metformin is used for Type 2 diabetes.
  4. No agency — she could answer questions but couldn't take actions.

Abby 2.0 addresses the first three. Agency comes in Phase 4.


The Design: Cognitive Architecture

We designed Abby 2.0 as a cognitive architecture inspired by human memory models, with a comprehensive spec covering six phases. The design was brainstormed, reviewed, and refined through two full spec review cycles (17 issues found and resolved) before a single line of code was written.

The three phases shipped today:

Phase 1: Memory Foundation      → She remembers who you are
Phase 2: Intelligence Upgrade → She gets a bigger brain when needed
Phase 3: Knowledge Graph → She understands concept relationships

Each phase was implemented via a formal plan (reviewed, approved), executed through subagent-driven TDD development, and verified with integration tests before merge.


Phase 1: Memory Foundation

The Problem

A senior epidemiologist with 10 years of OHDSI experience got the same boilerplate explanations as a first-year medical student. A researcher who asked about diabetes cohorts every day never heard "Welcome back — still working on that T2DM study?" The context pipeline dumped everything into the prompt without prioritization — help docs, RAG results, live queries, page data — hoping MedGemma would sort it out within a cramped 4K token window.

Four-Tier Memory System

We implemented a layered memory model:

TierScopeWhat It StoresPersistence
Working MemorySingle sessionIntent stack (active topics), scratch pad (SQL drafts, cohort specs)In-memory, cleared on restart
Episodic MemoryPer userResearch profile (interests, expertise, preferences), conversation archivePostgreSQL + pgvector
Semantic MemoryDomainRAG collections, knowledge graph (Phase 3)ChromaDB + PostgreSQL
InstitutionalOrganizationShared discoveries, templates, FAQsPhase 6 (future)

Intent Stack: Tracking What You're Actually Doing

Conversations aren't random — they have threads. The IntentStack is a bounded stack (max depth 3) that tracks active topics using domain keyword detection across 10 clinical research areas (diabetes, cardiovascular, oncology, respiratory, neurology, mental health, infectious disease, rheumatology, nephrology, endocrinology).

# Turn 1: "What's the diabetes prevalence in our CDM?"
# Turn 2: "Break it down by age group"
# → Intent stack knows turn 2 is about diabetes, not a fresh query
# Turn 5: "Now let's look at hypertension"
# → Explicit topic change detected, stack cleared

Topics expire after 10 turns of inactivity. The stack serializes for session persistence and feeds into the context assembly pipeline.

Profile Learner: Understanding Who You Are

The ProfileLearner extracts research interests, interaction preferences, and expertise levels from every conversation — using keyword matching and regex pattern detection, not LLM calls. Zero additional compute cost.

What it learns:

  • Research interests: Detected via domain keywords ("I'm studying incident diabetes in elderly populations" → diabetes interest added)
  • Interaction preferences: Detected via behavioral patterns ("just give me the SQL, I don't need the explanation" → verbosity: terse)
  • Corrections: Tracked for disambiguation ("no, I meant Type 2 specifically" → stored for future reference)
  • Expertise calibration: Requires 5+ interactions before adjusting (a single basic question doesn't downgrade an expert), uses exponential decay weighting so recent interactions matter more

Critical design decision: immutability. learn_from_conversation() returns a new UserProfile instance — it never mutates the input. This follows our project-wide coding standard and prevents subtle bugs in concurrent contexts.

The profile is visible to users via a "My Research Profile" panel in the chat interface, showing learned interests as teal tags and expertise as progress bars. Users can reset their profile at any time.

Context Assembly Pipeline: Every Token Counts

The old approach concatenated everything and hoped for the best. The new ContextAssembler scores every piece of context by relevance and allocates tokens within strict per-tier budgets:

                        MedGemma (4K)    Claude (28K)
Working Memory 1,500 8,000
Page Context 500 2,000
Live Database 800 4,000
Episodic Memory 400 4,000
Semantic Knowledge 600 6,000
Institutional 200 4,000

Safety-critical context (data quality warnings) gets guaranteed minimum allocation regardless of budget pressure — if a warning exists for the domain being discussed, it cannot be truncated.

PostgreSQL-Backed Conversation Archive

ChromaDB's 90-day TTL conversations are migrated to PostgreSQL with pgvector embeddings. The abby_messages table now carries a vector(384) column with an HNSW cosine index for fast similarity search. During the migration period, a MigrationBridge queries PostgreSQL first and falls back to ChromaDB, with automatic deduplication across sources.

What Shipped

  • 7 Python memory module components (intent stack, scratch pad, context assembler, profile learner, conversation store, summarizer, migration bridge)
  • abby_user_profiles table with TEXT[] and JSONB columns
  • pgvector embedding column + HNSW index on abby_messages
  • Profile API (GET/PUT/POST reset) under auth:sanctum
  • Frontend profile panel with TanStack Query hook
  • ChromaDB → PostgreSQL migration script
  • 41 new unit tests + 3 integration tests

Phase 2: Intelligence Upgrade

The Problem

MedGemma 1.5 4B is fast (sub-2-second responses) and medically literate — it handles simple lookups, vocabulary navigation, and factual questions well. But it hits a ceiling on:

  • Multi-step reasoning ("Design a cohort for incident statin users with prior cardiovascular events, excluding those with liver disease")
  • Methodology critique ("Are there any biases in this study design?")
  • Analysis interpretation ("What does this characterization tell me about my cohort?")
  • Complex NL-to-SQL translation

The answer isn't to replace MedGemma — it's to route the right questions to the right brain.

Two-Stage Model Router

Every message flows through a two-stage router:

Stage 1: Deterministic Rules (<1ms, zero cost)

"Create a cohort..."        → CLOUD  (action word: create)
"Hello Abby" → LOCAL (greeting detected)
"What is concept 201826?" → LOCAL (simple lookup, &lt;80 chars)
"Modify the inclusion..." → CLOUD (action word: modify)
200+ chars with 2+ clauses → CLOUD (complexity signal)

Stage 2: Bootstrap Heuristic Scoring (when Stage 1 is uncertain)

Complexity indicators (interpret, analyze, critique, methodology, bias, SQL, propensity score, immortal time...) boost the cloud score by 0.2 each. Simplicity indicators (bare "what is X?", "yes/no", "show me...") boost the local score by 0.3. The router errs toward Claude when scores are tied — better to over-spend than under-deliver on a clinical research question.

~70% of requests stay local. The remaining 30% go to Claude. Users never see the routing.

Bootstrap → Classifier Pipeline: As users provide thumbs-up/down feedback, each rating is stored alongside the routing decision. Once 500+ labeled samples accumulate, a fine-tuned distilbert classifier will replace the heuristic scoring. The CostTracker already has get_routing_labels() and get_routing_label_count() methods ready for this transition.

PHI Protection: Defense in Depth

Clinical data must never leave the network. We enforce this at two levels:

Primary Defense — Architectural Boundary: The CloudSafetyFilter maintains an explicit allowlist. Only approved data sources can enter a cloud-bound prompt. All 15+ individual-level CDM tables (person, visit_occurrence, condition_occurrence, drug_exposure, measurement, observation, etc.) are permanently blocked. Content patterns like person_id: 12345 trigger blocking even from unknown sources.

Secondary Defense — PHI Sanitizer: A PHISanitizer runs on every cloud-bound prompt as defense-in-depth, combining:

  • Regex patterns for SSN, MRN (with medical record context), phone (with contact context), email, DOB (with birth context)
  • spaCy en_core_web_sm PERSON entity recognition for name detection
  • Clinical context guard preventing false positives on OMOP concept IDs

Circuit Breaker: If PHI is detected, the cloud request is blocked entirely (not just redacted) and falls back to MedGemma locally. Monthly audit of all cloud API logs.

Cost Controls

Cloud API costs are tracked per-request to the abby_cloud_usage table and enforced per-month:

  • Multi-tier alerting at 50%, 80%, and 95% of the monthly budget
  • Circuit breaker at 95%: Cloud routing disabled, all requests fall back to local with a user-visible confidence indicator ("low") and a degraded-mode caveat
  • Per-user/department tracking for cost attribution
  • SHA-256 audit hash on every cloud request payload

Confidence & Attribution

Every response now carries:

  • Confidence indicator: high (Claude + context), medium (MedGemma + context), low (degraded mode)
  • Routing metadata: Which model, why, which stage decided
  • Source attribution: Context pieces tagged with tier and source

What Shipped

  • 5 Python routing module components (rule router, Claude client, PHI sanitizer, cloud safety filter, cost tracker)
  • abby_cloud_usage table with department, token, cost, and audit columns
  • Context assembler extended with Claude 28K budget profile
  • Hybrid routing integrated into chat pipeline with PHI blocking and cost tracking
  • Router calibration feedback loop for classifier training
  • 47 new tests across 5 test suites

Phase 3: Semantic Knowledge Graph

The Problem

OMOP's vocabulary is inherently hierarchical. SNOMED CT organizes 350,000+ clinical concepts into parent-child trees. ICD-10 nests codes in chapters, blocks, and categories. RxNorm links ingredients to clinical drugs to branded products.

But Abby treated concepts as flat keywords. Ask "What are the subtypes of diabetes?" and she'd search for the word "diabetes" in her RAG knowledge base — returning whatever documents mentioned diabetes, not the actual concept hierarchy.

KnowledgeGraphService: Relational Understanding

Five core operations, all backed by concept_ancestor and concept_relationship with Redis caching (1-hour TTL):

MethodQuery PatternUse Case
get_ancestors(id, levels)Walk up concept_ancestor"What broader category does metformin belong to?"
get_descendants(id, levels)Walk down concept_ancestor"What are all subtypes of diabetes?"
get_siblings(id)Parent → children"What other drugs are in the same class as metformin?"
find_related(id, types)concept_relationship"What conditions are associated with this drug?"
get_concept(id)Direct lookup"What is concept 201826?"

No separate graph database. The existing OMOP concept_ancestor table (with min_levels_of_separation and max_levels_of_separation columns) already encodes the full transitive closure. We just made it fast and queryable from the AI service via Redis caching. Hot paths become in-memory lookups after first access.

DataProfileService: Know Your Data Before You Query It

Abby now maintains a living understanding of the institution's CDM:

  • Person count: Total patients
  • Temporal coverage: Observation period date range
  • Domain density: Records per clinical domain (conditions, drugs, procedures, measurements, observations, visits, devices), sorted by density
  • Gap detection: Three warning types:
    • Critical: CDM has zero patients
    • Sparse domain: <1 record per patient (unreliable for research)
    • Temporal: <3 years coverage (insufficient for longitudinal studies)

Data quality warnings are injected as safety-critical context in the system prompt — they cannot be truncated by the token budget allocator. When a researcher asks about measurements and the Measurement domain has 500 records for 100K patients, Abby says so before the researcher wastes hours building a cohort on empty data.

Integration via Intent Detection

Five new intent patterns route queries to the knowledge graph:

"ancestors of concept 201826"    → get_ancestors
"subtypes of diabetes" → get_descendants
"similar concepts to metformin" → get_siblings
"relationships for this drug" → find_related
"how much data do we have" → data_profile

Results are injected into the live context pipeline with section headers (CONCEPT HIERARCHY:, CDM DATA PROFILE:) for structured prompt composition.

What Shipped

  • 2 Python knowledge module components (graph service, data profile service)
  • 5 new intent patterns + 2 tool functions in live context pipeline
  • Data quality warning injection in chat system prompt
  • Redis caching with configurable TTL and prefix
  • 19 new tests across 2 test suites + 2 integration tests

The Numbers

Code Impact

MetricValue
Total commits42 across 3 feature branches
Files created60+ new files
Files modified20+ existing files
Lines added~7,000+
Python test suites15
Total tests passing188 (3 pre-existing failures unrelated)
TypeScriptCompiles cleanly
Vite buildSucceeds

Architecture Components

LayerComponents
Python Memory Module (7)IntentStack, ScratchPad, ContextAssembler, ProfileLearner, ConversationStore, Summarizer, MigrationBridge
Python Routing Module (5)RuleRouter, ClaudeClient, PHISanitizer, CloudSafetyFilter, CostTracker
Python Knowledge Module (2)KnowledgeGraphService, DataProfileService
Database (3 tables)abby_user_profiles, abby_messages (pgvector), abby_cloud_usage
Laravel (4)AbbyUserProfile model, AbbyProfileController, UpdateAbbyProfileRequest, AiService extensions
Frontend (4)memory.ts types, useAbbyProfile hook, AbbyProfilePanel component, AskAbbyChannel integration

Request Flow (After Phase 2)

User message arrives

Working memory: Intent stack tracks topic, scratch pad holds artifacts

Router: Stage 1 deterministic rules → Stage 2 heuristic scoring

Budget check: Cost tracker circuit breaker (95% = local only)

┌─ CLAUDE PATH ──────────────────────┐ ┌─ LOCAL PATH ────────────┐
│ Cloud safety filter (allowlist) │ │ MedGemma via Ollama │
│ PHI sanitizer (regex + NER) │ │ 4K token budget │
│ Context assembler (28K budget) │ │ &lt;2s response time │
│ Claude API call │ │ │
│ Cost recorded to abby_cloud_usage │ │ │
└────────────────────────────────────┘ └─────────────────────────┘

Context enrichment: RAG + Live DB + Knowledge Graph + Data Profile

Data quality warnings injected (safety-critical, never truncated)

Response: reply + suggestions + confidence + sources + routing metadata

Profile learner: Updates user research profile (non-blocking)
Memory storage: Conversation persisted with pgvector embedding

Engineering Decisions Worth Noting

Why PostgreSQL + Redis, Not a Graph Database

The OMOP concept_ancestor table already contains the full transitive closure of concept hierarchies. Adding Neo4j or a similar graph database would introduce operational complexity (another service to manage, data sync concerns, schema divergence) for a capability that PostgreSQL handles well with indexed queries. Redis caching makes hot paths effectively in-memory.

Why Rule-Based Routing Instead of an LLM Classifier

Using MedGemma itself to classify complexity creates a circular dependency — the model must judge its own limitations. Small models are poorly calibrated for confidence estimation. A rule-based pre-filter + heuristic scoring is predictable, debuggable, and zero-cost. The classifier will be trained on real production data once 500+ labeled routing decisions accumulate.

Why Block PHI Instead of Redact-and-Send

The spec's PHI circuit breaker blocks the entire cloud request when PHI is detected, rather than redacting and sending. This is intentional. Redaction creates a false sense of safety — if the redactor misses something (and regex-based detectors will miss edge cases), patient data leaks. Blocking ensures the worst case is a slightly less capable response, not a data breach. The fallback to MedGemma locally still produces a useful answer.

Why Immutable Data Patterns Everywhere

Every learn_from_conversation() call, every profile update, every context assembly returns a new object rather than mutating the input. This prevents subtle bugs in concurrent FastAPI workers, makes debugging deterministic, and follows the project-wide coding standard. The cost (extra object allocation) is negligible compared to the LLM inference time that dominates every request.


What's Next

Three phases remain in the Abby 2.0 roadmap:

PhaseGoalKey Capability
Phase 4: Agency FrameworkAbby can take actionsPlan-Confirm-Execute loop with risk classification, rollback, and audit trail
Phase 5: Advanced AgencyComplex workflowsDAG-based multi-step orchestration with parallel execution
Phase 6: Institutional IntelligenceOrganization-wide learningAutomatic knowledge capture, shared artifact library, FAQ auto-promotion

Phase 4 is where Abby gets hands. "Build me a cohort of Type 2 diabetes patients on metformin with HbA1c > 9" won't just get an explanation — it'll get a reviewable plan that, with one click, creates the concept sets, defines the cohort, generates the patient count, and links you to the result.

The memory foundation, hybrid intelligence, and semantic understanding we built today are the platform everything else stands on. Abby remembers, reasons, and understands. Next, she acts.