Abby 2.0: From Chatbot to Cognitive Research Assistant — The Complete Architecture

March 17, 2026 · 15 min read

Creator, Parthenon

AI Development Assistant

In a single development session, we shipped three phases of a cognitive architecture that transforms Abby from a stateless RAG chatbot into a persistent, intelligent, context-aware research assistant. She now remembers who you are, routes complex questions to a more powerful brain, traverses clinical concept hierarchies, and warns you when your data has gaps. This post tells the complete story — the problems we solved, the architecture we built, and the engineering decisions behind 188 passing tests across 60+ new files.

Where Abby 1.0 Left Off

By March 16th, Abby was already impressive for an on-premises clinical research assistant:

RAG-enabled with 115,000+ vectors across 6 ChromaDB collections (documentation, per-user conversations, shared FAQs, clinical references, OHDSI papers, unified)
Database-aware with 8 live context tools querying concept sets, cohort definitions, vocabulary concepts, Achilles characterization, DQD results, CDM summaries, and analysis executions in real time
Page-contextual with 18 specialized personas adapting to wherever the researcher was in the platform
Streaming via SSE through the Laravel proxy for responsive token-by-token output
Privacy-preserving — everything ran on-premises via MedGemma 1.5 4B through Ollama

But she had four fundamental limitations:

Amnesia — she forgot everything between sessions. Every conversation started from zero.
One brain — MedGemma 4B handled every query, whether it was "Hello" or "Design a complex incident cohort with temporal logic and propensity score matching."
Flat knowledge — she treated SNOMED concepts as keywords, not as nodes in a hierarchy. She couldn't tell you that metformin is used for Type 2 diabetes.
No agency — she could answer questions but couldn't take actions.

Abby 2.0 addresses the first three. Agency comes in Phase 4.

The Design: Cognitive Architecture

We designed Abby 2.0 as a cognitive architecture inspired by human memory models, with a comprehensive spec covering six phases. The design was brainstormed, reviewed, and refined through two full spec review cycles (17 issues found and resolved) before a single line of code was written.

The three phases shipped today:

Phase 1: Memory Foundation      → She remembers who you are
Phase 2: Intelligence Upgrade   → She gets a bigger brain when needed
Phase 3: Knowledge Graph        → She understands concept relationships

Each phase was implemented via a formal plan (reviewed, approved), executed through subagent-driven TDD development, and verified with integration tests before merge.

Phase 1: Memory Foundation

The Problem

A senior epidemiologist with 10 years of OHDSI experience got the same boilerplate explanations as a first-year medical student. A researcher who asked about diabetes cohorts every day never heard "Welcome back — still working on that T2DM study?" The context pipeline dumped everything into the prompt without prioritization — help docs, RAG results, live queries, page data — hoping MedGemma would sort it out within a cramped 4K token window.

Four-Tier Memory System

We implemented a layered memory model:

Tier	Scope	What It Stores	Persistence
Working Memory	Single session	Intent stack (active topics), scratch pad (SQL drafts, cohort specs)	In-memory, cleared on restart
Episodic Memory	Per user	Research profile (interests, expertise, preferences), conversation archive	PostgreSQL + pgvector
Semantic Memory	Domain	RAG collections, knowledge graph (Phase 3)	ChromaDB + PostgreSQL
Institutional	Organization	Shared discoveries, templates, FAQs	Phase 6 (future)

Intent Stack: Tracking What You're Actually Doing

Conversations aren't random — they have threads. The IntentStack is a bounded stack (max depth 3) that tracks active topics using domain keyword detection across 10 clinical research areas (diabetes, cardiovascular, oncology, respiratory, neurology, mental health, infectious disease, rheumatology, nephrology, endocrinology).

# Turn 1: "What's the diabetes prevalence in our CDM?"
# Turn 2: "Break it down by age group"
#   → Intent stack knows turn 2 is about diabetes, not a fresh query
# Turn 5: "Now let's look at hypertension"
#   → Explicit topic change detected, stack cleared

Topics expire after 10 turns of inactivity. The stack serializes for session persistence and feeds into the context assembly pipeline.

Profile Learner: Understanding Who You Are

The ProfileLearner extracts research interests, interaction preferences, and expertise levels from every conversation — using keyword matching and regex pattern detection, not LLM calls. Zero additional compute cost.

What it learns:

Research interests: Detected via domain keywords ("I'm studying incident diabetes in elderly populations" → diabetes interest added)
Interaction preferences: Detected via behavioral patterns ("just give me the SQL, I don't need the explanation" → verbosity: terse)
Corrections: Tracked for disambiguation ("no, I meant Type 2 specifically" → stored for future reference)
Expertise calibration: Requires 5+ interactions before adjusting (a single basic question doesn't downgrade an expert), uses exponential decay weighting so recent interactions matter more

Critical design decision: immutability. learn_from_conversation() returns a new UserProfile instance — it never mutates the input. This follows our project-wide coding standard and prevents subtle bugs in concurrent contexts.

The profile is visible to users via a "My Research Profile" panel in the chat interface, showing learned interests as teal tags and expertise as progress bars. Users can reset their profile at any time.

Context Assembly Pipeline: Every Token Counts

The old approach concatenated everything and hoped for the best. The new ContextAssembler scores every piece of context by relevance and allocates tokens within strict per-tier budgets:

                        MedGemma (4K)    Claude (28K)
Working Memory            1,500            8,000
Page Context                500            2,000
Live Database               800            4,000
Episodic Memory             400            4,000
Semantic Knowledge          600            6,000
Institutional               200            4,000

Safety-critical context (data quality warnings) gets guaranteed minimum allocation regardless of budget pressure — if a warning exists for the domain being discussed, it cannot be truncated.

PostgreSQL-Backed Conversation Archive

ChromaDB's 90-day TTL conversations are migrated to PostgreSQL with pgvector embeddings. The abby_messages table now carries a vector(384) column with an HNSW cosine index for fast similarity search. During the migration period, a MigrationBridge queries PostgreSQL first and falls back to ChromaDB, with automatic deduplication across sources.

What Shipped

7 Python memory module components (intent stack, scratch pad, context assembler, profile learner, conversation store, summarizer, migration bridge)
abby_user_profiles table with TEXT[] and JSONB columns
pgvector embedding column + HNSW index on abby_messages
Profile API (GET/PUT/POST reset) under auth:sanctum
Frontend profile panel with TanStack Query hook
ChromaDB → PostgreSQL migration script
41 new unit tests + 3 integration tests

Phase 2: Intelligence Upgrade

The Problem

MedGemma 1.5 4B is fast (sub-2-second responses) and medically literate — it handles simple lookups, vocabulary navigation, and factual questions well. But it hits a ceiling on:

Multi-step reasoning ("Design a cohort for incident statin users with prior cardiovascular events, excluding those with liver disease")
Methodology critique ("Are there any biases in this study design?")
Analysis interpretation ("What does this characterization tell me about my cohort?")
Complex NL-to-SQL translation

The answer isn't to replace MedGemma — it's to route the right questions to the right brain.

Two-Stage Model Router

Every message flows through a two-stage router:

Stage 1: Deterministic Rules (<1ms, zero cost)

"Create a cohort..."        → CLOUD  (action word: create)
"Hello Abby"                → LOCAL  (greeting detected)
"What is concept 201826?"   → LOCAL  (simple lookup, &lt;80 chars)
"Modify the inclusion..."   → CLOUD  (action word: modify)
200+ chars with 2+ clauses  → CLOUD  (complexity signal)

Stage 2: Bootstrap Heuristic Scoring (when Stage 1 is uncertain)

Complexity indicators (interpret, analyze, critique, methodology, bias, SQL, propensity score, immortal time...) boost the cloud score by 0.2 each. Simplicity indicators (bare "what is X?", "yes/no", "show me...") boost the local score by 0.3. The router errs toward Claude when scores are tied — better to over-spend than under-deliver on a clinical research question.

~70% of requests stay local. The remaining 30% go to Claude. Users never see the routing.

Bootstrap → Classifier Pipeline: As users provide thumbs-up/down feedback, each rating is stored alongside the routing decision. Once 500+ labeled samples accumulate, a fine-tuned distilbert classifier will replace the heuristic scoring. The CostTracker already has get_routing_labels() and get_routing_label_count() methods ready for this transition.

PHI Protection: Defense in Depth

Clinical data must never leave the network. We enforce this at two levels:

Primary Defense — Architectural Boundary: The CloudSafetyFilter maintains an explicit allowlist. Only approved data sources can enter a cloud-bound prompt. All 15+ individual-level CDM tables (person, visit_occurrence, condition_occurrence, drug_exposure, measurement, observation, etc.) are permanently blocked. Content patterns like person_id: 12345 trigger blocking even from unknown sources.

Secondary Defense — PHI Sanitizer: A PHISanitizer runs on every cloud-bound prompt as defense-in-depth, combining:

Regex patterns for SSN, MRN (with medical record context), phone (with contact context), email, DOB (with birth context)
spaCy en_core_web_sm PERSON entity recognition for name detection
Clinical context guard preventing false positives on OMOP concept IDs

Circuit Breaker: If PHI is detected, the cloud request is blocked entirely (not just redacted) and falls back to MedGemma locally. Monthly audit of all cloud API logs.

Cost Controls

Cloud API costs are tracked per-request to the abby_cloud_usage table and enforced per-month:

Multi-tier alerting at 50%, 80%, and 95% of the monthly budget
Circuit breaker at 95%: Cloud routing disabled, all requests fall back to local with a user-visible confidence indicator ("low") and a degraded-mode caveat
Per-user/department tracking for cost attribution
SHA-256 audit hash on every cloud request payload

Confidence & Attribution

Every response now carries:

Confidence indicator: high (Claude + context), medium (MedGemma + context), low (degraded mode)
Routing metadata: Which model, why, which stage decided
Source attribution: Context pieces tagged with tier and source

What Shipped

5 Python routing module components (rule router, Claude client, PHI sanitizer, cloud safety filter, cost tracker)
abby_cloud_usage table with department, token, cost, and audit columns
Context assembler extended with Claude 28K budget profile
Hybrid routing integrated into chat pipeline with PHI blocking and cost tracking
Router calibration feedback loop for classifier training
47 new tests across 5 test suites

Phase 3: Semantic Knowledge Graph

The Problem

OMOP's vocabulary is inherently hierarchical. SNOMED CT organizes 350,000+ clinical concepts into parent-child trees. ICD-10 nests codes in chapters, blocks, and categories. RxNorm links ingredients to clinical drugs to branded products.

But Abby treated concepts as flat keywords. Ask "What are the subtypes of diabetes?" and she'd search for the word "diabetes" in her RAG knowledge base — returning whatever documents mentioned diabetes, not the actual concept hierarchy.

KnowledgeGraphService: Relational Understanding

Five core operations, all backed by concept_ancestor and concept_relationship with Redis caching (1-hour TTL):

Method	Query Pattern	Use Case
`get_ancestors(id, levels)`	Walk up `concept_ancestor`	"What broader category does metformin belong to?"
`get_descendants(id, levels)`	Walk down `concept_ancestor`	"What are all subtypes of diabetes?"
`get_siblings(id)`	Parent → children	"What other drugs are in the same class as metformin?"
`find_related(id, types)`	`concept_relationship`	"What conditions are associated with this drug?"
`get_concept(id)`	Direct lookup	"What is concept 201826?"

No separate graph database. The existing OMOP concept_ancestor table (with min_levels_of_separation and max_levels_of_separation columns) already encodes the full transitive closure. We just made it fast and queryable from the AI service via Redis caching. Hot paths become in-memory lookups after first access.

DataProfileService: Know Your Data Before You Query It

Abby now maintains a living understanding of the institution's CDM:

Person count: Total patients
Temporal coverage: Observation period date range
Domain density: Records per clinical domain (conditions, drugs, procedures, measurements, observations, visits, devices), sorted by density
Gap detection: Three warning types:
- Critical: CDM has zero patients
- Sparse domain: <1 record per patient (unreliable for research)
- Temporal: <3 years coverage (insufficient for longitudinal studies)

Data quality warnings are injected as safety-critical context in the system prompt — they cannot be truncated by the token budget allocator. When a researcher asks about measurements and the Measurement domain has 500 records for 100K patients, Abby says so before the researcher wastes hours building a cohort on empty data.

Integration via Intent Detection

Five new intent patterns route queries to the knowledge graph:

"ancestors of concept 201826"    → get_ancestors
"subtypes of diabetes"           → get_descendants
"similar concepts to metformin"  → get_siblings
"relationships for this drug"    → find_related
"how much data do we have"       → data_profile

Results are injected into the live context pipeline with section headers (CONCEPT HIERARCHY:, CDM DATA PROFILE:) for structured prompt composition.

What Shipped

2 Python knowledge module components (graph service, data profile service)
5 new intent patterns + 2 tool functions in live context pipeline
Data quality warning injection in chat system prompt
Redis caching with configurable TTL and prefix
19 new tests across 2 test suites + 2 integration tests

The Numbers

Code Impact

Metric	Value
Total commits	42 across 3 feature branches
Files created	60+ new files
Files modified	20+ existing files
Lines added	~7,000+
Python test suites	15
Total tests passing	188 (3 pre-existing failures unrelated)
TypeScript	Compiles cleanly
Vite build	Succeeds

Architecture Components

Layer	Components
Python Memory Module (7)	IntentStack, ScratchPad, ContextAssembler, ProfileLearner, ConversationStore, Summarizer, MigrationBridge
Python Routing Module (5)	RuleRouter, ClaudeClient, PHISanitizer, CloudSafetyFilter, CostTracker
Python Knowledge Module (2)	KnowledgeGraphService, DataProfileService
Database (3 tables)	abby_user_profiles, abby_messages (pgvector), abby_cloud_usage
Laravel (4)	AbbyUserProfile model, AbbyProfileController, UpdateAbbyProfileRequest, AiService extensions
Frontend (4)	memory.ts types, useAbbyProfile hook, AbbyProfilePanel component, AskAbbyChannel integration

Request Flow (After Phase 2)

User message arrives
    ↓
Working memory: Intent stack tracks topic, scratch pad holds artifacts
    ↓
Router: Stage 1 deterministic rules → Stage 2 heuristic scoring
    ↓
Budget check: Cost tracker circuit breaker (95% = local only)
    ↓
┌─ CLAUDE PATH ──────────────────────┐  ┌─ LOCAL PATH ────────────┐
│ Cloud safety filter (allowlist)    │  │ MedGemma via Ollama     │
│ PHI sanitizer (regex + NER)       │  │ 4K token budget         │
│ Context assembler (28K budget)    │  │ &lt;2s response time       │
│ Claude API call                   │  │                         │
│ Cost recorded to abby_cloud_usage │  │                         │
└────────────────────────────────────┘  └─────────────────────────┘
    ↓
Context enrichment: RAG + Live DB + Knowledge Graph + Data Profile
    ↓
Data quality warnings injected (safety-critical, never truncated)
    ↓
Response: reply + suggestions + confidence + sources + routing metadata
    ↓
Profile learner: Updates user research profile (non-blocking)
Memory storage: Conversation persisted with pgvector embedding

Engineering Decisions Worth Noting

Why PostgreSQL + Redis, Not a Graph Database

The OMOP concept_ancestor table already contains the full transitive closure of concept hierarchies. Adding Neo4j or a similar graph database would introduce operational complexity (another service to manage, data sync concerns, schema divergence) for a capability that PostgreSQL handles well with indexed queries. Redis caching makes hot paths effectively in-memory.

Why Rule-Based Routing Instead of an LLM Classifier

Using MedGemma itself to classify complexity creates a circular dependency — the model must judge its own limitations. Small models are poorly calibrated for confidence estimation. A rule-based pre-filter + heuristic scoring is predictable, debuggable, and zero-cost. The classifier will be trained on real production data once 500+ labeled routing decisions accumulate.

Why Block PHI Instead of Redact-and-Send

The spec's PHI circuit breaker blocks the entire cloud request when PHI is detected, rather than redacting and sending. This is intentional. Redaction creates a false sense of safety — if the redactor misses something (and regex-based detectors will miss edge cases), patient data leaks. Blocking ensures the worst case is a slightly less capable response, not a data breach. The fallback to MedGemma locally still produces a useful answer.

Why Immutable Data Patterns Everywhere

Every learn_from_conversation() call, every profile update, every context assembly returns a new object rather than mutating the input. This prevents subtle bugs in concurrent FastAPI workers, makes debugging deterministic, and follows the project-wide coding standard. The cost (extra object allocation) is negligible compared to the LLM inference time that dominates every request.

What's Next

Three phases remain in the Abby 2.0 roadmap:

Phase	Goal	Key Capability
Phase 4: Agency Framework	Abby can take actions	Plan-Confirm-Execute loop with risk classification, rollback, and audit trail
Phase 5: Advanced Agency	Complex workflows	DAG-based multi-step orchestration with parallel execution
Phase 6: Institutional Intelligence	Organization-wide learning	Automatic knowledge capture, shared artifact library, FAQ auto-promotion

Phase 4 is where Abby gets hands. "Build me a cohort of Type 2 diabetes patients on metformin with HbA1c > 9" won't just get an explanation — it'll get a reviewable plan that, with one click, creates the concept sets, defines the cohort, generates the patient count, and links you to the result.

The memory foundation, hybrid intelligence, and semantic understanding we built today are the platform everything else stands on. Abby remembers, reasons, and understands. Next, she acts.

Where Abby 1.0 Left Off​

The Design: Cognitive Architecture​

Phase 1: Memory Foundation​

The Problem​

Four-Tier Memory System​

Intent Stack: Tracking What You're Actually Doing​

Profile Learner: Understanding Who You Are​

Context Assembly Pipeline: Every Token Counts​

PostgreSQL-Backed Conversation Archive​

What Shipped​

Phase 2: Intelligence Upgrade​

The Problem​

Two-Stage Model Router​

PHI Protection: Defense in Depth​

Cost Controls​

Confidence & Attribution​

What Shipped​

Phase 3: Semantic Knowledge Graph​

The Problem​

KnowledgeGraphService: Relational Understanding​

DataProfileService: Know Your Data Before You Query It​

Integration via Intent Detection​

What Shipped​

The Numbers​

Code Impact​

Architecture Components​

Request Flow (After Phase 2)​

Engineering Decisions Worth Noting​

Why PostgreSQL + Redis, Not a Graph Database​

Why Rule-Based Routing Instead of an LLM Classifier​

Why Block PHI Instead of Redact-and-Send​

Why Immutable Data Patterns Everywhere​

What's Next​

Where Abby 1.0 Left Off

The Design: Cognitive Architecture

Phase 1: Memory Foundation

The Problem

Four-Tier Memory System

Intent Stack: Tracking What You're Actually Doing

Profile Learner: Understanding Who You Are

Context Assembly Pipeline: Every Token Counts

PostgreSQL-Backed Conversation Archive

What Shipped

Phase 2: Intelligence Upgrade

The Problem

Two-Stage Model Router

PHI Protection: Defense in Depth

Cost Controls

Confidence & Attribution

What Shipped

Phase 3: Semantic Knowledge Graph

The Problem

KnowledgeGraphService: Relational Understanding

DataProfileService: Know Your Data Before You Query It

Integration via Intent Detection

What Shipped

The Numbers

Code Impact

Architecture Components

Request Flow (After Phase 2)

Engineering Decisions Worth Noting

Why PostgreSQL + Redis, Not a Graph Database

Why Rule-Based Routing Instead of an LLM Classifier

Why Block PHI Instead of Redact-and-Send

Why Immutable Data Patterns Everywhere

What's Next