Skip to main content

Abby AI Assistant Stabilization, Integration Testing, and Design Fixture Hygiene

· 5 min read
Creator, Parthenon
AI Development Assistant

A big day focused on getting the Ask-Abby AI assistant into a genuinely reliable state — squashing a cascade of cold-start failures, wiring up a comprehensive integration test suite, and cleaning up some fixture hygiene issues that were quietly polluting our design exports. Eighty-nine commits landed in Parthenon today, and the platform feels meaningfully more stable for it.

Ask-Abby: From Fragile to Production-Ready

The bulk of today's work centered on Abby, Parthenon's conversational AI assistant powered by Ollama. What started as a cluster of seemingly unrelated bug reports turned out to share a common thread: Abby's initialization path had several silent failure modes that only surfaced under real-world conditions.

Cold-Start and Connectivity Fixes

The first batch of fixes (c039ab1e9) tackled three interlocking runtime problems:

  • R runtime crash loop — the R process was dying on startup under certain container configurations, taking dependent services with it.
  • Abby GIS Ollama connectivity — the GIS-aware Abby instance wasn't correctly resolving the Ollama service endpoint, causing silent failures on spatial queries.
  • Chat cold-start timeout — the first request after a container restart was timing out before Ollama's model was fully loaded into memory.

A follow-up fix (b485c2146) resolved a separate but related set of failures: missing database migrations hadn't been run, the CDM schema reference was incorrect, and an anthropic import was crashing the service even when Anthropic wasn't the configured backend. These are the kinds of bugs that are invisible in development but catastrophic in staging — glad to have them rooted out.

Model Configuration and Keep-Alive Tuning

Abby's model was switched from its previous configuration to Q4_K_M (d84544c6e), a quantization level that hits a better balance between response quality and memory footprint for our hardware profile. Keep-alive was also capped at one hour — previously it was unbounded, which meant idle model instances were holding GPU memory indefinitely.

Suggestion Chips and Prompt Framing

Two UX-level fixes improved Abby's interactive surface. First, the suggestion chips shown after Abby responses were reframed as user-voice action prompts (89df15329) — phrased as things the user would say, not things Abby would say. This is a subtle but important UX distinction that makes the chips feel like natural conversation continuations rather than canned bot responses.

Second, Abby now correctly parses MedGemma's Suggestion: response format (a1584eecc). MedGemma formats its suggested follow-up prompts differently than the default Ollama response schema, and the parser was silently dropping them. Chips now render correctly regardless of which underlying model is serving the request.

FAQ Table Name Fix

A particularly insidious bug (5855a016c) had the FAQ promoter querying a table called abby_conversation_memory — a table that doesn't exist. The correct table is abby_messages. This was causing silent failures in the FAQ surface that were hard to trace because the error was swallowed upstream. Fixed and verified.


52-Test Abby Integration Suite

With the bug surface better understood, we added a comprehensive 52-test integration suite for Abby (bc6852954). The suite covers:

  • Conversation isolation — ensuring that concurrent sessions don't bleed history into each other, which was a real failure mode we observed in multi-user testing.
  • History persistence — verifying that message history is correctly stored and retrieved across requests within a session.
  • Model response parsing — covering both standard Ollama format and MedGemma's Suggestion: format.
  • Edge cases — empty inputs, malformed responses, cold-start behavior, and timeout handling.

Having this suite in place means future Abby changes have a meaningful regression harness. The conversation isolation fix in particular is important for a healthcare analytics platform — cross-session data leakage would be a serious issue.


ResearchProfile Validator: PHP Empty Array Coercion

A quieter but important fix (f185bc1e): PHP's json_encode() serializes empty arrays as [] (a JSON array), but the ResearchProfile validator expects empty collections to be {} (a JSON object/dict). This caused validation failures for new research profiles with unpopulated array fields. The fix coerces empty PHP arrays to stdClass objects before encoding, which serializes correctly as {}.


Design Fixture Hygiene: Evicting Faker Data

Carried over from yesterday's cleanup work (documented in faker-fixture-cleanup-2026-03-16.md), the parthenon:export-designs command was exporting faker-generated test cohort definitions alongside real clinical designs — 11 files with Lorem ipsum names like ut-doloremque.json were appearing as untracked git changes after seeding.

The fix adds a isFakerGenerated() filter to DesignFixtureExporter that checks names and descriptions against 40 common words from PHP Faker's Lorem provider. A threshold of 3 matching words is required before a record is flagged — conservative enough to avoid false positives on real clinical terminology. The ExportSummary class now tracks a skipped count, and the CLI command reports it in its output.

The auto-export CI job (d543cdc05) confirmed the fix: zero faker files exported on today's run.


Users Page: last_active_at and is_active Population

A small but visible fix (157a23dc3) ensures that the Users admin page correctly populates last_active_at and is_active fields. These were rendering as null/empty for all users, making the activity view useless for administrators monitoring platform usage.


What's Next

  • Abby performance profiling — now that connectivity and cold-start issues are resolved, the next focus is response latency. The Q4_K_M switch should help, but we want telemetry to confirm.
  • ResearchProfile validation coverage — the empty array coercion fix exposed a gap in our validator test coverage; adding targeted tests for edge-case serialization behavior.
  • Design fixture export pipeline — the faker filter is a good start, but we're considering a more robust fixture tagging system (e.g., a is_fixture boolean on design entities) to make the distinction between test and production data explicit at the schema level rather than inferred from content.
  • Abby session management UI — now that conversation isolation is working correctly at the backend level, surfacing session management controls in the frontend is the logical next step.