Achilles Reliability Hardening: A Big Day for OHDSI Analytics
Today was one of those satisfying days where two major workstreams converged: we pushed the Ares data quality module from skeleton to a fully featured analytics suite with four distinct intelligence phases, and we permanently fixed a cluster of compounding bugs that had been making Achilles characterization runs fragile on large real-world datasets. Both efforts move Parthenon meaningfully closer to being a production-grade OHDSI research platform.
Ares Parity+ Milestone: From Stub to Suite
The headline work today was the completion of the Ares Parity+ milestone — a multi-phase build that brings Ares data quality analytics into Parthenon as a first-class citizen. The full design spec and devlog were committed alongside the code (docs(ares): add devlog and design specs), so future contributors have a clear paper trail for every architectural decision.
Backend Foundation
The backend work started with AresController, wiring up release and annotation API routes, and the ares:backfill-releases Artisan command for migrating legacy release data into the new schema. These two pieces together mean Parthenon can ingest historical Ares output and track new releases going forward without any manual data surgery.
Frontend Shell: Hub Dashboard, Releases & Annotations
The first frontend phase (feat: add Ares tab frontend) established the hub dashboard, releases list, and annotations views. This is the scaffolding everything else hangs off — a consistent navigation frame and data-loading pattern that the subsequent phase components slot into cleanly.
Phase 2 — Quality Intelligence
Phase 2 (feat(ares): implement Phase 2 Quality Intelligence) delivers the analytical meat that Ares users expect: DQ history trending, unmapped source codes exploration, and domain continuity checks. These views surface the data quality signals that are otherwise buried in raw Ares JSON exports, making them actionable directly inside Parthenon rather than requiring a separate Ares UI instance.
Phase 3 — Network Intelligence
Phase 3 (feat(ares): implement Phase 3 Network Intelligence) adds the collaborative research layer: site comparison, population coverage metrics, demographic diversity analysis, and feasibility assessment. This is particularly valuable for multi-site OHDSI network studies where understanding which sites have sufficient data for a given research question is half the battle.
Phase 4 — Cost Analysis
Phase 4 (feat(ares): implement Phase 4 Cost Analysis) rounds out the milestone with CostService, dedicated cost endpoints, and CostView. The hub skeletons are in place for further expansion. Healthcare cost data is notoriously messy in CDM mappings, so having a dedicated analysis surface for it — rather than treating it as just another domain — reflects how researchers actually use it.
CI Cleanup
The Ares build also came with a round of CI fixes: a recharts Tooltip formatter cast to any for strict TypeScript compatibility, PHPStan and TypeScript error resolution across Ares components, and a Pint auto-fix pass that also removed a stale AchillesRunSummary import and corrected react-joyride export references. These aren't glamorous commits, but a green CI pipeline is what lets us ship confidently.
Achilles Engine Reliability Hardening
Separately, the devlog for Phase 14 — Achilles Engine Reliability Hardening documents the root cause analysis and fixes for a set of compounding bugs that had made every characterization run on the SynPUF dataset (source 47, ~100M+ row measurement table) fragile. Smaller datasets like Eunomia never surfaced these issues, which is exactly why production-scale testing matters.
Four bugs were identified and fixed:
Bug 1 (the killer): Non-resumable retries. RunAchillesJob used AchillesRun::create(), which hit a unique constraint on retry after a timeout. Replaced with AchillesRun::updateOrCreate() — the job is now fully idempotent across retry attempts.
Bug 2: Timeout too short. The 1-hour timeout ($timeout = 3600) was simply not enough — analysis 1811 alone (measurement records by concept by year-month) takes ~116 minutes on SynPUF. Bumped to 3 hours ($timeout = 10800), with $tries increased to 3 and a 30-second backoff.
Bug 3: No analysis-level resume. A run that completed 111 of 127 analyses and then died would restart from analysis 1 on retry, throwing away up to 175 minutes of completed work. The fix adds resume capability to AchillesEngineService so restarts pick up where they left off.
Bug 4: Zombie "running" status. Without a failed() method on RunAchillesJob, any failed run stayed in status=running indefinitely. The UI showed perpetually active jobs with no recovery path. The new failed() handler marks runs as failed with a timestamp, restoring operator visibility.
Worth noting: status remains excluded from $fillable per HIGHSEC 3.1 — all status transitions go through explicit update() calls, not mass assignment. The reliability improvements don't compromise that security invariant.
What's Next
With the Ares Parity+ milestone shipped, the immediate priority is integration testing across all four phases against a real Ares output directory — particularly the network comparison views, which depend on multi-source data being present. We'll also be looking at paginating the cost endpoint responses as cost data can be voluminous.
On the Achilles side, the next step is validating the resume logic under controlled timeout conditions in a staging environment before we consider SynPUF source 47 fully unblocked. Once that's confirmed stable, we can look at parallelizing the slower analyses (1811 in particular) to bring total characterization time down to something more reasonable for routine use.
It was a dense day, but the platform is measurably more capable and more reliable for it.