Building the Ingestion Pipeline: File Staging, Project Management, and the Path to Aqueduct
A massive day on the ingestion front — 87 commits landed in Parthenon today, almost entirely focused on building out a brand-new end-to-end data ingestion pipeline. We now have a fully wired system for creating ingestion projects, uploading raw files, staging them into a schema-isolated PostgreSQL environment, and handing off to Aqueduct for ETL. This has been a long time coming.
The Ingestion Pipeline: From Zero to Staged Data
The headline work today is the ingestion subsystem — a cohesive feature that takes a researcher from "I have some CSV files" to "my data is staged and ready for CDM mapping," all within the Parthenon UI.
Project Model and Access Control
Everything starts with IngestionProject — a new Eloquent model and accompanying Laravel policy (aacf41c93). Projects act as the top-level container for a researcher's raw data, tracking lifecycle state from initial creation through file upload, staging, and ultimately a ready status that unlocks downstream actions. The policy enforces ownership and role-based access from the start, ensuring researchers only see and act on their own projects.
A dedicated set of form requests and a full IngestionProjectController (f48992b5b) wire up the REST surface — create, list, show, and status-transition endpoints — all sitting behind properly scoped middleware. Notably, a follow-up fix (60bd93bf7) patched a gap where the ingestion routes were missing permission middleware entirely; that's now resolved and serves as a reminder to audit new route groups at the point of creation rather than after.
Queue-Based File Staging
The core of the pipeline is StageFileJob (58ed82726), a queued Laravel job that handles the heavy lifting of getting uploaded files into a usable database structure. Each file gets dispatched independently, meaning multi-file uploads process in parallel without blocking the UI. The job hands off to StagingService (28797e458), which is responsible for:
- Schema creation: Each ingestion project gets its own isolated PostgreSQL schema, preventing cross-project data bleed during the staging phase.
- Data loading: Reads uploaded files and bulk-loads rows into the staging schema, handling type inference at the column level.
Alongside staging, we introduced a column and table name sanitizer (aacf41c93) that handles the unglamorous but critical job of cleaning arbitrary user-supplied headers into valid SQL identifiers. It handles reserved word collisions, strips illegal characters, and deduplicates columns — exactly the kind of defensive logic that prevents subtle downstream failures when researchers upload files with headers like "order", "select", or "patient id (v2) [final]".
Frontend: Project List, Detail View, and Multi-File Upload
The UI side kept pace with the backend. New React hooks and API bindings (01a657dd0) wrap all the ingestion endpoints, and a project list component gives researchers a dashboard view of their active and completed ingestion projects. The Upload Files tab was restructured (a7b2c59d4) to support multi-file selection with per-file status indicators — upload progress, staging status, and any errors surface inline rather than in a toast that disappears.
The project detail view is the centrepiece here: it shows project metadata, file status, and — once the project reaches ready — an Open in Aqueduct button.
Auto-Creation and Aqueduct Handoff
Two commits tie the lifecycle together neatly. When a project transitions to ready status (all files staged without error), the system automatically creates a staging Source record (e0efbb89b) — the entity that Aqueduct uses to know where to pull data from. No manual configuration step required.
The Open in Aqueduct button (fbea80b04) then deep-links into Aqueduct with that source pre-selected, dropping the researcher directly into the ETL mapping workflow with their data already wired up. This is the kind of cross-tool integration that makes the platform feel like a platform rather than a collection of loosely related tools.
On the Horizon: Abby 2.0 Phase 3
While today's work was all ingestion, the devlog notes from last week signal what's coming next on the AI side. Abby 2.0 Phase 3 — the Semantic Knowledge Graph — is in active planning. The design calls for a KnowledgeGraphService that traverses concept_ancestor and concept_relationship tables with Redis-backed caching, paired with a DataProfileService that builds a living coverage profile of the institution's CDM: temporal range, domain density, vocabulary completeness, and proactive gap warnings.
The goal is to give Abby genuine relational understanding of clinical concepts — so when a researcher asks about a condition with thin data at this institution, she warns them before they build a cohort on a foundation of 12 patients. That work will touch ai/app/knowledge/, the live context pipeline in chroma/live_context.py, and the context assembler. Expect those commits to start landing soon.
What's Next
- Ingestion error handling: Surface per-row staging errors back to the UI, and define retry semantics for
StageFileJobon transient failures. - Schema lifecycle management: Staged schemas need a cleanup path — either on project deletion or after successful CDM load in Aqueduct.
- Abby Phase 3 kickoff:
KnowledgeGraphServiceandDataProfileServiceimplementation, starting with the OMOP hierarchy traversal and Redis caching layer. - Staging source permissions: Review whether auto-created Sources inherit project-level ACLs correctly or need explicit permission wiring.
Solid day. The ingestion pipeline has been a missing piece for researchers who want to bring their own data into the platform without going through a manual DBA-assisted ETL setup. Today's work makes that self-service path real.