Skip to main content

Administering Abby

This guide covers day-to-day management of Abby's knowledge base, health monitoring, and the ChromaDB Studio admin panel.

ChromaDB Studio

The ChromaDB Studio is accessible at Admin > Services > ChromaDB. It provides:

Collection Overview

A dropdown selector lists all ChromaDB collections with vector counts. Selecting a collection shows:

  • Vector count — total embedded chunks
  • Dimensions — embedding size (384 or 768)
  • Metadata fields — available filter keys
  • Facet distribution — breakdown of metadata values (source, year, package, etc.)
  • Sample records — preview of stored chunks with metadata tags

Ingestion Actions

Five action buttons trigger knowledge base updates:

ButtonEndpointWhat It DoesDuration
Ingest DocsPOST /chroma/ingest-docsRe-embeds platform documentation~1 min
Ingest ClinicalPOST /chroma/ingest-clinicalEmbeds OMOP concepts with SapBERT~5 min
Promote FAQPOST /chroma/promote-faqPromotes frequent user questions~30s
Ingest OHDSI PapersPOST /chroma/ingest-ohdsi-papersEmbeds research PDFs~15-30 min
Ingest OHDSI KnowledgePOST /chroma/ingest-ohdsi-knowledgeEmbeds Book, HADES vignettes, forums~2-5 min
Ingestion timing

Paper ingestion processes 2,000+ PDFs with SapBERT embeddings and can take 15-30 minutes. Run during maintenance windows. All other ingestions are idempotent — unchanged content is skipped via content hashing.

The Retrieval tab lets you test semantic queries against any collection:

  1. Select a collection from the dropdown
  2. Enter a natural language query
  3. Adjust K (number of results, 1-50)
  4. Results show matched chunks with cosine distance scores and metadata

This is useful for verifying that Abby's knowledge base contains the content you expect and that retrieval quality is adequate.

3D Vector Explorer

The Semantic Map visualizes the vector space as an interactive 3D point cloud:

  • Points colored by cluster assignment
  • Rotate, zoom, and pan with mouse controls
  • Click points to inspect metadata
  • Outlier and duplicate detection highlighted
  • Powered by PCA+UMAP projection, accelerated by Solr pre-computation

Solr Acceleration

The 3D vector explorer uses Apache Solr to cache pre-computed projections. Without Solr, each projection request requires ~8-10 seconds of live PCA+UMAP computation. With Solr, cached projections load in under 500ms.

Updating the Solr Index

After ingesting new content, update the Solr index:

# Index a specific collection
docker compose exec php php artisan solr:index-vector-explorer --collection=ohdsi_papers

# Index all collections
docker compose exec php php artisan solr:index-vector-explorer

# Fresh re-index (delete existing, then index)
docker compose exec php php artisan solr:index-vector-explorer --fresh

# Custom sample size (default: 5000 vectors)
docker compose exec php php artisan solr:index-vector-explorer --sample-size=10000

Solr Core Schema

The vector_explorer Solr core stores:

FieldTypeDescription
point_idstringcollection:chroma_id (composite key)
collection_namestringChromaDB collection name
x, y, zfloat3D projected coordinates
cluster_idintHDBSCAN cluster assignment
cluster_labelstringAuto-generated cluster label
is_outlierbooleanOutlier flag from HDBSCAN
sourcestringContent source tag
titlestringDocument/paper title
meta_s_*dynamic stringString metadata fields
meta_i_*dynamic intInteger metadata fields
meta_f_*dynamic floatFloat metadata fields

Health Monitoring

Abby's components are monitored on the System Health dashboard:

ComponentCheckHealthyDegraded
ChromaDBHTTP heartbeatConnected, collections queryableTimeout or connection refused
OllamaModel availabilityMedGemma loaded and responsiveModel not found or OOM
SolrCore pingvector_explorer core reachableCore missing or Solr down
RedisConnection testConnectedConnection refused

Conversation Memory Management

Pruning Old Conversations

Conversation memory entries older than the TTL (default: 90 days) can be pruned per-user:

# Prune conversations older than 90 days for user ID 1
curl -X POST "http://localhost:8002/chroma/prune-conversations/1?ttl_days=90"

# Custom TTL
curl -X POST "http://localhost:8002/chroma/prune-conversations/1?ttl_days=30"

FAQ Promotion

The FAQ promotion algorithm scans all user conversation collections and promotes questions meeting the criteria:

  • Frequency: Asked 5+ times
  • Breadth: By 3+ distinct users
  • Similarity: 0.85 cosine similarity threshold for grouping
# Promote FAQ from last 7 days of conversations
curl -X POST "http://localhost:8002/chroma/promote-faq?days=7"

# Promote from last 30 days
curl -X POST "http://localhost:8002/chroma/promote-faq?days=30"

Updating the Knowledge Base

After Documentation Changes

# Rebuild docs and re-ingest
./deploy.sh --docs
curl -X POST http://localhost:8002/chroma/ingest-docs

After Vocabulary Updates

curl -X POST http://localhost:8002/chroma/ingest-clinical

Harvesting New Research Papers

cd OHDSI-scraper

# Run the full harvester (scrapes new papers since last run)
python3 harvester.py --email your@email.com

# Then ingest into ChromaDB
curl -X POST http://localhost:8002/chroma/ingest-ohdsi-papers

Refreshing Forum Content

cd OHDSI-scraper
python3 scrape_forums.py
curl -X POST http://localhost:8002/chroma/ingest-ohdsi-knowledge

After Any Ingestion: Update Solr

docker compose exec php php artisan solr:index-vector-explorer --fresh

Environment Variables

VariableDefaultDescription
CHROMA_HOSTchromadbChromaDB hostname
CHROMA_PORT8000ChromaDB port
OLLAMA_BASE_URLhttp://host.docker.internal:11434Ollama API base URL
OLLAMA_MODELMedAIBase/MedGemma1.5:4bLLM model identifier
OHDSI_CORPUS_DIR/app/ohdsi_corpusPath to harvested PDF corpus
OHDSI_BOOK_DIR/app/book_of_ohdsiPath to Book of OHDSI chapters
OHDSI_VIGNETTES_DIR/app/hades_vignettesPath to HADES vignettes
OHDSI_FORUMS_DIR/app/ohdsi_forumsPath to forum threads
DOCS_DIR/app/docsPath to platform documentation