GBC-AI: a sermon RAG for my church, running entirely on my own hardware

I attend Grace Bible Church. Every Sunday there's a sermon, posted to YouTube on Monday, and an archive going back years that nobody actually queries. The information is there — it's just locked inside ~30-minute videos with no transcripts, no timestamps, and no way to ask "what did Pastor Bryan say about Romans 8 the last time he preached on it?"

GBC-AI is the answer. It's a local-first sermon RAG that ingests every video, transcribes and diarizes it, indexes the chunks, and lets a congregant ask questions with cited answers that link back to the exact timestamp in the source video.

The shape of it

Three entry points, one pipeline behind them:

app.py — Streamlit chat UI, the thing congregants actually use
api/main.py — FastAPI REST surface, the thing future integrations consume
launch.py — unified launcher with venv setup, preflight checks, model warm-up

The data flow:

YouTube / local video
        │
        ▼  yt-dlp
download_manager.py
        │
        ▼  5-tier pipeline
ingest.py
   ├─ faster-whisper        (large-v3-turbo, GPU)
   ├─ pyannote.audio        (speaker diarization)
   ├─ chunker               (semantic + windowed)
   ├─ BGE embeddings        (1024-dim)
   └─ ChromaDB              (persistent vector store)
        │
        ▼
chat_engine.py
   ├─ RAG retrieval         (top-k + hybrid search)
   └─ LM Studio streaming   (currently Gemma-4-26b-a4b)

A central llm_client.py wraps the streaming client with retries and timeouts. SQLite via database.py replaced an earlier JSON store once the dataset grew. Pydantic env config in settings.py keeps all the secrets and tunables in one place.

Ten phases, all done

The implementation plan ran across ten phases — every one is complete and merged:

Structured logging — Loguru everywhere, JSON sinks for prod
Ruff — locked down style + lint at CI time
Exception hierarchy — IngestError, RetrievalError, etc., with structured fields
LLM client centralization — retries, timeouts, token tracking
DB migration — JSON → SQLite for query-shaped data, ChromaDB stays for vectors
API layer — FastAPI with versioned routes
Auth — token-based, integrates with our existing church accounts
Docker — Compose stack for the API + DB; Streamlit still runs natively
Observability — Langfuse for trace inspection
Pipeline orchestration — Prefect for the ingest jobs

Last meaningful work was the multi-model routing layer and a Redis-backed task queue. There's an IMPROVEMENT_PLAN_V2.md with the next round of work.

What runs where

This whole thing is local-first by design. The 179 GB on disk breaks down roughly as:

sermon-rag/videos/ — raw MP4/MKV, the actual sermons (>100 GB)
sermon-rag/audio/ — extracted audio + voice separation models
sermon-rag/db/ — ChromaDB persistent vector store + SQLite
venv with CUDA torch — ~20–40 GB

The LLM weights live in LM Studio's own directory, not in the repo. The 24 GB on the RTX 3090 is enough to run a Gemma-4-26b-a4b for the retrieval-augmented Q&A while keeping ChromaDB warm.

The hard parts

Whisper accuracy on theological vocabulary. "Propitiation" and "soteriology" don't appear in Whisper's training data the way "weather" does. I tried hot-word lists and they hurt more than they helped — Whisper's a sequence model, not a dictionary. The fix was a post-process pass that uses an LLM with a custom Bible-specific vocabulary prompt to rewrite obvious mishearings.

Citations with timestamps. The answer "Pastor Bryan talked about the imputed righteousness of Christ in his Romans 4 series" is useless without a deeplink. Every chunk in ChromaDB carries (video_id, start_seconds, end_seconds). The chat engine surfaces those as YouTube ?t= links in its output, so the user can jump to the moment.

Cache race conditions. Documented in cache.py:43–46 and IMPROVEMENT_PLAN_V2.md. Multiple tabs hitting the API simultaneously could trigger duplicate work. The fix is a proper double-checked lock with a TTL guard, but I'd rather rip the cache out and front the LLM with Redis instead.

What's next

The roadmap I have written down:

Fix the 26 failing tests in test_ingest.py (heavy module-level imports block clean mocking)
Replace the homegrown cache with Redis-fronted memoization
Add the email-digest feature — weekly auto-summary mailed to opted-in congregants
Sermon comparison — "show me every time this pastor preached on grace, sorted by year"

This was the project that taught me how much you can do with a single 24 GB GPU if you're disciplined about model sizes. It's also the project that's most directly useful to people in my actual life — which is its own kind of vindication.

The code is at github.com/Raymondriter/GBC-AI.