RAG Knowledge Assistant
A self-hostable assistant that answers questions over internal docs with citations — OpenAI in the cloud, Ollama on-prem.
- Problem
- Teams kept re-asking the same questions because answers were buried across wikis, PDFs and chat history.
- Solution
- A retrieval-augmented pipeline over PostgreSQL/pgvector with a FastAPI service, swappable between OpenAI and local Ollama models.
- Impact
- Cut time-to-answer for common internal questions from minutes of searching to a single grounded reply with sources.
- Python
- FastAPI
- PostgreSQL
- pgvector
- OpenAI
- Ollama
- Redis
Context
Internal knowledge was spread across wikis, exported PDFs and months of chat history. People burned real time re-discovering answers that already existed somewhere.
Architecture
Documents are chunked, embedded and stored in PostgreSQL with pgvector. A FastAPI service handles retrieval (similarity search + reranking) and generation, with a thin abstraction that swaps the model provider between OpenAI and a local Ollama runtime — so the same deployment works for cloud or fully on-prem. Redis caches hot queries and embeddings to keep latency low.
Details
- Grounded answers always cite their source chunks, so users can verify.
- Ingestion is incremental — only changed documents are re-embedded.
- Provider abstraction means no lock-in: switch to local models for sensitive data.
What I’d do next
Add evaluation harnesses for answer quality and a feedback loop that promotes frequently-confirmed answers into a faster cache.