RAG Knowledge Assistant

A self-hostable assistant that answers questions over internal docs with citations — OpenAI in the cloud, Ollama on-prem.

Problem: Teams kept re-asking the same questions because answers were buried across wikis, PDFs and chat history.
Solution: A retrieval-augmented pipeline over PostgreSQL/pgvector with a FastAPI service, swappable between OpenAI and local Ollama models.
Impact: Cut time-to-answer for common internal questions from minutes of searching to a single grounded reply with sources.

Stack

Python
FastAPI
PostgreSQL
pgvector
OpenAI
Ollama
Redis

Context

Internal knowledge was spread across wikis, exported PDFs and months of chat history. People burned real time re-discovering answers that already existed somewhere.

Architecture

Documents are chunked, embedded and stored in PostgreSQL with pgvector. A FastAPI service handles retrieval (similarity search + reranking) and generation, with a thin abstraction that swaps the model provider between OpenAI and a local Ollama runtime — so the same deployment works for cloud or fully on-prem. Redis caches hot queries and embeddings to keep latency low.

Details

Grounded answers always cite their source chunks, so users can verify.
Ingestion is incremental — only changed documents are re-embedded.
Provider abstraction means no lock-in: switch to local models for sensitive data.

What I’d do next

Add evaluation harnesses for answer quality and a feedback loop that promotes frequently-confirmed answers into a faster cache.