If you are building AI features in 2026, retrieval-augmented generation (RAG) is still the most practical way to ship reliable answers on private data without fine-tuning a huge model for every use case. In this guide, you will build a production-ready RAG API with FastAPI, PostgreSQL + pgvector, hybrid search (vector + keyword), and reranking for better relevance.
Why this stack works in 2026
Many teams overcomplicate RAG systems with too many moving parts too early. A clean Postgres-first architecture is often enough for docs, support knowledge bases, internal wikis, product specs, and engineering runbooks.
- FastAPI gives a simple high-performance API layer.
- PostgreSQL + pgvector keeps vectors and metadata in one trusted database.
- Hybrid search improves recall by combining semantic and lexical matching.
- Reranking improves precision before generation.
Architecture
- Ingest documents and split into chunks.
- Generate embeddings and store in Postgres.
- At query time, run vector search + keyword search.
- Merge and rerank top candidates.
- Send best context to the LLM and return answer with citations.
1) Database setup (PostgreSQL + pgvector)
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
title TEXT,
chunk_text TEXT NOT NULL,
chunk_index INT NOT NULL,
metadata JSONB DEFAULT '{}'::jsonb,
embedding vector(1536),
tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(title,'') || ' ' || chunk_text)) STORED
);
CREATE INDEX idx_documents_embedding ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_documents_tsv ON documents USING GIN (tsv);
Use a dimension that matches your embedding model. If you switch models, migrate carefully and re-embed consistently.
2) Ingestion pipeline
Ingestion quality determines answer quality. Keep chunks semantically coherent and preserve section context.
from dataclasses import dataclass
from typing import Iterable
@dataclass
class Chunk:
source: str
title: str
text: str
index: int
def split_text(text: str, max_chars: int = 1200, overlap: int = 150) -> list[str]:
parts = []
start = 0
while start < len(text):
end = min(len(text), start + max_chars)
parts.append(text[start:end])
if end == len(text):
break
start = end - overlap
return parts
import psycopg
def upsert_chunks(conn, chunks: Iterable[Chunk], embed_fn):
with conn.cursor() as cur:
for c in chunks:
emb = embed_fn(c.text) # returns list[float]
cur.execute("""
INSERT INTO documents (source, title, chunk_text, chunk_index, metadata, embedding)
VALUES (%s, %s, %s, %s, %s::jsonb, %s)
""", (c.source, c.title, c.text, c.index, '{"lang":"en"}', emb))
conn.commit()
3) Hybrid retrieval query
Run both searches and combine scores. Vector search catches semantic matches, keyword search catches exact terms, version numbers, API names, and error codes.
WITH vec AS (
SELECT id, source, title, chunk_text,
1 - (embedding <=> $1::vector) AS score
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 40
),
lex AS (
SELECT id, source, title, chunk_text,
ts_rank_cd(tsv, plainto_tsquery('english', $2)) AS score
FROM documents
WHERE tsv @@ plainto_tsquery('english', $2)
ORDER BY score DESC
LIMIT 40
)
SELECT id, source, title, chunk_text,
MAX(score) AS fused_score
FROM (
SELECT * FROM vec
UNION ALL
SELECT * FROM lex
) x
GROUP BY id, source, title, chunk_text
ORDER BY fused_score DESC
LIMIT 20;
4) FastAPI endpoint with reranking
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class AskReq(BaseModel):
question: str
@app.post("/ask")
def ask(req: AskReq):
q_emb = embed(req.question)
candidates = hybrid_search(q_emb, req.question, limit=20)
reranked = rerank(req.question, candidates)[:6] # cross-encoder or hosted reranker
context = "\n\n".join([f"[{i+1}] {c['chunk_text']}" for i, c in enumerate(reranked)])
prompt = f"""Answer using only the context below.
If unsure, say you don't know.
Add citation numbers like [1], [2].
Question: {req.question}
Context:
{context}
"""
answer = generate(prompt)
return {
"answer": answer,
"citations": [{"n": i+1, "source": c["source"], "title": c["title"]} for i, c in enumerate(reranked)]
}
5) Practical production guardrails
A) Metadata filtering
Always filter by tenant, product, locale, or permission scope before retrieval. This prevents data leakage across customers or teams.
B) Freshness strategy
Add updated_at and a lightweight re-index queue. Most teams do not need full re-indexes every hour.
C) Observability
- Track retrieval hit rate.
- Track citation acceptance rate (did users click cited sources).
- Track no-answer rate and hallucination feedback.
D) Caching
Cache embeddings for repeated questions and cache final answers with short TTL for high-traffic endpoints.
E) Safety policy
Use a response policy layer for sensitive actions: code execution instructions, secrets, legal/medical guidance, and internal-only docs.
6) Evaluation loop you can run weekly
Create a benchmark set of 100 real user questions with expected sources. Then score:
- Recall@k for retrieval
- MRR / nDCG for ranking quality
- Answer groundedness (citation-backed claims)
- User-rated helpfulness
Small, regular eval cycles usually outperform one-time architecture rewrites.
Common mistakes to avoid
- Using very large chunks that blur topics.
- Skipping keyword search entirely.
- No reranking step for top candidates.
- Returning answers without citations.
- Ignoring permissions in retrieval filters.
Final thoughts
For most developer teams in 2026, a Postgres-based RAG stack is the fastest path to useful AI features in production. Start with clean ingestion, hybrid retrieval, and reranking. Add observability from day one, and improve through weekly evaluation instead of guesswork. You will ship faster, reduce hallucinations, and keep infrastructure understandable for your team.

Leave a Reply