AI RAG in 2026: Build a Production-Ready FastAPI + pgvector Service with Hybrid Search and Reranking

If you are building AI features in 2026, retrieval-augmented generation (RAG) is still the most practical way to ship reliable answers on private data without fine-tuning a huge model for every use case. In this guide, you will build a production-ready RAG API with FastAPI, PostgreSQL + pgvector, hybrid search (vector + keyword), and reranking for better relevance.

Why this stack works in 2026

Many teams overcomplicate RAG systems with too many moving parts too early. A clean Postgres-first architecture is often enough for docs, support knowledge bases, internal wikis, product specs, and engineering runbooks.

FastAPI gives a simple high-performance API layer.
PostgreSQL + pgvector keeps vectors and metadata in one trusted database.
Hybrid search improves recall by combining semantic and lexical matching.
Reranking improves precision before generation.

Architecture

Ingest documents and split into chunks.
Generate embeddings and store in Postgres.
At query time, run vector search + keyword search.
Merge and rerank top candidates.
Send best context to the LLM and return answer with citations.

1) Database setup (PostgreSQL + pgvector)

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id BIGSERIAL PRIMARY KEY,
  source TEXT NOT NULL,
  title TEXT,
  chunk_text TEXT NOT NULL,
  chunk_index INT NOT NULL,
  metadata JSONB DEFAULT '{}'::jsonb,
  embedding vector(1536),
  tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(title,'') || ' ' || chunk_text)) STORED
);

CREATE INDEX idx_documents_embedding ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_documents_tsv ON documents USING GIN (tsv);

Use a dimension that matches your embedding model. If you switch models, migrate carefully and re-embed consistently.

2) Ingestion pipeline

Ingestion quality determines answer quality. Keep chunks semantically coherent and preserve section context.

from dataclasses import dataclass
from typing import Iterable

@dataclass
class Chunk:
    source: str
    title: str
    text: str
    index: int

def split_text(text: str, max_chars: int = 1200, overlap: int = 150) -> list[str]:
    parts = []
    start = 0
    while start < len(text):
        end = min(len(text), start + max_chars)
        parts.append(text[start:end])
        if end == len(text):
            break
        start = end - overlap
    return parts

import psycopg

def upsert_chunks(conn, chunks: Iterable[Chunk], embed_fn):
    with conn.cursor() as cur:
        for c in chunks:
            emb = embed_fn(c.text)  # returns list[float]
            cur.execute("""
                INSERT INTO documents (source, title, chunk_text, chunk_index, metadata, embedding)
                VALUES (%s, %s, %s, %s, %s::jsonb, %s)
            """, (c.source, c.title, c.text, c.index, '{"lang":"en"}', emb))
    conn.commit()

3) Hybrid retrieval query

Run both searches and combine scores. Vector search catches semantic matches, keyword search catches exact terms, version numbers, API names, and error codes.

WITH vec AS (
  SELECT id, source, title, chunk_text,
         1 - (embedding <=> $1::vector) AS score
  FROM documents
  ORDER BY embedding <=> $1::vector
  LIMIT 40
),
lex AS (
  SELECT id, source, title, chunk_text,
         ts_rank_cd(tsv, plainto_tsquery('english', $2)) AS score
  FROM documents
  WHERE tsv @@ plainto_tsquery('english', $2)
  ORDER BY score DESC
  LIMIT 40
)
SELECT id, source, title, chunk_text,
       MAX(score) AS fused_score
FROM (
  SELECT * FROM vec
  UNION ALL
  SELECT * FROM lex
) x
GROUP BY id, source, title, chunk_text
ORDER BY fused_score DESC
LIMIT 20;

4) FastAPI endpoint with reranking

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class AskReq(BaseModel):
    question: str

@app.post("/ask")
def ask(req: AskReq):
    q_emb = embed(req.question)
    candidates = hybrid_search(q_emb, req.question, limit=20)

    reranked = rerank(req.question, candidates)[:6]  # cross-encoder or hosted reranker

    context = "\n\n".join([f"[{i+1}] {c['chunk_text']}" for i, c in enumerate(reranked)])
    prompt = f"""Answer using only the context below.
If unsure, say you don't know.
Add citation numbers like [1], [2].

Question: {req.question}

Context:
{context}
"""

    answer = generate(prompt)
    return {
        "answer": answer,
        "citations": [{"n": i+1, "source": c["source"], "title": c["title"]} for i, c in enumerate(reranked)]
    }

5) Practical production guardrails

A) Metadata filtering

Always filter by tenant, product, locale, or permission scope before retrieval. This prevents data leakage across customers or teams.

B) Freshness strategy

Add updated_at and a lightweight re-index queue. Most teams do not need full re-indexes every hour.

C) Observability

Track retrieval hit rate.
Track citation acceptance rate (did users click cited sources).
Track no-answer rate and hallucination feedback.

D) Caching

Cache embeddings for repeated questions and cache final answers with short TTL for high-traffic endpoints.

E) Safety policy

Use a response policy layer for sensitive actions: code execution instructions, secrets, legal/medical guidance, and internal-only docs.

6) Evaluation loop you can run weekly

Create a benchmark set of 100 real user questions with expected sources. Then score:

Recall@k for retrieval
MRR / nDCG for ranking quality
Answer groundedness (citation-backed claims)
User-rated helpfulness

Small, regular eval cycles usually outperform one-time architecture rewrites.

Common mistakes to avoid

Using very large chunks that blur topics.
Skipping keyword search entirely.
No reranking step for top candidates.
Returning answers without citations.
Ignoring permissions in retrieval filters.

Final thoughts

For most developer teams in 2026, a Postgres-based RAG stack is the fastest path to useful AI features in production. Start with clean ingestion, hybrid retrieval, and reranking. Add observability from day one, and improve through weekly evaluation instead of guesswork. You will ship faster, reduce hallucinations, and keep infrastructure understandable for your team.