AI/ML in 2026: Build a Production-Ready Hybrid RAG API with FastAPI, pgvector, and Reranking

If your AI feature still depends on plain vector search, you are likely missing relevant context and paying more than needed. In 2026, the most reliable retrieval-augmented generation (RAG) stacks combine dense vectors, keyword signals, and reranking before the LLM sees anything. In this tutorial, you will build a practical hybrid RAG API using FastAPI, PostgreSQL + pgvector, BM25-style scoring, and a lightweight reranker, with code you can adapt for production.

Why hybrid RAG is the default in 2026

Dense embeddings are great for semantic similarity, but they can miss exact terms such as error codes, function names, versions, and product SKUs. Keyword search catches these. Reranking then improves the final shortlist by evaluating full query-document relevance.

Vector search: captures semantic intent
Keyword scoring: catches exact tokens and rare terms
Reranking: improves precision at top-k

This three-step retrieval flow usually beats vector-only search on both accuracy and user trust.

Architecture overview

Ingest docs and split into chunks
Generate embeddings and store in PostgreSQL (pgvector)
Run hybrid retrieval: vector + text score
Rerank top candidates
Send best context to LLM through a FastAPI endpoint

1) Database schema (PostgreSQL + pgvector)

Enable pgvector and create a table that supports both vector similarity and full-text ranking.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS doc_chunks (
  id BIGSERIAL PRIMARY KEY,
  doc_id TEXT NOT NULL,
  chunk_index INT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),
  tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);

CREATE INDEX IF NOT EXISTS idx_doc_chunks_embedding
ON doc_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 200);

CREATE INDEX IF NOT EXISTS idx_doc_chunks_tsv
ON doc_chunks USING GIN (tsv);

Tip: for small datasets, exact vector search can be enough. For larger corpora, tune lists and query-time probes for speed/quality balance.

2) Ingestion and embedding pipeline (Python)

Use sentence-aware chunking and overlap to avoid context breaks. Keep chunk size stable so retrieval quality remains predictable.

from openai import OpenAI
import psycopg

client = OpenAI()

def chunk_text(text, size=900, overlap=150):
    chunks = []
    i = 0
    while i < len(text):
        chunks.append(text[i:i+size])
        i += size - overlap
    return chunks

def embed(texts):
    res = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts
    )
    return [d.embedding for d in res.data]

def ingest_document(conn, doc_id, text):
    chunks = chunk_text(text)
    vectors = embed(chunks)

    with conn.cursor() as cur:
        for idx, (chunk, vec) in enumerate(zip(chunks, vectors)):
            cur.execute(
                """
                INSERT INTO doc_chunks (doc_id, chunk_index, content, embedding)
                VALUES (%s, %s, %s, %s)
                """,
                (doc_id, idx, chunk, vec)
            )
    conn.commit()

3) Hybrid retrieval SQL

The query below blends cosine similarity with text rank. Adjust the weights for your data.

WITH q AS (
  SELECT
    %(query_embedding)s::vector AS emb,
    plainto_tsquery('english', %(query_text)s) AS tsq
),
scored AS (
  SELECT
    id, doc_id, chunk_index, content,
    1 - (embedding <=> (SELECT emb FROM q)) AS vector_score,
    ts_rank_cd(tsv, (SELECT tsq FROM q)) AS text_score
  FROM doc_chunks
  WHERE tsv @@ (SELECT tsq FROM q)
     OR embedding IS NOT NULL
)
SELECT *,
       (0.72 * vector_score + 0.28 * text_score) AS hybrid_score
FROM scored
ORDER BY hybrid_score DESC
LIMIT 40;

Why over-fetch to 40? Because reranking needs a broader candidate set than your final context window.

4) Reranking and answer generation API (FastAPI)

Rerank the top candidates, keep the best 6 to 10, then call your chat model.

from fastapi import FastAPI
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()
client = OpenAI()

class AskRequest(BaseModel):
    query: str

def rerank(query, docs):
    # Replace with your reranker of choice (cross-encoder or API reranker)
    pairs = [f"Query: {query}\nDoc: {d['content']}" for d in docs]
    scores = client.embeddings.create(model="text-embedding-3-small", input=pairs).data
    # Placeholder scoring pattern; use real reranker in production
    ranked = sorted(zip(docs, scores), key=lambda x: x[1].index, reverse=True)
    return [d for d, _ in ranked]

@app.post("/ask")
def ask(req: AskRequest):
    q_emb = client.embeddings.create(
        model="text-embedding-3-large",
        input=req.query
    ).data[0].embedding

    candidates = hybrid_retrieve(req.query, q_emb)  # your DB function
    reranked = rerank(req.query, candidates)[:8]

    context = "\n\n---\n\n".join([d["content"] for d in reranked])

    prompt = f"""
Use ONLY the context below to answer.
If uncertain, say what is missing.

Context:
{context}

Question: {req.query}
"""

    answer = client.chat.completions.create(
        model="gpt-4.1-mini",
        temperature=0.2,
        messages=[{"role": "user", "content": prompt}]
    )

    return {
        "answer": answer.choices[0].message.content,
        "sources": [
            {"doc_id": d["doc_id"], "chunk_index": d["chunk_index"]}
            for d in reranked
        ]
    }

Production checklist

Evaluation first, not vibes

Create a small labeled dataset (50 to 200 queries) and track:

Recall@k for retrieval
Answer groundedness (citation-backed)
Hallucination rate
P95 latency and cost per request

Guardrails that matter

Strip secrets and credentials before indexing
Apply tenant filters in SQL for multi-tenant apps
Return citations with every answer
Set max context tokens to control cost spikes

Ops and scaling tips

Cache query embeddings for repeated prompts
Use async batching during ingestion
Re-embed only changed documents
Monitor retrieval misses and add synonym mappings

Common pitfalls

Chunking too large: decreases retrieval precision
No keyword component: misses exact matches
Skipping rerank: relevant docs buried below top-k
No evaluation loop: regressions go unnoticed

Final thoughts

Hybrid RAG is practical, not exotic. With PostgreSQL, pgvector, and a thin FastAPI layer, you can deliver reliable AI answers that are fast, traceable, and affordable. Start with a measurable baseline, add reranking, and iterate with real user queries. That simple discipline is what separates demo-grade AI from production-grade developer tools in 2026.