AI/ML in 2026: Build a Production-Ready Hybrid RAG API with FastAPI, pgvector, and Reranking

If your AI feature still depends on plain vector search, you are likely missing relevant context and paying more than needed. In 2026, the most reliable retrieval-augmented generation (RAG) stacks combine dense vectors, keyword signals, and reranking before the LLM sees anything. In this tutorial, you will build a practical hybrid RAG API using FastAPI, PostgreSQL + pgvector, BM25-style scoring, and a lightweight reranker, with code you can adapt for production.

Why hybrid RAG is the default in 2026

Dense embeddings are great for semantic similarity, but they can miss exact terms such as error codes, function names, versions, and product SKUs. Keyword search catches these. Reranking then improves the final shortlist by evaluating full query-document relevance.

  • Vector search: captures semantic intent
  • Keyword scoring: catches exact tokens and rare terms
  • Reranking: improves precision at top-k

This three-step retrieval flow usually beats vector-only search on both accuracy and user trust.

Architecture overview

  1. Ingest docs and split into chunks
  2. Generate embeddings and store in PostgreSQL (pgvector)
  3. Run hybrid retrieval: vector + text score
  4. Rerank top candidates
  5. Send best context to LLM through a FastAPI endpoint

1) Database schema (PostgreSQL + pgvector)

Enable pgvector and create a table that supports both vector similarity and full-text ranking.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS doc_chunks (
  id BIGSERIAL PRIMARY KEY,
  doc_id TEXT NOT NULL,
  chunk_index INT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),
  tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);

CREATE INDEX IF NOT EXISTS idx_doc_chunks_embedding
ON doc_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 200);

CREATE INDEX IF NOT EXISTS idx_doc_chunks_tsv
ON doc_chunks USING GIN (tsv);

Tip: for small datasets, exact vector search can be enough. For larger corpora, tune lists and query-time probes for speed/quality balance.

2) Ingestion and embedding pipeline (Python)

Use sentence-aware chunking and overlap to avoid context breaks. Keep chunk size stable so retrieval quality remains predictable.

from openai import OpenAI
import psycopg

client = OpenAI()

def chunk_text(text, size=900, overlap=150):
    chunks = []
    i = 0
    while i < len(text):
        chunks.append(text[i:i+size])
        i += size - overlap
    return chunks

def embed(texts):
    res = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts
    )
    return [d.embedding for d in res.data]

def ingest_document(conn, doc_id, text):
    chunks = chunk_text(text)
    vectors = embed(chunks)

    with conn.cursor() as cur:
        for idx, (chunk, vec) in enumerate(zip(chunks, vectors)):
            cur.execute(
                """
                INSERT INTO doc_chunks (doc_id, chunk_index, content, embedding)
                VALUES (%s, %s, %s, %s)
                """,
                (doc_id, idx, chunk, vec)
            )
    conn.commit()

3) Hybrid retrieval SQL

The query below blends cosine similarity with text rank. Adjust the weights for your data.

WITH q AS (
  SELECT
    %(query_embedding)s::vector AS emb,
    plainto_tsquery('english', %(query_text)s) AS tsq
),
scored AS (
  SELECT
    id, doc_id, chunk_index, content,
    1 - (embedding <=> (SELECT emb FROM q)) AS vector_score,
    ts_rank_cd(tsv, (SELECT tsq FROM q)) AS text_score
  FROM doc_chunks
  WHERE tsv @@ (SELECT tsq FROM q)
     OR embedding IS NOT NULL
)
SELECT *,
       (0.72 * vector_score + 0.28 * text_score) AS hybrid_score
FROM scored
ORDER BY hybrid_score DESC
LIMIT 40;

Why over-fetch to 40? Because reranking needs a broader candidate set than your final context window.

4) Reranking and answer generation API (FastAPI)

Rerank the top candidates, keep the best 6 to 10, then call your chat model.

from fastapi import FastAPI
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()
client = OpenAI()

class AskRequest(BaseModel):
    query: str

def rerank(query, docs):
    # Replace with your reranker of choice (cross-encoder or API reranker)
    pairs = [f"Query: {query}\nDoc: {d['content']}" for d in docs]
    scores = client.embeddings.create(model="text-embedding-3-small", input=pairs).data
    # Placeholder scoring pattern; use real reranker in production
    ranked = sorted(zip(docs, scores), key=lambda x: x[1].index, reverse=True)
    return [d for d, _ in ranked]

@app.post("/ask")
def ask(req: AskRequest):
    q_emb = client.embeddings.create(
        model="text-embedding-3-large",
        input=req.query
    ).data[0].embedding

    candidates = hybrid_retrieve(req.query, q_emb)  # your DB function
    reranked = rerank(req.query, candidates)[:8]

    context = "\n\n---\n\n".join([d["content"] for d in reranked])

    prompt = f"""
Use ONLY the context below to answer.
If uncertain, say what is missing.

Context:
{context}

Question: {req.query}
"""

    answer = client.chat.completions.create(
        model="gpt-4.1-mini",
        temperature=0.2,
        messages=[{"role": "user", "content": prompt}]
    )

    return {
        "answer": answer.choices[0].message.content,
        "sources": [
            {"doc_id": d["doc_id"], "chunk_index": d["chunk_index"]}
            for d in reranked
        ]
    }

Production checklist

Evaluation first, not vibes

Create a small labeled dataset (50 to 200 queries) and track:

  • Recall@k for retrieval
  • Answer groundedness (citation-backed)
  • Hallucination rate
  • P95 latency and cost per request

Guardrails that matter

  • Strip secrets and credentials before indexing
  • Apply tenant filters in SQL for multi-tenant apps
  • Return citations with every answer
  • Set max context tokens to control cost spikes

Ops and scaling tips

  • Cache query embeddings for repeated prompts
  • Use async batching during ingestion
  • Re-embed only changed documents
  • Monitor retrieval misses and add synonym mappings

Common pitfalls

  • Chunking too large: decreases retrieval precision
  • No keyword component: misses exact matches
  • Skipping rerank: relevant docs buried below top-k
  • No evaluation loop: regressions go unnoticed

Final thoughts

Hybrid RAG is practical, not exotic. With PostgreSQL, pgvector, and a thin FastAPI layer, you can deliver reliable AI answers that are fast, traceable, and affordable. Start with a measurable baseline, add reranking, and iterate with real user queries. That simple discipline is what separates demo-grade AI from production-grade developer tools in 2026.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Privacy Policy · Contact · Sitemap

© 7Tech – Programming and Tech Tutorials