Shipping an AI feature is easy. Shipping one that stays accurate as your data, prompts, and models change is the hard part. In 2026, the teams moving fastest are the ones treating LLM quality like a CI problem: every prompt change gets evaluated, every retrieval change gets measured, and regressions are blocked before production.
In this guide, you will build a practical RAG evaluation pipeline using Python, synthetic test generation, and rubric-based scoring. The goal is simple: catch bad answers before users do.
Why classic unit tests are not enough for AI features
If your app uses retrieval-augmented generation (RAG), failures usually come from one of three places:
- Retrieval failure: the right chunk was never retrieved.
- Grounding failure: the chunk was retrieved, but the answer ignored it.
- Formatting/behavior failure: the model answered, but not in required structure, tone, or policy.
Traditional assertions like assert output == expected fail because LLM outputs are non-deterministic. We need scored checks and thresholds instead of strict equality.
Architecture of a robust evaluation loop
- Build a versioned test set from your documents.
- Generate synthetic Q&A cases per document chunk.
- Run your RAG pipeline for each test case.
- Score answer quality with rubric judges.
- Fail CI if key metrics drop below thresholds.
Project setup
python -m venv .venv
source .venv/bin/activate
pip install openai pydantic pandas numpy richCreate this structure:
rag-eval/
data/docs/
data/testset.jsonl
src/retriever.py
src/rag.py
src/eval.pyStep 1: Build synthetic test cases from your docs
Use your source documents to generate realistic user questions. Each test should include the expected evidence chunk so you can measure retrieval precision.
# src/eval.py
from pydantic import BaseModel
from typing import List
import json
class TestCase(BaseModel):
question: str
expected_answer: str
source_doc_id: str
source_chunk_id: str
tags: List[str] = []
def save_testset(cases: List[TestCase], path="data/testset.jsonl"):
with open(path, "w", encoding="utf-8") as f:
for c in cases:
f.write(c.model_dump_json() + "\n")
# In production, generate with an LLM from each chunk.
seed_cases = [
TestCase(
question="How long are API tokens cached in the gateway?",
expected_answer="Tokens are cached for 15 minutes and proactively refreshed at 12 minutes.",
source_doc_id="gateway-config",
source_chunk_id="chunk-17",
tags=["auth", "cache"]
)
]
save_testset(seed_cases)Tip: keep 20 to 50 hand-verified gold cases even if most of your suite is synthetic. Gold cases protect against synthetic drift.
Step 2: Instrument your RAG pipeline for evaluation
Your RAG function should return not just the final answer, but also retrieved chunk IDs. Without this, you cannot diagnose whether failures are retrieval or reasoning issues.
# src/rag.py
from typing import Dict, List
# Mock interfaces for clarity
def retrieve(query: str, k: int = 5) -> List[Dict]:
# Return docs like: {"chunk_id": "chunk-17", "text": "..."}
...
def generate_answer(query: str, contexts: List[Dict]) -> str:
...
def answer_with_trace(query: str) -> Dict:
ctx = retrieve(query, k=5)
answer = generate_answer(query, ctx)
return {
"answer": answer,
"retrieved_chunk_ids": [c["chunk_id"] for c in ctx]
}Step 3: Add rubric-based judges
Instead of exact matching, score each answer on groundedness and correctness. In 2026, a common pattern is a small, cheap judge model first, then an expensive judge only on borderline cases.
# src/eval.py
from dataclasses import dataclass
@dataclass
class Score:
correctness: float # 0..1
groundedness: float # 0..1
retrieved_expected: bool
def score_case(question: str, expected: str, got: str, expected_chunk: str, retrieved_chunks: list[str]) -> Score:
# Replace with judge-model call in production.
# Heuristic fallback keeps local CI fast.
got_low = got.lower()
exp_low = expected.lower()
token_overlap = len(set(got_low.split()) & set(exp_low.split())) / max(1, len(set(exp_low.split())))
correctness = min(1.0, token_overlap * 1.2)
retrieved_expected = expected_chunk in retrieved_chunks
groundedness = 1.0 if retrieved_expected else 0.2
return Score(
correctness=round(correctness, 3),
groundedness=groundedness,
retrieved_expected=retrieved_expected,
)Step 4: Run batch evaluation and compute release gates
# src/run_eval.py
import json
from statistics import mean
from src.rag import answer_with_trace
from src.eval import score_case
THRESHOLDS = {
"correctness": 0.78,
"groundedness": 0.90,
"retrieval_hit_rate": 0.85,
}
rows = []
with open("data/testset.jsonl", "r", encoding="utf-8") as f:
for line in f:
tc = json.loads(line)
out = answer_with_trace(tc["question"])
s = score_case(
question=tc["question"],
expected=tc["expected_answer"],
got=out["answer"],
expected_chunk=tc["source_chunk_id"],
retrieved_chunks=out["retrieved_chunk_ids"],
)
rows.append(s)
metrics = {
"correctness": mean(r.correctness for r in rows),
"groundedness": mean(r.groundedness for r in rows),
"retrieval_hit_rate": mean(1.0 if r.retrieved_expected else 0.0 for r in rows),
}
print("Metrics:", metrics)
failed = [k for k, v in metrics.items() if v < THRESHOLDS[k]]
if failed:
raise SystemExit(f"Evaluation failed gates: {failed}")
print("All quality gates passed.")Run it in CI on every pull request. If you deploy multiple models, run the same suite per model and compare deltas, not just absolute scores.
Step 5: Add regression triage that engineers will actually use
When a gate fails, developers need a fast path to root cause. Log these fields for every failed case:
- Question
- Expected answer
- Model answer
- Expected chunk ID
- Retrieved chunk IDs
- Prompt version + retriever version
This lets you quickly classify failures into indexing, retrieval, prompt, or model regressions.
Production best practices for 2026
- Version everything: prompts, retriever config, embedding model, chunking strategy.
- Use stratified test sets: include easy, medium, and adversarial questions.
- Track cost and latency with quality: best model is not always best product.
- Run canary evals on fresh data weekly: static test sets hide drift.
- Add policy checks: verify redaction, citation requirements, and refusal behavior.
Common mistakes to avoid
- Judging answers without checking retrieved evidence.
- Using only synthetic tests and no human-verified gold set.
- Changing chunking strategy without re-baselining thresholds.
- Failing builds on tiny metric noise instead of meaningful deltas.
Final takeaway
If your AI feature matters, evaluation is not optional plumbing. It is your reliability system. Start with a small test set, add retrieval-aware scoring, and wire quality gates into CI. Within a few releases, your team will ship faster with fewer silent regressions, and your users will feel the difference.
Want a follow-up? I can publish a companion guide on integrating this pipeline with GitHub Actions and posting evaluation summaries directly into pull request comments.

Leave a Reply