RAG demos are easy, but production reliability is hard. In 2026, teams are shipping AI features weekly, and the bottleneck is no longer model access, it is confidence: can you prove your retriever is finding the right context, your answers are faithful, and regressions are blocked before release? In this hands-on guide, you will build a practical RAG evaluation pipeline with automated test sets, LLM-as-judge scoring, trace-level diagnostics, and CI gates that fail bad builds before users see them.
Why RAG evaluation needs engineering discipline
Most RAG outages come from three sources:
- Retrieval drift: embeddings, chunking, or metadata filters silently reduce recall.
- Hallucinated synthesis: model generates plausible but unsupported claims.
- Prompt regressions: small prompt/template changes degrade quality.
A robust pipeline should score each release across:
- Answer correctness
- Groundedness (faithfulness to sources)
- Context precision/recall
- Latency and token cost
Architecture: what we are building
We will implement a Python pipeline that runs on every pull request:
- Load evaluation dataset (questions + expected facts).
- Call your RAG endpoint and collect answer + retrieved chunks + trace IDs.
- Run an LLM judge for correctness and groundedness.
- Compute metrics and compare with baseline.
- Fail CI if thresholds are violated.
Project structure
rag-evals/
eval_data.jsonl
run_eval.py
judge.py
metrics.py
ci_gate.py
baseline.json
Step 1: Evaluation dataset design
Each test case should include not just a question, but expected facts and optional must-cite constraints.
{"id":"tc_001","question":"What is our refund window for annual plans?","expected_facts":["30-day refund window","annual plans only"],"must_cite":["policy_refund_v3"]}Keep 50 to 200 high-signal examples per domain. Start with difficult edge cases (negations, multi-hop, policy versioning, acronym collisions).
Step 2: Run your RAG system and capture traces
Your evaluation runner should capture both functional output and observability context. The trace ID lets you debug regressions quickly in your telemetry stack.
import json, time, requests
RAG_URL = "https://api.example.com/rag/answer"
def call_rag(question: str):
t0 = time.time()
r = requests.post(RAG_URL, json={"question": question}, timeout=30)
r.raise_for_status()
data = r.json()
return {
"answer": data["answer"],
"chunks": data.get("retrieved_chunks", []),
"trace_id": data.get("trace_id"),
"latency_ms": int((time.time() - t0) * 1000),
"tokens": data.get("usage", {}).get("total_tokens", 0)
}
def run_eval(path="eval_data.jsonl"):
results = []
with open(path) as f:
for line in f:
tc = json.loads(line)
out = call_rag(tc["question"])
results.append({**tc, **out})
return resultsStep 3: LLM-as-judge for correctness and groundedness
Use a deterministic judge prompt and request strict JSON output. Keep judge temperature at 0 for stability.
from openai import OpenAI
import json
client = OpenAI()
JUDGE_PROMPT = """
You are a strict evaluator. Score the answer using only provided context.
Return JSON with keys:
- correctness: 0..1
- groundedness: 0..1
- missing_facts: string[]
- unsupported_claims: string[]
Question: {question}
Expected facts: {expected_facts}
Retrieved context: {chunks}
Answer: {answer}
"""
def judge_case(case):
prompt = JUDGE_PROMPT.format(
question=case["question"],
expected_facts=case["expected_facts"],
chunks=case["chunks"],
answer=case["answer"],
)
resp = client.responses.create(
model="gpt-5-mini",
temperature=0,
input=prompt,
response_format={"type": "json_object"}
)
return json.loads(resp.output_text)If you need higher robustness, run two judges and average scores, or add rule-based checks for forbidden claims.
Step 4: Compute metrics that actually protect quality
Aggregate technical and product metrics in one report:
- Correctness@mean and Groundedness@mean
- Context precision (retrieved chunks that were truly useful)
- P95 latency
- Avg tokens / cost per answer
import statistics
def summarize(scored):
return {
"correctness_mean": round(statistics.mean(x["judge"]["correctness"] for x in scored), 3),
"groundedness_mean": round(statistics.mean(x["judge"]["groundedness"] for x in scored), 3),
"p95_latency_ms": sorted(x["latency_ms"] for x in scored)[int(0.95 * len(scored)) - 1],
"avg_tokens": int(statistics.mean(x["tokens"] for x in scored)),
}Step 5: Enforce CI quality gates
This is where evaluation moves from dashboard theater to real protection. Compare against a committed baseline and fail fast on regressions.
import json, sys
THRESHOLDS = {
"correctness_mean": 0.82,
"groundedness_mean": 0.88,
"p95_latency_ms": 3500
}
current = json.load(open("report.json"))
if current["correctness_mean"] < THRESHOLDS["correctness_mean"]:
print("FAIL: correctness dropped")
sys.exit(1)
if current["groundedness_mean"] < THRESHOLDS["groundedness_mean"]:
print("FAIL: groundedness dropped")
sys.exit(1)
if current["p95_latency_ms"] > THRESHOLDS["p95_latency_ms"]:
print("FAIL: latency too high")
sys.exit(1)
print("PASS: quality gates satisfied")GitHub Actions workflow example
name: rag-evals
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements.txt
- run: python run_eval.py
- run: python ci_gate.py
Operational tips for 2026 teams
- Version your prompts and retriever config together with code.
- Track per-segment metrics (new users vs power users, region, language).
- Store failed cases and automatically add them to the regression set.
- Run nightly full evals, PR-time smoke evals (fast subset).
- Budget guardrails matter: quality without cost control will not survive production.
Conclusion
In 2026, successful AI teams treat RAG quality like application reliability: measured, tested, and enforced. By combining trace-aware evaluation, LLM-as-judge scoring, and CI gates, you can ship faster while reducing hallucination risk and surprise regressions. Start with a small high-quality dataset this week, wire it to CI, and iterate. The compounding payoff is huge.

Leave a Reply