AI/ML in 2026: Build a Production RAG Evaluation Pipeline with LLM-as-Judge, Tracing, and CI Quality Gates

RAG demos are easy, but production reliability is hard. In 2026, teams are shipping AI features weekly, and the bottleneck is no longer model access, it is confidence: can you prove your retriever is finding the right context, your answers are faithful, and regressions are blocked before release? In this hands-on guide, you will build a practical RAG evaluation pipeline with automated test sets, LLM-as-judge scoring, trace-level diagnostics, and CI gates that fail bad builds before users see them.

Why RAG evaluation needs engineering discipline

Most RAG outages come from three sources:

Retrieval drift: embeddings, chunking, or metadata filters silently reduce recall.
Hallucinated synthesis: model generates plausible but unsupported claims.
Prompt regressions: small prompt/template changes degrade quality.

A robust pipeline should score each release across:

Answer correctness
Groundedness (faithfulness to sources)
Context precision/recall
Latency and token cost

Architecture: what we are building

We will implement a Python pipeline that runs on every pull request:

Load evaluation dataset (questions + expected facts).
Call your RAG endpoint and collect answer + retrieved chunks + trace IDs.
Run an LLM judge for correctness and groundedness.
Compute metrics and compare with baseline.
Fail CI if thresholds are violated.

Project structure

rag-evals/
  eval_data.jsonl
  run_eval.py
  judge.py
  metrics.py
  ci_gate.py
  baseline.json

Step 1: Evaluation dataset design

Each test case should include not just a question, but expected facts and optional must-cite constraints.

{"id":"tc_001","question":"What is our refund window for annual plans?","expected_facts":["30-day refund window","annual plans only"],"must_cite":["policy_refund_v3"]}

Keep 50 to 200 high-signal examples per domain. Start with difficult edge cases (negations, multi-hop, policy versioning, acronym collisions).

Step 2: Run your RAG system and capture traces

Your evaluation runner should capture both functional output and observability context. The trace ID lets you debug regressions quickly in your telemetry stack.

import json, time, requests

RAG_URL = "https://api.example.com/rag/answer"

def call_rag(question: str):
    t0 = time.time()
    r = requests.post(RAG_URL, json={"question": question}, timeout=30)
    r.raise_for_status()
    data = r.json()
    return {
        "answer": data["answer"],
        "chunks": data.get("retrieved_chunks", []),
        "trace_id": data.get("trace_id"),
        "latency_ms": int((time.time() - t0) * 1000),
        "tokens": data.get("usage", {}).get("total_tokens", 0)
    }

def run_eval(path="eval_data.jsonl"):
    results = []
    with open(path) as f:
        for line in f:
            tc = json.loads(line)
            out = call_rag(tc["question"])
            results.append({**tc, **out})
    return results

Step 3: LLM-as-judge for correctness and groundedness

Use a deterministic judge prompt and request strict JSON output. Keep judge temperature at 0 for stability.

from openai import OpenAI
import json

client = OpenAI()

JUDGE_PROMPT = """
You are a strict evaluator. Score the answer using only provided context.
Return JSON with keys:
- correctness: 0..1
- groundedness: 0..1
- missing_facts: string[]
- unsupported_claims: string[]

Question: {question}
Expected facts: {expected_facts}
Retrieved context: {chunks}
Answer: {answer}
"""

def judge_case(case):
    prompt = JUDGE_PROMPT.format(
        question=case["question"],
        expected_facts=case["expected_facts"],
        chunks=case["chunks"],
        answer=case["answer"],
    )
    resp = client.responses.create(
        model="gpt-5-mini",
        temperature=0,
        input=prompt,
        response_format={"type": "json_object"}
    )
    return json.loads(resp.output_text)

If you need higher robustness, run two judges and average scores, or add rule-based checks for forbidden claims.

Step 4: Compute metrics that actually protect quality

Aggregate technical and product metrics in one report:

Correctness@mean and Groundedness@mean
Context precision (retrieved chunks that were truly useful)
P95 latency
Avg tokens / cost per answer

import statistics

def summarize(scored):
    return {
        "correctness_mean": round(statistics.mean(x["judge"]["correctness"] for x in scored), 3),
        "groundedness_mean": round(statistics.mean(x["judge"]["groundedness"] for x in scored), 3),
        "p95_latency_ms": sorted(x["latency_ms"] for x in scored)[int(0.95 * len(scored)) - 1],
        "avg_tokens": int(statistics.mean(x["tokens"] for x in scored)),
    }

Step 5: Enforce CI quality gates

This is where evaluation moves from dashboard theater to real protection. Compare against a committed baseline and fail fast on regressions.

import json, sys

THRESHOLDS = {
  "correctness_mean": 0.82,
  "groundedness_mean": 0.88,
  "p95_latency_ms": 3500
}

current = json.load(open("report.json"))

if current["correctness_mean"] < THRESHOLDS["correctness_mean"]:
    print("FAIL: correctness dropped")
    sys.exit(1)
if current["groundedness_mean"] < THRESHOLDS["groundedness_mean"]:
    print("FAIL: groundedness dropped")
    sys.exit(1)
if current["p95_latency_ms"] > THRESHOLDS["p95_latency_ms"]:
    print("FAIL: latency too high")
    sys.exit(1)

print("PASS: quality gates satisfied")

GitHub Actions workflow example

name: rag-evals
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements.txt
      - run: python run_eval.py
      - run: python ci_gate.py

Operational tips for 2026 teams

Version your prompts and retriever config together with code.
Track per-segment metrics (new users vs power users, region, language).
Store failed cases and automatically add them to the regression set.
Run nightly full evals, PR-time smoke evals (fast subset).
Budget guardrails matter: quality without cost control will not survive production.

Conclusion

In 2026, successful AI teams treat RAG quality like application reliability: measured, tested, and enforced. By combining trace-aware evaluation, LLM-as-judge scoring, and CI gates, you can ship faster while reducing hallucination risk and surprise regressions. Start with a small high-quality dataset this week, wire it to CI, and iterate. The compounding payoff is huge.