Playbook

RAG Evaluation Playbook

How to build an eval harness for a RAG system that actually catches regressions — retrieval hit rate, faithfulness, citation correctness, and CI gates.

All resources12 min read

Why separate retrieval eval from generation eval

Most failures in a RAG system are retrieval failures. If you only score final answers, you conflate retrieval and generation and can't tell which layer to fix. Score retrieval in isolation first — hit rate @ k on a labeled gold set. Only then move to answer faithfulness.

Build the gold set

Start small: 30–50 real questions with known-good answers and the specific chunks that support them. Grow the set as real traffic surfaces edge cases. Version it in the repo next to prompts.

Hit rate @ k, not accuracy

For each gold question, check whether the correct chunk(s) appear in the top-k retrieval results. Report hit rate @ 1, 5, 10. This is the number that determines whether the model has the context to answer at all.

Faithfulness with LLM-as-judge

Claim-level grading: extract claims from the answer, check each against the retrieved context. Use a capable judge model (Claude Opus or GPT-4.1). Sanity-check judge scores against human labels on 10% of items until you trust it.

Gate every PR

Eval lives in CI. Changes to prompts, chunking, embeddings, rerankers, or models run the full battery. Merging is blocked on pre-agreed gates (e.g., faithfulness >= 0.9, PII leakage == 0). Regressions get caught before prod.

Production sampling

Sample 1–5% of prod traffic into a pipeline that scores retrieval and faithfulness continuously. Alert on drift. Surface low-scoring items into the gold set so the eval grows with the product.

Want us to run this for you?

We turn these playbooks into paid engagements. Book a call and we'll scope it.

See engagements

Ready to accelerate your tech growth?

Schedule your free consultation today and let's discuss how we can help your business scale efficiently.