Leasey
Playbook

RAG Evaluation Playbook

How to build an eval harness for a RAG system that actually catches regressions — retrieval hit rate, faithfulness, citation correctness, and CI gates.

All resources12 min read

Why separate retrieval eval from generation eval

Most failures in a RAG system are retrieval failures. If you only score final answers, you conflate retrieval and generation and can't tell which layer to fix. Score retrieval in isolation first — hit rate @ k on a labeled gold set. Only then move to answer faithfulness.

Build the gold set

Start small: 30–50 real questions with known-good answers and the specific chunks that support them. Grow the set as real traffic surfaces edge cases. Version it in the repo next to prompts.

Hit rate @ k, not accuracy

For each gold question, check whether the correct chunk(s) appear in the top-k retrieval results. Report hit rate @ 1, 5, 10. This is the number that determines whether the model has the context to answer at all.

Faithfulness with LLM-as-judge

Claim-level grading: extract claims from the answer, check each against the retrieved context. Use a capable judge model (Claude Opus or GPT-4.1). Sanity-check judge scores against human labels on 10% of items until you trust it.

Gate every PR

Eval lives in CI. Changes to prompts, chunking, embeddings, rerankers, or models run the full battery. Merging is blocked on pre-agreed gates (e.g., faithfulness >= 0.9, PII leakage == 0). Regressions get caught before prod.

Production sampling

Sample 1–5% of prod traffic into a pipeline that scores retrieval and faithfulness continuously. Alert on drift. Surface low-scoring items into the gold set so the eval grows with the product.

Want us to run this for you?

We turn these playbooks into paid engagements. Book a call and we'll scope it.

See engagements

Ready to accelerate your tech growth?

Schedule your free consultation today and let's discuss how we can help your business scale efficiently.

Tech growth illustration
Ready when you are

Let’s ship your AI system.

Whether you’re scoping a new LLM product, hardening an existing one, or standing up the infra behind it — we’ll map the shortest path to production.

Email the teamOther ways to reach us