RAG Evaluation Playbook
How to build an eval harness for a RAG system that actually catches regressions — retrieval hit rate, faithfulness, citation correctness, and CI gates.
Why separate retrieval eval from generation eval
Most failures in a RAG system are retrieval failures. If you only score final answers, you conflate retrieval and generation and can't tell which layer to fix. Score retrieval in isolation first — hit rate @ k on a labeled gold set. Only then move to answer faithfulness.
Build the gold set
Start small: 30–50 real questions with known-good answers and the specific chunks that support them. Grow the set as real traffic surfaces edge cases. Version it in the repo next to prompts.
Hit rate @ k, not accuracy
For each gold question, check whether the correct chunk(s) appear in the top-k retrieval results. Report hit rate @ 1, 5, 10. This is the number that determines whether the model has the context to answer at all.
Faithfulness with LLM-as-judge
Claim-level grading: extract claims from the answer, check each against the retrieved context. Use a capable judge model (Claude Opus or GPT-4.1). Sanity-check judge scores against human labels on 10% of items until you trust it.
Gate every PR
Eval lives in CI. Changes to prompts, chunking, embeddings, rerankers, or models run the full battery. Merging is blocked on pre-agreed gates (e.g., faithfulness >= 0.9, PII leakage == 0). Regressions get caught before prod.
Production sampling
Sample 1–5% of prod traffic into a pipeline that scores retrieval and faithfulness continuously. Alert on drift. Surface low-scoring items into the gold set so the eval grows with the product.
We turn these playbooks into paid engagements. Book a call and we'll scope it.
See engagementsReady to accelerate your tech growth?
Schedule your free consultation today and let's discuss how we can help your business scale efficiently.
