Score retrieval before you score anything else
If you only grade final answers, you can't tell whether a failure is retrieval or generation. Grade retrieval in isolation first.
Most RAG failures are retrieval failures. When a RAG system gives a wrong answer, it's usually because the right chunk never made it into the context — not because the model fabricated something out of thin air. If your eval pipeline only scores end-to-end answers, you blur the two, and the team argues about prompt tweaks when the real problem is that top-5 retrieval is at 40% hit rate on your data.
Build a labeled gold set first — 30–50 real questions, each with the specific chunks that support the known-good answer. Then score hit rate @ k (1, 5, 10) before you score anything else. That one number tells you whether the model even has the context to succeed.
Only after retrieval is measured do you move to faithfulness. Claim-level grading: extract claims from the answer, check each against the retrieved context, report grounded / unsupported / contradicted rates. An LLM judge is fine here — just sanity-check 20% of judge scores against human labels until you trust it.
Gate CI on both. Retrieval hit rate @ 5 and faithfulness both need floors; PRs that regress either get blocked. Everything downstream — prompt tuning, model routing, cost optimization — gets much easier when you know which layer you're actually tuning.
Ready to accelerate your tech growth?
Schedule your free consultation today and let's discuss how we can help your business scale efficiently.
