Hallucination Audit Template
A structured audit you can run in a day to quantify hallucination rate on a live LLM workflow and rank the fixes that will move the number most.
Define hallucination precisely
Three disjoint failure modes: (a) fabricated fact not in context, (b) unsupported-but-plausible claim, (c) contradicted claim (context says otherwise). Score each separately.
Build a 100-item evaluation set
Stratified across top intents, doc sources, and question difficulties. Include 10–20 adversarial items designed to tempt the model into fabrication.
Run a claim-level judge
For each answer, extract claims, mark each as grounded / unsupported / contradicted. Aggregate into rates per category. Sanity-check 20% against human labels.
Rank the fixes
Common winners: better retrieval (reranking + top-k tuning), explicit refusal prompt, citation requirement, and smaller top-k with higher-quality chunks. Retest after each change.
Gate future changes
Bake the final eval into CI. New prompts, models, or retrievers must meet the agreed-on hallucination ceiling before merging.
We turn these playbooks into paid engagements. Book a call and we'll scope it.
See engagementsReady to accelerate your tech growth?
Schedule your free consultation today and let's discuss how we can help your business scale efficiently.
