Template

Hallucination Audit Template

A structured audit you can run in a day to quantify hallucination rate on a live LLM workflow and rank the fixes that will move the number most.

All resources7 min read

Define hallucination precisely

Three disjoint failure modes: (a) fabricated fact not in context, (b) unsupported-but-plausible claim, (c) contradicted claim (context says otherwise). Score each separately.

Build a 100-item evaluation set

Stratified across top intents, doc sources, and question difficulties. Include 10–20 adversarial items designed to tempt the model into fabrication.

Run a claim-level judge

For each answer, extract claims, mark each as grounded / unsupported / contradicted. Aggregate into rates per category. Sanity-check 20% against human labels.

Rank the fixes

Common winners: better retrieval (reranking + top-k tuning), explicit refusal prompt, citation requirement, and smaller top-k with higher-quality chunks. Retest after each change.

Gate future changes

Bake the final eval into CI. New prompts, models, or retrievers must meet the agreed-on hallucination ceiling before merging.

Want us to run this for you?

We turn these playbooks into paid engagements. Book a call and we'll scope it.

See engagements

Ready to accelerate your tech growth?

Schedule your free consultation today and let's discuss how we can help your business scale efficiently.