Leasey
Template

Hallucination Audit Template

A structured audit you can run in a day to quantify hallucination rate on a live LLM workflow and rank the fixes that will move the number most.

All resources7 min read

Define hallucination precisely

Three disjoint failure modes: (a) fabricated fact not in context, (b) unsupported-but-plausible claim, (c) contradicted claim (context says otherwise). Score each separately.

Build a 100-item evaluation set

Stratified across top intents, doc sources, and question difficulties. Include 10–20 adversarial items designed to tempt the model into fabrication.

Run a claim-level judge

For each answer, extract claims, mark each as grounded / unsupported / contradicted. Aggregate into rates per category. Sanity-check 20% against human labels.

Rank the fixes

Common winners: better retrieval (reranking + top-k tuning), explicit refusal prompt, citation requirement, and smaller top-k with higher-quality chunks. Retest after each change.

Gate future changes

Bake the final eval into CI. New prompts, models, or retrievers must meet the agreed-on hallucination ceiling before merging.

Want us to run this for you?

We turn these playbooks into paid engagements. Book a call and we'll scope it.

See engagements

Ready to accelerate your tech growth?

Schedule your free consultation today and let's discuss how we can help your business scale efficiently.

Tech growth illustration
Ready when you are

Let’s ship your AI system.

Whether you’re scoping a new LLM product, hardening an existing one, or standing up the infra behind it — we’ll map the shortest path to production.

Email the teamOther ways to reach us