Leasey
Evaluations

Prove your LLM workflow is actually working.

We audit existing AI systems against your data and your definition of good — then fix the top failures and gate future changes on evals in CI.

6 areas
from retrieval to safety
Real data
your traffic, not benchmarks
CI gates
block regressions on PR
Ranked fixes
biggest-impact first

What we measure.

Six evaluation areas, each with a concrete check list. We run what fits your system and skip what doesn’t.

Retrieval Quality

Measure whether the right context reaches the model. We test retrieval in isolation so you know if bad answers are a retrieval problem or a generation problem.

  • Hit rate @ k on curated question sets
  • Recall vs. a labeled gold set
  • Reranker uplift analysis
  • Permission-aware filtering correctness
  • Chunking and embedding model comparison

Answer Faithfulness & Hallucination

Grade whether answers are supported by retrieved context. Separate fabricated facts from unsupported-but-plausible claims.

  • Claim-level faithfulness scoring
  • Hallucination rate by topic and document source
  • Citation coverage and correctness
  • Refusal behavior on out-of-scope queries

Task Success for Agents

For agentic workflows, measure end-to-end completion, not just token-level quality.

  • End-to-end task success rate on gold tasks
  • Tool-call correctness and argument validation
  • Recovery behavior on tool failures
  • Step efficiency and cost per successful task

Cost, Latency, and Reliability

Production-readiness metrics that decide whether a system survives real traffic.

  • p50 / p95 latency per route and model
  • Token and dollar cost per request, per user, per tenant
  • Error rate and retry behavior
  • Model routing and fallback correctness

Safety, PII, and Policy

Detect what your system should never do, and prove it doesn't do it.

  • PII leakage on input and output
  • Prompt-injection resistance on untrusted content
  • Jailbreak and policy-violation rate
  • Sensitive topic handling per vertical (health, legal, finance)

Regression & Drift

AI systems change when you change prompts, models, or data. Catch regressions before users do.

  • Prompt version A/B eval on frozen test sets
  • Model upgrade regression sweeps
  • Data drift alarms on production traffic
  • CI eval gates on pull requests
Eval as code

Versioned, diffable, gated on CI.

Your eval battery lives in your repo next to your prompts. Every PR runs it. Regressions get blocked, not discovered in prod.

eval.config.ts
// eval.config.ts — version your eval battery in code
export const config = {
  datasets: ["gold_v3", "adversarial_v2", "prod_sample_7d"],
  evaluators: [
    "retrieval_hit_rate",
    "answer_faithfulness",
    "pii_leakage",
    "jailbreak_resistance",
    "cost_per_request",
    "p95_latency_ms",
  ],
  gates: { faithfulness: ">=0.9", pii_leakage: "==0", p95_latency_ms: "<=1800" },
}

How the engagement works.

Typically 2–6 weeks end to end, depending on surface area and access.

01

Discovery

We map your current LLM workflow, success criteria, and risk surface. Short, focused engagement to agree what "good" means for this system.

02

Dataset & Harness

We build a versioned eval dataset from real traffic, gold examples, and adversarial cases, plus the harness to run it.

03

Baseline & Report

We run the full battery against your current system and deliver a report with numbers, failure modes, and ranked recommendations.

04

Fix & Re-run

We implement the top fixes — retrieval, prompts, routing, guardrails — and re-run evals so every change is proven, not guessed.

05

CI Integration

We wire evals into your CI/CD so future changes are gated on quality, not just tests.

Ready to accelerate your tech growth?

Schedule your free consultation today and let's discuss how we can help your business scale efficiently.

Tech growth illustration
Ready when you are

Let’s ship your AI system.

Whether you’re scoping a new LLM product, hardening an existing one, or standing up the infra behind it — we’ll map the shortest path to production.

Email the teamOther ways to reach us