Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

How to Run LLM Evals in Production

How to run LLM evals in production with gold sets, graders, trace checks, online signals, and release gates.

February 3, 2026·6 min read
EvalsReliability
On this page(11)
Evals are not a benchmark hobbyStart with one small trusted setTrusted setMonitoring setGrade the thing that actually failedLLM judges have a narrow remitProduction signals belong in the eval conversationRelease gates should be boringProduction incidents feed the suite quicklyThe eval program needs an ownerA useful first version is smallRelated readingFurther reading

The useful question stopped being whether the model seems good in a demo. The useful question is whether the system survives a prompt change, a model swap, and a bad Tuesday without quietly getting worse.

Production default

Treat evals as part of the operating system. If a change matters enough to ship, it matters enough to measure.

Minimum viable eval harness

  • a small trusted dataset for hard gates
  • a wider regression set for drift monitoring
  • task-specific graders instead of one vague quality score
  • latency, cost, abstention, and escalation signals next to answer quality
Eval loop
The healthy pattern is simple: define success, measure it, ship carefully, then feed production failures back into the suite.

Evals are not a benchmark hobby

A lot of teams still talk about evals as if they were an optional research habit. That was tolerable when the system was one prompt and one model.

It stops being tolerable once the product has retrieval, tools, policies, response formats, multiple failure cases, and people who will notice when it breaks.

OpenAI's current evaluation guidance is explicit. Evals are how you make sense of non-deterministic systems in production. Anthropic makes the same point from a different angle: once the system acts over several turns, you need a deliberate way to define success, otherwise every regression turns into anecdote.

That is the real role of evals. They turn "it feels worse" into something a team can argue about productively.

Start with one small trusted set

The most useful eval dataset is smaller than people expect and more carefully reviewed than they want.

I split evaluation data into two tiers.

Trusted set

This is the set you are willing to use for release decisions.

It should be:

  • hand-audited
  • representative of real work
  • small enough to maintain
  • clear enough that graders and humans usually agree

Monitoring set

This set is wider, noisier, and closer to production traffic.

It is useful for:

  • drift detection
  • finding new failure classes
  • catching prompt or routing side effects
  • estimating what changed in the wild

Not automatically clean enough for hard release gates.

The distinction matters because teams pour everything into one giant eval bucket and then wonder why half the suite feels unreliable. The dataset is not one thing. It has jobs.

Grade the thing that actually failed

A single score is tidy. Often useless.

The better pattern is task-specific grading.

For a RAG or legal system, for example, I want different signals for:

  • answer correctness
  • support quality
  • citation or provenance integrity
  • abstention behavior
  • format compliance
  • latency

For an agent, OpenAI's trace grading and Anthropic's recent eval guidance point in the same direction: do not grade only the final answer if the behavior inside the trace matters. Grade the trace when the trace explains success or failure.

That means asking questions like:

  • did the agent choose the right tool?
  • did it call too many tools?
  • did it cross an approval boundary incorrectly?
  • did it retrieve the right evidence and still misuse it?

If the system failed in the middle of the loop, a final-answer score alone will hide the reason.

LLM judges have a narrow remit

Judge models are useful when the task is open-ended and the grading criteria are legible enough to write down.

They are weaker when the rubric is underspecified, when the answer format should have been deterministic in the first place, or when the task is sensitive to one factual slip hidden inside otherwise good prose.

Practical rule. Code-based checks where you can. Model-based grading where you must. Calibrate the second against the first and against periodic human review.

Anthropic's eval writing makes this point clearly. The value of automated evals compounds, but only if the task definition is concrete. Otherwise you build automation around ambiguity and call it rigour.

Production signals belong in the eval conversation

A lot of teams measure answer quality and then bolt operations on later. That split looks clean on a slide. In a real system it is fake.

A model change that preserves headline answer quality but doubles latency, suppresses abstention, or increases tool churn has still changed the product.

The production eval pack should therefore track more than semantic quality:

  • time to first token
  • end-to-end latency
  • token cost
  • refusal rate
  • escalation rate
  • tool-call count
  • trace length
  • retry rate

You do not need to worship every metric. You do need to know what changed.

Release gates should be boring

The best release gate is legible, not impressive.

A practical gate looks like this:

  1. hard-fail if trusted-set accuracy drops below threshold
  2. hard-fail if format compliance breaks
  3. hard-fail if high-risk refusal cases regress
  4. warn if latency or cost crosses budget
  5. inspect trace deltas when agent behaviour shifts materially

Enough to keep a team honest without building a religion around dashboards. The mistake is treating every model or prompt change as a fresh act of intuition. Evals are there so you can stop arguing from memory.

Production incidents feed the suite quickly

The fastest way to build a stale eval program is to treat the dataset as a museum.

A healthier loop. User or monitoring finds a failure. The failure becomes an eval case. The fixed system must pass it forever after.

This is the compounding part. A good eval harness gets more valuable with every embarrassing bug it absorbs.

The eval program needs an owner

A shared dashboard is not ownership. Someone has to decide which evals are trusted enough to gate releases, which failures are signal versus noise, when a new task belongs in the suite, who signs off on rubric changes.

Anthropic's recent write-up mentions dedicated eval ownership plus domain experts contributing tasks. Matches what I see in practice. Central infrastructure matters, but the people closest to the product define what success means.

A useful first version is small

If I had to stand up evals for a production system tomorrow morning:

  1. collect 30-50 trusted examples
  2. separate them by failure class
  3. define a code-based or model-based grader for each class
  4. track latency and refusal next to quality
  5. block releases on the small set before trying to boil the ocean

The first eval system does not need to feel impressive. It needs to catch the failures you are actually shipping. That is enough.

Related reading

Related reading

  • Building agentic AI systems that hold up
  • Prompt engineering: from phrasing to policy
References

Further reading

  • OpenAI: Evaluation best practices
  • OpenAI: Agent evals
  • OpenAI: Trace grading
  • Anthropic: Demystifying evals for AI agents

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on Evals

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →How to Build Legal Answering Systems That Can Be TrustedMar 10, 2026·22 min read
  • →Preventing Hallucinations in LLM SystemsFeb 18, 2026·5 min read
  • →Forecasting Without Prophecy: a plain-text disciplineMay 2, 2026·13 min read
On this page
  • 01Evals are not a benchmark hobby1 min
  • 02Start with one small trusted set
  • Trusted set
  • Monitoring set
  • 03Grade the thing that actually failed1 min
  • 04LLM judges have a narrow remit
  • 05Production signals belong in the eval conversation
  • 06Release gates should be boring
  • 07Production incidents feed the suite quickly
  • 08The eval program needs an owner
  • 09A useful first version is small
  • 10Related reading
  • 11Further reading