Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

Building Agentic AI Systems That Hold Up

Practical guidance on tool contracts, context engineering, evals, approvals, and telemetry.

March 2, 2026·5 min read
Agents
On this page(10)
Workflows firstTools are contractsRetrieval is control, not flavorApprovals on the expensive edgeContext engineering, not sentimental memoryEvals are the operating systemTelemetry should explain decisionsThe pattern that holds upRelated readingFurther reading

Tool-calling stopped being interesting around mid-2024. The harder question now is whether the agent does it predictably, leaves evidence, and stops when it should.

North star

Start with the smallest loop that solves the real task. Anthropic's production guidance still draws the line I trust. Use workflows when the sequence is known. Reach for more autonomous agents only when runtime adaptation is actually necessary.

Default production stance

Build the narrowest loop that preserves evidence, approvals, and eval coverage. More autonomy is useful only after the narrow version has become the bottleneck.

What I keep coming back to

  • workflows before autonomy
  • tools as contracts, not personality traits
  • human approval on the expensive edge
  • engineered context, not sentimental memory
  • evaluate traces and outcomes, not the final prose
Small-loop default
The useful pattern is usually narrower than the first architecture draft.
Control surfaces

Workflows first

The fastest way to ship a fragile system is to start with a roaming planner because it feels advanced. Teams know this and do it anyway.

A workflow with a known sequence buys you cheaper debugging, more legible cost and latency, and a smaller blast radius when the model misreads the room. Reach for the more autonomous agent only after the workflow version has actually become too rigid. Before that, extra autonomy is just more surface area.

Tools are contracts

Web search, file search, connectors, shell, computer use. They are execution surfaces. The quality bar is the same as any other integration.

A tool contract worth trusting has a narrow input schema, obvious failure modes, explicit permissions, and deterministic post-processing. The contract is what closes off ways for the model to be clever at the wrong moment. A model being able to invoke a tool is not the same thing as the tool being safe to invoke.

Retrieval is control, not flavor

A surprising number of agent failures are retrieval failures in reasoning costume.

Grounded systems behave better when retrieval is a control layer. Retrieve only what the current step needs. Rerank or filter before generation. Preserve document identity through the whole run. Let the model abstain when support is thin. Inside an agentic loop one weak retrieval step poisons everything downstream.

Approvals on the expensive edge

Approval gates do not belong everywhere. They belong where the system crosses a boundary a human will care about later.

The honest list:

  • sending, deleting, or publishing
  • changing financial or legal state
  • mutating code or infrastructure with real side effects
  • answering with confidence in a high-stakes domain

Everything else should be automated, logged, and reversible.

Context engineering, not sentimental memory

Teams often say they need memory when what they really need is a stable thread of state and evidence. That thread comes from current run state, durable preferences worth reusing, and retrievable artifacts (receipts, summaries, prior outputs).

The job is to decide what context belongs in the loop, what should be retrieved on demand, what should expire, and what stays inspectable. Accumulating more text is not the job.

A growing blob of previous conversation that nobody can audit becomes mythology within a quarter. If the memory cannot be inspected, expired, or replayed from source artifacts, it is already mythology.

Evals are the operating system

When the system takes several steps, touches tools, or branches under uncertainty, evals are the thing that lets you sleep.

The eval packs I trust combine task success checks, tool-call correctness, source grounding, refusal and escalation, latency and cost budgets, and trace-level grading when internal behavior matters. An agent without evals is a workflow you have chosen not to measure.

Telemetry should explain decisions

Most teams log failures. Fewer log reasoning boundaries, tool choices, retrieval snapshots, approval branches, policy triggers. That missing context is what makes agent incidents expensive to understand a week later.

At minimum, log: tool selected, arguments used, documents retrieved, policy or guardrail events, human approval requests, final answer shape and confidence posture.

The trace I want lets another engineer answer one plain question:

Why did this system believe it was allowed to do that?

The pattern that holds up

Modest, in roughly this order: classify the task, load only relevant context, choose from a constrained set of tools, execute with receipts, run checks, then answer, abstain, or escalate.

That is most of why it works.

Tomorrow-morning pass

  • narrow the tool contracts before touching the planner
  • add approval boundaries where cost or risk becomes durable
  • log receipts for every external action
  • make context loading explicit, not ambient
  • build a small eval pack around real top failures
Related reading

Related reading

  • How to run LLM evals in production
  • LLM product safety without theater
References

Further reading

  • OpenAI: Agents guide
  • OpenAI: Agent evals
  • OpenAI: Trace grading
  • Anthropic: Building effective agents
  • Anthropic: Effective context engineering for AI agents

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on Agents

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →I Ran 12 AI Agents for 47 Hours. Here's What Survived.Mar 29, 2026·7 min read
  • →How to Build Legal Answering Systems That Can Be TrustedMar 10, 2026·22 min read
  • →Need a job? Sip your drink. We'll look for you.Apr 23, 2026·4 min read
On this page
  • 01Workflows first
  • 02Tools are contracts
  • 03Retrieval is control, not flavor
  • 04Approvals on the expensive edge
  • 05Context engineering, not sentimental memory
  • 06Evals are the operating system
  • 07Telemetry should explain decisions
  • 08The pattern that holds up
  • 09Related reading
  • 10Further reading