Alex Chernysh
AI systems / retrieval / evals / architecture
Discuss a system
SystemsWritingAssistant
Back to notes

Note

Building Agentic AI Systems That Hold Up

Practical guidance on tool contracts, context engineering, evals, approvals, and telemetry.

March 2, 20266 min readBy Alex Chernysh
Agents
Jump to section
1. Prefer workflows until you have earned agents2. Treat tool use as an API contract, not a personality trait3. Retrieval is part of control, not just context4. Human approval belongs on the expensive edge5. Context engineering is healthier than sentimental memory6. Evals are the operating system7. Telemetry should explain decisions, not just errors8. The winning pattern is smaller than people expectRelated readingFurther reading

Prefer a shorter pass first?

Agent systems are no longer impressive because they can call tools. The quieter question is whether they can do that predictably, leave evidence behind, and stop when they should.

North star

Start with the smallest loop that solves the real task. Anthropic's production guidance still draws the healthiest line in the field: use workflows when the sequence is known, and reach for more autonomous agents only when runtime adaptation is actually necessary.

Default production stance

Build the narrowest loop that preserves evidence, approvals, and eval coverage. Extra autonomy is only useful after the narrower version has clearly become the bottleneck.

What tends to hold up

  • start with workflows before autonomy
  • treat tools as contracts, not as magical powers
  • keep human approval at the expensive edge
  • engineer context deliberately instead of accumulating sentimental memory
  • evaluate traces and outcomes, not just the final prose
Small-loop default
The useful pattern is usually narrower than the first architecture draft.
Control surfaces

1. Prefer workflows until you have earned agents

The fastest way to build a fragile system is to begin with a roaming planner because it feels advanced.

That sounds obvious. Teams skip it anyway.

If the task has a mostly known sequence, a workflow gives you three things almost for free:

  • clearer debugging
  • more legible cost and latency
  • smaller scope of failure when the model misreads the room

Use a more autonomous agent only after the workflow version has become too rigid for the real task. Before that, extra autonomy is usually just more surface area.

2. Treat tool use as an API contract, not a personality trait

OpenAI's current Agents documentation is useful precisely because it keeps tools grounded in reality. Web search, file search, connectors, shell, or computer use are all execution surfaces. The quality bar is therefore the same as any other integration.

A good tool contract has four properties:

  1. narrow input schema
  2. obvious failure modes
  3. explicit permissions
  4. deterministic post-processing

A tool is not valuable because a model can invoke it. It is valuable because the contract leaves fewer ways for the model to be clever at the wrong moment.

3. Retrieval is part of control, not just context

A surprising number of agent failures are retrieval failures wearing a reasoning costume.

In practice, grounded systems behave better when retrieval is treated as a control layer:

  • retrieve only what the current step needs
  • rerank or filter before generation
  • preserve document identity through the whole run
  • allow the model to abstain when support is thin

This matters even more in agentic loops, because one weak retrieval step can poison everything that follows.

4. Human approval belongs on the expensive edge

Approval gates should not appear everywhere. They should appear where the system crosses a boundary that a human will care about later.

Typical approval points:

  • sending, deleting, or publishing something
  • changing financial or legal state
  • mutating code or infrastructure with real side effects
  • answering with confidence in a high-stakes domain

Everything else should be automated, logged, and reversible where possible.

5. Context engineering is healthier than sentimental memory

Teams often say they need memory when what they really need is a stable thread of state and evidence.

That usually comes from three things:

  • current run state
  • durable preferences worth reusing
  • retrievable artifacts such as receipts, summaries, and prior outputs

Anthropic's recent writing on context engineering is the right mental correction here. The main job is not to accumulate more text. It is to decide what context belongs in the loop, what should be retrieved on demand, what should expire, and what needs to stay inspectable.

What you do not want is a growing blob of previous conversation that nobody can audit. If the memory cannot be inspected, expired, or replayed from source artifacts, it will become mythology.

6. Evals are the operating system

OpenAI's current eval guidance and Anthropic's work on agent evals converge on the same operational point: if the system can take several steps, touch tools, or branch under uncertainty, evals stop being a research accessory and become the thing that lets you sleep.

The strongest eval setups usually combine:

  • task success checks
  • tool-call correctness checks
  • source-grounding or citation checks
  • refusal and escalation checks
  • latency and cost budgets
  • trace-level grading when internal behavior matters

An agent without evals is just a workflow you have chosen not to measure yet.

7. Telemetry should explain decisions, not just errors

Most teams now log failures. Fewer teams log reasoning boundaries, tool choices, retrieval snapshots, approval branches, and policy triggers.

That missing context is what makes agent incidents expensive to understand.

At minimum, you want telemetry for:

  • tool selected
  • arguments used
  • documents retrieved
  • policy or guardrail events
  • human approval requests
  • final answer shape and confidence posture

The ideal trace lets another engineer answer a very plain question:

Why did this system believe it was allowed to do that?

8. The winning pattern is smaller than people expect

The production pattern I trust most still looks modest:

  1. classify the task
  2. retrieve or load only relevant context
  3. choose from a constrained set of tools
  4. execute with receipts
  5. run checks
  6. answer, abstain, or escalate

That is not glamorous. It is also why it works.

Tomorrow-morning pass

  • narrow the tool contracts before touching the planner
  • add approval boundaries where cost or risk becomes durable
  • log receipts for every external action
  • make context loading explicit instead of relying on ambient memory
  • build a small eval pack around the top real failure cases
Related reading

Related reading

  • How to run LLM evals in production
  • LLM product safety without theater
References

Further reading

  • OpenAI: Agents guide
  • OpenAI: Agent evals
  • OpenAI: Trace grading
  • Anthropic: Building effective agents
  • Anthropic: Effective context engineering for agents

Related reading

Part of the public notes on grounded AI systems, retrieval, evals, and delivery under real constraints.

How to Build Legal Answering Systems That Can Be TrustedWhich Query Transformation Techniques Actually Help RAG?Spec-Driven Development: the workflow I actually use
On this page
1. Prefer workflows until you have earned agents2. Treat tool use as an API contract, not a personality trait3. Retrieval is part of control, not just context4. Human approval belongs on the expensive edge5. Context engineering is healthier than sentimental memory6. Evals are the operating system7. Telemetry should explain decisions, not just errors8. The winning pattern is smaller than people expectRelated readingFurther reading