Agent systems are no longer impressive because they can call tools. The quieter question is whether they can do that predictably, leave evidence behind, and stop when they should.
1. Prefer workflows until you have earned agents
The fastest way to build a fragile system is to begin with a roaming planner because it feels advanced.
That sounds obvious. Teams skip it anyway.
If the task has a mostly known sequence, a workflow gives you three things almost for free:
- clearer debugging
- more legible cost and latency
- smaller scope of failure when the model misreads the room
Use a more autonomous agent only after the workflow version has become too rigid for the real task. Before that, extra autonomy is usually just more surface area.
2. Treat tool use as an API contract, not a personality trait
OpenAI's current Agents documentation is useful precisely because it keeps tools grounded in reality. Web search, file search, connectors, shell, or computer use are all execution surfaces. The quality bar is therefore the same as any other integration.
A good tool contract has four properties:
- narrow input schema
- obvious failure modes
- explicit permissions
- deterministic post-processing
A tool is not valuable because a model can invoke it. It is valuable because the contract leaves fewer ways for the model to be clever at the wrong moment.
3. Retrieval is part of control, not just context
A surprising number of agent failures are retrieval failures wearing a reasoning costume.
In practice, grounded systems behave better when retrieval is treated as a control layer:
- retrieve only what the current step needs
- rerank or filter before generation
- preserve document identity through the whole run
- allow the model to abstain when support is thin
This matters even more in agentic loops, because one weak retrieval step can poison everything that follows.
4. Human approval belongs on the expensive edge
Approval gates should not appear everywhere. They should appear where the system crosses a boundary that a human will care about later.
Typical approval points:
- sending, deleting, or publishing something
- changing financial or legal state
- mutating code or infrastructure with real side effects
- answering with confidence in a high-stakes domain
Everything else should be automated, logged, and reversible where possible.
5. Context engineering is healthier than sentimental memory
Teams often say they need memory when what they really need is a stable thread of state and evidence.
That usually comes from three things:
- current run state
- durable preferences worth reusing
- retrievable artifacts such as receipts, summaries, and prior outputs
Anthropic's recent writing on context engineering is the right mental correction here. The main job is not to accumulate more text. It is to decide what context belongs in the loop, what should be retrieved on demand, what should expire, and what needs to stay inspectable.
What you do not want is a growing blob of previous conversation that nobody can audit. If the memory cannot be inspected, expired, or replayed from source artifacts, it will become mythology.
6. Evals are the operating system
OpenAI's current eval guidance and Anthropic's work on agent evals converge on the same operational point: if the system can take several steps, touch tools, or branch under uncertainty, evals stop being a research accessory and become the thing that lets you sleep.
The strongest eval setups usually combine:
- task success checks
- tool-call correctness checks
- source-grounding or citation checks
- refusal and escalation checks
- latency and cost budgets
- trace-level grading when internal behavior matters
An agent without evals is just a workflow you have chosen not to measure yet.
7. Telemetry should explain decisions, not just errors
Most teams now log failures. Fewer teams log reasoning boundaries, tool choices, retrieval snapshots, approval branches, and policy triggers.
That missing context is what makes agent incidents expensive to understand.
At minimum, you want telemetry for:
- tool selected
- arguments used
- documents retrieved
- policy or guardrail events
- human approval requests
- final answer shape and confidence posture
The ideal trace lets another engineer answer a very plain question:
Why did this system believe it was allowed to do that?
8. The winning pattern is smaller than people expect
The production pattern I trust most still looks modest:
- classify the task
- retrieve or load only relevant context
- choose from a constrained set of tools
- execute with receipts
- run checks
- answer, abstain, or escalate
That is not glamorous. It is also why it works.