Tool-calling stopped being interesting around mid-2024. The harder question now is whether the agent does it predictably, leaves evidence, and stops when it should.
Workflows first
The fastest way to ship a fragile system is to start with a roaming planner because it feels advanced. Teams know this and do it anyway.
A workflow with a known sequence buys you cheaper debugging, more legible cost and latency, and a smaller blast radius when the model misreads the room. Reach for the more autonomous agent only after the workflow version has actually become too rigid. Before that, extra autonomy is just more surface area.
Tools are contracts
Web search, file search, connectors, shell, computer use. They are execution surfaces. The quality bar is the same as any other integration.
A tool contract worth trusting has a narrow input schema, obvious failure modes, explicit permissions, and deterministic post-processing. The contract is what closes off ways for the model to be clever at the wrong moment. A model being able to invoke a tool is not the same thing as the tool being safe to invoke.
Retrieval is control, not flavor
A surprising number of agent failures are retrieval failures in reasoning costume.
Grounded systems behave better when retrieval is a control layer. Retrieve only what the current step needs. Rerank or filter before generation. Preserve document identity through the whole run. Let the model abstain when support is thin. Inside an agentic loop one weak retrieval step poisons everything downstream.
Approvals on the expensive edge
Approval gates do not belong everywhere. They belong where the system crosses a boundary a human will care about later.
The honest list:
- sending, deleting, or publishing
- changing financial or legal state
- mutating code or infrastructure with real side effects
- answering with confidence in a high-stakes domain
Everything else should be automated, logged, and reversible.
Context engineering, not sentimental memory
Teams often say they need memory when what they really need is a stable thread of state and evidence. That thread comes from current run state, durable preferences worth reusing, and retrievable artifacts (receipts, summaries, prior outputs).
The job is to decide what context belongs in the loop, what should be retrieved on demand, what should expire, and what stays inspectable. Accumulating more text is not the job.
A growing blob of previous conversation that nobody can audit becomes mythology within a quarter. If the memory cannot be inspected, expired, or replayed from source artifacts, it is already mythology.
Evals are the operating system
When the system takes several steps, touches tools, or branches under uncertainty, evals are the thing that lets you sleep.
The eval packs I trust combine task success checks, tool-call correctness, source grounding, refusal and escalation, latency and cost budgets, and trace-level grading when internal behavior matters. An agent without evals is a workflow you have chosen not to measure.
Telemetry should explain decisions
Most teams log failures. Fewer log reasoning boundaries, tool choices, retrieval snapshots, approval branches, policy triggers. That missing context is what makes agent incidents expensive to understand a week later.
At minimum, log: tool selected, arguments used, documents retrieved, policy or guardrail events, human approval requests, final answer shape and confidence posture.
The trace I want lets another engineer answer one plain question:
Why did this system believe it was allowed to do that?
The pattern that holds up
Modest, in roughly this order: classify the task, load only relevant context, choose from a constrained set of tools, execute with receipts, run checks, then answer, abstain, or escalate.
That is most of why it works.