Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

LLM Product Safety Without Theater

A practical guide to LLM product safety: prompt injection, excessive agency, unsafe outputs, evals, and sober boundaries.

March 9, 2026·5 min read
DesignSafetySecurity
On this page(11)
Safety is product behaviourPrompt injection belongs in the normal threat modelExcessive agency is a design bugOutput validation, because downstream systems are literalWhere to put which checkCritical pathMonitoring and reviewEvals make safety harder to fakeMonitoring should explain decisionsA mature safety stack feels soberWhat I would do firstRelated readingFurther reading

Most LLM products do not fail for lack of someone saying "safety" out loud. They fail because the safety story stayed in the slide deck while the rest of the system kept shipping around it.

Working rule

Safety is real once it changes what the system is allowed to do, what it has to log, and what it must refuse. Until then it is theatre with extra meetings.

Minimum viable safety posture

  • treat prompt injection as expected input, not exotic sabotage
  • constrain tool permissions and approval boundaries
  • validate outputs before they hit downstream systems
  • run evals and red-team cases on the behaviours that matter most
  • preserve enough telemetry to explain why the system acted
Layered safety path
A healthy posture is layered. Control what the model sees, what it can do, what can leave the system, and what gets reviewed later.

Safety is product behaviour

Some teams hear "safety" and reach for policy binders. Others hear it and reach for censorship. Neither is the job.

In product terms safety covers what the system can see, what it can do, what it can claim, what it must refuse, and how failures are observed and contained. That is a long list and it should be. Each item is a place where something specific gets written down.

Good safety work looks dull in code. Narrower permissions. Clearer approval points. Safer defaults. Auditable traces. Release gates around the rare action that can break something for real. Dull is the whole point.

Prompt injection belongs in the normal threat model

OWASP's LLM Top 10 still leads with prompt injection, and it deserves the spot. Too many systems still trust model-consumed text more than they should.

The rule is short.

Untrusted content does not get to redefine the system's instructions or its permissions.

Retrieved documents, emails, web pages, third-party data. Treat them as hostile by default in any path that matters. A model is allowed to read a document. The document does not get to write the system's policy.

Excessive agency is a design bug

OWASP now calls excessive agency out by name. Overdue.

The problem was never agency. The problem is broad permissions sitting next to vague boundaries with thin review. That combination breaks even when nothing about the model is wrong.

A healthier pattern looks narrower. Tool scopes that fit the job. Typed contracts. Explicit approvals for durable side effects. Reversible operations where the choice exists. Telemetry on every external action.

If your system can email, purchase, delete, deploy, or mutate records, the permission model is product infrastructure. It is not prompt decoration.

Output validation, because downstream systems are literal

Unsafe output is more than offensive text. It is malformed JSON entering a workflow. SQL or code suggestions reaching execution paths without checks. Confident legal or medical claims with no support. Links and commands that inherit too much trust from the surrounding interface.

This is why OWASP's categories around insecure output handling and sensitive information disclosure stay practical. The output is where a fuzzy model meets a literal system. That meeting needs a chaperone.

Where to put which check

Not every defence belongs on the critical path.

Critical path

Anything that prevents immediate damage. Permission boundaries. Output schema validation. Approval gates for the dangerous calls. High-confidence blocks for known-forbidden behaviour.

Monitoring and review

The slower or noisier work. Deeper red-team analysis. Trend monitoring. Judge-model grading. Broad anomaly review.

Plenty of teams get this wired backwards. Either the critical path drowns in expensive checks, or dangerous behaviour quietly waits for the postmortem.

Evals make safety harder to fake

A safety story should survive contact with a real eval suite.

Useful test cases. Prompt injection attempts. Unsupported-claim scenarios. Unsafe tool-call proposals. Data exfiltration attempts. Refusal cases driven by policy. Escalation boundaries.

Anthropic's recent writing on agent evals keeps coming back to one discipline. Define the task, define the grading logic, measure repeatedly. Safety work gets a lot better the moment it stops sounding like posture and starts looking like test design.

Monitoring should explain decisions

A dashboard that tells you something bad happened beats nothing. A dashboard that tells you why is the one you actually want at 02:00.

What you want to see for a real incident. Which input triggered the behaviour. What context was present. Which tool the system tried to call. What policy or approval boundary fired. What finally reached the user or the downstream system.

Without that, the review turns into archaeology with worse morale.

A mature safety stack feels sober

The systems I trust never feel paranoid. They feel disciplined.

They do not promise perfection. They do not claim the model is now safe in some mystical global sense. They just shrink the number of ways the system can cause expensive trouble. That is most of the job.

What I would do first

If I had to harden an LLM product this month, the order I would work in.

  1. map the real side effects and data exposures
  2. narrow tool permissions and approval boundaries
  3. add high-value safety evals on the top risky behaviours
  4. validate outputs before they hit literal downstream systems
  5. lift telemetry until incident review stops feeling like guesswork

Ceremony can wait. Controls cannot.

Related reading

Related reading

  • Preventing hallucinations in LLM systems
  • Building agentic AI systems that hold up
References

Further reading

  • OWASP Top 10 for LLM Applications
  • NIST AI Risk Management Framework
  • Anthropic: Demystifying evals for AI agents

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on Design

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →Need a job? Sip your drink. We'll look for you.Apr 23, 2026·4 min read
  • →Interface Design for Serious ProductsMar 6, 2026·7 min read
  • →Preventing Hallucinations in LLM SystemsFeb 18, 2026·5 min read
On this page
  • 01Safety is product behaviour
  • 02Prompt injection belongs in the normal threat model
  • 03Excessive agency is a design bug
  • 04Output validation, because downstream systems are literal
  • 05Where to put which check
  • Critical path
  • Monitoring and review
  • 06Evals make safety harder to fake
  • 07Monitoring should explain decisions
  • 08A mature safety stack feels sober
  • 09What I would do first
  • 10Related reading
  • 11Further reading