Note

LLM Product Safety Without Theater

A practical guide to LLM product safety: prompt injection, excessive agency, unsafe outputs, evals, and sober boundaries.

March 9, 20265 min read

Design Safety Security

On this page(11)

Most LLM products do not fail for lack of someone saying "safety" out loud. They fail because the safety story stayed in the slide deck while the rest of the system kept shipping around it.

Layered safety path

A healthy posture is layered. Control what the model sees, what it can do, what can leave the system, and what gets reviewed later.

Safety is product behaviour

Some teams hear "safety" and reach for policy binders. Others hear it and reach for censorship. Neither is the job.

In product terms safety covers what the system can see, what it can do, what it can claim, what it must refuse, and how failures are observed and contained. That is a long list and it should be. Each item is a place where something specific gets written down.

Good safety work looks dull in code. Narrower permissions. Clearer approval points. Safer defaults. Auditable traces. Release gates around the rare action that can break something for real. Dull is the whole point.

Prompt injection belongs in the normal threat model

OWASP's LLM Top 10 still leads with prompt injection, and it deserves the spot. Too many systems still trust model-consumed text more than they should.

The rule is short.

Untrusted content does not get to redefine the system's instructions or its permissions.

Retrieved documents, emails, web pages, third-party data. Treat them as hostile by default in any path that matters. A model is allowed to read a document. The document does not get to write the system's policy.

Excessive agency is a design bug

OWASP now calls excessive agency out by name. Overdue.

The problem was never agency. The problem is broad permissions sitting next to vague boundaries with thin review. That combination breaks even when nothing about the model is wrong.

A healthier pattern looks narrower. Tool scopes that fit the job. Typed contracts. Explicit approvals for durable side effects. Reversible operations where the choice exists. Telemetry on every external action.

If your system can email, purchase, delete, deploy, or mutate records, the permission model is product infrastructure. It is not prompt decoration.

Output validation, because downstream systems are literal

Unsafe output is more than offensive text. It is malformed JSON entering a workflow. SQL or code suggestions reaching execution paths without checks. Confident legal or medical claims with no support. Links and commands that inherit too much trust from the surrounding interface.

This is why OWASP's categories around insecure output handling and sensitive information disclosure stay practical. The output is where a fuzzy model meets a literal system. That meeting needs a chaperone.

Where to put which check

Not every defence belongs on the critical path.

Critical path

Anything that prevents immediate damage. Permission boundaries. Output schema validation. Approval gates for the dangerous calls. High-confidence blocks for known-forbidden behaviour.

Monitoring and review

The slower or noisier work. Deeper red-team analysis. Trend monitoring. Judge-model grading. Broad anomaly review.

Plenty of teams get this wired backwards. Either the critical path drowns in expensive checks, or dangerous behaviour quietly waits for the postmortem.

Evals make safety harder to fake

A safety story should survive contact with a real eval suite.

Useful test cases. Prompt injection attempts. Unsupported-claim scenarios. Unsafe tool-call proposals. Data exfiltration attempts. Refusal cases driven by policy. Escalation boundaries.

Anthropic's recent writing on agent evals keeps coming back to one discipline. Define the task, define the grading logic, measure repeatedly. Safety work gets a lot better the moment it stops sounding like posture and starts looking like test design.

Monitoring should explain decisions

A dashboard that tells you something bad happened beats nothing. A dashboard that tells you why is the one you actually want at 02:00.

What you want to see for a real incident. Which input triggered the behaviour. What context was present. Which tool the system tried to call. What policy or approval boundary fired. What finally reached the user or the downstream system.

Without that, the review turns into archaeology with worse morale.

A mature safety stack feels sober

The systems I trust never feel paranoid. They feel disciplined.

They do not promise perfection. They do not claim the model is now safe in some mystical global sense. They just shrink the number of ways the system can cause expensive trouble. That is most of the job.

What I would do first

If I had to harden an LLM product this month, the order I would work in.

map the real side effects and data exposures
narrow tool permissions and approval boundaries
add high-value safety evals on the top risky behaviours
validate outputs before they hit literal downstream systems
lift telemetry until incident review stops feeling like guesswork

Ceremony can wait. Controls cannot.

Safety is product behaviour

Prompt injection belongs in the normal threat model

Excessive agency is a design bug

Output validation, because downstream systems are literal

Where to put which check

Critical path

Monitoring and review

Evals make safety harder to fake

Monitoring should explain decisions

A mature safety stack feels sober

What I would do first

Related reading

Further reading