Hallucination prevention stopped being one trick. It is a stack. Retrieval discipline, clearer response formats, explicit abstention, claim checks, and guardrails that are honest about what they can and cannot prove.
Thin defenses
The model is asked to be accurate.
- weak retrieval
- no abstention rule
- free-form outputs
- one vague quality score
Grounded system
The system is designed to stay support-bound.
- retrieval is scoped and inspectable
- unsupported claims are allowed to stop
- outputs are checkable by contract
- evals and validators target specific failure classes
Hallucination is a system-design failure
A lot of hallucinations are not model failures. They are system failures.
The model gets blamed for facts it was never given, formats it was never shown, policies it was never allowed to follow honestly. In production, hallucinations come from weak context, ambiguous tasks, unconstrained output shape, refusal rules that are too vague, no downstream checks.
If you only swap models, you might reduce the symptoms. You probably will not fix the disease.
Retrieval discipline over volume
Groundedness improves when the system retrieves less but better.
Retrieve only evidence relevant to the specific question. Preserve source identity through ranking and generation. Require the model to answer from the retrieved set or abstain. Analyse retrieval failures separately from generation failures.
The common anti-pattern is to stuff the prompt with everything remotely related and hope the model becomes wiser through saturation. It just gets noisier.
Make uncertainty legal in the prompt
Prompting matters most when it sets boundaries.
A good high-stakes prompt defines the task precisely, defines the expected output format, defines what counts as enough evidence, and explicitly allows the model to say the answer is unsupported.
If the prompt implies that an answer must always appear, an answer will appear. That is not intelligence. It is leakage from your incentives.
Examples shape behaviour
Strong examples do more than make the answer look nicer.
They teach the model to cite only when evidence exists, stay concise when support is thin, preserve a strict JSON or markdown schema, refuse when a field cannot be justified.
A few good examples often outperform one more paragraph of elegant instructions.
Guardrails inside a threat model
Guardrails help when they are honest about what they cover. Useful for policy checks, structured domain rules, bounded post-answer validation, specific high-risk behaviours that can be classified reliably.
They are not a spell that makes the whole response true.
OWASP's LLM Top 10 is still useful because it forces teams to think beyond "the model might be wrong." Prompt injection, data leakage, insecure output handling, and excessive agency turn hallucination into something more expensive than a bad paragraph.
Claim-level verification beats answer-level vibes
The production systems I trust break answers into checks that can be evaluated independently.
Instead of "does this answer seem fine?", ask: which claims depend on retrieved evidence, which are date- or number-sensitive, which are policy-bound, which should trigger abstention if unsupported.
This lets the system trim or block the unsafe parts instead of throwing away the entire answer every time something feels suspicious.
Evals catch what prompt reviews miss
OpenAI's current eval guidance is still the right operational lens. If you care about truthfulness in production, build evals into the shipping path.
A layered pack:
- answer-grounding checks
- unsupported-claim refusal checks
- citation integrity checks
- structured-output checks
- risky-domain red-team cases
The important part is not the dataset. It is the habit. Rerun the same checks after prompt changes, retrieval changes, model swaps, ranking tweaks.
Streaming creates a special problem
Streaming is good product design. It also shortens your time to regret.
Because streaming emits partial text, a final-answer check alone is not enough. If unsupported or sensitive content cannot appear in public, you need generation scoped tightly enough that the bad answer is less likely, buffering or delayed release for guarded fields, chunk-level sanitation with withheld tails, or a non-streaming path for the riskiest answer types.
This is one reason formal post-generation validation tools often sit behind non-streaming or semi-buffered flows.
The best fallback is a useful refusal
A good refusal is a precise boundary, not a generic apology.
Examples:
- "The retrieved material does not support a reliable answer."
- "I can summarise the available facts, but I should not infer beyond them."
- "This claim needs a source-backed check before I answer directly."
The refusal should preserve trust and keep the next step obvious.