Note

Preventing Hallucinations in LLM Systems

How to reduce hallucinations in LLM systems with better retrieval, abstention, verification, evals, and guardrails.

February 18, 20265 min read

On this page(11)

Hallucination prevention stopped being one trick. It is a stack. Retrieval discipline, clearer response formats, explicit abstention, claim checks, and guardrails that are honest about what they can and cannot prove.

Thin defenses

The model is asked to be accurate.

weak retrieval
no abstention rule
free-form outputs
one vague quality score

Grounded system

The system is designed to stay support-bound.

retrieval is scoped and inspectable
unsupported claims are allowed to stop
outputs are checkable by contract
evals and validators target specific failure classes

Grounded answer path

Support first. If support is weak, decline without drama.

Hallucination is a system-design failure

A lot of hallucinations are not model failures. They are system failures.

The model gets blamed for facts it was never given, formats it was never shown, policies it was never allowed to follow honestly. In production, hallucinations come from weak context, ambiguous tasks, unconstrained output shape, refusal rules that are too vague, no downstream checks.

If you only swap models, you might reduce the symptoms. You probably will not fix the disease.

Retrieval discipline over volume

Groundedness improves when the system retrieves less but better.

Retrieve only evidence relevant to the specific question. Preserve source identity through ranking and generation. Require the model to answer from the retrieved set or abstain. Analyse retrieval failures separately from generation failures.

The common anti-pattern is to stuff the prompt with everything remotely related and hope the model becomes wiser through saturation. It just gets noisier.

Make uncertainty legal in the prompt

Prompting matters most when it sets boundaries.

A good high-stakes prompt defines the task precisely, defines the expected output format, defines what counts as enough evidence, and explicitly allows the model to say the answer is unsupported.

If the prompt implies that an answer must always appear, an answer will appear. That is not intelligence. It is leakage from your incentives.

Examples shape behaviour

Strong examples do more than make the answer look nicer.

They teach the model to cite only when evidence exists, stay concise when support is thin, preserve a strict JSON or markdown schema, refuse when a field cannot be justified.

A few good examples often outperform one more paragraph of elegant instructions.

Guardrails inside a threat model

Guardrails help when they are honest about what they cover. Useful for policy checks, structured domain rules, bounded post-answer validation, specific high-risk behaviours that can be classified reliably.

They are not a spell that makes the whole response true.

OWASP's LLM Top 10 is still useful because it forces teams to think beyond "the model might be wrong." Prompt injection, data leakage, insecure output handling, and excessive agency turn hallucination into something more expensive than a bad paragraph.

Claim-level verification beats answer-level vibes

The production systems I trust break answers into checks that can be evaluated independently.

Instead of "does this answer seem fine?", ask: which claims depend on retrieved evidence, which are date- or number-sensitive, which are policy-bound, which should trigger abstention if unsupported.

This lets the system trim or block the unsafe parts instead of throwing away the entire answer every time something feels suspicious.

Evals catch what prompt reviews miss

OpenAI's current eval guidance is still the right operational lens. If you care about truthfulness in production, build evals into the shipping path.

A layered pack:

answer-grounding checks
unsupported-claim refusal checks
citation integrity checks
structured-output checks
risky-domain red-team cases

The important part is not the dataset. It is the habit. Rerun the same checks after prompt changes, retrieval changes, model swaps, ranking tweaks.

Streaming creates a special problem

Streaming is good product design. It also shortens your time to regret.

Because streaming emits partial text, a final-answer check alone is not enough. If unsupported or sensitive content cannot appear in public, you need generation scoped tightly enough that the bad answer is less likely, buffering or delayed release for guarded fields, chunk-level sanitation with withheld tails, or a non-streaming path for the riskiest answer types.

This is one reason formal post-generation validation tools often sit behind non-streaming or semi-buffered flows.

The best fallback is a useful refusal

A good refusal is a precise boundary, not a generic apology.

Examples:

"The retrieved material does not support a reliable answer."
"I can summarise the available facts, but I should not infer beyond them."
"This claim needs a source-backed check before I answer directly."

The refusal should preserve trust and keep the next step obvious.

Hallucination is a system-design failure

Retrieval discipline over volume

Make uncertainty legal in the prompt

Examples shape behaviour

Guardrails inside a threat model

Claim-level verification beats answer-level vibes

Evals catch what prompt reviews miss

Streaming creates a special problem

The best fallback is a useful refusal

Related reading

Further reading