Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

Preventing Hallucinations in LLM Systems

How to reduce hallucinations in LLM systems with better retrieval, abstention, verification, evals, and guardrails.

February 18, 2026·5 min read
RAGReliabilitySafety
On this page(11)
Hallucination is a system-design failureRetrieval discipline over volumeMake uncertainty legal in the promptExamples shape behaviourGuardrails inside a threat modelClaim-level verification beats answer-level vibesEvals catch what prompt reviews missStreaming creates a special problemThe best fallback is a useful refusalRelated readingFurther reading

Hallucination prevention stopped being one trick. It is a stack. Retrieval discipline, clearer response formats, explicit abstention, claim checks, and guardrails that are honest about what they can and cannot prove.

Plain truth

No serious strategy is called "trust the model more." Better support, better verification, the discipline to stop when support is missing.

Verification stack

  • retrieval that can actually support the claim
  • prompts that permit useful abstention
  • response formats that make checking possible
  • claim-level or slot-level verification where the risk justifies it

Thin defenses

The model is asked to be accurate.

  • weak retrieval
  • no abstention rule
  • free-form outputs
  • one vague quality score

Grounded system

The system is designed to stay support-bound.

  • retrieval is scoped and inspectable
  • unsupported claims are allowed to stop
  • outputs are checkable by contract
  • evals and validators target specific failure classes
Grounded answer path
Support first. If support is weak, decline without drama.

Hallucination is a system-design failure

A lot of hallucinations are not model failures. They are system failures.

The model gets blamed for facts it was never given, formats it was never shown, policies it was never allowed to follow honestly. In production, hallucinations come from weak context, ambiguous tasks, unconstrained output shape, refusal rules that are too vague, no downstream checks.

If you only swap models, you might reduce the symptoms. You probably will not fix the disease.

Retrieval discipline over volume

Groundedness improves when the system retrieves less but better.

Retrieve only evidence relevant to the specific question. Preserve source identity through ranking and generation. Require the model to answer from the retrieved set or abstain. Analyse retrieval failures separately from generation failures.

The common anti-pattern is to stuff the prompt with everything remotely related and hope the model becomes wiser through saturation. It just gets noisier.

Make uncertainty legal in the prompt

Prompting matters most when it sets boundaries.

A good high-stakes prompt defines the task precisely, defines the expected output format, defines what counts as enough evidence, and explicitly allows the model to say the answer is unsupported.

If the prompt implies that an answer must always appear, an answer will appear. That is not intelligence. It is leakage from your incentives.

Examples shape behaviour

Strong examples do more than make the answer look nicer.

They teach the model to cite only when evidence exists, stay concise when support is thin, preserve a strict JSON or markdown schema, refuse when a field cannot be justified.

A few good examples often outperform one more paragraph of elegant instructions.

Guardrails inside a threat model

Guardrails help when they are honest about what they cover. Useful for policy checks, structured domain rules, bounded post-answer validation, specific high-risk behaviours that can be classified reliably.

They are not a spell that makes the whole response true.

OWASP's LLM Top 10 is still useful because it forces teams to think beyond "the model might be wrong." Prompt injection, data leakage, insecure output handling, and excessive agency turn hallucination into something more expensive than a bad paragraph.

Claim-level verification beats answer-level vibes

The production systems I trust break answers into checks that can be evaluated independently.

Instead of "does this answer seem fine?", ask: which claims depend on retrieved evidence, which are date- or number-sensitive, which are policy-bound, which should trigger abstention if unsupported.

This lets the system trim or block the unsafe parts instead of throwing away the entire answer every time something feels suspicious.

Evals catch what prompt reviews miss

OpenAI's current eval guidance is still the right operational lens. If you care about truthfulness in production, build evals into the shipping path.

A layered pack:

  1. answer-grounding checks
  2. unsupported-claim refusal checks
  3. citation integrity checks
  4. structured-output checks
  5. risky-domain red-team cases

The important part is not the dataset. It is the habit. Rerun the same checks after prompt changes, retrieval changes, model swaps, ranking tweaks.

Streaming creates a special problem

Streaming is good product design. It also shortens your time to regret.

Because streaming emits partial text, a final-answer check alone is not enough. If unsupported or sensitive content cannot appear in public, you need generation scoped tightly enough that the bad answer is less likely, buffering or delayed release for guarded fields, chunk-level sanitation with withheld tails, or a non-streaming path for the riskiest answer types.

This is one reason formal post-generation validation tools often sit behind non-streaming or semi-buffered flows.

The best fallback is a useful refusal

A good refusal is a precise boundary, not a generic apology.

Examples:

  • "The retrieved material does not support a reliable answer."
  • "I can summarise the available facts, but I should not infer beyond them."
  • "This claim needs a source-backed check before I answer directly."

The refusal should preserve trust and keep the next step obvious.

Related reading

Related reading

  • LLM product safety without theater
  • How to run LLM evals in production
References

Further reading

  • OpenAI: Evaluation best practices
  • OWASP Top 10 for LLM Applications
  • NIST AI Risk Management Framework
  • Anthropic prompt engineering overview

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on RAG

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →How to Build Legal Answering Systems That Can Be TrustedMar 10, 2026·22 min read
  • →How to Run LLM Evals in ProductionFeb 3, 2026·6 min read
  • →LLM Product Safety Without TheaterMar 9, 2026·5 min read
On this page
  • 01Hallucination is a system-design failure
  • 02Retrieval discipline over volume
  • 03Make uncertainty legal in the prompt
  • 04Examples shape behaviour
  • 05Guardrails inside a threat model
  • 06Claim-level verification beats answer-level vibes
  • 07Evals catch what prompt reviews miss
  • 08Streaming creates a special problem
  • 09The best fallback is a useful refusal
  • 10Related reading
  • 11Further reading