Alex Chernysh
AI systems / retrieval / evals / architecture
Discuss a system
SystemsWritingAssistant
Back to notes

Note

Preventing Hallucinations in LLM Systems

How to reduce hallucinations in LLM systems with better retrieval, abstention, verification, evals, and guardrails.

February 18, 20266 min readBy Alex Chernysh
RAGReliabilitySafety
Jump to section
1. Stop treating hallucination as a model-only problem2. Retrieval discipline beats retrieval volume3. Prompts should make uncertainty legal4. Examples shape behavior more than they shape tone5. Guardrails are useful only inside a clear threat model6. Claim-level verification is stronger than answer-level vibes7. Evals catch regressions that prompt reviews miss8. Streaming creates a special problem9. The best fallback is a useful refusalRelated readingFurther reading

Prefer a shorter pass first?

Hallucination prevention is no longer one trick. It is a stack: retrieval discipline, clearer response formats, explicit abstention, claim checks, and guardrails that are honest about what they can and cannot prove.

Plain truth

There is no serious strategy called "trust the model more." There is only better support, better verification, and a willingness to stop when support is missing.

Verification stack

  • retrieval that can actually support the claim
  • prompts that permit useful abstention
  • response formats that make checking possible
  • claim-level or slot-level verification where the risk justifies it

Thin defenses

The model is asked to be accurate.

  • weak retrieval
  • no abstention rule
  • free-form outputs
  • one vague quality score

Grounded system

The system is designed to stay support-bound.

  • retrieval is scoped and inspectable
  • unsupported claims are allowed to stop
  • outputs are checkable by contract
  • evals and validators target specific failure classes
Grounded answer path
Support first. If support is weak, decline without drama.

1. Stop treating hallucination as a model-only problem

A lot of hallucinations are system-design failures.

The model is often blamed for facts it was never given, formats it was never shown, or policies it was never allowed to follow honestly. In production, hallucinations usually come from some combination of:

  • weak context
  • ambiguous tasks
  • unconstrained output shape
  • refusal rules that are too weak or too vague
  • no downstream checks

If you only swap models, you might reduce the symptoms. You probably will not fix the disease.

2. Retrieval discipline beats retrieval volume

Groundedness improves when the system retrieves less but better.

The healthier pattern is:

  • retrieve only evidence relevant to the specific question
  • preserve source identity through ranking and generation
  • require the model to answer from the retrieved set or abstain
  • analyze retrieval failures separately from generation failures

The common anti-pattern is to stuff the prompt with everything remotely related and hope the model becomes wiser through saturation. It usually becomes noisier instead.

3. Prompts should make uncertainty legal

Prompting matters most when it sets boundaries.

A good high-stakes prompt does at least four things:

  • defines the task precisely
  • defines the expected output format
  • defines what counts as enough evidence
  • explicitly allows the model to say the answer is unsupported

If the prompt implies that an answer must always appear, an answer will often appear. That is not intelligence. It is leakage from your incentives.

4. Examples shape behavior more than they shape tone

Strong examples do more than make the answer look nicer.

They teach the model to:

  • cite only when evidence exists
  • stay concise when support is thin
  • preserve a strict JSON or markdown schema
  • refuse when a field cannot be justified

This is why a few good examples often outperform one more paragraph of elegant instructions.

5. Guardrails are useful only inside a clear threat model

Guardrails help when they are honest about what they cover.

They are useful for:

  • policy checks
  • structured domain rules
  • bounded post-answer validation
  • specific high-risk behaviors that can be classified reliably

They are not a magic spell that makes the whole response true.

OWASP's LLM Top 10 is still useful here because it forces teams to think beyond "the model might be wrong". Prompt injection, data leakage, insecure output handling, and excessive agency often turn hallucination into something more expensive than a bad paragraph.

6. Claim-level verification is stronger than answer-level vibes

The most practical production systems now break answers into checks that can be evaluated independently.

Instead of asking, "Does this answer seem fine?", ask:

  • which claims depend on retrieved evidence?
  • which claims are date- or number-sensitive?
  • which claims are policy-bound?
  • which claims should trigger abstention if unsupported?

This lets the system trim or block the unsafe parts instead of throwing away the entire answer every time something feels suspicious.

7. Evals catch regressions that prompt reviews miss

OpenAI's current eval guidance is still the right operational lens: if you care about truthfulness in production, build evals into the shipping path.

For hallucination prevention, I like a layered pack:

  1. answer-grounding checks
  2. unsupported-claim refusal checks
  3. citation integrity checks
  4. structured-output checks
  5. risky-domain red-team cases

The important part is not just the dataset. It is the habit: rerun the same checks after prompt changes, retrieval changes, model swaps, and ranking tweaks.

8. Streaming creates a special problem

Streaming is good product design. It also shortens your time to regret.

Because streaming emits partial text, you cannot rely on a final answer check alone. If unsupported or sensitive content is not allowed to appear in public, you need one of these approaches:

  • generation scoped tightly enough that the bad answer is less likely
  • buffering or delayed release for guarded fields
  • chunk-level sanitation with withheld tails
  • a non-streaming path for the riskiest answer types

This is one reason formal post-generation validation tools often sit behind non-streaming or semi-buffered flows.

9. The best fallback is a useful refusal

A good refusal is not a generic apology. It is a precise boundary.

Examples of useful fallbacks:

  • "The retrieved material does not support a reliable answer."
  • "I can summarize the available facts, but I should not infer beyond them."
  • "This claim needs a source-backed check before I answer directly."

The refusal should preserve trust and keep the next step obvious.

Related reading

Related reading

  • LLM product safety without theater
  • How to run LLM evals in production
References

Further reading

  • OpenAI: Evaluation best practices
  • OWASP Top 10 for LLM Applications
  • NIST AI Risk Management Framework
  • Anthropic prompt engineering overview

Related reading

Part of the public notes on grounded AI systems, retrieval, evals, and delivery under real constraints.

How to Build Legal Answering Systems That Can Be TrustedHow to Run LLM Evals in ProductionLLM Product Safety Without Theater
On this page
1. Stop treating hallucination as a model-only problem2. Retrieval discipline beats retrieval volume3. Prompts should make uncertainty legal4. Examples shape behavior more than they shape tone5. Guardrails are useful only inside a clear threat model6. Claim-level verification is stronger than answer-level vibes7. Evals catch regressions that prompt reviews miss8. Streaming creates a special problem9. The best fallback is a useful refusalRelated readingFurther reading