Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

Most RAG Failures Start in the Documents

Chunking, titles, metadata, parent-child structure, reranking, and corpus QA for RAG systems.

February 12, 2026·5 min read
RAGRetrievalGrounding
On this page(12)
Chunking is a retrieval decision, not a choreStrong titles quietly improve retrievalParent-child is still the cleanest trade-offMetadata narrows the search spaceSentence-window after the corpus is saneReranking is the second stage, not the first miracleCorpus QA deserves its own checklistMost retrieval problems are ingestion problemsThe right question is not "what chunk size?"What I would fix firstRelated readingFurther reading

When a RAG system fails, the model usually gets the blame. The documents usually had a head start.

Flat corpus

Documents are dumped into the index with minimal structure.

  • weak titles
  • generic chunking
  • missing metadata
  • duplicates and stale versions survive ingestion

Prepared corpus

The corpus preserves identity, structure, and retrieval hints.

  • chunks know what document they belong to
  • titles carry real signal
  • metadata and filters narrow the search space
  • stale and low-quality inputs are caught early

What actually helps

  • chunks should be self-contained enough to retrieve and specific enough to rerank
  • document identity should survive every stage of the pipeline
  • metadata is not decoration; it is search control
  • ingestion QA often matters more than another round of prompt tuning
Ingestion path
The useful RAG pipeline starts before embeddings.

Chunking is a retrieval decision, not a chore

Teams still talk about chunking as a preprocessing step. It is one of the main retrieval decisions.

Good chunks are small enough to match the question precisely, preserve enough context to remain intelligible, carry enough identity to be trusted later. That balance is why fixed-size chunking ages badly.

The better starting point is structural chunking. Markdown by headings. Policies by sections. Contracts by clauses and definitions. Case law by facts, reasoning, and holdings. Docs with tables or figures by layout-aware extraction where possible.

You can still backstop this with token limits. The structure speaks first.

Strong titles quietly improve retrieval

One of the simplest upgrades is the most neglected. Give chunks useful titles.

A chunk called "Section 4" tells the retriever almost nothing. A chunk called "Notice periods for termination in enterprise plans" gives both the retriever and the later generator a better chance.

Not glamorous. Effective.

A surprising amount of retrieval quality comes from small signals: document title, section path, subsection name, version or effective date, source type. If those fields are noisy, the vector index works harder than it should.

Parent-child is still the cleanest trade-off

Small chunks retrieve better. Larger chunks are easier for the model to read. That tension never went away.

Parent-child retrieval remains the cleanest compromise. Index smaller child chunks for matching. Return the larger parent section for reading. Preserve the link between them all the way to generation.

You get better recall without forcing the model to answer from disconnected sentence fragments. Provenance stays cleaner, because retrieved evidence still belongs to a recognisable section rather than an orphan paragraph with good embeddings and no life.

Metadata narrows the search space

RAG systems often look smarter when they are allowed to search a smaller, more relevant space.

Useful metadata: source or repository, document type, effective date or version, language, team or product or policy domain, confidentiality level where relevant.

Once you have this, natural-language questions can be paired with structured filtering. That reduces the burden on reranking and generation. Without this layer, the model spends expensive tokens sorting out mistakes the ingestion pipeline should have prevented.

Sentence-window after the corpus is sane

Sentence-window and other fine-grained retrieval patterns can improve precision when a small factual span matters.

They help when the corpus has been cleaned, chunk identity is preserved, the pipeline can expand from the hit sentence to its local context.

They help less when the underlying corpus is duplicated, stale, or structurally broken. The system becomes precise about the wrong thing. That is not progress.

Reranking is the second stage, not the first miracle

A reranker can materially improve quality. It cannot redeem a bad corpus.

The healthy order:

  1. clean and structure the data
  2. retrieve a broad but plausible top-k
  3. rerank for final relevance
  4. pass only the smallest defensible set to generation

Skip step one and the reranker chooses from junk with excellent confidence.

Corpus QA deserves its own checklist

Every serious RAG system needs a corpus-quality pass separate from answer evals.

Explicit checks for duplicate documents, stale superseded versions, broken OCR or parse failures, missing titles or headings, chunks with no document identity, malformed tables or invisible text, missing effective dates where the domain depends on them.

Tedious. Cheaper than endlessly tuning prompts around a bad index.

Most retrieval problems are ingestion problems

When people say "the retriever is inconsistent" or "the reranker is unstable" or "the model is missing the point", they often mean the documents were chunked badly, the titles were weak, duplicates survived, metadata was absent, document identity got lost.

The pipeline then looks mysterious. It is not mysterious. It is underprepared.

The right question is not "what chunk size?"

The better question:

What document unit should still make sense when retrieved on its own?

The answer changes by domain. Make the decision consciously.

Technical docs: a heading-bounded section. Contracts: a clause plus local definitions. Regulations: a section path with effective-date metadata.

No universal chunk size. Only better or worse fit for the corpus you actually have.

What I would fix first

If a RAG system felt shaky and I had one morning, I would start here:

  1. remove duplicates and stale versions
  2. improve chunk titles and section paths
  3. preserve parent-child identity
  4. add metadata filters for the main corpus dimensions
  5. inspect retrieval failures before touching the prompt

The model may still need work. The documents usually go first.

Related reading

Related reading

  • Which query transformation techniques actually help RAG?
  • How to build legal answering systems that can be trusted
References

Further reading

  • Lost in the Middle: How Language Models Use Long Contexts
  • How to build legal answering systems that can be trusted

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on RAG

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →Which Query Transformation Techniques Actually Help RAG?Feb 24, 2026·6 min read
  • →Forecasting Without Prophecy: a plain-text disciplineMay 2, 2026·13 min read
  • →How to Build Legal Answering Systems That Can Be TrustedMar 10, 2026·22 min read
On this page
  • 01Chunking is a retrieval decision, not a chore
  • 02Strong titles quietly improve retrieval
  • 03Parent-child is still the cleanest trade-off
  • 04Metadata narrows the search space
  • 05Sentence-window after the corpus is sane
  • 06Reranking is the second stage, not the first miracle
  • 07Corpus QA deserves its own checklist
  • 08Most retrieval problems are ingestion problems
  • 09The right question is not "what chunk size?"
  • 10What I would fix first
  • 11Related reading
  • 12Further reading