Note

Most RAG Failures Start in the Documents

Chunking, titles, metadata, parent-child structure, reranking, and corpus QA for RAG systems.

February 12, 20265 min read

RAG Retrieval Grounding

On this page(12)

When a RAG system fails, the model usually gets the blame. The documents usually had a head start.

Flat corpus

Documents are dumped into the index with minimal structure.

weak titles
generic chunking
missing metadata
duplicates and stale versions survive ingestion

Prepared corpus

The corpus preserves identity, structure, and retrieval hints.

chunks know what document they belong to
titles carry real signal
metadata and filters narrow the search space
stale and low-quality inputs are caught early

Ingestion path

The useful RAG pipeline starts before embeddings.

Chunking is a retrieval decision, not a chore

Teams still talk about chunking as a preprocessing step. It is one of the main retrieval decisions.

Good chunks are small enough to match the question precisely, preserve enough context to remain intelligible, carry enough identity to be trusted later. That balance is why fixed-size chunking ages badly.

The better starting point is structural chunking. Markdown by headings. Policies by sections. Contracts by clauses and definitions. Case law by facts, reasoning, and holdings. Docs with tables or figures by layout-aware extraction where possible.

You can still backstop this with token limits. The structure speaks first.

Strong titles quietly improve retrieval

One of the simplest upgrades is the most neglected. Give chunks useful titles.

A chunk called "Section 4" tells the retriever almost nothing. A chunk called "Notice periods for termination in enterprise plans" gives both the retriever and the later generator a better chance.

Not glamorous. Effective.

A surprising amount of retrieval quality comes from small signals: document title, section path, subsection name, version or effective date, source type. If those fields are noisy, the vector index works harder than it should.

Parent-child is still the cleanest trade-off

Small chunks retrieve better. Larger chunks are easier for the model to read. That tension never went away.

Parent-child retrieval remains the cleanest compromise. Index smaller child chunks for matching. Return the larger parent section for reading. Preserve the link between them all the way to generation.

You get better recall without forcing the model to answer from disconnected sentence fragments. Provenance stays cleaner, because retrieved evidence still belongs to a recognisable section rather than an orphan paragraph with good embeddings and no life.

Metadata narrows the search space

RAG systems often look smarter when they are allowed to search a smaller, more relevant space.

Useful metadata: source or repository, document type, effective date or version, language, team or product or policy domain, confidentiality level where relevant.

Once you have this, natural-language questions can be paired with structured filtering. That reduces the burden on reranking and generation. Without this layer, the model spends expensive tokens sorting out mistakes the ingestion pipeline should have prevented.

Sentence-window after the corpus is sane

Sentence-window and other fine-grained retrieval patterns can improve precision when a small factual span matters.

They help when the corpus has been cleaned, chunk identity is preserved, the pipeline can expand from the hit sentence to its local context.

They help less when the underlying corpus is duplicated, stale, or structurally broken. The system becomes precise about the wrong thing. That is not progress.

Reranking is the second stage, not the first miracle

A reranker can materially improve quality. It cannot redeem a bad corpus.

The healthy order:

clean and structure the data
retrieve a broad but plausible top-k
rerank for final relevance
pass only the smallest defensible set to generation

Skip step one and the reranker chooses from junk with excellent confidence.

Corpus QA deserves its own checklist

Every serious RAG system needs a corpus-quality pass separate from answer evals.

Explicit checks for duplicate documents, stale superseded versions, broken OCR or parse failures, missing titles or headings, chunks with no document identity, malformed tables or invisible text, missing effective dates where the domain depends on them.

Tedious. Cheaper than endlessly tuning prompts around a bad index.

Most retrieval problems are ingestion problems

When people say "the retriever is inconsistent" or "the reranker is unstable" or "the model is missing the point", they often mean the documents were chunked badly, the titles were weak, duplicates survived, metadata was absent, document identity got lost.

The pipeline then looks mysterious. It is not mysterious. It is underprepared.

The right question is not "what chunk size?"

The better question:

What document unit should still make sense when retrieved on its own?

The answer changes by domain. Make the decision consciously.

Technical docs: a heading-bounded section. Contracts: a clause plus local definitions. Regulations: a section path with effective-date metadata.

No universal chunk size. Only better or worse fit for the corpus you actually have.

What I would fix first

If a RAG system felt shaky and I had one morning, I would start here:

remove duplicates and stale versions
improve chunk titles and section paths
preserve parent-child identity
add metadata filters for the main corpus dimensions
inspect retrieval failures before touching the prompt

The model may still need work. The documents usually go first.