When a RAG system fails, the model usually gets the blame. The documents usually had a head start.
Flat corpus
Documents are dumped into the index with minimal structure.
- weak titles
- generic chunking
- missing metadata
- duplicates and stale versions survive ingestion
Prepared corpus
The corpus preserves identity, structure, and retrieval hints.
- chunks know what document they belong to
- titles carry real signal
- metadata and filters narrow the search space
- stale and low-quality inputs are caught early
Chunking is a retrieval decision, not a chore
Teams still talk about chunking as a preprocessing step. It is one of the main retrieval decisions.
Good chunks are small enough to match the question precisely, preserve enough context to remain intelligible, carry enough identity to be trusted later. That balance is why fixed-size chunking ages badly.
The better starting point is structural chunking. Markdown by headings. Policies by sections. Contracts by clauses and definitions. Case law by facts, reasoning, and holdings. Docs with tables or figures by layout-aware extraction where possible.
You can still backstop this with token limits. The structure speaks first.
Strong titles quietly improve retrieval
One of the simplest upgrades is the most neglected. Give chunks useful titles.
A chunk called "Section 4" tells the retriever almost nothing. A chunk called "Notice periods for termination in enterprise plans" gives both the retriever and the later generator a better chance.
Not glamorous. Effective.
A surprising amount of retrieval quality comes from small signals: document title, section path, subsection name, version or effective date, source type. If those fields are noisy, the vector index works harder than it should.
Parent-child is still the cleanest trade-off
Small chunks retrieve better. Larger chunks are easier for the model to read. That tension never went away.
Parent-child retrieval remains the cleanest compromise. Index smaller child chunks for matching. Return the larger parent section for reading. Preserve the link between them all the way to generation.
You get better recall without forcing the model to answer from disconnected sentence fragments. Provenance stays cleaner, because retrieved evidence still belongs to a recognisable section rather than an orphan paragraph with good embeddings and no life.
Metadata narrows the search space
RAG systems often look smarter when they are allowed to search a smaller, more relevant space.
Useful metadata: source or repository, document type, effective date or version, language, team or product or policy domain, confidentiality level where relevant.
Once you have this, natural-language questions can be paired with structured filtering. That reduces the burden on reranking and generation. Without this layer, the model spends expensive tokens sorting out mistakes the ingestion pipeline should have prevented.
Sentence-window after the corpus is sane
Sentence-window and other fine-grained retrieval patterns can improve precision when a small factual span matters.
They help when the corpus has been cleaned, chunk identity is preserved, the pipeline can expand from the hit sentence to its local context.
They help less when the underlying corpus is duplicated, stale, or structurally broken. The system becomes precise about the wrong thing. That is not progress.
Reranking is the second stage, not the first miracle
A reranker can materially improve quality. It cannot redeem a bad corpus.
The healthy order:
- clean and structure the data
- retrieve a broad but plausible top-k
- rerank for final relevance
- pass only the smallest defensible set to generation
Skip step one and the reranker chooses from junk with excellent confidence.
Corpus QA deserves its own checklist
Every serious RAG system needs a corpus-quality pass separate from answer evals.
Explicit checks for duplicate documents, stale superseded versions, broken OCR or parse failures, missing titles or headings, chunks with no document identity, malformed tables or invisible text, missing effective dates where the domain depends on them.
Tedious. Cheaper than endlessly tuning prompts around a bad index.
Most retrieval problems are ingestion problems
When people say "the retriever is inconsistent" or "the reranker is unstable" or "the model is missing the point", they often mean the documents were chunked badly, the titles were weak, duplicates survived, metadata was absent, document identity got lost.
The pipeline then looks mysterious. It is not mysterious. It is underprepared.
The right question is not "what chunk size?"
The better question:
What document unit should still make sense when retrieved on its own?
The answer changes by domain. Make the decision consciously.
Technical docs: a heading-bounded section. Contracts: a clause plus local definitions. Regulations: a section path with effective-date metadata.
No universal chunk size. Only better or worse fit for the corpus you actually have.
What I would fix first
If a RAG system felt shaky and I had one morning, I would start here:
- remove duplicates and stale versions
- improve chunk titles and section paths
- preserve parent-child identity
- add metadata filters for the main corpus dimensions
- inspect retrieval failures before touching the prompt
The model may still need work. The documents usually go first.
Related reading
- Which query transformation techniques actually help RAG?
- How to build legal answering systems that can be trusted