Note

How to Build Legal Answering Systems That Can Be Trusted

A practical blueprint for legal QA, shaped in part by work around the Agentic RAG Legal Challenge: document identity, hybrid retrieval, structured answers, page-level grounding, telemetry, and evals.

March 10, 202622 min read

RAG Legal Reliability

The fastest way to ship a dangerous legal assistant is to optimise fluency before evidence. The real work is not making the model sound persuasive. It is making sure the answer is anchored to the right document, the right page, and the smallest defensible set of supporting facts.

Reference Architecture

01 Ingestion

Keep document identity intact

parse PDFs with OCR fallback
chunk by structure, not by raw size
preserve title, document type, section path, and page number

02 Retrieval

Shortlist the right document family

hybrid retrieval: dense plus lexical
aggregate by document or law family
rerank clauses inside the shortlist, not across the whole corpus

03 Answering

Answer under a strict contract

route by question shape
keep booleans, dates, names, and comparisons in different answer modes
send only the smallest defensible evidence set to generation

04 Provenance and checks

Do not let the answer leave alone

attach page-level provenance
keep telemetry bound to the final answer
separate answer quality, grounding quality, and latency in evals

A legal answering system is easier to read as four narrow layers than as one long pipe. The work is preserving document identity, narrowing the evidence set, and only then letting the answer leave the system.

The one thing that actually worked

I tried every retrieval trick in the literature on a legal corpus. BM25 tuning, RAG Fusion, HyDE, step-back prompting. None of them moved the needle. Not one. The only change that reliably improved end-to-end answer quality was making the system faster. Prompt caching cut latency from 5,086ms to 1,931ms, and accuracy went up with it.

That was not the only surprise. My internal eval scored the system around 0.92. The platform's held-out eval returned 0.48. A 48-percentage-point gap from distribution shift alone. The questions I had been testing against did not represent the questions the system would actually face.

Those two findings shaped everything that follows. Retrieval engineering matters more than retrieval sophistication. If you are not evaluating against realistic question distributions, you are not evaluating at all.

Wrong evidence, not weak prose

When people say a legal system "hallucinated," they often mean one of four different failures:

it retrieved from the wrong document
it retrieved the right document but the wrong clause
it answered from one supporting page and silently dropped the second page the answer depended on
it forced a free-form answer where the task really wanted a typed result

Treating hallucination as a prompt-engineering problem misses most of the failure surface.

In legal work, a confident answer built on the wrong statutory family is worse than a visible refusal. The system feels precise right up until someone checks the source and realizes the clause came from a neighboring instrument, an older consolidated version, or a similar-looking notice that does not actually govern the question at hand.

The real architectural question is not "How smart is the model?" It is:

Can the system preserve the identity of the governing source all the way from ingestion to the final answer?

Legal corpora are hostile input

They are long, repetitive, structurally similar, and full of high-stakes near-matches. That combination breaks a surprising number of otherwise competent retrieval systems.

Research in 2025 gave this failure mode a useful name: Document-Level Retrieval Mismatch. Markus Reuter and colleagues showed that legal retrievers often select chunks from the wrong source document because boilerplate and formal language are so repetitive across the corpus. Their proposed fix, Summary-Augmented Chunking, is attractive precisely because it is simple: inject document-level identity back into each chunk before retrieval, instead of pretending local chunk text is enough on its own.

That has three design consequences.

Preserve page and section identity from day one

You want at least:

canonical document ID
document title
document type
section path or heading trail
page number
raw chunk text

If you lose page identity early, you end up rebuilding provenance later with heuristics and regret.

Use OCR as a fallback, not as an afterthought

Legal PDFs are not always born clean. Some are scans, some have image-based signatures or notices, and some bury decisive text inside low-quality page images. OCR should not sit in the online path, but it should absolutely sit in ingestion.

Keep chunking structural

The wrong default is still "fixed-size chunking and hope." In legal material, structure matters:

statutes want article/section-aware chunks
case law wants reasoning/facts/holding boundaries
contracts want clause and definition boundaries

Flat chunking works until it splits the definition from the obligation it governs.

Retrieve the right page

The most useful retrieval stack for legal QA is still hybrid:

dense retrieval for semantic similarity
lexical retrieval for exact language, law numbers, article numbers, and party names

But hybrid search alone is not enough. The more important design choice is what happens after first-stage retrieval.

I like this sequence:

retrieve wide enough to preserve recall
aggregate candidates by document or law family
apply a document-consistency sanity layer
rerank within that shortlist
send only the smallest defensible evidence set to generation

This is where many systems quietly fail. They retrieve the right page somewhere in the top set, then let it die during shortlist shaping because a cousin document looks semantically similar enough.

That cousin is often the real enemy in legal retrieval:

the consolidated version of the same law
an amendment law
an enactment notice
a related schedule
a neighboring regulation with nearly identical phrasing

The system should not treat those as interchangeable. It should understand them as a document family and preserve the correct family member depending on the question.

For example:

an effective-date question may need the law body and the enactment notice
an administration question may need the canonical law page, not the consolidated surrogate
a comparison question may need one page from each named law, not the most semantically similar two pages in the corpus

This is why expanding context size does not solve the problem. It raises recall and noise at the same time. In legal QA, the goal is not maximal context. It is correct support-family survival.

Stop forcing one answer style onto every question

One of the simplest ways to improve a legal answerer is to stop pretending every question wants a paragraph.

Some legal questions are naturally free text. Many are not.

There is a world of difference between:

"What is the date?"
"Who are the claimants?"
"Does Article X make this restriction effective?"
"Compare how two laws treat the same concept."

These should not be routed through the same answer contract.

One generic paragraph

Every question is pushed through the same free-form answer style.

verification becomes fuzzy
format compliance drifts
the model invents unnecessary prose

Typed answer contracts

The answer format follows the question shape.

booleans stay boolean
dates stay dates
analytical comparisons stay short and bounded

Here is the pattern I trust most:

Question shape	Better answer contract
boolean	JSON `true` / `false` or explicit abstention
number	JSON number
date	ISO date
name	exact string
names	list of strings
analytical comparison	short free text with explicit support boundaries

This matters for two reasons.

First, typed answers are easier to verify.

Second, they reduce the number of ways the model can be "creative" when the task did not actually call for creativity.

The same principle applies to what leaves the system. Many systems need two distinct answer layers:

an internal reasoning or evidence-rich representation
a final user-facing or API-facing contract

That split is healthy. The mistake is to collapse them.

Provenance: minimal and complete

A citation stack can fail in two opposite ways:

it can cite too much
it can cite too little

Most teams notice the first failure because it looks noisy. In legal systems, the second failure is often worse.

If the answer depends on two factual atoms that live on different pages, you need both pages. Not one "best" page chosen for neatness.

That sounds trivial. It is not.

In practice, a legal answer item may contain several support slots:

the title of the instrument
the enactment date
the effective date
the amended law
the administration clause
the common element being compared

If those slots localize to different pages, provenance pruning is not allowed to collapse them into one page just because the answer still sounds plausible.

That is why I prefer item-level and slot-level provenance over sentence-level vibes. It also fits the broader diagnostic direction behind frameworks like RAGChecker: if you want to understand a retrieval-augmented system, you need to score where support is correct, missing, or attached to the wrong evidence unit.

The rule is simple:

Minimal support is good only if it is still complete support.

This also means page-spanning answers need special handling. If the answer starts on one page and continues on the next, both pages belong in the final support set. A lot of legal systems miss this because they optimize for single-page neatness instead of evidentiary continuity.

Streaming and telemetry are part of the product

Legal answerers are often built as if latency and telemetry were observability concerns. They are not. They are product behavior.

If your first token arrives late, the system feels hesitant.

If your telemetry is incomplete, you cannot explain failures.

If your streaming path and your final-answer path diverge, you create a shadow system that behaves differently in public than it does in traces.

The production pattern I trust most is:

stream as early as the answer contract safely allows
keep final answer canonical
emit stage timings, token counts, provider identity, and retrieved/used sources
never buffer the whole answer just to feel clean

That does not mean "stream recklessly." It means streaming should be designed together with:

answer-type routing
provenance
verification boundaries
failure reporting

The system should be able to answer a very boring but very important question:

Why did we think this answer was allowed to leave the system?

Evals: separate answer quality from grounding quality

A lot of teams still use one headline score and call it evaluation.

That is not enough.

For legal answering systems, I want separate signals for:

answer correctness
grounding recall
wrong-document rate
wrong-family rate
orphan-page rate
format compliance
latency

And I want one more distinction that becomes critical as the system matures:

trusted benchmark tier
monitoring tier

This sounds bureaucratic until you have lived through mislabeled gold pages, inherited regression seeds, or eval cases that are useful as monitors but not honest hard gates.

The practical lesson is simple:

use a small, audited, trusted tier for hard acceptance
keep the wider, noisier tier for drift monitoring and triage

I also strongly prefer evaluation that can diagnose where the system is failing: source-family selection, page survival, answer formatting, or support completeness. A judge score is useful signal. It is not robustness.

Your eval harness is only as good as your test distribution. I learned this the hard way.

If I were shipping this tomorrow

If I were building a legal answerer today, I'd default to something close to this:

Layer	Practical default
parsing	robust PDF extraction with OCR fallback
chunking	structure-aware chunks with section path preserved
retrieval identity	doc title, doc family, doc summary, page number
retrieval	hybrid dense + lexical
reranking	shortlist by document family, then rerank by clause relevance
answering	typed contracts for strict questions, concise free text for analytical questions
provenance	used pages, not visible pages
output formatting	separate submission formatting from internal reasoning representation
evals	trusted hard-gate set + broader drift monitor

Notice what is not on that list:

giant prompt pyramids
gratuitous multi-agent loops (there are cases where coordinated multi-agent systems are warranted, more on that below)
broad context stuffing
blind trust in a single frontier model

The more mature these systems become, the less magical they look.

If I had to build a strong legal answering system from scratch again, I'd do these in order:

ingestion with page identity and OCR fallback
structure-aware chunking with document identity preserved
hybrid retrieval with document-family sanity checks
typed answer contracts for strict question classes
page-level provenance with complete support coverage
a small trusted benchmark before broad optimization
streaming plus telemetry that explain every stage

That sequence matters.

Most weak systems do the reverse:

pick a model
write prompts
hope retrieval is fine
add evaluation later

That ordering typically produces systems that work on demos but degrade quickly under real queries.

For legal AI, the standard should be boringly high.

Not:

"The model sounds smart."

But:

"The system found the right source."
"It kept the right page."
"It used the smallest support set that still covers the whole claim."
"It can tell me why it answered that way."
"It knows when to abstain."

The emphasis shifts from context size and model eloquence to evidence quality and answer contracts.

The principles above are design-time constraints. The next section is about what happens when you apply them under a real deadline, with a real scoring formula, and a real team. Some of the team were AI agents. The specifics come from one competition sprint. The failure modes and surprises do not.

What actually happened: a 13-day sprint

I entered the Agentic AI RAG Challenge to test these principles against a real benchmark. 300+ DIFC legal PDFs. 900 questions. Multiplicative scoring formula. 13 days. This system was built during active conflict in Israel, which imposed unpredictable constraints on development time and drove the engineering philosophy toward resilience and efficiency over complexity. What follows is a condensed account of what worked, what didn't, and what surprised me.

The formula

Every strategic choice traces back to this equation:

Total = (0.7 * Det + 0.3 * Asst) * G * T * F

Factor	What it measures	Leverage (per 1pp)
G	page-level grounding (F-beta, beta=2.5)	+0.93pp total
Det	exact-match accuracy on typed answers	+0.72pp total
Asst	LLM-judged free-text quality	+0.31pp total
T	telemetry schema compliance	multiplicative gate
F	time-to-first-token coefficient (0.85--1.05)	multiplicative gate

One percentage point of grounding is worth 3x one percentage point of free-text quality. The implication was immediate: protect the multipliers at all costs, then push answer quality higher inside that safety envelope.

Act 1: The ablation graveyard

Most of what you read in RAG tutorials will actively hurt you in a legal domain. I ran 16 serious experiments across 13 days. The rejection rate for retrieval-expansion experiments was 100 percent. Here are the four that taught me the most.

RAG Fusion was the worst regression: -7.8pp grounding, +246ms. The technique rewrites a query into three to five paraphrased variants and merges their results. A query like "What is the Date of Issue of CFI 022/2025?" became "provisions related to CFI 022/2025" and "date associated with case filing CFI 022." The merge promoted tangentially related pages over precisely correct ones. Query diversity adds noise when legal queries are already optimally specific.

HyDE generated a hypothetical answer document and used its embedding to retrieve similar real documents. The LLM produced hypothetical answers referencing plausible but non-existent provisions, like "Article 47(2) of the Regulatory Law". Those embedded into vector neighborhoods far from the actual enforcement decisions. Result: -0.65pp grounding, +560ms. Hallucinated legal context poisons retrieval.

Step-back rewriting abstracted questions into broader forms. "What does Article 16(1)(c) of the Employment Law say about payroll deductions?" became "What are the payroll deduction rules in DIFC?" This lost the exact article reference that anchors retrieval and added 1,542ms of latency. Abstraction loses the precision legal queries require.

BM25 hybrid retrieval added sparse keyword search alongside dense embeddings. Zero grounding gain, +260ms. The domain-tuned dense encoder already captured the lexical signal. Keyword signal is not additive with a domain-tuned dense encoder.

Full ablation table with all 16 experiments in the technical supplement below.

The pattern is consistent: techniques designed for ambiguous general-domain queries do not transfer to precise legal text.

The climax of Act 1

The single largest improvement to total score came from prompt caching. I restructured system prompts to exceed OpenAI's 1,024-token caching threshold (as of March 2026). Free-text TTFT dropped from 5,086ms to 1,931ms. The F coefficient jumped from 0.973 to 1.006: +3.0pp of total score from a change that touched zero answer logic and zero retrieval code. Not better retrieval. Faster infrastructure.

Combined with early token emission (moving mark_first_token() to fire before the grounding sidecar finished), the final F coefficient reached 1.032. That's +3.19% of total score, earned purely from when I stopped the clock.

I was confident. Internal evaluation showed near-perfect scores.

Act 2: The gap

My internal grounding proxy showed G=0.9956. Telemetry was perfect at T=1.000. Speed was optimized at F=1.032. I estimated a total score around 0.92.

The platform returned 0.48.

The gap was not a bug. It was a distribution shift. The warmup set had 100 questions:

Type	Warmup	Private	Multiplier
boolean	32	193	6.0x
free_text	30	270	9.0x
date	1	93	93.0x
name	15	95	6.3x
names	5	90	18.0x
number	17	159	9.4x

Date questions went from 1 to 93. Names went from 5 to 90. And 279 questions referenced document types that appeared zero times in the warmup set. My eval harness was tuned to a distribution that didn't represent the real test.

The grounding proxy was the other failure. It measured whether cited pages existed in the corpus, not whether they were the right pages. An internal 0.9956 masked a platform 0.589. The proxy was answering the wrong question.

The full distribution shift analysis is in the technical supplement.

The multi-agent sprint

The multi-agent development process (12 Claude Code agents working in parallel on a single laptop, closing 737 tickets in 47 hours) eventually grew into Bernstein, an open-source orchestrator. The competition pipeline itself is open-source as Shafi, named after al-Shafi'i (767–820), the father of Islamic legal methodology. A fitting reference for a system that reasons about law.

What held up

The techniques that survived the distribution shift didn't depend on the training distribution. Prompt caching delivered the same TTFT improvement regardless of question type. Metadata lookups worked for any question referencing a case number, whether it appeared in the warmup set or not. Type-specific routing generalized because it keyed on question structure, not question content.

The techniques that broke were tuned to the warmup distribution: grounding thresholds calibrated on 100 questions, answer patterns overfit to a handful of document types, a proxy metric that never tested against ground truth. Infrastructure beat sophistication. Domain engineering beat generic RAG.

For the full ablation study, score progression charts, distribution shift analysis, and per-type performance breakdown, see the research deep-dive in the second tab above.

What I would do differently

I'd invest in eval distribution analysis on day 1. The warmup set had 1 date question. The private set had 93. That's a 93x multiplier on a type I barely tested. Comparing the warmup type distribution against reasonable expectations for a legal corpus (how many enforcement decisions, how many regulations, how many practice directions) would have revealed the coverage gap immediately. I didn't look at the distribution until after the final score came back. That analysis takes 30 minutes and would have changed the entire testing strategy.

I'd skip the retrieval experiments entirely. With a domain-tuned dense encoder already achieving 90%+ page recall, retrieval was near-ceiling. Every hour spent on BM25, RAG Fusion, HyDE, and step-back rewriting was wasted. Six retrieval-expansion experiments, all rejected, all consuming time I didn't have. The returns were in answer quality and grounding calibration, not in retrieval. I'd redirect that effort toward free-text answer quality, which turned out to be the largest scoring gap.

I'd build the multi-agent system from the start. The 1.78× measured throughput from 12 parallel agents was real. 737 tickets in 47 hours. But spinning up the coordination infrastructure cold on day 10, under deadline pressure, meant the first 8 hours were spent on wakeup protocols, bulletin boards, and directive files instead of on the actual problem. Starting the multi-agent system on day 3, with module boundaries already in place, would have amortized that coordination overhead across the full sprint and reduced the merge conflicts that plagued the final 48 hours.

I'd measure TTFT from day 1. The F coefficient was the cheapest multiplier in the formula. Prompt caching alone delivered +3.0pp of total score. Early token emission added another +2.0pp. Combined, that's more than any retrieval or answer-quality improvement I found. But I didn't focus on TTFT until submission v6 on day 3. The first two days of submissions ran with F below 1.0, leaving free points on the table. In a multiplicative formula, the cheapest multiplier is always the first one worth optimizing.

A cautionary tale about agent permissions

While preparing this article for publication, an AI agent tasked with cleaning the repository ran rm -rf on several gitignored directories. 4.2 GB of development artifacts, agent coordination logs, golden labels, research archives. Not in git history. Not recoverable. Two weeks of development traces, gone in under a second.

The agent had full filesystem permissions because the task seemed simple: "clean up the repo." It understood that gitignored files aren't needed on GitHub. What it didn't consider is that "not needed on GitHub" and "not needed at all" are different things.

If you give an AI agent unrestricted filesystem access, treat it like giving root to an intern on their first day. Use allowlists, not blocklists. Require confirmation for destructive operations. Or at minimum, make sure everything is backed up before the agent starts.