The fastest way to ship a dangerous legal assistant is to optimise fluency before evidence. The real work is not making the model sound persuasive. It is making sure the answer is anchored to the right document, the right page, and the smallest defensible set of supporting facts.
01 Ingestion
Keep document identity intact
- parse PDFs with OCR fallback
- chunk by structure, not by raw size
- preserve title, document type, section path, and page number
02 Retrieval
Shortlist the right document family
- hybrid retrieval: dense plus lexical
- aggregate by document or law family
- rerank clauses inside the shortlist, not across the whole corpus
03 Answering
Answer under a strict contract
- route by question shape
- keep booleans, dates, names, and comparisons in different answer modes
- send only the smallest defensible evidence set to generation
04 Provenance and checks
Do not let the answer leave alone
- attach page-level provenance
- keep telemetry bound to the final answer
- separate answer quality, grounding quality, and latency in evals
The one thing that actually worked
I tried every retrieval trick in the literature on a legal corpus. BM25 tuning, RAG Fusion, HyDE, step-back prompting. None of them moved the needle. Not one. The only change that reliably improved end-to-end answer quality was making the system faster. Prompt caching cut latency from 5,086ms to 1,931ms, and accuracy went up with it.
That was not the only surprise. My internal eval scored the system around 0.92. The platform's held-out eval returned 0.48. A 48-percentage-point gap from distribution shift alone. The questions I had been testing against did not represent the questions the system would actually face.
Those two findings shaped everything that follows. Retrieval engineering matters more than retrieval sophistication. If you are not evaluating against realistic question distributions, you are not evaluating at all.
Wrong evidence, not weak prose
When people say a legal system "hallucinated," they often mean one of four different failures:
- it retrieved from the wrong document
- it retrieved the right document but the wrong clause
- it answered from one supporting page and silently dropped the second page the answer depended on
- it forced a free-form answer where the task really wanted a typed result
Treating hallucination as a prompt-engineering problem misses most of the failure surface.
In legal work, a confident answer built on the wrong statutory family is worse than a visible refusal. The system feels precise right up until someone checks the source and realizes the clause came from a neighboring instrument, an older consolidated version, or a similar-looking notice that does not actually govern the question at hand.
The real architectural question is not "How smart is the model?" It is:
Can the system preserve the identity of the governing source all the way from ingestion to the final answer?
Legal corpora are hostile input
They are long, repetitive, structurally similar, and full of high-stakes near-matches. That combination breaks a surprising number of otherwise competent retrieval systems.
Research in 2025 gave this failure mode a useful name: Document-Level Retrieval Mismatch. Markus Reuter and colleagues showed that legal retrievers often select chunks from the wrong source document because boilerplate and formal language are so repetitive across the corpus. Their proposed fix, Summary-Augmented Chunking, is attractive precisely because it is simple: inject document-level identity back into each chunk before retrieval, instead of pretending local chunk text is enough on its own.
That has three design consequences.
Preserve page and section identity from day one
You want at least:
- canonical document ID
- document title
- document type
- section path or heading trail
- page number
- raw chunk text
If you lose page identity early, you end up rebuilding provenance later with heuristics and regret.
Use OCR as a fallback, not as an afterthought
Legal PDFs are not always born clean. Some are scans, some have image-based signatures or notices, and some bury decisive text inside low-quality page images. OCR should not sit in the online path, but it should absolutely sit in ingestion.
Keep chunking structural
The wrong default is still "fixed-size chunking and hope." In legal material, structure matters:
- statutes want article/section-aware chunks
- case law wants reasoning/facts/holding boundaries
- contracts want clause and definition boundaries
Flat chunking works until it splits the definition from the obligation it governs.
Retrieve the right page
The most useful retrieval stack for legal QA is still hybrid:
- dense retrieval for semantic similarity
- lexical retrieval for exact language, law numbers, article numbers, and party names
But hybrid search alone is not enough. The more important design choice is what happens after first-stage retrieval.
I like this sequence:
- retrieve wide enough to preserve recall
- aggregate candidates by document or law family
- apply a document-consistency sanity layer
- rerank within that shortlist
- send only the smallest defensible evidence set to generation
This is where many systems quietly fail. They retrieve the right page somewhere in the top set, then let it die during shortlist shaping because a cousin document looks semantically similar enough.
That cousin is often the real enemy in legal retrieval:
- the consolidated version of the same law
- an amendment law
- an enactment notice
- a related schedule
- a neighboring regulation with nearly identical phrasing
The system should not treat those as interchangeable. It should understand them as a document family and preserve the correct family member depending on the question.
For example:
- an effective-date question may need the law body and the enactment notice
- an administration question may need the canonical law page, not the consolidated surrogate
- a comparison question may need one page from each named law, not the most semantically similar two pages in the corpus
This is why expanding context size does not solve the problem. It raises recall and noise at the same time. In legal QA, the goal is not maximal context. It is correct support-family survival.
Stop forcing one answer style onto every question
One of the simplest ways to improve a legal answerer is to stop pretending every question wants a paragraph.
Some legal questions are naturally free text. Many are not.
There is a world of difference between:
- "What is the date?"
- "Who are the claimants?"
- "Does Article X make this restriction effective?"
- "Compare how two laws treat the same concept."
These should not be routed through the same answer contract.
One generic paragraph
Every question is pushed through the same free-form answer style.
- verification becomes fuzzy
- format compliance drifts
- the model invents unnecessary prose
Typed answer contracts
The answer format follows the question shape.
- booleans stay boolean
- dates stay dates
- analytical comparisons stay short and bounded
Here is the pattern I trust most:
| Question shape | Better answer contract |
|---|---|
| boolean | JSON true / false or explicit abstention |
| number | JSON number |
| date | ISO date |
| name | exact string |
| names | list of strings |
| analytical comparison | short free text with explicit support boundaries |
This matters for two reasons.
First, typed answers are easier to verify.
Second, they reduce the number of ways the model can be "creative" when the task did not actually call for creativity.
The same principle applies to what leaves the system. Many systems need two distinct answer layers:
- an internal reasoning or evidence-rich representation
- a final user-facing or API-facing contract
That split is healthy. The mistake is to collapse them.
Provenance: minimal and complete
A citation stack can fail in two opposite ways:
- it can cite too much
- it can cite too little
Most teams notice the first failure because it looks noisy. In legal systems, the second failure is often worse.
If the answer depends on two factual atoms that live on different pages, you need both pages. Not one "best" page chosen for neatness.
That sounds trivial. It is not.
In practice, a legal answer item may contain several support slots:
- the title of the instrument
- the enactment date
- the effective date
- the amended law
- the administration clause
- the common element being compared
If those slots localize to different pages, provenance pruning is not allowed to collapse them into one page just because the answer still sounds plausible.
That is why I prefer item-level and slot-level provenance over sentence-level vibes. It also fits the broader diagnostic direction behind frameworks like RAGChecker: if you want to understand a retrieval-augmented system, you need to score where support is correct, missing, or attached to the wrong evidence unit.
The rule is simple:
Minimal support is good only if it is still complete support.
This also means page-spanning answers need special handling. If the answer starts on one page and continues on the next, both pages belong in the final support set. A lot of legal systems miss this because they optimize for single-page neatness instead of evidentiary continuity.
Streaming and telemetry are part of the product
Legal answerers are often built as if latency and telemetry were observability concerns. They are not. They are product behavior.
If your first token arrives late, the system feels hesitant.
If your telemetry is incomplete, you cannot explain failures.
If your streaming path and your final-answer path diverge, you create a shadow system that behaves differently in public than it does in traces.
The production pattern I trust most is:
- stream as early as the answer contract safely allows
- keep final answer canonical
- emit stage timings, token counts, provider identity, and retrieved/used sources
- never buffer the whole answer just to feel clean
That does not mean "stream recklessly." It means streaming should be designed together with:
- answer-type routing
- provenance
- verification boundaries
- failure reporting
The system should be able to answer a very boring but very important question:
Why did we think this answer was allowed to leave the system?
Evals: separate answer quality from grounding quality
A lot of teams still use one headline score and call it evaluation.
That is not enough.
For legal answering systems, I want separate signals for:
- answer correctness
- grounding recall
- wrong-document rate
- wrong-family rate
- orphan-page rate
- format compliance
- latency
And I want one more distinction that becomes critical as the system matures:
- trusted benchmark tier
- monitoring tier
This sounds bureaucratic until you have lived through mislabeled gold pages, inherited regression seeds, or eval cases that are useful as monitors but not honest hard gates.
The practical lesson is simple:
- use a small, audited, trusted tier for hard acceptance
- keep the wider, noisier tier for drift monitoring and triage
I also strongly prefer evaluation that can diagnose where the system is failing: source-family selection, page survival, answer formatting, or support completeness. A judge score is useful signal. It is not robustness.
Your eval harness is only as good as your test distribution. I learned this the hard way.
If I were shipping this tomorrow
If I were building a legal answerer today, I'd default to something close to this:
| Layer | Practical default |
|---|---|
| parsing | robust PDF extraction with OCR fallback |
| chunking | structure-aware chunks with section path preserved |
| retrieval identity | doc title, doc family, doc summary, page number |
| retrieval | hybrid dense + lexical |
| reranking | shortlist by document family, then rerank by clause relevance |
| answering | typed contracts for strict questions, concise free text for analytical questions |
| provenance | used pages, not visible pages |
| output formatting | separate submission formatting from internal reasoning representation |
| evals | trusted hard-gate set + broader drift monitor |
Notice what is not on that list:
- giant prompt pyramids
- gratuitous multi-agent loops (there are cases where coordinated multi-agent systems are warranted, more on that below)
- broad context stuffing
- blind trust in a single frontier model
The more mature these systems become, the less magical they look.
If I had to build a strong legal answering system from scratch again, I'd do these in order:
- ingestion with page identity and OCR fallback
- structure-aware chunking with document identity preserved
- hybrid retrieval with document-family sanity checks
- typed answer contracts for strict question classes
- page-level provenance with complete support coverage
- a small trusted benchmark before broad optimization
- streaming plus telemetry that explain every stage
That sequence matters.
Most weak systems do the reverse:
- pick a model
- write prompts
- hope retrieval is fine
- add evaluation later
That ordering typically produces systems that work on demos but degrade quickly under real queries.
For legal AI, the standard should be boringly high.
Not:
- "The model sounds smart."
But:
- "The system found the right source."
- "It kept the right page."
- "It used the smallest support set that still covers the whole claim."
- "It can tell me why it answered that way."
- "It knows when to abstain."
The emphasis shifts from context size and model eloquence to evidence quality and answer contracts.
The principles above are design-time constraints. The next section is about what happens when you apply them under a real deadline, with a real scoring formula, and a real team. Some of the team were AI agents. The specifics come from one competition sprint. The failure modes and surprises do not.
What actually happened: a 13-day sprint
I entered the Agentic AI RAG Challenge to test these principles against a real benchmark. 300+ DIFC legal PDFs. 900 questions. Multiplicative scoring formula. 13 days. This system was built during active conflict in Israel, which imposed unpredictable constraints on development time and drove the engineering philosophy toward resilience and efficiency over complexity. What follows is a condensed account of what worked, what didn't, and what surprised me.
The formula
Every strategic choice traces back to this equation:
Total = (0.7 * Det + 0.3 * Asst) * G * T * F| Factor | What it measures | Leverage (per 1pp) |
|---|---|---|
| G | page-level grounding (F-beta, beta=2.5) | +0.93pp total |
| Det | exact-match accuracy on typed answers | +0.72pp total |
| Asst | LLM-judged free-text quality | +0.31pp total |
| T | telemetry schema compliance | multiplicative gate |
| F | time-to-first-token coefficient (0.85--1.05) | multiplicative gate |
One percentage point of grounding is worth 3x one percentage point of free-text quality. The implication was immediate: protect the multipliers at all costs, then push answer quality higher inside that safety envelope.
Act 1: The ablation graveyard
Most of what you read in RAG tutorials will actively hurt you in a legal domain. I ran 16 serious experiments across 13 days. The rejection rate for retrieval-expansion experiments was 100 percent. Here are the four that taught me the most.
RAG Fusion was the worst regression: -7.8pp grounding, +246ms. The technique rewrites a query into three to five paraphrased variants and merges their results. A query like "What is the Date of Issue of CFI 022/2025?" became "provisions related to CFI 022/2025" and "date associated with case filing CFI 022." The merge promoted tangentially related pages over precisely correct ones. Query diversity adds noise when legal queries are already optimally specific.
HyDE generated a hypothetical answer document and used its embedding to retrieve similar real documents. The LLM produced hypothetical answers referencing plausible but non-existent provisions, like "Article 47(2) of the Regulatory Law". Those embedded into vector neighborhoods far from the actual enforcement decisions. Result: -0.65pp grounding, +560ms. Hallucinated legal context poisons retrieval.
Step-back rewriting abstracted questions into broader forms. "What does Article 16(1)(c) of the Employment Law say about payroll deductions?" became "What are the payroll deduction rules in DIFC?" This lost the exact article reference that anchors retrieval and added 1,542ms of latency. Abstraction loses the precision legal queries require.
BM25 hybrid retrieval added sparse keyword search alongside dense embeddings. Zero grounding gain, +260ms. The domain-tuned dense encoder already captured the lexical signal. Keyword signal is not additive with a domain-tuned dense encoder.
Full ablation table with all 16 experiments in the technical supplement below.
The pattern is consistent: techniques designed for ambiguous general-domain queries do not transfer to precise legal text.
The climax of Act 1
The single largest improvement to total score came from prompt caching. I restructured system prompts to exceed OpenAI's 1,024-token caching threshold (as of March 2026). Free-text TTFT dropped from 5,086ms to 1,931ms. The F coefficient jumped from 0.973 to 1.006: +3.0pp of total score from a change that touched zero answer logic and zero retrieval code. Not better retrieval. Faster infrastructure.
Combined with early token emission (moving mark_first_token() to fire before the grounding sidecar finished), the final F coefficient reached 1.032. That's +3.19% of total score, earned purely from when I stopped the clock.
I was confident. Internal evaluation showed near-perfect scores.
Act 2: The gap
My internal grounding proxy showed G=0.9956. Telemetry was perfect at T=1.000. Speed was optimized at F=1.032. I estimated a total score around 0.92.
The platform returned 0.48.
The gap was not a bug. It was a distribution shift. The warmup set had 100 questions:
| Type | Warmup | Private | Multiplier |
|---|---|---|---|
| boolean | 32 | 193 | 6.0x |
| free_text | 30 | 270 | 9.0x |
| date | 1 | 93 | 93.0x |
| name | 15 | 95 | 6.3x |
| names | 5 | 90 | 18.0x |
| number | 17 | 159 | 9.4x |
Date questions went from 1 to 93. Names went from 5 to 90. And 279 questions referenced document types that appeared zero times in the warmup set. My eval harness was tuned to a distribution that didn't represent the real test.
The grounding proxy was the other failure. It measured whether cited pages existed in the corpus, not whether they were the right pages. An internal 0.9956 masked a platform 0.589. The proxy was answering the wrong question.
The full distribution shift analysis is in the technical supplement.
The multi-agent sprint
The multi-agent development process (12 Claude Code agents working in parallel on a single laptop, closing 737 tickets in 47 hours) eventually grew into Bernstein, an open-source orchestrator. The competition pipeline itself is open-source as Shafi, named after al-Shafi'i (767–820), the father of Islamic legal methodology. A fitting reference for a system that reasons about law.
What held up
The techniques that survived the distribution shift didn't depend on the training distribution. Prompt caching delivered the same TTFT improvement regardless of question type. Metadata lookups worked for any question referencing a case number, whether it appeared in the warmup set or not. Type-specific routing generalized because it keyed on question structure, not question content.
The techniques that broke were tuned to the warmup distribution: grounding thresholds calibrated on 100 questions, answer patterns overfit to a handful of document types, a proxy metric that never tested against ground truth. Infrastructure beat sophistication. Domain engineering beat generic RAG.
For the full ablation study, score progression charts, distribution shift analysis, and per-type performance breakdown, see the research deep-dive in the second tab above.
What I would do differently
I'd invest in eval distribution analysis on day 1. The warmup set had 1 date question. The private set had 93. That's a 93x multiplier on a type I barely tested. Comparing the warmup type distribution against reasonable expectations for a legal corpus (how many enforcement decisions, how many regulations, how many practice directions) would have revealed the coverage gap immediately. I didn't look at the distribution until after the final score came back. That analysis takes 30 minutes and would have changed the entire testing strategy.
I'd skip the retrieval experiments entirely. With a domain-tuned dense encoder already achieving 90%+ page recall, retrieval was near-ceiling. Every hour spent on BM25, RAG Fusion, HyDE, and step-back rewriting was wasted. Six retrieval-expansion experiments, all rejected, all consuming time I didn't have. The returns were in answer quality and grounding calibration, not in retrieval. I'd redirect that effort toward free-text answer quality, which turned out to be the largest scoring gap.
I'd build the multi-agent system from the start. The 1.78× measured throughput from 12 parallel agents was real. 737 tickets in 47 hours. But spinning up the coordination infrastructure cold on day 10, under deadline pressure, meant the first 8 hours were spent on wakeup protocols, bulletin boards, and directive files instead of on the actual problem. Starting the multi-agent system on day 3, with module boundaries already in place, would have amortized that coordination overhead across the full sprint and reduced the merge conflicts that plagued the final 48 hours.
I'd measure TTFT from day 1. The F coefficient was the cheapest multiplier in the formula. Prompt caching alone delivered +3.0pp of total score. Early token emission added another +2.0pp. Combined, that's more than any retrieval or answer-quality improvement I found. But I didn't focus on TTFT until submission v6 on day 3. The first two days of submissions ran with F below 1.0, leaving free points on the table. In a multiplicative formula, the cheapest multiplier is always the first one worth optimizing.
Related reading
Further reading
- Towards Reliable Retrieval in RAG Systems for Large Legal Datasets
- LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain
- RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
- OpenAI: Evaluation best practices
- Anthropic: Demystifying evals for AI agents