Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

How to Build Legal Answering Systems That Can Be Trusted

A practical blueprint for legal QA, shaped in part by work around the Agentic RAG Legal Challenge: document identity, hybrid retrieval, structured answers, page-level grounding, telemetry, and evals.

March 10, 2026·22 min read
RAGLegalReliability
On this page(13)
The one thing that actually workedWrong evidence, not weak proseLegal corpora are hostile inputPreserve page and section identity from day oneUse OCR as a fallback, not as an afterthoughtKeep chunking structuralRetrieve the right pageStop forcing one answer style onto every questionProvenance: minimal and completeStreaming and telemetry are part of the productEvals: separate answer quality from grounding qualityIf I were shipping this tomorrowWhat actually happened: a 13-day sprintThe formulaAct 1: The ablation graveyardThe climax of Act 1Act 2: The gapThe multi-agent sprintWhat held upWhat I would do differentlyRelated readingFurther reading

The fastest way to ship a dangerous legal assistant is to optimise fluency before evidence. The real work is not making the model sound persuasive. It is making sure the answer is anchored to the right document, the right page, and the smallest defensible set of supporting facts.

Challenge context

This piece is informed by work on the Agentic RAG Legal Challenge (38th of 356 registered participants), where legal QA systems are judged as full pipelines rather than as standalone prompts. The competition pipeline is open-source as Shafi, named after al-Shafi'i (767–820), the father of Islamic legal methodology.

Hard truth

If a legal QA system retrieves the wrong family member of the right law, or the right clause from the wrong document, the generator is already in trouble. Most failures that look like "reasoning problems" are evidence-discipline problems in disguise. Downstream generation gets blamed for what are usually retrieval failures.

What trustworthy legal answerers need

  • structure-aware ingestion, not flat PDF text dumping
  • retrieval that preserves document identity all the way to the answer
  • typed answer contracts instead of one generic free-form response style
  • provenance that is both minimal and complete
  • evals that separate answer quality, grounding quality, and latency
Reference Architecture

01 Ingestion

Keep document identity intact

  • parse PDFs with OCR fallback
  • chunk by structure, not by raw size
  • preserve title, document type, section path, and page number

02 Retrieval

Shortlist the right document family

  • hybrid retrieval: dense plus lexical
  • aggregate by document or law family
  • rerank clauses inside the shortlist, not across the whole corpus

03 Answering

Answer under a strict contract

  • route by question shape
  • keep booleans, dates, names, and comparisons in different answer modes
  • send only the smallest defensible evidence set to generation

04 Provenance and checks

Do not let the answer leave alone

  • attach page-level provenance
  • keep telemetry bound to the final answer
  • separate answer quality, grounding quality, and latency in evals
A legal answering system is easier to read as four narrow layers than as one long pipe. The work is preserving document identity, narrowing the evidence set, and only then letting the answer leave the system.

The one thing that actually worked

I tried every retrieval trick in the literature on a legal corpus. BM25 tuning, RAG Fusion, HyDE, step-back prompting. None of them moved the needle. Not one. The only change that reliably improved end-to-end answer quality was making the system faster. Prompt caching cut latency from 5,086ms to 1,931ms, and accuracy went up with it.

That was not the only surprise. My internal eval scored the system around 0.92. The platform's held-out eval returned 0.48. A 48-percentage-point gap from distribution shift alone. The questions I had been testing against did not represent the questions the system would actually face.

Those two findings shaped everything that follows. Retrieval engineering matters more than retrieval sophistication. If you are not evaluating against realistic question distributions, you are not evaluating at all.

Wrong evidence, not weak prose

When people say a legal system "hallucinated," they often mean one of four different failures:

  • it retrieved from the wrong document
  • it retrieved the right document but the wrong clause
  • it answered from one supporting page and silently dropped the second page the answer depended on
  • it forced a free-form answer where the task really wanted a typed result

Treating hallucination as a prompt-engineering problem misses most of the failure surface.

In legal work, a confident answer built on the wrong statutory family is worse than a visible refusal. The system feels precise right up until someone checks the source and realizes the clause came from a neighboring instrument, an older consolidated version, or a similar-looking notice that does not actually govern the question at hand.

The real architectural question is not "How smart is the model?" It is:

Can the system preserve the identity of the governing source all the way from ingestion to the final answer?

Legal corpora are hostile input

They are long, repetitive, structurally similar, and full of high-stakes near-matches. That combination breaks a surprising number of otherwise competent retrieval systems.

Research in 2025 gave this failure mode a useful name: Document-Level Retrieval Mismatch. Markus Reuter and colleagues showed that legal retrievers often select chunks from the wrong source document because boilerplate and formal language are so repetitive across the corpus. Their proposed fix, Summary-Augmented Chunking, is attractive precisely because it is simple: inject document-level identity back into each chunk before retrieval, instead of pretending local chunk text is enough on its own.

Research signal

The practical lesson from the DRM/SAC work is not "add more summarization everywhere." It is "do not let chunks forget what document they belong to." In legal search, global document identity is not decoration. It is part of the retrieval signal.

That has three design consequences.

Preserve page and section identity from day one

You want at least:

  • canonical document ID
  • document title
  • document type
  • section path or heading trail
  • page number
  • raw chunk text

If you lose page identity early, you end up rebuilding provenance later with heuristics and regret.

Use OCR as a fallback, not as an afterthought

Legal PDFs are not always born clean. Some are scans, some have image-based signatures or notices, and some bury decisive text inside low-quality page images. OCR should not sit in the online path, but it should absolutely sit in ingestion.

Keep chunking structural

The wrong default is still "fixed-size chunking and hope." In legal material, structure matters:

  • statutes want article/section-aware chunks
  • case law wants reasoning/facts/holding boundaries
  • contracts want clause and definition boundaries

Flat chunking works until it splits the definition from the obligation it governs.

Retrieve the right page

The most useful retrieval stack for legal QA is still hybrid:

  • dense retrieval for semantic similarity
  • lexical retrieval for exact language, law numbers, article numbers, and party names

But hybrid search alone is not enough. The more important design choice is what happens after first-stage retrieval.

I like this sequence:

  1. retrieve wide enough to preserve recall
  2. aggregate candidates by document or law family
  3. apply a document-consistency sanity layer
  4. rerank within that shortlist
  5. send only the smallest defensible evidence set to generation

This is where many systems quietly fail. They retrieve the right page somewhere in the top set, then let it die during shortlist shaping because a cousin document looks semantically similar enough.

That cousin is often the real enemy in legal retrieval:

  • the consolidated version of the same law
  • an amendment law
  • an enactment notice
  • a related schedule
  • a neighboring regulation with nearly identical phrasing

The system should not treat those as interchangeable. It should understand them as a document family and preserve the correct family member depending on the question.

For example:

  • an effective-date question may need the law body and the enactment notice
  • an administration question may need the canonical law page, not the consolidated surrogate
  • a comparison question may need one page from each named law, not the most semantically similar two pages in the corpus

This is why expanding context size does not solve the problem. It raises recall and noise at the same time. In legal QA, the goal is not maximal context. It is correct support-family survival.

Stop forcing one answer style onto every question

One of the simplest ways to improve a legal answerer is to stop pretending every question wants a paragraph.

Some legal questions are naturally free text. Many are not.

There is a world of difference between:

  • "What is the date?"
  • "Who are the claimants?"
  • "Does Article X make this restriction effective?"
  • "Compare how two laws treat the same concept."

These should not be routed through the same answer contract.

One generic paragraph

Every question is pushed through the same free-form answer style.

  • verification becomes fuzzy
  • format compliance drifts
  • the model invents unnecessary prose

Typed answer contracts

The answer format follows the question shape.

  • booleans stay boolean
  • dates stay dates
  • analytical comparisons stay short and bounded

Here is the pattern I trust most:

Question shapeBetter answer contract
booleanJSON true / false or explicit abstention
numberJSON number
dateISO date
nameexact string
nameslist of strings
analytical comparisonshort free text with explicit support boundaries

This matters for two reasons.

First, typed answers are easier to verify.

Second, they reduce the number of ways the model can be "creative" when the task did not actually call for creativity.

The same principle applies to what leaves the system. Many systems need two distinct answer layers:

  1. an internal reasoning or evidence-rich representation
  2. a final user-facing or API-facing contract

That split is healthy. The mistake is to collapse them.

Provenance: minimal and complete

A citation stack can fail in two opposite ways:

  • it can cite too much
  • it can cite too little

Most teams notice the first failure because it looks noisy. In legal systems, the second failure is often worse.

If the answer depends on two factual atoms that live on different pages, you need both pages. Not one "best" page chosen for neatness.

That sounds trivial. It is not.

In practice, a legal answer item may contain several support slots:

  • the title of the instrument
  • the enactment date
  • the effective date
  • the amended law
  • the administration clause
  • the common element being compared

If those slots localize to different pages, provenance pruning is not allowed to collapse them into one page just because the answer still sounds plausible.

That is why I prefer item-level and slot-level provenance over sentence-level vibes. It also fits the broader diagnostic direction behind frameworks like RAGChecker: if you want to understand a retrieval-augmented system, you need to score where support is correct, missing, or attached to the wrong evidence unit.

The rule is simple:

Minimal support is good only if it is still complete support.

This also means page-spanning answers need special handling. If the answer starts on one page and continues on the next, both pages belong in the final support set. A lot of legal systems miss this because they optimize for single-page neatness instead of evidentiary continuity.

Main design bet

Optimize for minimal justified support, not minimal citations at any cost. In high-stakes domains, missing one required support page is usually worse than carrying one extra clearly related page.

Streaming and telemetry are part of the product

Legal answerers are often built as if latency and telemetry were observability concerns. They are not. They are product behavior.

If your first token arrives late, the system feels hesitant.

If your telemetry is incomplete, you cannot explain failures.

If your streaming path and your final-answer path diverge, you create a shadow system that behaves differently in public than it does in traces.

The production pattern I trust most is:

  • stream as early as the answer contract safely allows
  • keep final answer canonical
  • emit stage timings, token counts, provider identity, and retrieved/used sources
  • never buffer the whole answer just to feel clean

That does not mean "stream recklessly." It means streaming should be designed together with:

  • answer-type routing
  • provenance
  • verification boundaries
  • failure reporting

The system should be able to answer a very boring but very important question:

Why did we think this answer was allowed to leave the system?

Evals: separate answer quality from grounding quality

A lot of teams still use one headline score and call it evaluation.

That is not enough.

For legal answering systems, I want separate signals for:

  • answer correctness
  • grounding recall
  • wrong-document rate
  • wrong-family rate
  • orphan-page rate
  • format compliance
  • latency

And I want one more distinction that becomes critical as the system matures:

  • trusted benchmark tier
  • monitoring tier

This sounds bureaucratic until you have lived through mislabeled gold pages, inherited regression seeds, or eval cases that are useful as monitors but not honest hard gates.

The practical lesson is simple:

  • use a small, audited, trusted tier for hard acceptance
  • keep the wider, noisier tier for drift monitoring and triage

I also strongly prefer evaluation that can diagnose where the system is failing: source-family selection, page survival, answer formatting, or support completeness. A judge score is useful signal. It is not robustness.

Your eval harness is only as good as your test distribution. I learned this the hard way.

If I were shipping this tomorrow

If I were building a legal answerer today, I'd default to something close to this:

LayerPractical default
parsingrobust PDF extraction with OCR fallback
chunkingstructure-aware chunks with section path preserved
retrieval identitydoc title, doc family, doc summary, page number
retrievalhybrid dense + lexical
rerankingshortlist by document family, then rerank by clause relevance
answeringtyped contracts for strict questions, concise free text for analytical questions
provenanceused pages, not visible pages
output formattingseparate submission formatting from internal reasoning representation
evalstrusted hard-gate set + broader drift monitor

Notice what is not on that list:

  • giant prompt pyramids
  • gratuitous multi-agent loops (there are cases where coordinated multi-agent systems are warranted, more on that below)
  • broad context stuffing
  • blind trust in a single frontier model

The more mature these systems become, the less magical they look.

Build order

If I had to build a strong legal answering system from scratch again, I'd do these in order:

  1. ingestion with page identity and OCR fallback
  2. structure-aware chunking with document identity preserved
  3. hybrid retrieval with document-family sanity checks
  4. typed answer contracts for strict question classes
  5. page-level provenance with complete support coverage
  6. a small trusted benchmark before broad optimization
  7. streaming plus telemetry that explain every stage

That sequence matters.

Most weak systems do the reverse:

  1. pick a model
  2. write prompts
  3. hope retrieval is fine
  4. add evaluation later

That ordering typically produces systems that work on demos but degrade quickly under real queries.

Operating standard

For legal AI, the standard should be boringly high.

Not:

  • "The model sounds smart."

But:

  • "The system found the right source."
  • "It kept the right page."
  • "It used the smallest support set that still covers the whole claim."
  • "It can tell me why it answered that way."
  • "It knows when to abstain."

The emphasis shifts from context size and model eloquence to evidence quality and answer contracts.

The principles above are design-time constraints. The next section is about what happens when you apply them under a real deadline, with a real scoring formula, and a real team. Some of the team were AI agents. The specifics come from one competition sprint. The failure modes and surprises do not.

Competition deep-dive

Competition context

Everything below comes from one team's experience in the Agentic RAG Legal Challenge 2026. The dataset was 300+ DIFC legal PDFs and 900 questions. The scoring formula was multiplicative. The timeline was 13 days. Most of the numbers I cite are internal measurements, not leaderboard scores. I share them because the patterns are more useful than the rankings.

What actually happened: a 13-day sprint

I entered the Agentic AI RAG Challenge to test these principles against a real benchmark. 300+ DIFC legal PDFs. 900 questions. Multiplicative scoring formula. 13 days. This system was built during active conflict in Israel, which imposed unpredictable constraints on development time and drove the engineering philosophy toward resilience and efficiency over complexity. What follows is a condensed account of what worked, what didn't, and what surprised me.

The formula

Every strategic choice traces back to this equation:

Total = (0.7 * Det + 0.3 * Asst) * G * T * F
FactorWhat it measuresLeverage (per 1pp)
Gpage-level grounding (F-beta, beta=2.5)+0.93pp total
Detexact-match accuracy on typed answers+0.72pp total
AsstLLM-judged free-text quality+0.31pp total
Ttelemetry schema compliancemultiplicative gate
Ftime-to-first-token coefficient (0.85--1.05)multiplicative gate

One percentage point of grounding is worth 3x one percentage point of free-text quality. The implication was immediate: protect the multipliers at all costs, then push answer quality higher inside that safety envelope.

Act 1: The ablation graveyard

Most of what you read in RAG tutorials will actively hurt you in a legal domain. I ran 16 serious experiments across 13 days. The rejection rate for retrieval-expansion experiments was 100 percent. Here are the four that taught me the most.

RAG Fusion was the worst regression: -7.8pp grounding, +246ms. The technique rewrites a query into three to five paraphrased variants and merges their results. A query like "What is the Date of Issue of CFI 022/2025?" became "provisions related to CFI 022/2025" and "date associated with case filing CFI 022." The merge promoted tangentially related pages over precisely correct ones. Query diversity adds noise when legal queries are already optimally specific.

HyDE generated a hypothetical answer document and used its embedding to retrieve similar real documents. The LLM produced hypothetical answers referencing plausible but non-existent provisions, like "Article 47(2) of the Regulatory Law". Those embedded into vector neighborhoods far from the actual enforcement decisions. Result: -0.65pp grounding, +560ms. Hallucinated legal context poisons retrieval.

Step-back rewriting abstracted questions into broader forms. "What does Article 16(1)(c) of the Employment Law say about payroll deductions?" became "What are the payroll deduction rules in DIFC?" This lost the exact article reference that anchors retrieval and added 1,542ms of latency. Abstraction loses the precision legal queries require.

BM25 hybrid retrieval added sparse keyword search alongside dense embeddings. Zero grounding gain, +260ms. The domain-tuned dense encoder already captured the lexical signal. Keyword signal is not additive with a domain-tuned dense encoder.

Full ablation table with all 16 experiments in the technical supplement below.

The pattern is consistent: techniques designed for ambiguous general-domain queries do not transfer to precise legal text.

The climax of Act 1

The single largest improvement to total score came from prompt caching. I restructured system prompts to exceed OpenAI's 1,024-token caching threshold (as of March 2026). Free-text TTFT dropped from 5,086ms to 1,931ms. The F coefficient jumped from 0.973 to 1.006: +3.0pp of total score from a change that touched zero answer logic and zero retrieval code. Not better retrieval. Faster infrastructure.

Combined with early token emission (moving mark_first_token() to fire before the grounding sidecar finished), the final F coefficient reached 1.032. That's +3.19% of total score, earned purely from when I stopped the clock.

I was confident. Internal evaluation showed near-perfect scores.

Act 2: The gap

My internal grounding proxy showed G=0.9956. Telemetry was perfect at T=1.000. Speed was optimized at F=1.032. I estimated a total score around 0.92.

The platform returned 0.48.

The gap was not a bug. It was a distribution shift. The warmup set had 100 questions:

TypeWarmupPrivateMultiplier
boolean321936.0x
free_text302709.0x
date19393.0x
name15956.3x
names59018.0x
number171599.4x

Date questions went from 1 to 93. Names went from 5 to 90. And 279 questions referenced document types that appeared zero times in the warmup set. My eval harness was tuned to a distribution that didn't represent the real test.

The grounding proxy was the other failure. It measured whether cited pages existed in the corpus, not whether they were the right pages. An internal 0.9956 masked a platform 0.589. The proxy was answering the wrong question.

The full distribution shift analysis is in the technical supplement.

The multi-agent sprint

The multi-agent development process (12 Claude Code agents working in parallel on a single laptop, closing 737 tickets in 47 hours) eventually grew into Bernstein, an open-source orchestrator. The competition pipeline itself is open-source as Shafi, named after al-Shafi'i (767–820), the father of Islamic legal methodology. A fitting reference for a system that reasons about law.

What held up

The techniques that survived the distribution shift didn't depend on the training distribution. Prompt caching delivered the same TTFT improvement regardless of question type. Metadata lookups worked for any question referencing a case number, whether it appeared in the warmup set or not. Type-specific routing generalized because it keyed on question structure, not question content.

The techniques that broke were tuned to the warmup distribution: grounding thresholds calibrated on 100 questions, answer patterns overfit to a handful of document types, a proxy metric that never tested against ground truth. Infrastructure beat sophistication. Domain engineering beat generic RAG.

For the full ablation study, score progression charts, distribution shift analysis, and per-type performance breakdown, see the research deep-dive in the second tab above.

What I would do differently

I'd invest in eval distribution analysis on day 1. The warmup set had 1 date question. The private set had 93. That's a 93x multiplier on a type I barely tested. Comparing the warmup type distribution against reasonable expectations for a legal corpus (how many enforcement decisions, how many regulations, how many practice directions) would have revealed the coverage gap immediately. I didn't look at the distribution until after the final score came back. That analysis takes 30 minutes and would have changed the entire testing strategy.

I'd skip the retrieval experiments entirely. With a domain-tuned dense encoder already achieving 90%+ page recall, retrieval was near-ceiling. Every hour spent on BM25, RAG Fusion, HyDE, and step-back rewriting was wasted. Six retrieval-expansion experiments, all rejected, all consuming time I didn't have. The returns were in answer quality and grounding calibration, not in retrieval. I'd redirect that effort toward free-text answer quality, which turned out to be the largest scoring gap.

I'd build the multi-agent system from the start. The 1.78× measured throughput from 12 parallel agents was real. 737 tickets in 47 hours. But spinning up the coordination infrastructure cold on day 10, under deadline pressure, meant the first 8 hours were spent on wakeup protocols, bulletin boards, and directive files instead of on the actual problem. Starting the multi-agent system on day 3, with module boundaries already in place, would have amortized that coordination overhead across the full sprint and reduced the merge conflicts that plagued the final 48 hours.

I'd measure TTFT from day 1. The F coefficient was the cheapest multiplier in the formula. Prompt caching alone delivered +3.0pp of total score. Early token emission added another +2.0pp. Combined, that's more than any retrieval or answer-quality improvement I found. But I didn't focus on TTFT until submission v6 on day 3. The first two days of submissions ran with F below 1.0, leaving free points on the table. In a multiplicative formula, the cheapest multiplier is always the first one worth optimizing.

A cautionary tale about agent permissions

While preparing this article for publication, an AI agent tasked with cleaning the repository ran rm -rf on several gitignored directories. 4.2 GB of development artifacts, agent coordination logs, golden labels, research archives. Not in git history. Not recoverable. Two weeks of development traces, gone in under a second.

The agent had full filesystem permissions because the task seemed simple: "clean up the repo." It understood that gitignored files aren't needed on GitHub. What it didn't consider is that "not needed on GitHub" and "not needed at all" are different things.

If you give an AI agent unrestricted filesystem access, treat it like giving root to an intern on their first day. Use allowlists, not blocklists. Require confirmation for destructive operations. Or at minimum, make sure everything is backed up before the agent starts.

Related reading

Related reading

  • Most RAG failures start in the documents
  • How to run LLM evals in production
References

Further reading

  • Towards Reliable Retrieval in RAG Systems for Large Legal Datasets
  • LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain
  • RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
  • OpenAI: Evaluation best practices
  • Anthropic: Demystifying evals for AI agents

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on RAG

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →Building Agentic AI Systems That Hold UpMar 2, 2026·5 min read
  • →Which Query Transformation Techniques Actually Help RAG?Feb 24, 2026·6 min read
  • →Preventing Hallucinations in LLM SystemsFeb 18, 2026·5 min read
On this page
  • 01The one thing that actually worked1 min
  • 02Wrong evidence, not weak prose1 min
  • 03Legal corpora are hostile input1 min
  • Preserve page and section identity from day one
  • Use OCR as a fallback, not as an afterthought
  • Keep chunking structural
  • 04Retrieve the right page1 min
  • 05Stop forcing one answer style onto every question1 min
  • 06Provenance: minimal and complete1 min
  • 07Streaming and telemetry are part of the product1 min
  • 08Evals: separate answer quality from grounding quality1 min
  • 09If I were shipping this tomorrow2 min
  • 10What actually happened: a 13-day sprint
  • The formula
  • Act 1: The ablation graveyard1 min
  • The climax of Act 1
  • Act 2: The gap1 min
  • The multi-agent sprint
  • What held up1 min
  • 11What I would do differently2 min
  • 12Related reading
  • 13Further reading