How do you define answer accuracy for legal queries?

Legal experts scored if answer correctly answered question AND all citations matched the claim.

What was the toughest query type?

Questions requiring reasoning across multiple cases—solved with multi-turn retrieval.

How does this case study work?

Raise a request, talk to experts, fund the project, expert works, review and approve payment. All remote, all through our platform.

Improving Answer Accuracy with RAG Systems

Executive Summary

A legal tech startup's RAG system answered only 65% of queries correctly—unacceptable for legal professionals. By implementing multi-stage retrieval, self-critique, and citation verification, they achieved 94% accuracy, passing 3 law firm pilot programs.

Key Outcomes

▹ 65% → 94% answer accuracy
▹ 0% hallucination rate on verified queries
▹ 3 enterprise law firm contracts secured

Client Situation

Law firms testing the product found too many incorrect citations and hallucinated case law, making it unusable for client work.

Key Challenges

⚠ 65% accuracy meant 1 in 3 answers wrong
⚠ Hallucinated case citations damaging trust
⚠ Inability to cite specific paragraph numbers

Existing Architecture

Single-stage vector retrieval with naive concatenation, single LLM call for answer generation.

No verification of retrieved documents
No multi-turn reasoning for complex queries
No citation granularity beyond document level

Solution Design

Multi-stage RAG with HyDE retrieval, self-critique verification, and paragraph-level citations.

Key Decisions

✓ HyDE (Hypothetical Document Embeddings) for better retrieval
✓ Self-critique step verifying answer against retrieved chunks
✓ Paragraph-level citations for legal-grade references

LangChainWeaviateGPT-4CohereLegal BERT

Implementation

Iterative improvement with legal experts scoring 1,000 test queries after each change.

Phase 1: Phase 1: Multi-Stage Retrieval
Added HyDE and cross-encoder re-ranking—improved accuracy to 82%.
Phase 2: Phase 2: Self-Critique
LLM validates answer against retrieved chunks—reduced hallucinations to near zero.
Phase 3: Phase 3: Citation Granularity
Added paragraph citations and direct quotes for legal validation.

Technical Challenges

Self-critique latency

Impact: 2x inference time (5 seconds → 10 seconds) unacceptable

Resolution: Parallel verification + smaller critique model for speed

Legal terminology embedding

Impact: Standard embeddings missed legal-specific relationships

Resolution: Fine-tuned Legal BERT on case law corpus

Results

Answer accuracy (legal expert evaluation): Before65%
After94%
Improvement45% increase
Hallucination rate: Before15%
After0.5%
Improvement97% reduction
Citation precision: BeforeN/A
Afterparagraph-level
Improvementcourt-admissible

Lessons Learned

📘 Self-critique reduced hallucinations from 15% to <1%—critical for legal use
📘 HyDE retrieval improved recall by 25% for complex queries
📘 Legal experts preferred lower accuracy with citations over high accuracy without them

What We Would Do Differently

💡 Implement RAGAS evaluation framework from day one
💡 Use DSPy for automated prompt optimization

Role Relevance

RAG engineers designed the verification pipeline that made legal-grade accuracy possible, transforming a toy demo into enterprise product.

Critical Skills Demonstrated

Multi-stage retrievalSelf-critique pipelinesCitation granularityLegal domain adaptation

Related Roles

RAG Engineer LLM Engineer ML Engineer

Frequently Asked Questions

How do you define answer accuracy for legal queries?: Legal experts scored if answer correctly answered question AND all citations matched the claim.
What was the toughest query type?: Questions requiring reasoning across multiple cases—solved with multi-turn retrieval.