LAB (Legal Agent Benchmark) is an open-source benchmark Harvey released on May 6, 2026 that measures AI agent performance on long-horizon legal work. The v1 release includes 1,200+ tasks spanning 24 legal practice areas, graded against more than 75,000 expert-written rubric criteria. It's the first major benchmark to focus on multi-day, multi-step legal workflows rather than single-question legal QA, and it's designed to be the legal industry's equivalent of SWE-Bench Verified.

Why did Harvey build LAB?

Harvey is the leading legal AI vendor (used by hundreds of AmLaw 200 firms and corporate legal departments) and has spent years arguing that legal AI evaluation is broken — single-question multiple-choice benchmarks like LegalBench don't reflect actual legal work, which is messy, multi-step, document-heavy, and requires extended reasoning. LAB is Harvey's attempt to set the industry-standard benchmark. Same playbook OpenAI ran with HumanEval, Anthropic ran with SWE-Bench Verified.

What kinds of tasks are in Harvey LAB?

LAB covers 24 legal practice areas including M&A due diligence, contract negotiation, litigation discovery, regulatory analysis, IP filings, and immigration. Tasks are multi-day, multi-document, and require agents to read briefs, draft responses, cite specific case law, request additional documents when needed, and produce work product graded by practicing lawyers. The 75,000+ rubric criteria are written by lawyers in each practice area.

Is Harvey LAB actually open source or just open access?

Open source per Harvey's announcement, with the task suite, evaluation harness, and rubric framework published on GitHub. Some specific corporate documents in the test suite are synthetic (for confidentiality), but the methodology and evaluation code are fully open. Any vendor or buyer can run LAB against their own legal AI agent and compare to published baselines from Harvey Assistant, Co-Counsel (Thomson Reuters), and several Anthropic/OpenAI direct submissions.

Quick Answer

What is Harvey LAB? The Long-Horizon Legal AI Benchmark Explained

Published: May 24, 2026

What is Harvey LAB? (May 2026)

LAB (Legal Agent Benchmark) is an open-source benchmark Harvey released on May 6, 2026 that measures AI agent performance on multi-day, multi-step legal work. It’s the first credible attempt to benchmark legal AI on the work lawyers actually do, rather than single-question legal trivia.

Last verified: May 24, 2026.

TL;DR

What it is: Open-source benchmark for long-horizon legal AI agents.
Released by: Harvey, May 6, 2026.
Scale: 1,200+ tasks across 24 legal practice areas.
Grading: 75,000+ expert-written rubric criteria by practicing lawyers.
Closest analog: SWE-Bench Verified, but for legal work instead of coding.
Why it matters: Legal AI procurement has had no credible benchmark until now.

Why legal AI needed a new benchmark

Legal AI has been a benchmark wasteland for years. The standard evaluations were:

LegalBench (2023) — multiple-choice and short-answer legal reasoning. Useful for measuring model knowledge, useless for measuring whether an agent can do a real legal job.
Bar exam scores — GPT-4 famously passed the bar in 2023. Cool. Doesn’t tell you whether a model can handle an M&A due diligence engagement.
Vendor-specific demos — Harvey, Co-Counsel, Spellbook, and others all showed impressive demos, but each measured itself against its own internal task suite. No apples-to-apples comparison.

Meanwhile, legal AI procurement has exploded — AmLaw 200 firms collectively spent more than $2B on legal AI in 2025. Buyers had no shared way to compare vendors. RFPs asked vendors to demo capabilities, then GCs voted on vibes.

LAB is Harvey’s structural fix.

How LAB actually works

The v1 benchmark, per Harvey’s announcement and the GitHub release:

1. Tasks: 1,200+ multi-step legal workflows across 24 practice areas, including:

M&A due diligence (review data room → produce issues list)
Contract review and negotiation (markup → counterproposal)
Litigation discovery (document review → privilege log → production)
Regulatory analysis (research → memo → client letter)
IP work (prior art search → patent claim drafting)
Immigration (case preparation → filing → response to RFE)
Employment, real estate, tax, bankruptcy, etc.

2. Agent harness: LAB defines a standard agent interface — tool calls for document retrieval, web research, citation lookup, and draft submission. Any agent that implements the interface can be evaluated.

3. Multi-day workflows: Tasks span what Harvey calls “long horizons” — 10-100+ tool calls across multiple sessions, mimicking how real legal work unfolds over days or weeks.

4. Expert rubric grading: Each task has a rubric of 30-100 specific criteria written by lawyers practicing in that area. Examples: “Did the agent cite the controlling case in this jurisdiction? (Y/N)”, “Did the privilege log include all attorney-client communications? (Y/N)”, “Was the issues list ranked by materiality? (Y/N)”. Total: 75,000+ criteria across the v1 task suite.

5. Reporting: Each agent gets a per-practice-area score plus an aggregate, plus token cost and wall-clock time. The “Pareto frontier” view shows the cost-quality tradeoff.

Why this is structurally important

LAB does three things no prior legal AI benchmark has done:

1. Open methodology. Anyone can read the rubrics, anyone can re-grade, anyone can dispute a score. That’s the difference between a vendor benchmark and an industry benchmark.

2. Long-horizon workflows. Real legal work is not “answer this question.” It’s “read these 800 documents, find the issues, produce a memo, cite case law, respond to opposing counsel’s positions.” LAB is the first benchmark that measures this end-to-end.

3. Lawyer-graded. The 75,000+ rubric criteria come from practicing lawyers, not from ML researchers. That dramatically increases the benchmark’s credibility with GCs and CLOs who actually buy legal AI.

Published baselines (May 2026)

Harvey’s launch blog reports baselines for the major legal AI agents:

Agent	LAB v1 Aggregate Score	Best practice area	Weakest area
Harvey Assistant (default)	68.3	M&A due diligence (78.1)	Tax (54.2)
Co-Counsel (Thomson Reuters)	61.7	Litigation discovery (72.4)	Immigration (48.9)
Spellbook	54.2	Contract review (71.0)	Regulatory (39.6)
Claude Opus 4.7 (raw, no legal harness)	47.5	Regulatory (59.1)	Litigation discovery (35.2)
GPT-5.5 (raw, no legal harness)	45.8	Tax (54.7)	M&A (33.4)
Practicing lawyer baseline (control)	81.2	M&A (88.4)	Tax (74.1)

The story: vertical legal AI agents (Harvey, Co-Counsel) outperform raw frontier models by 15-20 points, because the harness, retrieval, and citation tooling matter as much as the model. Practicing lawyers still beat all agents by 13-20 points. Nobody is above human in May 2026.

How buyers should use LAB

If you’re a CLO, GC, or law firm IT lead evaluating legal AI in 2026:

1. Stop accepting vendor-specific demos. Ask every vendor for their LAB v1 score, broken down by the practice areas you care about.

2. Run your own tasks through LAB. The eval harness is open — you can add your own firm-specific tasks (with synthetic data for confidentiality) and re-run any vendor’s agent.

3. Compare on cost-adjusted score, not raw score. Harvey Assistant might lead on aggregate but cost 3x more than a competitor with a 92% score per dollar. LAB reports both.

4. Don’t expect superhuman. Practicing lawyers still beat the best agent by 13 points. Plan for augmentation, not replacement.

How vendors should respond

If you’re shipping a legal AI agent:

Publish your LAB v1 score immediately. Vendors who hide will look like they have something to hide.
Break down by practice area. Aggregate scores are coarse; show buyers where you win and where you don’t.
Set up continuous LAB CI. Run LAB against every release. Track score deltas over time.

How LAB compares to other 2026 verticals

Benchmark	Vertical	Released	Scale
Harvey LAB	Legal	May 6, 2026	1,200+ tasks, 75K rubrics
GDPval-AA	General agent	Q2 2026	~500 tasks
STATE-Bench	Memory (any agent)	May 19, 2026	200 tasks (more in repo)
HealthBench (OpenAI)	Healthcare	2025	Multiple categories
SWE-Bench Verified	Coding	2024-2025	500 verified tasks

The pattern: 2026 is the year vertical benchmarks went from research curiosities to procurement requirements. Legal, healthcare, coding, and general-agent benchmarks now exist with credible methodology. Finance and engineering are next.

Caveats

v1 is English-language only. EU/Asia legal markets need v2.
Some test documents are synthetic for confidentiality reasons. Real-document evaluation requires a separate licensed track.
Practice area coverage is US-centric (M&A reflects Delaware law, litigation reflects FRCP, etc.). International expansion is on Harvey’s roadmap.
Harvey runs it. Even with open methodology, the benchmark designer has a structural advantage — Harvey will be first to optimize against the benchmark they designed. Watch for independent verification by Stanford CodeX or similar.

Verdict

Harvey LAB is the legal AI benchmark the industry has needed for three years. It will become a procurement requirement by Q4 2026 and a standard vendor marketing claim by Q3.

The fact that Harvey published it open-source — rather than keeping it as a proprietary advantage — tells you they think they win on the benchmark even when competitors can study it. That’s a strong signal.