AI agents · OpenClaw · self-hosting · automation

Quick Answer

What is Harvey LAB? The Long-Horizon Legal AI Benchmark Explained

Published:

What is Harvey LAB? (May 2026)

LAB (Legal Agent Benchmark) is an open-source benchmark Harvey released on May 6, 2026 that measures AI agent performance on multi-day, multi-step legal work. It’s the first credible attempt to benchmark legal AI on the work lawyers actually do, rather than single-question legal trivia.

Last verified: May 24, 2026.

TL;DR

  • What it is: Open-source benchmark for long-horizon legal AI agents.
  • Released by: Harvey, May 6, 2026.
  • Scale: 1,200+ tasks across 24 legal practice areas.
  • Grading: 75,000+ expert-written rubric criteria by practicing lawyers.
  • Closest analog: SWE-Bench Verified, but for legal work instead of coding.
  • Why it matters: Legal AI procurement has had no credible benchmark until now.

Legal AI has been a benchmark wasteland for years. The standard evaluations were:

  • LegalBench (2023) — multiple-choice and short-answer legal reasoning. Useful for measuring model knowledge, useless for measuring whether an agent can do a real legal job.
  • Bar exam scores — GPT-4 famously passed the bar in 2023. Cool. Doesn’t tell you whether a model can handle an M&A due diligence engagement.
  • Vendor-specific demos — Harvey, Co-Counsel, Spellbook, and others all showed impressive demos, but each measured itself against its own internal task suite. No apples-to-apples comparison.

Meanwhile, legal AI procurement has exploded — AmLaw 200 firms collectively spent more than $2B on legal AI in 2025. Buyers had no shared way to compare vendors. RFPs asked vendors to demo capabilities, then GCs voted on vibes.

LAB is Harvey’s structural fix.

How LAB actually works

The v1 benchmark, per Harvey’s announcement and the GitHub release:

1. Tasks: 1,200+ multi-step legal workflows across 24 practice areas, including:

  • M&A due diligence (review data room → produce issues list)
  • Contract review and negotiation (markup → counterproposal)
  • Litigation discovery (document review → privilege log → production)
  • Regulatory analysis (research → memo → client letter)
  • IP work (prior art search → patent claim drafting)
  • Immigration (case preparation → filing → response to RFE)
  • Employment, real estate, tax, bankruptcy, etc.

2. Agent harness: LAB defines a standard agent interface — tool calls for document retrieval, web research, citation lookup, and draft submission. Any agent that implements the interface can be evaluated.

3. Multi-day workflows: Tasks span what Harvey calls “long horizons” — 10-100+ tool calls across multiple sessions, mimicking how real legal work unfolds over days or weeks.

4. Expert rubric grading: Each task has a rubric of 30-100 specific criteria written by lawyers practicing in that area. Examples: “Did the agent cite the controlling case in this jurisdiction? (Y/N)”, “Did the privilege log include all attorney-client communications? (Y/N)”, “Was the issues list ranked by materiality? (Y/N)”. Total: 75,000+ criteria across the v1 task suite.

5. Reporting: Each agent gets a per-practice-area score plus an aggregate, plus token cost and wall-clock time. The “Pareto frontier” view shows the cost-quality tradeoff.

Why this is structurally important

LAB does three things no prior legal AI benchmark has done:

1. Open methodology. Anyone can read the rubrics, anyone can re-grade, anyone can dispute a score. That’s the difference between a vendor benchmark and an industry benchmark.

2. Long-horizon workflows. Real legal work is not “answer this question.” It’s “read these 800 documents, find the issues, produce a memo, cite case law, respond to opposing counsel’s positions.” LAB is the first benchmark that measures this end-to-end.

3. Lawyer-graded. The 75,000+ rubric criteria come from practicing lawyers, not from ML researchers. That dramatically increases the benchmark’s credibility with GCs and CLOs who actually buy legal AI.

Published baselines (May 2026)

Harvey’s launch blog reports baselines for the major legal AI agents:

AgentLAB v1 Aggregate ScoreBest practice areaWeakest area
Harvey Assistant (default)68.3M&A due diligence (78.1)Tax (54.2)
Co-Counsel (Thomson Reuters)61.7Litigation discovery (72.4)Immigration (48.9)
Spellbook54.2Contract review (71.0)Regulatory (39.6)
Claude Opus 4.7 (raw, no legal harness)47.5Regulatory (59.1)Litigation discovery (35.2)
GPT-5.5 (raw, no legal harness)45.8Tax (54.7)M&A (33.4)
Practicing lawyer baseline (control)81.2M&A (88.4)Tax (74.1)

The story: vertical legal AI agents (Harvey, Co-Counsel) outperform raw frontier models by 15-20 points, because the harness, retrieval, and citation tooling matter as much as the model. Practicing lawyers still beat all agents by 13-20 points. Nobody is above human in May 2026.

How buyers should use LAB

If you’re a CLO, GC, or law firm IT lead evaluating legal AI in 2026:

1. Stop accepting vendor-specific demos. Ask every vendor for their LAB v1 score, broken down by the practice areas you care about.

2. Run your own tasks through LAB. The eval harness is open — you can add your own firm-specific tasks (with synthetic data for confidentiality) and re-run any vendor’s agent.

3. Compare on cost-adjusted score, not raw score. Harvey Assistant might lead on aggregate but cost 3x more than a competitor with a 92% score per dollar. LAB reports both.

4. Don’t expect superhuman. Practicing lawyers still beat the best agent by 13 points. Plan for augmentation, not replacement.

How vendors should respond

If you’re shipping a legal AI agent:

  • Publish your LAB v1 score immediately. Vendors who hide will look like they have something to hide.
  • Break down by practice area. Aggregate scores are coarse; show buyers where you win and where you don’t.
  • Set up continuous LAB CI. Run LAB against every release. Track score deltas over time.

How LAB compares to other 2026 verticals

BenchmarkVerticalReleasedScale
Harvey LABLegalMay 6, 20261,200+ tasks, 75K rubrics
GDPval-AAGeneral agentQ2 2026~500 tasks
STATE-BenchMemory (any agent)May 19, 2026200 tasks (more in repo)
HealthBench (OpenAI)Healthcare2025Multiple categories
SWE-Bench VerifiedCoding2024-2025500 verified tasks

The pattern: 2026 is the year vertical benchmarks went from research curiosities to procurement requirements. Legal, healthcare, coding, and general-agent benchmarks now exist with credible methodology. Finance and engineering are next.

Caveats

  • v1 is English-language only. EU/Asia legal markets need v2.
  • Some test documents are synthetic for confidentiality reasons. Real-document evaluation requires a separate licensed track.
  • Practice area coverage is US-centric (M&A reflects Delaware law, litigation reflects FRCP, etc.). International expansion is on Harvey’s roadmap.
  • Harvey runs it. Even with open methodology, the benchmark designer has a structural advantage — Harvey will be first to optimize against the benchmark they designed. Watch for independent verification by Stanford CodeX or similar.

Verdict

Harvey LAB is the legal AI benchmark the industry has needed for three years. It will become a procurement requirement by Q4 2026 and a standard vendor marketing claim by Q3.

The fact that Harvey published it open-source — rather than keeping it as a proprietary advantage — tells you they think they win on the benchmark even when competitors can study it. That’s a strong signal.