What is the difference between Harvey LAB and LegalBench?

LegalBench (2023) is a multiple-choice and short-answer benchmark measuring a model's legal knowledge across 162 tasks. Harvey LAB (May 6, 2026) is an end-to-end agent benchmark measuring whether an AI agent can complete real long-horizon legal workflows — like a multi-day M&A due diligence — graded against 75,000+ expert-written rubric criteria. LegalBench tests legal knowledge; LAB tests legal work.

Why is LAB considered the new standard for legal AI?

Three reasons: (1) open-source methodology that any vendor or buyer can run, (2) tasks cover 24 practice areas of real multi-step legal work rather than synthetic Q&A, and (3) rubrics are written by practicing lawyers, not ML researchers. Harvey published baselines that show practicing lawyers still beat the best agents by 13-20 points, which makes the benchmark feel honest rather than vendor-flattering. Expect LAB scores in legal AI RFPs by Q4 2026.

Is Co-Counsel's internal evaluation public?

Thomson Reuters has published partial methodology for Co-Counsel's internal evaluation but has not open-sourced the task suite or rubrics. That makes it a vendor evaluation rather than an industry benchmark — you can read about it but can't reproduce it independently. Co-Counsel scores 61.7 on LAB v1 according to Harvey's published baselines, behind Harvey Assistant (68.3) but ahead of Spellbook (54.2).

Which legal AI benchmark should procurement use?

For 2026 procurement, demand: (1) Harvey LAB v1 score broken down by your relevant practice areas, (2) cost-adjusted score (score per dollar of inference cost), (3) ideally a re-run against your own firm-specific synthetic tasks. LegalBench can supplement for raw model knowledge questions. Stop accepting vendor-curated demos as evidence of capability.

Quick Answer

Harvey LAB vs LegalBench vs Co-Counsel Bench (May 2026)

Published: May 24, 2026

Harvey LAB vs LegalBench vs Co-Counsel Bench (May 2026)

Harvey released LAB on May 6, 2026 — the first long-horizon, lawyer-graded, open-source benchmark for legal AI agents. It changes the procurement game for legal AI, and it’s the right moment to compare the three benchmarks that matter for legal buyers in 2026.

Last verified: May 24, 2026.

TL;DR table

	Harvey LAB	LegalBench	Co-Counsel Internal Eval
Released	May 6, 2026	August 2023	Ongoing (Thomson Reuters)
Maintainer	Harvey	Stanford CodeX + Hugging Face	Thomson Reuters
License	Open source (GitHub)	Open (Hugging Face, MIT-like)	Proprietary
Format	Multi-day agent workflows	Multiple choice + short answer	Mixed
Tasks	1,200+ tasks, 24 practice areas	162 tasks	Not public (estimated 200+)
Grading	75,000+ lawyer-written rubric criteria	Auto-graded vs. ground truth	Internal lawyer review
Agent harness	Yes — standard tool interface	No — single-turn Q&A	Co-Counsel-specific
Long horizon	Yes (10-100+ tool calls)	No (single question)	Partial
What it measures	Can an agent finish real legal work?	Does the model know legal facts?	Co-Counsel’s own progress
Best for	Procurement, vendor comparison	Model knowledge research	Co-Counsel customers

What each benchmark is actually for

Harvey LAB — agent task success

LAB is the legal-industry equivalent of SWE-Bench Verified for coding. It asks: can an AI agent, given the same tools and document access as a junior associate, produce work product that meets practicing-lawyer standards?

The tasks are realistic: M&A diligence on a fictional data room, contract markup with negotiation goals, litigation discovery with privilege issues, regulatory memos with specific citation requirements. Each is graded against 30-100 specific rubric criteria written by practicing lawyers in that area.

LegalBench — legal knowledge

LegalBench (Stanford CodeX, 2023) was the first widely-adopted academic legal benchmark. It’s a collection of 162 tasks covering legal reasoning, rule application, issue spotting, and rhetorical understanding — but each task is single-turn: read a passage, answer a question.

LegalBench remains useful for measuring model legal knowledge (“does GPT-5.5 understand the rule against perpetuities?”) but doesn’t measure whether an agent can run a legal engagement.

Co-Counsel Internal Eval — vendor self-report

Thomson Reuters’ Co-Counsel publishes accuracy claims based on its internal evaluation suite. The methodology is described in marketing materials but the task suite isn’t open. As a result, it functions as a vendor self-report — useful context but not independently verifiable.

Why the difference matters for buyers

Legal AI procurement in 2024-2025 was a vibes-based exercise:

Vendors demoed their best-case scenarios.
General counsels voted based on which demo felt impressive.
ROI was measured anecdotally (“our associates say it saves 5 hours/week”).
No vendor could be directly compared to another.

LAB ends that era. With an open benchmark:

Buyers can demand a LAB score from every vendor.
Vendors can be ranked apples-to-apples.
Cost-quality tradeoffs become explicit (Harvey reports cost per task).
Independent third parties (Stanford, Berkeley Law, etc.) can re-run the benchmark to verify claims.

LegalBench is still useful, but it’s the wrong tool for procurement. A model that scores 92% on LegalBench can still fail at a 5-day M&A engagement because the work requires document retrieval, multi-step planning, citation discipline, and writing quality — none of which LegalBench measures.

Published LAB v1 baseline numbers

From Harvey’s announcement post (May 6, 2026):

Agent	LAB v1 aggregate	Notes
Practicing lawyer baseline	81.2	Human control
Harvey Assistant	68.3	Harvey’s own product
Co-Counsel	61.7	Thomson Reuters
Spellbook	54.2	Strong on contract review
Claude Opus 4.7 (raw)	47.5	No legal harness
GPT-5.5 (raw)	45.8	No legal harness
Gemini 3.1 Pro (raw)	43.6	No legal harness

Key takeaway: legal-specific harnesses (Harvey, Co-Counsel) outperform raw frontier models by 15-25 points because retrieval, citation tooling, and document management matter as much as raw model intelligence.

Where LegalBench still wins

1. Knowledge measurement. If you want to know whether a model understands a specific area of law (admin law, contracts, IP), LegalBench’s targeted tasks are the cleanest way to measure that.

2. Research baselines. Academic papers will continue to use LegalBench because it has stable, well-documented evaluation protocols and a long history of comparable scores.

3. Free. No tool harness, no infrastructure — you can run LegalBench on any model API in an afternoon. LAB requires more setup.

4. Pre-training signal. Model developers use LegalBench to track whether their model’s legal capability is improving. It’s a leading indicator that’s easier to move than LAB.

Where Co-Counsel’s eval still wins

1. For Co-Counsel customers, it’s directly relevant. Thomson Reuters can tune their internal eval to the workflows their customers actually use.

2. Real-document evaluation. Because Co-Counsel works with actual customer documents (under NDA), their internal eval can use real-world data in a way open benchmarks can’t.

3. Continuous improvement signal. Co-Counsel can publish “we improved by X% this quarter” with credibility because the eval is consistent over time.

The structural change LAB represents

For 18 months, legal AI vendors have argued that legal work is “too nuanced to benchmark” — which conveniently meant nobody could compare them. LAB calls that bluff.

The fact that Harvey published it open-source is a power move. It says: we win on this benchmark and we’re confident enough to let everyone study it. If a competitor outscores Harvey on LAB by Q3 2026, Harvey will face hard questions. If they don’t, Harvey has effectively defined the legal AI market.

Same playbook OpenAI ran with HumanEval (2021) and Anthropic ran with SWE-Bench Verified (2024). The vendor who defines the benchmark usually wins the category for years.

What to demand in a 2026 legal AI RFP

If you’re procuring legal AI in mid-2026:

LAB v1 score (aggregate + your top 3 practice areas).
LAB cost per task (token cost + wall-clock time).
LegalBench score (as a model-knowledge sanity check).
Ability to add custom tasks to LAB (firm-specific workflows).
Re-run option: vendor agrees to support a Stanford CodeX or equivalent third-party re-evaluation.

Stop accepting vendor demos as evidence. The era of vibes-based legal AI procurement is over.

Verdict

Best for procurement (mid-2026 onward): Harvey LAB — the new industry standard.
Best for model knowledge research: LegalBench — still the cleanest legal-knowledge benchmark.
Best for Co-Counsel customers tracking vendor progress: Co-Counsel internal eval.

For everyone else: LAB is the score that will appear in every legal AI conversation by Q4 2026.

Harvey LAB vs LegalBench vs Co-Counsel Bench (May 2026)

TL;DR table

What each benchmark is actually for

Harvey LAB — agent task success

LegalBench — legal knowledge

Co-Counsel Internal Eval — vendor self-report

Why the difference matters for buyers

Published LAB v1 baseline numbers

Where LegalBench still wins

Where Co-Counsel’s eval still wins

The structural change LAB represents

What to demand in a 2026 legal AI RFP

Verdict

Related reading