AI agents · OpenClaw · self-hosting · automation

Quick Answer

Harvey LAB vs LegalBench vs Co-Counsel Bench (May 2026)

Published:

Harvey LAB vs LegalBench vs Co-Counsel Bench (May 2026)

Harvey released LAB on May 6, 2026 — the first long-horizon, lawyer-graded, open-source benchmark for legal AI agents. It changes the procurement game for legal AI, and it’s the right moment to compare the three benchmarks that matter for legal buyers in 2026.

Last verified: May 24, 2026.

TL;DR table

Harvey LABLegalBenchCo-Counsel Internal Eval
ReleasedMay 6, 2026August 2023Ongoing (Thomson Reuters)
MaintainerHarveyStanford CodeX + Hugging FaceThomson Reuters
LicenseOpen source (GitHub)Open (Hugging Face, MIT-like)Proprietary
FormatMulti-day agent workflowsMultiple choice + short answerMixed
Tasks1,200+ tasks, 24 practice areas162 tasksNot public (estimated 200+)
Grading75,000+ lawyer-written rubric criteriaAuto-graded vs. ground truthInternal lawyer review
Agent harnessYes — standard tool interfaceNo — single-turn Q&ACo-Counsel-specific
Long horizonYes (10-100+ tool calls)No (single question)Partial
What it measuresCan an agent finish real legal work?Does the model know legal facts?Co-Counsel’s own progress
Best forProcurement, vendor comparisonModel knowledge researchCo-Counsel customers

What each benchmark is actually for

Harvey LAB — agent task success

LAB is the legal-industry equivalent of SWE-Bench Verified for coding. It asks: can an AI agent, given the same tools and document access as a junior associate, produce work product that meets practicing-lawyer standards?

The tasks are realistic: M&A diligence on a fictional data room, contract markup with negotiation goals, litigation discovery with privilege issues, regulatory memos with specific citation requirements. Each is graded against 30-100 specific rubric criteria written by practicing lawyers in that area.

LegalBench (Stanford CodeX, 2023) was the first widely-adopted academic legal benchmark. It’s a collection of 162 tasks covering legal reasoning, rule application, issue spotting, and rhetorical understanding — but each task is single-turn: read a passage, answer a question.

LegalBench remains useful for measuring model legal knowledge (“does GPT-5.5 understand the rule against perpetuities?”) but doesn’t measure whether an agent can run a legal engagement.

Co-Counsel Internal Eval — vendor self-report

Thomson Reuters’ Co-Counsel publishes accuracy claims based on its internal evaluation suite. The methodology is described in marketing materials but the task suite isn’t open. As a result, it functions as a vendor self-report — useful context but not independently verifiable.

Why the difference matters for buyers

Legal AI procurement in 2024-2025 was a vibes-based exercise:

  • Vendors demoed their best-case scenarios.
  • General counsels voted based on which demo felt impressive.
  • ROI was measured anecdotally (“our associates say it saves 5 hours/week”).
  • No vendor could be directly compared to another.

LAB ends that era. With an open benchmark:

  • Buyers can demand a LAB score from every vendor.
  • Vendors can be ranked apples-to-apples.
  • Cost-quality tradeoffs become explicit (Harvey reports cost per task).
  • Independent third parties (Stanford, Berkeley Law, etc.) can re-run the benchmark to verify claims.

LegalBench is still useful, but it’s the wrong tool for procurement. A model that scores 92% on LegalBench can still fail at a 5-day M&A engagement because the work requires document retrieval, multi-step planning, citation discipline, and writing quality — none of which LegalBench measures.

Published LAB v1 baseline numbers

From Harvey’s announcement post (May 6, 2026):

AgentLAB v1 aggregateNotes
Practicing lawyer baseline81.2Human control
Harvey Assistant68.3Harvey’s own product
Co-Counsel61.7Thomson Reuters
Spellbook54.2Strong on contract review
Claude Opus 4.7 (raw)47.5No legal harness
GPT-5.5 (raw)45.8No legal harness
Gemini 3.1 Pro (raw)43.6No legal harness

Key takeaway: legal-specific harnesses (Harvey, Co-Counsel) outperform raw frontier models by 15-25 points because retrieval, citation tooling, and document management matter as much as raw model intelligence.

Where LegalBench still wins

1. Knowledge measurement. If you want to know whether a model understands a specific area of law (admin law, contracts, IP), LegalBench’s targeted tasks are the cleanest way to measure that.

2. Research baselines. Academic papers will continue to use LegalBench because it has stable, well-documented evaluation protocols and a long history of comparable scores.

3. Free. No tool harness, no infrastructure — you can run LegalBench on any model API in an afternoon. LAB requires more setup.

4. Pre-training signal. Model developers use LegalBench to track whether their model’s legal capability is improving. It’s a leading indicator that’s easier to move than LAB.

Where Co-Counsel’s eval still wins

1. For Co-Counsel customers, it’s directly relevant. Thomson Reuters can tune their internal eval to the workflows their customers actually use.

2. Real-document evaluation. Because Co-Counsel works with actual customer documents (under NDA), their internal eval can use real-world data in a way open benchmarks can’t.

3. Continuous improvement signal. Co-Counsel can publish “we improved by X% this quarter” with credibility because the eval is consistent over time.

The structural change LAB represents

For 18 months, legal AI vendors have argued that legal work is “too nuanced to benchmark” — which conveniently meant nobody could compare them. LAB calls that bluff.

The fact that Harvey published it open-source is a power move. It says: we win on this benchmark and we’re confident enough to let everyone study it. If a competitor outscores Harvey on LAB by Q3 2026, Harvey will face hard questions. If they don’t, Harvey has effectively defined the legal AI market.

Same playbook OpenAI ran with HumanEval (2021) and Anthropic ran with SWE-Bench Verified (2024). The vendor who defines the benchmark usually wins the category for years.

If you’re procuring legal AI in mid-2026:

  1. LAB v1 score (aggregate + your top 3 practice areas).
  2. LAB cost per task (token cost + wall-clock time).
  3. LegalBench score (as a model-knowledge sanity check).
  4. Ability to add custom tasks to LAB (firm-specific workflows).
  5. Re-run option: vendor agrees to support a Stanford CodeX or equivalent third-party re-evaluation.

Stop accepting vendor demos as evidence. The era of vibes-based legal AI procurement is over.

Verdict

  • Best for procurement (mid-2026 onward): Harvey LAB — the new industry standard.
  • Best for model knowledge research: LegalBench — still the cleanest legal-knowledge benchmark.
  • Best for Co-Counsel customers tracking vendor progress: Co-Counsel internal eval.

For everyone else: LAB is the score that will appear in every legal AI conversation by Q4 2026.