What is the difference between SWE-bench Pro and SWE-bench Verified?

SWE-bench Verified is the curated 500-issue Python benchmark that became the industry standard in 2024. SWE-bench Pro is the 2025 expansion: multi-language (Python, TypeScript, Go, Rust, Java), longer tasks, and harder verification. Current top scores are ~87% on Verified vs ~64% on Pro.

Which SWE-bench score matters more for picking a coding model?

SWE-bench Pro is more representative of real-world production coding in April 2026. Verified has saturated (Opus 4.7 hits 87.6%) and most top models cluster within 5 points. Pro still shows meaningful differentiation — 64.3% (Opus 4.7) vs 57.7% (GPT-5.4) vs 54.2% (Gemini 3.1 Pro).

Who maintains SWE-bench?

SWE-bench was originally released by Princeton NLP in 2023. SWE-bench Verified is curated by OpenAI (human-reviewed for clarity). SWE-bench Pro is maintained by a multi-lab coalition including Anthropic, OpenAI, and Princeton, with strict contamination guards and rotating test sets.

Can a model train on SWE-bench data?

Training on the benchmark's public issues is data contamination and heavily penalized in reporting. The Pro variant rotates ~25% of its test set quarterly and uses held-out issues from private repositories to reduce contamination. Lab reports in April 2026 now include 'contamination mitigations' disclosures.

Quick Answer

SWE-bench Pro vs SWE-bench Verified: What's Different

Published: April 20, 2026

SWE-bench Pro vs SWE-bench Verified: What’s Different

If you’ve noticed every AI lab now reports two SWE-bench numbers, here’s why. SWE-bench Verified is the Python benchmark that became the de facto standard in 2024. SWE-bench Pro is its harder 2025 cousin — multi-language, longer tasks, and much closer to what your actual engineers do. By April 2026, Pro is the number that matters; Verified is approaching saturation.

Last verified: April 20, 2026

Quick comparison

Property	SWE-bench Verified	SWE-bench Pro
Languages	Python	Python, TypeScript, Go, Rust, Java
Issues	500 curated	1,200+
Avg task length	~50 lines changed	~150 lines changed
Verification	Human-reviewed clarity	Human + CI + runtime tests
Contamination defense	Static test set	Rotates 25%/quarter
Top score (April 2026)	87.6% (Opus 4.7)	64.3% (Opus 4.7)
Maintained by	OpenAI + Princeton	Multi-lab coalition
Released	August 2024	September 2025

What SWE-bench actually tests

SWE-bench (both versions) gives a model:

A real open-source repository at a specific commit
A GitHub issue describing a bug or feature
The full repo context (files, history, tests)

The model must generate a patch that, when applied, makes the issue’s failing tests pass without breaking the existing test suite.

This is significantly harder than typical coding benchmarks (HumanEval, MBPP) because it requires:

Reading and understanding a large codebase
Navigating dependencies
Writing code that integrates with existing architecture
Not breaking other tests

Why Verified exists

Original SWE-bench (2023) had issues where test cases were ambiguous or the expected fix was underspecified. OpenAI built SWE-bench Verified by having human engineers review all 2,294 original tasks and keeping only the 500 that were:

Unambiguously specified
Verifiable by a clear test
Free from “you’d need more context to solve this”

Verified became the clean benchmark the industry could agree on. Top labs report Verified scores in every launch announcement from late 2024 onward.

Why Pro exists

By early 2025, Verified scores were crossing 75% and the signal-to-noise ratio got bad. A model scoring 81% vs 82% on Verified doesn’t tell you much about real-world capability — both are basically solving the easy half plus most of the hard half.

SWE-bench Pro was built to restore signal:

Five languages instead of one. Python is over-represented in training data; TypeScript, Go, Rust, and Java push models harder.
Longer tasks. Pro’s average fix is ~3× the code volume of Verified.
Real CI runs. Pro requires the model’s patch to pass the repo’s actual CI pipeline, not just the unit tests.
Rotating test set. 25% of tasks are replaced each quarter from held-out private repo data, reducing the contamination advantage of models trained on scraped GitHub.

The result: Pro scores in April 2026 still show meaningful spread across frontier models, while Verified has compressed.

April 2026 leaderboard

SWE-bench Verified (saturating)

Model	Score
Claude Mythos Preview	~92% (leaked)
Claude Opus 4.7	87.6%
GPT-5.4	84.1%
Claude Opus 4.6	80.8%
Gemini 3.1 Pro	79.4%
GPT-5.4-mini	76.2%
Grok 4.20	74.8%
Claude Sonnet 4.6	73.1%
DeepSeek V4	72.5%

SWE-bench Pro (still differentiating)

Model	Score
Claude Mythos Preview	~72% (leaked)
Claude Opus 4.7	64.3%
GPT-5.4	57.7%
Gemini 3.1 Pro	54.2%
Claude Opus 4.6	53.4%
DeepSeek V4	48.9%
Gemma 4 31B (open)	41.7%

Look at the gaps: 6.6 points between Opus 4.7 and GPT-5.4 on Pro, vs 3.5 points on Verified. Pro is where the differentiation lives in April 2026.

Which score should you care about?

Use Verified to filter the top tier

If a model scores under 75% on Verified, it’s not a frontier coding model in 2026. Verified is still a useful floor — fail it, and you’re not in the conversation.

Use Pro to pick between top-tier models

For any decision between Opus 4.7, GPT-5.4, Gemini 3.1 Pro, and similar — Pro is where the real answer lives. A 7-point Pro gap translates into noticeably different real-world experience on TypeScript codebases, Rust projects, or long Java refactors.

Don’t obsess over either in isolation

Neither benchmark captures:

Latency (some tasks take 30s vs 3 minutes)
Tool use (check MCP-Atlas for this — see our Opus 4.7 comparison)
Cost per task
Code style / maintainability
Behavior in real agentic loops (use OSWorld or Terminal-Bench 2.0)

A great coding model also needs to win on MCP-Atlas, Terminal-Bench 2.0, and real integration tests with your stack.

Contamination: the elephant in the room

Training on SWE-bench issues is cheating. In April 2026 the coalition penalizes contamination three ways:

Quarterly rotation — 25% of Pro test cases are replaced every 3 months with held-out private-repo issues
Contamination audits — labs must disclose training-data filters applied against SWE-bench repos
Held-out variants — a sealed private variant is run by the coalition to catch gaps

You can check contamination disclosures in each lab’s April 2026 launch posts. Anthropic, OpenAI, and Google DeepMind all publish them now; smaller labs are less consistent.

The benchmarks that will matter next

SWE-bench Pro is already showing signs of eventual saturation. The next-generation benchmarks catching attention in April 2026:

MCP-Atlas — scaled tool use with 100+ MCP servers
Terminal-Bench 2.0 — autonomous terminal-only coding, no IDE
OSWorld-Verified — computer use, not just code
Finance Agent v1.1 — multi-step financial analysis
CharXiv — scientific chart reasoning

Expect “SWE-bench Pro 2” or equivalent by end of 2026 as Opus 4.7 and its successors push Pro scores past 75%.

Verdict

For April 2026, report and compare SWE-bench Pro first. Verified still matters as a floor, but it’s saturating — Opus 4.7 at 87.6%, Mythos Preview at ~92%. Pro still has room and still separates the frontier from the rest.

If a launch announcement skips Pro entirely, assume the model didn’t do well on it. That’s the unwritten rule in AI launches right now.

Whichever benchmark you use, cross-check with MCP-Atlas and a real integration test on your own codebase. No single benchmark is the complete picture — and if you’re buying a model to ship code, only shipping speaks for itself.