SWE-bench Pro vs SWE-bench Verified: What's Different
SWE-bench Pro vs SWE-bench Verified: What’s Different
If you’ve noticed every AI lab now reports two SWE-bench numbers, here’s why. SWE-bench Verified is the Python benchmark that became the de facto standard in 2024. SWE-bench Pro is its harder 2025 cousin — multi-language, longer tasks, and much closer to what your actual engineers do. By April 2026, Pro is the number that matters; Verified is approaching saturation.
Last verified: April 20, 2026
Quick comparison
| Property | SWE-bench Verified | SWE-bench Pro |
|---|---|---|
| Languages | Python | Python, TypeScript, Go, Rust, Java |
| Issues | 500 curated | 1,200+ |
| Avg task length | ~50 lines changed | ~150 lines changed |
| Verification | Human-reviewed clarity | Human + CI + runtime tests |
| Contamination defense | Static test set | Rotates 25%/quarter |
| Top score (April 2026) | 87.6% (Opus 4.7) | 64.3% (Opus 4.7) |
| Maintained by | OpenAI + Princeton | Multi-lab coalition |
| Released | August 2024 | September 2025 |
What SWE-bench actually tests
SWE-bench (both versions) gives a model:
- A real open-source repository at a specific commit
- A GitHub issue describing a bug or feature
- The full repo context (files, history, tests)
The model must generate a patch that, when applied, makes the issue’s failing tests pass without breaking the existing test suite.
This is significantly harder than typical coding benchmarks (HumanEval, MBPP) because it requires:
- Reading and understanding a large codebase
- Navigating dependencies
- Writing code that integrates with existing architecture
- Not breaking other tests
Why Verified exists
Original SWE-bench (2023) had issues where test cases were ambiguous or the expected fix was underspecified. OpenAI built SWE-bench Verified by having human engineers review all 2,294 original tasks and keeping only the 500 that were:
- Unambiguously specified
- Verifiable by a clear test
- Free from “you’d need more context to solve this”
Verified became the clean benchmark the industry could agree on. Top labs report Verified scores in every launch announcement from late 2024 onward.
Why Pro exists
By early 2025, Verified scores were crossing 75% and the signal-to-noise ratio got bad. A model scoring 81% vs 82% on Verified doesn’t tell you much about real-world capability — both are basically solving the easy half plus most of the hard half.
SWE-bench Pro was built to restore signal:
- Five languages instead of one. Python is over-represented in training data; TypeScript, Go, Rust, and Java push models harder.
- Longer tasks. Pro’s average fix is ~3× the code volume of Verified.
- Real CI runs. Pro requires the model’s patch to pass the repo’s actual CI pipeline, not just the unit tests.
- Rotating test set. 25% of tasks are replaced each quarter from held-out private repo data, reducing the contamination advantage of models trained on scraped GitHub.
The result: Pro scores in April 2026 still show meaningful spread across frontier models, while Verified has compressed.
April 2026 leaderboard
SWE-bench Verified (saturating)
| Model | Score |
|---|---|
| Claude Mythos Preview | ~92% (leaked) |
| Claude Opus 4.7 | 87.6% |
| GPT-5.4 | 84.1% |
| Claude Opus 4.6 | 80.8% |
| Gemini 3.1 Pro | 79.4% |
| GPT-5.4-mini | 76.2% |
| Grok 4.20 | 74.8% |
| Claude Sonnet 4.6 | 73.1% |
| DeepSeek V4 | 72.5% |
SWE-bench Pro (still differentiating)
| Model | Score |
|---|---|
| Claude Mythos Preview | ~72% (leaked) |
| Claude Opus 4.7 | 64.3% |
| GPT-5.4 | 57.7% |
| Gemini 3.1 Pro | 54.2% |
| Claude Opus 4.6 | 53.4% |
| DeepSeek V4 | 48.9% |
| Gemma 4 31B (open) | 41.7% |
Look at the gaps: 6.6 points between Opus 4.7 and GPT-5.4 on Pro, vs 3.5 points on Verified. Pro is where the differentiation lives in April 2026.
Which score should you care about?
Use Verified to filter the top tier
If a model scores under 75% on Verified, it’s not a frontier coding model in 2026. Verified is still a useful floor — fail it, and you’re not in the conversation.
Use Pro to pick between top-tier models
For any decision between Opus 4.7, GPT-5.4, Gemini 3.1 Pro, and similar — Pro is where the real answer lives. A 7-point Pro gap translates into noticeably different real-world experience on TypeScript codebases, Rust projects, or long Java refactors.
Don’t obsess over either in isolation
Neither benchmark captures:
- Latency (some tasks take 30s vs 3 minutes)
- Tool use (check MCP-Atlas for this — see our Opus 4.7 comparison)
- Cost per task
- Code style / maintainability
- Behavior in real agentic loops (use OSWorld or Terminal-Bench 2.0)
A great coding model also needs to win on MCP-Atlas, Terminal-Bench 2.0, and real integration tests with your stack.
Contamination: the elephant in the room
Training on SWE-bench issues is cheating. In April 2026 the coalition penalizes contamination three ways:
- Quarterly rotation — 25% of Pro test cases are replaced every 3 months with held-out private-repo issues
- Contamination audits — labs must disclose training-data filters applied against SWE-bench repos
- Held-out variants — a sealed private variant is run by the coalition to catch gaps
You can check contamination disclosures in each lab’s April 2026 launch posts. Anthropic, OpenAI, and Google DeepMind all publish them now; smaller labs are less consistent.
The benchmarks that will matter next
SWE-bench Pro is already showing signs of eventual saturation. The next-generation benchmarks catching attention in April 2026:
- MCP-Atlas — scaled tool use with 100+ MCP servers
- Terminal-Bench 2.0 — autonomous terminal-only coding, no IDE
- OSWorld-Verified — computer use, not just code
- Finance Agent v1.1 — multi-step financial analysis
- CharXiv — scientific chart reasoning
Expect “SWE-bench Pro 2” or equivalent by end of 2026 as Opus 4.7 and its successors push Pro scores past 75%.
Verdict
For April 2026, report and compare SWE-bench Pro first. Verified still matters as a floor, but it’s saturating — Opus 4.7 at 87.6%, Mythos Preview at ~92%. Pro still has room and still separates the frontier from the rest.
If a launch announcement skips Pro entirely, assume the model didn’t do well on it. That’s the unwritten rule in AI launches right now.
Whichever benchmark you use, cross-check with MCP-Atlas and a real integration test on your own codebase. No single benchmark is the complete picture — and if you’re buying a model to ship code, only shipping speaks for itself.