SWE-bench Verified Leaderboard May 2026: Top 10 Models
SWE-bench Verified Leaderboard May 2026: Top 10 Models
SWE-bench Verified is the most-cited coding benchmark, and the frontier scores have crept into the high 80s. Here’s the May 2026 leaderboard, what the scores actually mean, and how to use them for picking a model.
Last verified: May 11, 2026
SWE-bench Verified — top 10 (May 11, 2026)
| Rank | Model | Vendor | SWE-bench Verified | Notes |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 88.7% | Released April 23, 2026 |
| 2 | Claude Opus 4.7 | Anthropic | 87.6% | Released April 16, 2026 (GA) |
| 3 | Claude Opus 4.6 | Anthropic | 80.8% | Prior generation |
| 4 | Gemini 3.1 Pro | 80.6% | LMSYS Arena #4 | |
| 5 | Kimi K2.6 | Moonshot | 80.2% | Strong open coding |
| 6 | DeepSeek V4-Flash | DeepSeek | ~79% | Open weights, MIT |
| 7 | DeepSeek V4-Pro | DeepSeek | ~79% (per vendor reports) | Open weights |
| 8 | GPT-5.4 | OpenAI | (prior gen) | Released earlier in 2026 |
| 9 | Qwen 3 Coder | Alibaba | (mid 70s) | Open weights, code-specialized |
| 10 | Mistral Large 3 | Mistral | (mid 70s) | EU-hosted option |
Scores are vendor-reported maximums or top independent third-party measurements as of May 11, 2026. Harness, prompt, and reasoning budget vary by row.
SWE-bench Pro (private, Scale AI)
The harder SWE-bench Pro leaderboard tells a complementary story:
| Model | SWE-bench Pro | Notes |
|---|---|---|
| Claude Opus 4.7 | 64.3% | Top private leaderboard result |
| GPT-5.4 | 57.7% | Newer GPT-5.5 expected to climb |
| Gemini 3.1 Pro | 54.2% | Strong second tier |
Pro uses newer issues that are less likely to be in training data — closer to a measure of genuine capability.
What SWE-bench actually measures
SWE-bench Verified is 500 human-verified GitHub issues from popular open-source repos. The model has to:
- Read the issue.
- Navigate the repo.
- Identify which files to change.
- Write a code patch.
- The patch is applied and the repo’s test suite is run.
- Pass/fail is determined by whether the relevant tests pass.
It’s a useful proxy for real-world software engineering, but it has limits:
- Only Python repos (mostly).
- Only certain types of issues (bug fixes, feature additions).
- Only one workload — no chat coding, no terminal agents, no tool-heavy workflows.
The harness effect
Same model + different harness = different SWE-bench score.
Cursor published data in 2026 showing GPT-5.5 lifted noticeably when run inside Cursor’s harness vs OpenAI’s native harness on functionality tests. The agent loop — file navigation tooling, error feedback, retry logic — matters as much as the model.
Practical implication: when comparing vendor-reported scores, also note the harness. A “lower” score in a different harness might still be the right model for your harness.
Beyond SWE-bench — the broader benchmark stack
| Benchmark | What it measures | Leader (May 2026) |
|---|---|---|
| SWE-bench Verified | GitHub issue resolution | GPT-5.5 (88.7%) |
| SWE-bench Pro | Harder, contamination-resistant | Claude Opus 4.7 (64.3%) |
| Terminal-Bench 2.0 | Unattended terminal agents | GPT-5.5 (82.7%) |
| MCP-Atlas | MCP-style tool use | Claude Opus 4.7 (77.3%) |
| LMSYS Arena | Chat coding (human preference) | Claude Opus 4.7 #1 thinking mode |
| Aider Polyglot | Cross-language coding | Varies by language |
| LiveCodeBench | Competitive programming | GPT-5.5 / Opus 4.7 close |
Pick the benchmark that matches your workload, not just the one with the biggest number.
How to use the leaderboard
Step 1: Identify your workload.
- Repo-scale issue resolution → SWE-bench Verified / Pro
- Terminal-driven CI agents → Terminal-Bench 2.0
- MCP-tool-heavy workflows → MCP-Atlas
- Chat coding inside an IDE → LMSYS Arena, Cursor’s internal evals
Step 2: Look at the right benchmark, not just SWE-bench.
A model 5 points below the SWE-bench leader might be the right model for your actual workload.
Step 3: Factor in cost-per-task.
Claude Opus 4.7 leads SWE-bench Pro but is the most expensive model. GPT-5.5 is competitive at meaningfully lower cost per task (~72% fewer output tokens). DeepSeek V4-Flash trails by 8 points but costs ~30x less.
Step 4: Test in your harness.
Vendor-reported scores are upper bounds. Your harness, prompts, and codebase will deliver different numbers. Run a small benchmark on your own issues.
Why the leaderboard moved so fast in 2026
April 16 - May 11, 2026 saw three frontier upgrades:
- April 16: Claude Opus 4.7 GA (Anthropic)
- April 23: GPT-5.5 launched (OpenAI, “Spud”)
- April 24: DeepSeek V4 preview
- April 30: Grok 4.3 full API rollout
Each release pushed scores. Claude Opus 4.6’s 80.8% was the leader in March; in May, it’s not even top three.
Decisions made on coding-model leaderboards from Q1 2026 should be revisited.
What to watch next
- Claude Mythos preview → public release. Anthropic’s flagship is in restricted preview with ~50 partners and reportedly scores 93% on SWE-bench Verified.
- DeepSeek V4 full launch following the April 24 preview.
- Google I/O 2026 (May 19) — Gemini updates expected.
- SWE-bench Pro V2 — Scale AI rotation of issues to keep the benchmark fresh.
Related reading
- Claude Mythos preview SWE-bench 93%
- SWE-bench Pro vs SWE-bench Verified
- Terminal Bench 2 results — GPT-5.5 vs Opus vs Gemini
- Grok 4.3 vs Claude Opus 4.7 vs GPT-5.5 coding
Last verified: May 11, 2026 — sources: Vellum.ai benchmark coverage, OpenRouter benchmarks, MindStudio coding comparisons, Anthropic Opus 4.7 release notes, OpenAI GPT-5.5 release notes, Cursor’s “Continually improving our agent harness” blog post, llm-stats.com.