What is the SWE-bench Verified score for the top AI models in May 2026?

As of May 11, 2026 the top vendor-reported SWE-bench Verified scores are: GPT-5.5 at 88.7%, Claude Opus 4.7 at 87.6%, Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, Kimi K2.6 at 80.2%, DeepSeek V4-Flash at ~79%. The gap between the top three and the rest has narrowed significantly in 2026 — frontier models now cluster within 10 points of each other on Verified.

Is SWE-bench Verified or SWE-bench Pro a better benchmark in May 2026?

Both have value but for different things. SWE-bench Verified is the public benchmark with 500 verified human-checked GitHub issues — useful for cross-vendor comparison and tracking progress over time. SWE-bench Pro is a harder private leaderboard run by Scale AI with newer, harder issues that aren't in training data — useful for catching benchmark contamination and seeing how models perform on genuinely fresh problems. In May 2026, Claude Opus 4.7 leads SWE-bench Pro at 64.3% vs GPT-5.4 at 57.7%. Use Verified for tracking trends; use Pro to gauge genuine capability.

Why do some sources show different SWE-bench scores for the same model?

Three reasons. (1) Different harnesses — the agent loop and tooling around the model significantly affects scores. Cursor's harness can lift GPT-5.5 noticeably. (2) Different prompts and reasoning budgets — vendor-reported max scores typically use the most expensive reasoning settings. (3) Different versions of the benchmark — SWE-bench has had multiple revisions; SWE-bench Verified, Pro, and Lite have different question sets. Always check whether scores are vendor-reported max or independent third-party with a specified harness.

Does a higher SWE-bench score mean a model is better for my coding work?

Not necessarily. SWE-bench tests one workload: resolving real GitHub issues in a sandboxed repo with patches. It doesn't measure conversational coding (Cursor-style autocomplete), terminal-agent reliability (Terminal-Bench 2.0 covers that), tool calling (MCP-Atlas covers that), long-context reasoning, or cost-per-completed-task. For chat coding, look at Vellum and LMSYS Arena. For terminal agents, Terminal-Bench 2.0. For tool-heavy workflows, MCP-Atlas. Use SWE-bench as one signal among several.

Quick Answer

SWE-bench Verified Leaderboard May 2026: Top 10 Models

Published: May 11, 2026

SWE-bench Verified Leaderboard May 2026: Top 10 Models

SWE-bench Verified is the most-cited coding benchmark, and the frontier scores have crept into the high 80s. Here’s the May 2026 leaderboard, what the scores actually mean, and how to use them for picking a model.

Last verified: May 11, 2026

SWE-bench Verified — top 10 (May 11, 2026)

Rank	Model	Vendor	SWE-bench Verified	Notes
1	GPT-5.5	OpenAI	88.7%	Released April 23, 2026
2	Claude Opus 4.7	Anthropic	87.6%	Released April 16, 2026 (GA)
3	Claude Opus 4.6	Anthropic	80.8%	Prior generation
4	Gemini 3.1 Pro	Google	80.6%	LMSYS Arena #4
5	Kimi K2.6	Moonshot	80.2%	Strong open coding
6	DeepSeek V4-Flash	DeepSeek	~79%	Open weights, MIT
7	DeepSeek V4-Pro	DeepSeek	~79% (per vendor reports)	Open weights
8	GPT-5.4	OpenAI	(prior gen)	Released earlier in 2026
9	Qwen 3 Coder	Alibaba	(mid 70s)	Open weights, code-specialized
10	Mistral Large 3	Mistral	(mid 70s)	EU-hosted option

Scores are vendor-reported maximums or top independent third-party measurements as of May 11, 2026. Harness, prompt, and reasoning budget vary by row.

SWE-bench Pro (private, Scale AI)

The harder SWE-bench Pro leaderboard tells a complementary story:

Model	SWE-bench Pro	Notes
Claude Opus 4.7	64.3%	Top private leaderboard result
GPT-5.4	57.7%	Newer GPT-5.5 expected to climb
Gemini 3.1 Pro	54.2%	Strong second tier

Pro uses newer issues that are less likely to be in training data — closer to a measure of genuine capability.

What SWE-bench actually measures

SWE-bench Verified is 500 human-verified GitHub issues from popular open-source repos. The model has to:

Read the issue.
Navigate the repo.
Identify which files to change.
Write a code patch.
The patch is applied and the repo’s test suite is run.
Pass/fail is determined by whether the relevant tests pass.

It’s a useful proxy for real-world software engineering, but it has limits:

Only Python repos (mostly).
Only certain types of issues (bug fixes, feature additions).
Only one workload — no chat coding, no terminal agents, no tool-heavy workflows.

The harness effect

Same model + different harness = different SWE-bench score.

Cursor published data in 2026 showing GPT-5.5 lifted noticeably when run inside Cursor’s harness vs OpenAI’s native harness on functionality tests. The agent loop — file navigation tooling, error feedback, retry logic — matters as much as the model.

Practical implication: when comparing vendor-reported scores, also note the harness. A “lower” score in a different harness might still be the right model for your harness.

Beyond SWE-bench — the broader benchmark stack

Benchmark	What it measures	Leader (May 2026)
SWE-bench Verified	GitHub issue resolution	GPT-5.5 (88.7%)
SWE-bench Pro	Harder, contamination-resistant	Claude Opus 4.7 (64.3%)
Terminal-Bench 2.0	Unattended terminal agents	GPT-5.5 (82.7%)
MCP-Atlas	MCP-style tool use	Claude Opus 4.7 (77.3%)
LMSYS Arena	Chat coding (human preference)	Claude Opus 4.7 #1 thinking mode
Aider Polyglot	Cross-language coding	Varies by language
LiveCodeBench	Competitive programming	GPT-5.5 / Opus 4.7 close

Pick the benchmark that matches your workload, not just the one with the biggest number.

How to use the leaderboard

Step 1: Identify your workload.

Repo-scale issue resolution → SWE-bench Verified / Pro
Terminal-driven CI agents → Terminal-Bench 2.0
MCP-tool-heavy workflows → MCP-Atlas
Chat coding inside an IDE → LMSYS Arena, Cursor’s internal evals

Step 2: Look at the right benchmark, not just SWE-bench.

A model 5 points below the SWE-bench leader might be the right model for your actual workload.

Step 3: Factor in cost-per-task.

Claude Opus 4.7 leads SWE-bench Pro but is the most expensive model. GPT-5.5 is competitive at meaningfully lower cost per task (~72% fewer output tokens). DeepSeek V4-Flash trails by 8 points but costs ~30x less.

Step 4: Test in your harness.

Vendor-reported scores are upper bounds. Your harness, prompts, and codebase will deliver different numbers. Run a small benchmark on your own issues.

Why the leaderboard moved so fast in 2026

April 16 - May 11, 2026 saw three frontier upgrades:

April 16: Claude Opus 4.7 GA (Anthropic)
April 23: GPT-5.5 launched (OpenAI, “Spud”)
April 24: DeepSeek V4 preview
April 30: Grok 4.3 full API rollout

Each release pushed scores. Claude Opus 4.6’s 80.8% was the leader in March; in May, it’s not even top three.

Decisions made on coding-model leaderboards from Q1 2026 should be revisited.

What to watch next

Claude Mythos preview → public release. Anthropic’s flagship is in restricted preview with ~50 partners and reportedly scores 93% on SWE-bench Verified.
DeepSeek V4 full launch following the April 24 preview.
Google I/O 2026 (May 19) — Gemini updates expected.
SWE-bench Pro V2 — Scale AI rotation of issues to keep the benchmark fresh.

Last verified: May 11, 2026 — sources: Vellum.ai benchmark coverage, OpenRouter benchmarks, MindStudio coding comparisons, Anthropic Opus 4.7 release notes, OpenAI GPT-5.5 release notes, Cursor’s “Continually improving our agent harness” blog post, llm-stats.com.

SWE-bench Verified Leaderboard May 2026: Top 10 Models

SWE-bench Verified — top 10 (May 11, 2026)

SWE-bench Pro (private, Scale AI)

What SWE-bench actually measures

The harness effect

Beyond SWE-bench — the broader benchmark stack

How to use the leaderboard

Why the leaderboard moved so fast in 2026

What to watch next

Related reading