AI agents · OpenClaw · self-hosting · automation

Quick Answer

SWE-bench Verified Leaderboard May 2026: Top 10 Models

Published:

SWE-bench Verified Leaderboard May 2026: Top 10 Models

SWE-bench Verified is the most-cited coding benchmark, and the frontier scores have crept into the high 80s. Here’s the May 2026 leaderboard, what the scores actually mean, and how to use them for picking a model.

Last verified: May 11, 2026

SWE-bench Verified — top 10 (May 11, 2026)

RankModelVendorSWE-bench VerifiedNotes
1GPT-5.5OpenAI88.7%Released April 23, 2026
2Claude Opus 4.7Anthropic87.6%Released April 16, 2026 (GA)
3Claude Opus 4.6Anthropic80.8%Prior generation
4Gemini 3.1 ProGoogle80.6%LMSYS Arena #4
5Kimi K2.6Moonshot80.2%Strong open coding
6DeepSeek V4-FlashDeepSeek~79%Open weights, MIT
7DeepSeek V4-ProDeepSeek~79% (per vendor reports)Open weights
8GPT-5.4OpenAI(prior gen)Released earlier in 2026
9Qwen 3 CoderAlibaba(mid 70s)Open weights, code-specialized
10Mistral Large 3Mistral(mid 70s)EU-hosted option

Scores are vendor-reported maximums or top independent third-party measurements as of May 11, 2026. Harness, prompt, and reasoning budget vary by row.

SWE-bench Pro (private, Scale AI)

The harder SWE-bench Pro leaderboard tells a complementary story:

ModelSWE-bench ProNotes
Claude Opus 4.764.3%Top private leaderboard result
GPT-5.457.7%Newer GPT-5.5 expected to climb
Gemini 3.1 Pro54.2%Strong second tier

Pro uses newer issues that are less likely to be in training data — closer to a measure of genuine capability.

What SWE-bench actually measures

SWE-bench Verified is 500 human-verified GitHub issues from popular open-source repos. The model has to:

  1. Read the issue.
  2. Navigate the repo.
  3. Identify which files to change.
  4. Write a code patch.
  5. The patch is applied and the repo’s test suite is run.
  6. Pass/fail is determined by whether the relevant tests pass.

It’s a useful proxy for real-world software engineering, but it has limits:

  • Only Python repos (mostly).
  • Only certain types of issues (bug fixes, feature additions).
  • Only one workload — no chat coding, no terminal agents, no tool-heavy workflows.

The harness effect

Same model + different harness = different SWE-bench score.

Cursor published data in 2026 showing GPT-5.5 lifted noticeably when run inside Cursor’s harness vs OpenAI’s native harness on functionality tests. The agent loop — file navigation tooling, error feedback, retry logic — matters as much as the model.

Practical implication: when comparing vendor-reported scores, also note the harness. A “lower” score in a different harness might still be the right model for your harness.

Beyond SWE-bench — the broader benchmark stack

BenchmarkWhat it measuresLeader (May 2026)
SWE-bench VerifiedGitHub issue resolutionGPT-5.5 (88.7%)
SWE-bench ProHarder, contamination-resistantClaude Opus 4.7 (64.3%)
Terminal-Bench 2.0Unattended terminal agentsGPT-5.5 (82.7%)
MCP-AtlasMCP-style tool useClaude Opus 4.7 (77.3%)
LMSYS ArenaChat coding (human preference)Claude Opus 4.7 #1 thinking mode
Aider PolyglotCross-language codingVaries by language
LiveCodeBenchCompetitive programmingGPT-5.5 / Opus 4.7 close

Pick the benchmark that matches your workload, not just the one with the biggest number.

How to use the leaderboard

Step 1: Identify your workload.

  • Repo-scale issue resolution → SWE-bench Verified / Pro
  • Terminal-driven CI agents → Terminal-Bench 2.0
  • MCP-tool-heavy workflows → MCP-Atlas
  • Chat coding inside an IDE → LMSYS Arena, Cursor’s internal evals

Step 2: Look at the right benchmark, not just SWE-bench.

A model 5 points below the SWE-bench leader might be the right model for your actual workload.

Step 3: Factor in cost-per-task.

Claude Opus 4.7 leads SWE-bench Pro but is the most expensive model. GPT-5.5 is competitive at meaningfully lower cost per task (~72% fewer output tokens). DeepSeek V4-Flash trails by 8 points but costs ~30x less.

Step 4: Test in your harness.

Vendor-reported scores are upper bounds. Your harness, prompts, and codebase will deliver different numbers. Run a small benchmark on your own issues.

Why the leaderboard moved so fast in 2026

April 16 - May 11, 2026 saw three frontier upgrades:

  • April 16: Claude Opus 4.7 GA (Anthropic)
  • April 23: GPT-5.5 launched (OpenAI, “Spud”)
  • April 24: DeepSeek V4 preview
  • April 30: Grok 4.3 full API rollout

Each release pushed scores. Claude Opus 4.6’s 80.8% was the leader in March; in May, it’s not even top three.

Decisions made on coding-model leaderboards from Q1 2026 should be revisited.

What to watch next

  • Claude Mythos preview → public release. Anthropic’s flagship is in restricted preview with ~50 partners and reportedly scores 93% on SWE-bench Verified.
  • DeepSeek V4 full launch following the April 24 preview.
  • Google I/O 2026 (May 19) — Gemini updates expected.
  • SWE-bench Pro V2 — Scale AI rotation of issues to keep the benchmark fresh.

Last verified: May 11, 2026 — sources: Vellum.ai benchmark coverage, OpenRouter benchmarks, MindStudio coding comparisons, Anthropic Opus 4.7 release notes, OpenAI GPT-5.5 release notes, Cursor’s “Continually improving our agent harness” blog post, llm-stats.com.