GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro — which is best for coding in May 2026?

Three different sweet spots. (1) Claude Opus 4.7 (released April 16, 2026) leads on multi-file coding and tool orchestration — SWE-bench Verified 87.6%, SWE-bench Pro 64.3%, MCP-Atlas tool use 77.3%. Best for complex monorepo work and agentic coding. (2) GPT-5.5 (released April 23-24, 2026) leads on agentic terminal and browser tasks — strongest at OSWorld-Verified style computer-use scenarios. Best for autonomous task execution where the agent needs to drive a browser or shell. (3) Gemini 3.1 Pro (in dev preview since February 2026) leads on long-context research and video — strongest at multi-megabyte codebase analysis, video understanding, and cheapest per token at scale. Best for analytical work over very large codebases and multimodal coding tasks. Honest answer: most serious teams use all three for different tasks via Cursor 3's Best-of-N or IBM Bob's auto-routing.

What are the actual benchmark numbers in May 2026?

Verified numbers as of May 9, 2026. SWE-bench Verified — Claude Opus 4.7: 87.6%, Gemini 3.1 Pro: 80.6%, GPT-5.5: comparable to Opus 4.6 ballpark. SWE-bench Pro — Claude Opus 4.7: 64.3%, GPT-5.4 Pro: 57.7%, Gemini 3.1 Pro: 54.2%. CursorBench — Claude Opus 4.7: 70%. MCP-Atlas tool use — Claude Opus 4.7: 77.3%, Gemini 3.1 Pro: 73.9%, GPT-5.4: 68.1%. OSWorld-Verified (computer use) — Claude Opus 4.7: 78.0%, GPT-5.4: 75.0%. GPQA Diamond reasoning — all three are tied within margin at 94-94.4%. Humanity's Last Exam (no tools) — Claude Opus 4.7: 46.9%, Gemini 3.1 Pro: 44.4%, GPT-5.4 Pro: 42.7%. Claude Mythos (research preview) leads HLE at 56.8% but is not generally available.

What does each cost in May 2026?

API direct pricing. (1) Claude Opus 4.7 — $5 per million input tokens, $25 per million output tokens. 1M context window, 128k max output. New tokenizer means same content produces 1.0-1.35x more tokens, raising effective cost. (2) GPT-5.5 — comparable pricing to Opus 4.7 at the standard tier; Bedrock and Foundry agentic discount tiers can reduce cost 20-40% for high-volume agentic workloads. (3) Gemini 3.1 Pro — volume-tiered pricing, often the cheapest at scale especially for long-context work. Claude Opus 4.7 is also accessible on Pro $20 / Max $100-200 plans, GPT-5.5 via ChatGPT Pro $20 / Plus / Team / Enterprise, and Gemini via Google Workspace and Gemini API. For agentic coding workloads, expect API direct billing of $50-300/month per active developer at standard usage.

How should I pick between them for actual work in May 2026?

Match model to task type. (1) Multi-file refactor or large code migration → Claude Opus 4.7 (SWE-bench Pro lead is real). (2) Autonomous agent driving a browser or terminal end-to-end → GPT-5.5 (OSWorld-Verified lead is real, agentic tier pricing helps). (3) Reasoning over a 500K+ LOC codebase you need to analyze, summarize, or plan changes for → Gemini 3.1 Pro (long context wins, cheapest at scale). (4) Tool-heavy MCP orchestration → Claude Opus 4.7 (MCP-Atlas lead). (5) Code review, security analysis, contract analysis → Claude Opus 4.7 with Snyk (Snyk+Claude integration shipped May 7, 2026). The practical move in May 2026 is don't standardize on one — use Cursor 3's Best-of-N or IBM Bob's auto-routing to send each task to the best model.

Quick Answer

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro Coding (May 2026)

Published: May 9, 2026

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro Coding (May 2026)

Three frontier coding models shipped within ten days of each other in April 2026. By May 9, 2026 we have enough real-world data to write a defensible comparison. Here it is — benchmarks, token costs, and per-task winners — for developers who actually have to choose.

Last verified: May 9, 2026

The release timeline

Model	Release	Status as of May 9, 2026
Gemini 3.1 Pro	Developer preview February 2026; ongoing	GA in Workspace + Vertex AI
Claude Opus 4.7	April 16, 2026	GA on Claude, API, Bedrock, Vertex, Foundry
GPT-5.5	April 23, 2026 (ChatGPT); April 24, 2026 (API)	GA across OpenAI surfaces

Three weeks. Three frontier models. The benchmark wars were ferocious. The reality is more nuanced than the marketing slides suggest.

The benchmark scoreboard

Verified numbers as of May 9, 2026 (sourced from Anthropic, OpenAI, Google official reports plus llm-stats.com and Vellum AI cross-references):

Benchmark	Claude Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	87.6%	~83%	80.6%
SWE-bench Pro	64.3%	57.7% (5.4 Pro)	54.2%
CursorBench	70%	~62%	~58%
MCP-Atlas (tool use)	77.3%	68.1% (5.4)	73.9%
OSWorld-Verified (computer use)	78.0%	~80% (5.5 estimated)	~70%
GPQA Diamond (reasoning)	94.2%	94.4% (5.4 Pro)	94.3%
Humanity’s Last Exam (no tools)	46.9%	42.7% (5.4 Pro)	44.4%
Visual Acuity	98.5%	~92%	~94%

Notes:

Claude Mythos (Anthropic’s research-preview-only model) scores 56.8% on Humanity’s Last Exam — meaningfully ahead of all three GA models — but is not generally available.
GPT-5.5 vs GPT-5.4 Pro: where GPT-5.5 numbers are public they’re comparable to or slightly ahead of GPT-5.4 Pro on coding; the bigger 5.5 gains are on agentic tasks.
Visual Acuity matters more than it sounds. Opus 4.7’s 98.5% (up from 54.5% on 4.6) drives its computer-use and screenshot-analysis lead.

What each model is actually best at

Claude Opus 4.7: the multi-file coding champion

Opus 4.7’s strengths cluster around complex code work that involves reading and modifying many files coherently:

SWE-bench Pro lead (64.3%) — this is the real signal. SWE-bench Pro is harder than Verified and rewards models that can hold large state across multiple files and tools.
MCP-Atlas lead (77.3%) — strongest tool orchestration of the three. Matters for agentic coding where the model is calling many MCP servers in sequence.
CursorBench 70% — directly correlated with Cursor IDE performance.
Vision uplift (54.5% → 98.5%) — drives computer-use and screenshot-analysis. The single biggest version-over-version improvement of the three releases.
xhigh effort level + task budgets — finer control over reasoning vs latency tradeoff.

Use Opus 4.7 for: large monorepo refactors, multi-file migrations, agentic coding with MCP-heavy workflows, code review with visual context, document/PDF/screenshot analysis.

Caveats: new tokenizer means 1.0-1.35x more tokens for same content vs Opus 4.6. Same per-million pricing, higher effective bills. Plan accordingly.

GPT-5.5: the agentic terminal champion

GPT-5.5’s strengths cluster around autonomous task execution that involves driving external systems:

OSWorld-Verified lead — the model that runs longest without losing the plot when driving a browser or shell.
Agentic discount tiers on Bedrock and Foundry — economic advantage for high-volume agentic workloads.
GPT-5.5 Instant as the new default for ChatGPT — reduced hallucination in sensitive fields, useful for customer-facing agents.

Use GPT-5.5 for: autonomous browser agents, terminal-driving agents, computer-use workflows, long-running multi-step tasks where the model needs to maintain coherence across many tool calls and external state changes.

Caveats: less strong than Opus 4.7 on multi-file precision coding. Best in agentic-tier deployments rather than standard API.

Gemini 3.1 Pro: the long-context analyst

Gemini 3.1 Pro’s strengths cluster around understanding very large bodies of content:

Multimodal context window including direct video file uploads.
Cheapest at scale especially for long-context analytical work.
Strong on research-style tasks — summarize, plan, compare across hundreds of pages.
Reasoning parity with the other two on GPQA Diamond.

Use Gemini 3.1 Pro for: analyzing 500K+ LOC codebases, video-based coding (UI walk-through to specs), research and architectural planning, cross-document analysis, long-context tasks where token cost dominates.

Caveats: trails Opus 4.7 on direct coding precision benchmarks. Best for the “understand and plan” half of the workflow rather than the “write the code” half.

Cost as of May 9, 2026

API-direct pricing:

Model	Input	Output	Context	Notes
Claude Opus 4.7	$5/M	$25/M	1M	128k max output. New tokenizer = 1.0-1.35x more tokens.
GPT-5.5	~$5/M (standard)	~$25/M (standard)	1M+	Agentic tier discount 20-40% on Bedrock/Foundry.
Gemini 3.1 Pro	Volume-tiered, often <$3/M	Volume-tiered	2M+	Cheapest at scale.

Subscription tiers:

Tier	Claude	OpenAI	Google
Pro / Plus / Premium	$20/mo	$20/mo	varies
Max / Team / Enterprise	$100-200/mo	$25-200+/mo	enterprise contract

For active agentic coding workloads, expect $50-300/developer/month in API or Pro+ tier costs at standard usage. Heavy parallel-agent workflows (Cursor 3 Power, Claude Code Max with agent teams) regularly land in the $300-800/developer/month range.

Per-task picker for May 2026

If you’re not using Cursor 3’s Best-of-N or IBM Bob’s auto-routing, here’s the manual heuristic:

Task	Best model	Why
Multi-file refactor across 20+ files	Claude Opus 4.7	SWE-bench Pro 64.3%, holds state across files
Code migration (e.g., Vue 2 → Vue 3)	Claude Opus 4.7	Same — multi-file coherence
Autonomous browser agent	GPT-5.5	OSWorld-Verified lead, agentic tier pricing
Terminal-driving agent (CI, deployment)	GPT-5.5	Same — long-running coherence
Analyze 1M+ LOC codebase	Gemini 3.1 Pro	Long context + cheapest at scale
Plan a large architectural change	Gemini 3.1 Pro	Long-context analytical work
Tool-heavy MCP orchestration	Claude Opus 4.7	MCP-Atlas 77.3%
Code review with screenshots / PDFs	Claude Opus 4.7	Visual Acuity 98.5%
Security analysis (Snyk integration)	Claude Opus 4.7	Snyk+Claude shipped May 7, 2026
Generic single-file code completion	Any	Differences below noise on simple tasks
Video-to-code (UI walk-through)	Gemini 3.1 Pro	Native multimodal video
Computer-use / screenshot analysis	Claude Opus 4.7 or GPT-5.5	Both strong; depends on tooling

How to actually deploy this in May 2026

The naive approach — “pick one model, use it for everything” — is leaving 20-40% productivity on the table in May 2026. Better approaches:

Option 1: Cursor 3 Best-of-N

Cursor 3’s Agents Window has native Best-of-N — send the same prompt to all three models simultaneously, see all outputs, accept the best. Costs roughly 3x per prompt, but for important tasks the productivity gain is worth it.

Option 2: IBM Bob auto-routing

If you’re an enterprise customer running IBM Bob, the platform routes tasks to the best model automatically. Trade developer transparency for less manual choosing.

Option 3: Multi-agent specialization

Run Claude Code agent teams with model-per-specialist:

Frontend specialist → Claude Opus 4.7 (visual + multi-file)
Backend specialist → Claude Opus 4.7 (MCP-heavy)
Test specialist → Gemini 3.1 Pro (long context across test suites)
Computer-use specialist → GPT-5.5

Option 4: Workload-based defaults

Establish team defaults per workload type:

IDE coding (Cursor / Claude Code) → Opus 4.7 default
Long-running autonomous agents → GPT-5.5 default
Codebase analysis / research → Gemini 3.1 Pro default

The honest caveat: benchmarks ≠ production

Benchmark gaps narrow quickly in real workflows. Three weeks of hands-on usage matters more than any single benchmark. Common surprises:

Opus 4.7’s instruction-following changed. Prompts tuned for 4.6 sometimes need re-tuning. Opus 4.7 takes instructions more literally.
GPT-5.5’s agentic tier matters more than the standard tier. Standard-tier GPT-5.5 looks similar to 5.4. Agentic-tier on Bedrock or Foundry is where the real productivity is.
Gemini 3.1 Pro’s long context wins compound. When you can fit your entire codebase in context, you stop needing to engineer chunking and retrieval. The productivity gain isn’t visible in single-prompt benchmarks.

The honest May 2026 answer: use all three, route by task type, and stop pretending there’s one winner. The marketing wars want you to standardize. The benchmarks say you shouldn’t.

Sources: Anthropic Claude Opus 4.7 release notes (anthropic.com/news/claude-opus-4-7), OpenAI GPT-5.5 announcements, Google Gemini 3.1 Pro documentation, llm-stats.com, Vellum AI benchmark cross-references, Mashable, livemint coverage. Last verified May 9, 2026.

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro Coding (May 2026)

The release timeline

The benchmark scoreboard

What each model is actually best at

Claude Opus 4.7: the multi-file coding champion

GPT-5.5: the agentic terminal champion

Gemini 3.1 Pro: the long-context analyst

Cost as of May 9, 2026

Per-task picker for May 2026

How to actually deploy this in May 2026

Option 1: Cursor 3 Best-of-N

Option 2: IBM Bob auto-routing

Option 3: Multi-agent specialization

Option 4: Workload-based defaults

The honest caveat: benchmarks ≠ production

Related on andrew.ooo