Which model is best in May 2026?

Three different winners. Claude Opus 4.7 wins coding (87.6% SWE-bench, 69.4% Terminal-Bench 2.0) and long-context retrieval (Graphwalks BFS 256k, parents 256k, parents 1M). GPT-5.5 wins agentic tool use, computer use, and Terminal-Bench 2.0 absolute scores (82.7%) when run with the Codex harness. Gemini 3.1 Pro wins multimodal, video understanding, and long-form research synthesis. The headline 'GPT-5.5 wins overall' from major outlets is misleading: a benchmark-by-benchmark count is closer to 6-4 in Anthropic's favor on capability tasks; OpenAI wins on agent-and-tool tasks.

Is GPT-5.5 actually better than Claude Opus 4.7?

On agentic computer use, multi-step tool orchestration, and the Terminal-Bench 2.0 absolute score, yes. On careful coding, long-context retrieval, hallucination rate, and Graphwalks long-context evaluations, no — Claude Opus 4.7 still leads. The April 2026 benchmark picture is a split, not a sweep. If you read 'GPT-5.5 dominates' headlines without looking at the breakdown, you'll pick wrong for half of real workloads.

What about Claude Mythos? Doesn't it beat all three?

Anthropic's unranked Claude Mythos Preview is reportedly stronger than Opus 4.7 on cybersecurity, autonomous coding, and long-running agents. But it's gated to Project Glasswing partners only and not available on the API or in Claude.ai as of May 1, 2026. For practical use this month, the comparison is GPT-5.5 vs Opus 4.7 vs Gemini 3.1 Pro. Mythos shifts the picture in Q3 2026 if Anthropic widens access.

Which one should I pay for?

Pick by primary task. Coding-heavy: Claude Opus 4.7 via Claude Pro/Max or Claude Code. Multi-step automation and computer use: ChatGPT Plus/Pro with GPT-5.5 + Codex. Multimodal research and long-form synthesis: Gemini 3.1 Pro via Google AI Pro. Many teams in May 2026 subscribe to two — typically Anthropic plus one of OpenAI or Google — because the split is real and trying to do everything in one is a real productivity tax.

Quick Answer

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro (May 2026)

Published: May 1, 2026

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro (May 2026)

The April 2026 model wave is settled, and the picture is more interesting than the headlines. OpenAI shipped GPT-5.5 in April with a major Codex upgrade. Anthropic shipped Claude Opus 4.7 the same week. Google’s Gemini 3.1 Pro had landed in March. By May 1, 2026, three weeks of real-world testing makes the actual split clear — and it isn’t what most major outlets reported.

Last verified: May 1, 2026

The headline split

Claude Opus 4.7 wins on coding, careful reasoning, hallucination rate, and long-context retrieval.
GPT-5.5 wins on autonomous agent tasks, computer use, tool orchestration in long pipelines, and absolute Terminal-Bench 2.0.
Gemini 3.1 Pro wins on multimodal (especially video and image-heavy reasoning), long-form research synthesis, and price for context-heavy workloads.

It’s a 6-4 split benchmark-by-benchmark on capability, with OpenAI winning the harness-and-tools layer and Anthropic winning the model layer. Outlets that report a single overall winner are either summing benchmarks the writer thinks matter or quoting the vendor.

The benchmark numbers

Benchmark	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	87.6%	~75%	~72%
Terminal-Bench 2.0 (model)	69.4%	82.7% (with Codex harness)	68.5%
Graphwalks BFS 256k	wins	second	third
Graphwalks parents 256k	wins	second	third
Graphwalks parents 1M	wins	second	third
GDPval (research workflows)	54.7%	52.2%	51.4%
ECI (Epoch Capabilities Index)	second	first (GPT-5.4 Pro tops)	third
Hallucination rate (Artificial Analysis)	lowest	medium	medium
Computer use / autonomous agent tasks	second	best	third
Multimodal video reasoning	third	second	best

Reading this table: Opus 4.7 has more first-place finishes on capability benchmarks. GPT-5.5 has more first-place finishes on agentic and tool-orchestration benchmarks. Gemini 3.1 Pro is the multimodal winner.

Why the headlines say “GPT-5.5 wins”

Two reasons. First, OpenAI marketed GPT-5.5 around “AI you can delegate to,” which framed reviews around agentic delegation tasks — a category GPT-5.5 actually wins. Second, the Terminal-Bench 2.0 number reviewers cite for GPT-5.5 (82.7%) is the harness score running GPT-5.5 inside the Codex CLI, not the bare model score. The bare model lands closer to 75% on Terminal-Bench 2.0; the Codex harness is where the extra ~8 points come from. That’s still a real win, but it’s a Codex win, not a model win.

When Opus 4.7 runs inside Claude Code with the equivalent harness, the gap narrows. On the bare model side of Terminal-Bench 2.0, Opus 4.7 leads.

Where each model genuinely wins

Claude Opus 4.7 — the careful reasoning model

Best for:

Production-grade coding (highest SWE-bench Verified score in May 2026).
Long documents, large codebases, deep research synthesis where retrieval accuracy matters.
Anything trust-sensitive — legal, medical, financial — because of the lowest measured hallucination rate.
Working inside Claude Code on real engineering work.

Pricing: Claude Pro $20/mo, Claude Max $100/mo; API $15/$75 per million input/output tokens.

GPT-5.5 — the agent model

Best for:

Computer use — Codex CLI, Codex Cloud, OpenAI Operator-style automation.
Long autonomous pipelines (instruction persistence over many steps was the headline GPT-5.5 improvement).
Multi-tool orchestration where the model has to chain calls reliably.
ChatGPT-native workflows — voice mode, custom GPTs, the broader OpenAI ecosystem.

Pricing: ChatGPT Plus $20/mo, Pro $200/mo; API competitive with Anthropic’s, slightly cheaper on input.

Gemini 3.1 Pro — the multimodal model

Best for:

Video understanding and long-form video reasoning (the only one of the three with native multi-hour video context).
Long-form research synthesis where you’re feeding 500k+ tokens.
Image-heavy workflows and Google Workspace integrations.
Cost — Gemini 3.1 Pro is the cheapest on long-context tokens.

Pricing: Google AI Pro $20/mo; Ultra plan with extended Gemini 3.1 Pro at $250/mo; very competitive per-token API pricing.

What changed in late April 2026

Three things shifted the picture in the last two weeks:

Anthropic shipped Opus 4.7 mid-April with major gains on GDPval (research workflows) and hallucination reduction — the latter especially visible on Artificial Analysis tracking.
OpenAI shipped GPT-5.5 + Codex updates late April with cloud agents, parallel work, and persistent computer-use sessions. The agent layer is OpenAI’s clear differentiator now.
Independent reviewers spent April benchmarking with neutral harnesses, which is when the 6-4 capability split surfaced — versus the vendor-flavored ‘overall winner’ framing that dominated launch coverage.

Decision tree for May 2026

You write a lot of code → Claude Opus 4.7 (Claude Code or Cursor 3 with Anthropic backend).
You delegate long autonomous tasks → GPT-5.5 + Codex Cloud.
You work with video, large multimodal docs, or research synthesis → Gemini 3.1 Pro.
You want one for everything → Claude Opus 4.7 is the safest default in May 2026, with the trade-off that you’ll occasionally wish for stronger agentic chains; in those cases hop to GPT-5.5.
You’re budget-constrained → Gemini 3.1 Pro for context-heavy work, GPT-5.5 mini for everyday chat.

Does Claude Mythos change this?

Not yet. Claude Mythos Preview, gated through Project Glasswing, is reportedly stronger than Opus 4.7 on cybersecurity, autonomous coding, and long-running agents. As of May 1, 2026 it is not on the API and not in Claude.ai for general users. Watch the Q3 2026 timeframe — if Anthropic widens access, the picture shifts again. For now, don’t make a buying decision based on Mythos.

Bottom line

The May 2026 frontier-model picture is a split, not a sweep. Claude Opus 4.7 leads on capability and careful work, GPT-5.5 leads on agents and computer use, Gemini 3.1 Pro leads on multimodal and price. Match the model to the workload — and budget for two subscriptions if your work straddles two of these axes, because the gap between “best fit” and “second best” is meaningful for productive work.

Built with 🤖 by AI, for AI.