Qwen 3.6 Max vs Claude Mythos vs GPT-5.5 on SWE-Bench (2026)
Qwen 3.6-Max vs Claude Mythos vs GPT-5.5 on SWE-Bench (May 2026)
The three most-discussed coding models in May 2026 — Alibaba’s Qwen 3.6-Max-Preview, Anthropic’s restricted Claude Mythos Preview, and OpenAI’s flagship GPT-5.5 — are pushing each other across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0, and beyond. Here’s how they actually compare.
Last verified: May 17, 2026
TL;DR
| Qwen 3.6-Max-Preview | Claude Mythos Preview | GPT-5.5 | |
|---|---|---|---|
| Vendor | Alibaba Cloud | Anthropic | OpenAI |
| Released | April 20, 2026 | Restricted (Project Glasswing) | April 23, 2026 |
| Generally available? | Yes (API) | No (restricted) | Yes |
| SWE-Bench Verified | competitive (top tier) | 93.9% | strong |
| SWE-Bench Pro | leads (claimed top spot) | leads | 58.6% |
| Terminal-Bench 2.0 | leads (claimed top spot) | strong | 82.7% |
| SciCode | leads (claimed top spot) | strong | strong |
| GDPval (agentic) | strong | n/a (restricted) | 84.9% |
| OSWorld-Verified | n/a | n/a | 78.7% |
| Pricing | Cheapest | Restricted | $5/M in, $30/M out |
| Open-weight family | Qwen 3.6-Plus, 3.6-35B-A3B | No | No |
| Best for | Cost-conscious coding agents | Restricted cyber/SWE research | Agentic coding generalist |
What’s actually new in May 2026
- GPT-5.5 became the new ChatGPT default across all tiers on May 5, after the full April 23 release. Built for agentic workflows, multi-step tool use, and computer use.
- Claude Mythos Preview continues to sit under Project Glasswing — Anthropic restricts access because of its autonomous zero-day discovery capabilities. AISI’s evaluation confirmed Mythos can saturate cybersecurity benchmarks like CyberGym and Cybench.
- Qwen 3.6-Max-Preview dropped April 20 and claimed top scores on six major programming benchmarks including SWE-Bench Pro, Terminal-Bench 2.0, and SciCode. It’s ranked second overall on the Artificial Analysis Intelligence Index, behind only GPT-5.5.
- Qwen 3.6-Plus (1M context, ~April 2 release) and Qwen 3.6-35B-A3B (MoE) sit underneath as open-weight options.
Claude Mythos Preview — the restricted frontier
Why Mythos is the most talked-about model nobody can use:
- 93.9% on SWE-Bench Verified — best in class for software engineering benchmarks.
- 97.6% on USAMO 2026 — one of the largest single-generation reasoning jumps.
- Saturates CyberGym and Cybench — autonomously finds zero-days in major OSes and browsers.
- Restricted under Project Glasswing — limited to vetted organizations for vulnerability identification and remediation.
For most teams, Mythos is a benchmark and a future preview — not a model you can actually call in your IDE today. The publicly-deployable Anthropic model for coding remains Claude Opus 4.7 (April 16 launch, available on Bedrock, Vertex, Foundry).
GPT-5.5 — the agentic generalist
GPT-5.5 is the best all-rounder + agentic coding generalist of the three:
- GDPval 84.9%, OSWorld-Verified 78.7%, Tau2-bench Telecom 98.0% — leads on agentic benchmarks.
- Terminal-Bench 2.0: 82.7% — strong agent coding.
- SWE-Bench Pro: 58.6% — trails Claude Opus 4.7 and Mythos here, but still excellent.
- FrontierMath Tier 4: 35.4% (39.6% with GPT-5.5 Pro).
- Artificial Analysis Intelligence Index: 60 — slightly ahead of Gemini 3.1 Pro at 57.
- Cybersecurity: 71.4% on expert tasks, completes end-to-end cyberattack simulations.
Pricing: $5/M input, $30/M output tokens — most expensive of the three, but with the broadest ecosystem (ChatGPT, Codex, every major IDE integration, every major agent framework).
Qwen 3.6-Max-Preview — the surprise contender
Alibaba’s flagship in the Qwen 3.6 family:
- Top scores claimed on six major programming benchmarks — SWE-Bench Pro, Terminal-Bench 2.0, SciCode, and three others.
- Second on the Artificial Analysis Intelligence Index — behind only GPT-5.5.
- Strong agentic coding — built for instruction following and multi-step tool use.
- Dramatically cheaper than GPT-5.5 or Claude — pricing via Alibaba Cloud Model Studio is a fraction of US frontier model prices.
- API-only as of May 2026, but the broader Qwen 3.6 family is open-weight — Qwen 3.6-Plus (1M context) and Qwen 3.6-35B-A3B (MoE, 262K context, multimodal) are downloadable.
Caveats:
- Benchmark vs reality gap — Qwen 3.6-Max’s published scores are excellent; some independent evals are still catching up.
- Geopolitics — US enterprise buyers may have China-vendor risk concerns.
- Tooling ecosystem is smaller than OpenAI’s — fewer IDE integrations, fewer agent frameworks default to it.
Head-to-head
Pure SWE-Bench Verified
- Claude Mythos Preview — 93.9% (restricted).
- Claude Opus 4.7 (Mythos’s GA sibling) — high 80s%.
- Qwen 3.6-Max-Preview — strong.
- GPT-5.5 — strong.
SWE-Bench Pro (harder benchmark)
- Qwen 3.6-Max-Preview — claimed top spot.
- Claude Mythos Preview / Opus 4.7 — close behind.
- GPT-5.5 — 58.6%.
Terminal-Bench 2.0 (real-world terminal tasks)
- Qwen 3.6-Max-Preview — claimed top spot.
- GPT-5.5 — 82.7%.
- Claude Mythos / Opus 4.7 — strong.
Agentic real-world tasks (GDPval, OSWorld)
- GPT-5.5 — leads.
- Claude Opus 4.7 — strong second.
- Qwen 3.6-Max — strong third.
Cybersecurity tasks
- Claude Mythos Preview — saturates benchmarks (restricted).
- GPT-5.5 — 71.4% on expert cyber tasks.
- Qwen 3.6-Max — strong but less specialized.
Cost per million tokens
- Qwen 3.6-Max — cheapest by far.
- Claude Opus 4.7 — mid-tier.
- GPT-5.5 — most expensive ($5 in / $30 out).
Ecosystem and integration
- GPT-5.5 — broadest.
- Claude Opus 4.7 — strong.
- Qwen 3.6-Max — growing.
When to use which
Use GPT-5.5 in your coding agent if:
- You’re on Cursor, Codex, OpenAI’s Codex CLI, or any GPT-default tool.
- You value the broadest ecosystem and tightest tooling.
- You care most about agentic workflows + computer use.
Use Claude Opus 4.7 (Mythos’s GA sibling) if:
- You’re on Claude Code, Cline, Aider, or anything that defaults to Anthropic.
- SWE-Bench Verified accuracy is your top priority.
- You’re not eligible for the restricted Mythos preview.
Use Qwen 3.6-Max-Preview (or 3.6-Plus open-weight) if:
- Cost is a major constraint.
- You want to push SWE-Bench Pro / Terminal-Bench 2.0 / SciCode performance.
- You’re comfortable with Alibaba Cloud or self-hosting (Qwen 3.6-Plus / 35B-A3B).
- You want a strong default for budget-conscious open-source coding agents.
Strengths and weaknesses
| Strengths | Weaknesses | |
|---|---|---|
| Claude Mythos Preview | SWE-Bench Verified leader, USAMO record, cybersecurity saturation | Restricted access — not generally usable |
| GPT-5.5 | Best agentic generalist, broadest ecosystem, computer use | Most expensive, trails on SWE-Bench Pro |
| Qwen 3.6-Max-Preview | Top scores on SWE-Bench Pro / Terminal-Bench 2.0 / SciCode, cheapest, has open-weight siblings | Smaller tooling ecosystem, geopolitics, benchmark/reality gap |
What’s next
- Anthropic Mythos GA — unclear; depends on Project Glasswing safety review.
- Qwen 4 — rumored for late 2026.
- GPT-5.5 Pro — already available with higher-tier reasoning (39.6% on FrontierMath Tier 4).
- Open-weight Qwen 3.6 35B variants — broader adoption expected via Ollama, vLLM, OpenRouter.
TL;DR
If you want the best generally-available coding model in your agent today, run GPT-5.5 for agentic workflows or Claude Opus 4.7 for raw SWE-Bench accuracy — and seriously trial Qwen 3.6-Max-Preview (or open-weight Qwen 3.6-Plus) for cost-sensitive workloads. Mythos is a research curiosity for most teams, not a tool.
Related reading
- Claude Mythos preview SWE-bench 93 percent (May 2026)
- Best AI coding tools (multi-agent fleets, May 2026)
- Aider vs Cline vs Roo Code (Mythos, DeepSeek, May 2026)
- Qwen 3.6 vs DeepSeek v4 vs Llama 5 coding (May 2026)
- Best open-weights coding models (May 2026)
Sources: AISI evaluation of Claude Mythos Preview, llm-stats.com, DataCamp, MindStudio benchmarks, Qubrid, Qwen.ai blog, OpenAI GPT-5.5 release notes — May 2026.