Which model is best for coding in May 2026 — Qwen 3.6, Claude Mythos, or GPT-5.5?

Claude Mythos Preview leads SWE-Bench Verified at 93.9% but is restricted under Project Glasswing — most teams can't use it. Of the generally-available models, Qwen 3.6-Max-Preview is the surprise leader, claiming top spots on SWE-Bench Pro, Terminal-Bench 2.0, and SciCode. GPT-5.5 is the strongest agentic-coding generalist (Terminal-Bench 2.0 82.7%, GDPval 84.9%) with the broadest integration ecosystem. For everyday coding agents, GPT-5.5 is the safest pick; for max coding benchmark performance on an open model, Qwen 3.6-Max is the standout.

Is Qwen 3.6-Max actually open-weight?

Qwen 3.6-Max-Preview is hosted/API only as of May 2026. Other Qwen 3.6 family members are open-weight — Qwen 3.6-Plus (1M context, released ~April 2, 2026) and Qwen 3.6-35B-A3B (Mixture of Experts) are openly available with permissive licensing for most use cases. Mythos is closed and restricted; GPT-5.5 is closed-API.

How does pricing compare?

Qwen 3.6 family is dramatically cheaper — Qwen 3.6-Plus runs at fractions of frontier pricing via Alibaba Cloud Model Studio, and self-hosted Qwen 3.6-35B-A3B is essentially infrastructure cost. GPT-5.5 is $5/M input and $30/M output tokens — the most expensive of the three. Mythos pricing is irrelevant for now since it's not publicly available.

Which model should I use in my coding agent (Cursor, Claude Code, Cline, Aider, Codex) today?

For Cursor and Codex, GPT-5.5 is the default and well-tuned. For Claude Code, Claude Opus 4.7 (publicly available) is the canonical choice — Mythos is not yet generally accessible. For Cline / Aider / Roo Code with budget constraints, Qwen 3.6-Plus via OpenRouter or Alibaba Cloud is the new value pick. For maximum benchmark performance via a generally-available API, Qwen 3.6-Max-Preview is worth a serious look.

Quick Answer

Qwen 3.6 Max vs Claude Mythos vs GPT-5.5 on SWE-Bench (2026)

Published: May 17, 2026

Qwen 3.6-Max vs Claude Mythos vs GPT-5.5 on SWE-Bench (May 2026)

The three most-discussed coding models in May 2026 — Alibaba’s Qwen 3.6-Max-Preview, Anthropic’s restricted Claude Mythos Preview, and OpenAI’s flagship GPT-5.5 — are pushing each other across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0, and beyond. Here’s how they actually compare.

Last verified: May 17, 2026

TL;DR

	Qwen 3.6-Max-Preview	Claude Mythos Preview	GPT-5.5
Vendor	Alibaba Cloud	Anthropic	OpenAI
Released	April 20, 2026	Restricted (Project Glasswing)	April 23, 2026
Generally available?	Yes (API)	No (restricted)	Yes
SWE-Bench Verified	competitive (top tier)	93.9%	strong
SWE-Bench Pro	leads (claimed top spot)	leads	58.6%
Terminal-Bench 2.0	leads (claimed top spot)	strong	82.7%
SciCode	leads (claimed top spot)	strong	strong
GDPval (agentic)	strong	n/a (restricted)	84.9%
OSWorld-Verified	n/a	n/a	78.7%
Pricing	Cheapest	Restricted	$5/M in, $30/M out
Open-weight family	Qwen 3.6-Plus, 3.6-35B-A3B	No	No
Best for	Cost-conscious coding agents	Restricted cyber/SWE research	Agentic coding generalist

What’s actually new in May 2026

GPT-5.5 became the new ChatGPT default across all tiers on May 5, after the full April 23 release. Built for agentic workflows, multi-step tool use, and computer use.
Claude Mythos Preview continues to sit under Project Glasswing — Anthropic restricts access because of its autonomous zero-day discovery capabilities. AISI’s evaluation confirmed Mythos can saturate cybersecurity benchmarks like CyberGym and Cybench.
Qwen 3.6-Max-Preview dropped April 20 and claimed top scores on six major programming benchmarks including SWE-Bench Pro, Terminal-Bench 2.0, and SciCode. It’s ranked second overall on the Artificial Analysis Intelligence Index, behind only GPT-5.5.
Qwen 3.6-Plus (1M context, ~April 2 release) and Qwen 3.6-35B-A3B (MoE) sit underneath as open-weight options.

Claude Mythos Preview — the restricted frontier

Why Mythos is the most talked-about model nobody can use:

93.9% on SWE-Bench Verified — best in class for software engineering benchmarks.
97.6% on USAMO 2026 — one of the largest single-generation reasoning jumps.
Saturates CyberGym and Cybench — autonomously finds zero-days in major OSes and browsers.
Restricted under Project Glasswing — limited to vetted organizations for vulnerability identification and remediation.

For most teams, Mythos is a benchmark and a future preview — not a model you can actually call in your IDE today. The publicly-deployable Anthropic model for coding remains Claude Opus 4.7 (April 16 launch, available on Bedrock, Vertex, Foundry).

GPT-5.5 — the agentic generalist

GPT-5.5 is the best all-rounder + agentic coding generalist of the three:

GDPval 84.9%, OSWorld-Verified 78.7%, Tau2-bench Telecom 98.0% — leads on agentic benchmarks.
Terminal-Bench 2.0: 82.7% — strong agent coding.
SWE-Bench Pro: 58.6% — trails Claude Opus 4.7 and Mythos here, but still excellent.
FrontierMath Tier 4: 35.4% (39.6% with GPT-5.5 Pro).
Artificial Analysis Intelligence Index: 60 — slightly ahead of Gemini 3.1 Pro at 57.
Cybersecurity: 71.4% on expert tasks, completes end-to-end cyberattack simulations.

Pricing: $5/M input, $30/M output tokens — most expensive of the three, but with the broadest ecosystem (ChatGPT, Codex, every major IDE integration, every major agent framework).

Qwen 3.6-Max-Preview — the surprise contender

Alibaba’s flagship in the Qwen 3.6 family:

Top scores claimed on six major programming benchmarks — SWE-Bench Pro, Terminal-Bench 2.0, SciCode, and three others.
Second on the Artificial Analysis Intelligence Index — behind only GPT-5.5.
Strong agentic coding — built for instruction following and multi-step tool use.
Dramatically cheaper than GPT-5.5 or Claude — pricing via Alibaba Cloud Model Studio is a fraction of US frontier model prices.
API-only as of May 2026, but the broader Qwen 3.6 family is open-weight — Qwen 3.6-Plus (1M context) and Qwen 3.6-35B-A3B (MoE, 262K context, multimodal) are downloadable.

Caveats:

Benchmark vs reality gap — Qwen 3.6-Max’s published scores are excellent; some independent evals are still catching up.
Geopolitics — US enterprise buyers may have China-vendor risk concerns.
Tooling ecosystem is smaller than OpenAI’s — fewer IDE integrations, fewer agent frameworks default to it.

Head-to-head

Pure SWE-Bench Verified

Claude Mythos Preview — 93.9% (restricted).
Claude Opus 4.7 (Mythos’s GA sibling) — high 80s%.
Qwen 3.6-Max-Preview — strong.
GPT-5.5 — strong.

SWE-Bench Pro (harder benchmark)

Qwen 3.6-Max-Preview — claimed top spot.
Claude Mythos Preview / Opus 4.7 — close behind.
GPT-5.5 — 58.6%.

Terminal-Bench 2.0 (real-world terminal tasks)

Qwen 3.6-Max-Preview — claimed top spot.
GPT-5.5 — 82.7%.
Claude Mythos / Opus 4.7 — strong.

Agentic real-world tasks (GDPval, OSWorld)

GPT-5.5 — leads.
Claude Opus 4.7 — strong second.
Qwen 3.6-Max — strong third.

Cybersecurity tasks

Claude Mythos Preview — saturates benchmarks (restricted).
GPT-5.5 — 71.4% on expert cyber tasks.
Qwen 3.6-Max — strong but less specialized.

Cost per million tokens

Qwen 3.6-Max — cheapest by far.
Claude Opus 4.7 — mid-tier.
GPT-5.5 — most expensive ($5 in / $30 out).

Ecosystem and integration

GPT-5.5 — broadest.
Claude Opus 4.7 — strong.
Qwen 3.6-Max — growing.

When to use which

Use GPT-5.5 in your coding agent if:

You’re on Cursor, Codex, OpenAI’s Codex CLI, or any GPT-default tool.
You value the broadest ecosystem and tightest tooling.
You care most about agentic workflows + computer use.

Use Claude Opus 4.7 (Mythos’s GA sibling) if:

You’re on Claude Code, Cline, Aider, or anything that defaults to Anthropic.
SWE-Bench Verified accuracy is your top priority.
You’re not eligible for the restricted Mythos preview.

Use Qwen 3.6-Max-Preview (or 3.6-Plus open-weight) if:

Cost is a major constraint.
You want to push SWE-Bench Pro / Terminal-Bench 2.0 / SciCode performance.
You’re comfortable with Alibaba Cloud or self-hosting (Qwen 3.6-Plus / 35B-A3B).
You want a strong default for budget-conscious open-source coding agents.

Strengths and weaknesses

	Strengths	Weaknesses
Claude Mythos Preview	SWE-Bench Verified leader, USAMO record, cybersecurity saturation	Restricted access — not generally usable
GPT-5.5	Best agentic generalist, broadest ecosystem, computer use	Most expensive, trails on SWE-Bench Pro
Qwen 3.6-Max-Preview	Top scores on SWE-Bench Pro / Terminal-Bench 2.0 / SciCode, cheapest, has open-weight siblings	Smaller tooling ecosystem, geopolitics, benchmark/reality gap

What’s next

Anthropic Mythos GA — unclear; depends on Project Glasswing safety review.
Qwen 4 — rumored for late 2026.
GPT-5.5 Pro — already available with higher-tier reasoning (39.6% on FrontierMath Tier 4).
Open-weight Qwen 3.6 35B variants — broader adoption expected via Ollama, vLLM, OpenRouter.

TL;DR

If you want the best generally-available coding model in your agent today, run GPT-5.5 for agentic workflows or Claude Opus 4.7 for raw SWE-Bench accuracy — and seriously trial Qwen 3.6-Max-Preview (or open-weight Qwen 3.6-Plus) for cost-sensitive workloads. Mythos is a research curiosity for most teams, not a tool.

Sources: AISI evaluation of Claude Mythos Preview, llm-stats.com, DataCamp, MindStudio benchmarks, Qubrid, Qwen.ai blog, OpenAI GPT-5.5 release notes — May 2026.

Qwen 3.6-Max vs Claude Mythos vs GPT-5.5 on SWE-Bench (May 2026)

TL;DR

What’s actually new in May 2026

Claude Mythos Preview — the restricted frontier

GPT-5.5 — the agentic generalist

Qwen 3.6-Max-Preview — the surprise contender

Head-to-head

Pure SWE-Bench Verified

SWE-Bench Pro (harder benchmark)

Terminal-Bench 2.0 (real-world terminal tasks)

Agentic real-world tasks (GDPval, OSWorld)

Cybersecurity tasks

Cost per million tokens

Ecosystem and integration

When to use which

Strengths and weaknesses

What’s next

TL;DR

Related reading