AI agents · OpenClaw · self-hosting · automation

Quick Answer

Qwen 3.6 Max vs Claude Mythos vs GPT-5.5 on SWE-Bench (2026)

Published:

Qwen 3.6-Max vs Claude Mythos vs GPT-5.5 on SWE-Bench (May 2026)

The three most-discussed coding models in May 2026 — Alibaba’s Qwen 3.6-Max-Preview, Anthropic’s restricted Claude Mythos Preview, and OpenAI’s flagship GPT-5.5 — are pushing each other across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0, and beyond. Here’s how they actually compare.

Last verified: May 17, 2026

TL;DR

Qwen 3.6-Max-PreviewClaude Mythos PreviewGPT-5.5
VendorAlibaba CloudAnthropicOpenAI
ReleasedApril 20, 2026Restricted (Project Glasswing)April 23, 2026
Generally available?Yes (API)No (restricted)Yes
SWE-Bench Verifiedcompetitive (top tier)93.9%strong
SWE-Bench Proleads (claimed top spot)leads58.6%
Terminal-Bench 2.0leads (claimed top spot)strong82.7%
SciCodeleads (claimed top spot)strongstrong
GDPval (agentic)strongn/a (restricted)84.9%
OSWorld-Verifiedn/an/a78.7%
PricingCheapestRestricted$5/M in, $30/M out
Open-weight familyQwen 3.6-Plus, 3.6-35B-A3BNoNo
Best forCost-conscious coding agentsRestricted cyber/SWE researchAgentic coding generalist

What’s actually new in May 2026

  • GPT-5.5 became the new ChatGPT default across all tiers on May 5, after the full April 23 release. Built for agentic workflows, multi-step tool use, and computer use.
  • Claude Mythos Preview continues to sit under Project Glasswing — Anthropic restricts access because of its autonomous zero-day discovery capabilities. AISI’s evaluation confirmed Mythos can saturate cybersecurity benchmarks like CyberGym and Cybench.
  • Qwen 3.6-Max-Preview dropped April 20 and claimed top scores on six major programming benchmarks including SWE-Bench Pro, Terminal-Bench 2.0, and SciCode. It’s ranked second overall on the Artificial Analysis Intelligence Index, behind only GPT-5.5.
  • Qwen 3.6-Plus (1M context, ~April 2 release) and Qwen 3.6-35B-A3B (MoE) sit underneath as open-weight options.

Claude Mythos Preview — the restricted frontier

Why Mythos is the most talked-about model nobody can use:

  • 93.9% on SWE-Bench Verified — best in class for software engineering benchmarks.
  • 97.6% on USAMO 2026 — one of the largest single-generation reasoning jumps.
  • Saturates CyberGym and Cybench — autonomously finds zero-days in major OSes and browsers.
  • Restricted under Project Glasswing — limited to vetted organizations for vulnerability identification and remediation.

For most teams, Mythos is a benchmark and a future preview — not a model you can actually call in your IDE today. The publicly-deployable Anthropic model for coding remains Claude Opus 4.7 (April 16 launch, available on Bedrock, Vertex, Foundry).

GPT-5.5 — the agentic generalist

GPT-5.5 is the best all-rounder + agentic coding generalist of the three:

  • GDPval 84.9%, OSWorld-Verified 78.7%, Tau2-bench Telecom 98.0% — leads on agentic benchmarks.
  • Terminal-Bench 2.0: 82.7% — strong agent coding.
  • SWE-Bench Pro: 58.6% — trails Claude Opus 4.7 and Mythos here, but still excellent.
  • FrontierMath Tier 4: 35.4% (39.6% with GPT-5.5 Pro).
  • Artificial Analysis Intelligence Index: 60 — slightly ahead of Gemini 3.1 Pro at 57.
  • Cybersecurity: 71.4% on expert tasks, completes end-to-end cyberattack simulations.

Pricing: $5/M input, $30/M output tokens — most expensive of the three, but with the broadest ecosystem (ChatGPT, Codex, every major IDE integration, every major agent framework).

Qwen 3.6-Max-Preview — the surprise contender

Alibaba’s flagship in the Qwen 3.6 family:

  • Top scores claimed on six major programming benchmarks — SWE-Bench Pro, Terminal-Bench 2.0, SciCode, and three others.
  • Second on the Artificial Analysis Intelligence Index — behind only GPT-5.5.
  • Strong agentic coding — built for instruction following and multi-step tool use.
  • Dramatically cheaper than GPT-5.5 or Claude — pricing via Alibaba Cloud Model Studio is a fraction of US frontier model prices.
  • API-only as of May 2026, but the broader Qwen 3.6 family is open-weight — Qwen 3.6-Plus (1M context) and Qwen 3.6-35B-A3B (MoE, 262K context, multimodal) are downloadable.

Caveats:

  • Benchmark vs reality gap — Qwen 3.6-Max’s published scores are excellent; some independent evals are still catching up.
  • Geopolitics — US enterprise buyers may have China-vendor risk concerns.
  • Tooling ecosystem is smaller than OpenAI’s — fewer IDE integrations, fewer agent frameworks default to it.

Head-to-head

Pure SWE-Bench Verified

  1. Claude Mythos Preview — 93.9% (restricted).
  2. Claude Opus 4.7 (Mythos’s GA sibling) — high 80s%.
  3. Qwen 3.6-Max-Preview — strong.
  4. GPT-5.5 — strong.

SWE-Bench Pro (harder benchmark)

  1. Qwen 3.6-Max-Preview — claimed top spot.
  2. Claude Mythos Preview / Opus 4.7 — close behind.
  3. GPT-5.5 — 58.6%.

Terminal-Bench 2.0 (real-world terminal tasks)

  1. Qwen 3.6-Max-Preview — claimed top spot.
  2. GPT-5.5 — 82.7%.
  3. Claude Mythos / Opus 4.7 — strong.

Agentic real-world tasks (GDPval, OSWorld)

  1. GPT-5.5 — leads.
  2. Claude Opus 4.7 — strong second.
  3. Qwen 3.6-Max — strong third.

Cybersecurity tasks

  1. Claude Mythos Preview — saturates benchmarks (restricted).
  2. GPT-5.5 — 71.4% on expert cyber tasks.
  3. Qwen 3.6-Max — strong but less specialized.

Cost per million tokens

  1. Qwen 3.6-Max — cheapest by far.
  2. Claude Opus 4.7 — mid-tier.
  3. GPT-5.5 — most expensive ($5 in / $30 out).

Ecosystem and integration

  1. GPT-5.5 — broadest.
  2. Claude Opus 4.7 — strong.
  3. Qwen 3.6-Max — growing.

When to use which

Use GPT-5.5 in your coding agent if:

  • You’re on Cursor, Codex, OpenAI’s Codex CLI, or any GPT-default tool.
  • You value the broadest ecosystem and tightest tooling.
  • You care most about agentic workflows + computer use.

Use Claude Opus 4.7 (Mythos’s GA sibling) if:

  • You’re on Claude Code, Cline, Aider, or anything that defaults to Anthropic.
  • SWE-Bench Verified accuracy is your top priority.
  • You’re not eligible for the restricted Mythos preview.

Use Qwen 3.6-Max-Preview (or 3.6-Plus open-weight) if:

  • Cost is a major constraint.
  • You want to push SWE-Bench Pro / Terminal-Bench 2.0 / SciCode performance.
  • You’re comfortable with Alibaba Cloud or self-hosting (Qwen 3.6-Plus / 35B-A3B).
  • You want a strong default for budget-conscious open-source coding agents.

Strengths and weaknesses

StrengthsWeaknesses
Claude Mythos PreviewSWE-Bench Verified leader, USAMO record, cybersecurity saturationRestricted access — not generally usable
GPT-5.5Best agentic generalist, broadest ecosystem, computer useMost expensive, trails on SWE-Bench Pro
Qwen 3.6-Max-PreviewTop scores on SWE-Bench Pro / Terminal-Bench 2.0 / SciCode, cheapest, has open-weight siblingsSmaller tooling ecosystem, geopolitics, benchmark/reality gap

What’s next

  • Anthropic Mythos GA — unclear; depends on Project Glasswing safety review.
  • Qwen 4 — rumored for late 2026.
  • GPT-5.5 Pro — already available with higher-tier reasoning (39.6% on FrontierMath Tier 4).
  • Open-weight Qwen 3.6 35B variants — broader adoption expected via Ollama, vLLM, OpenRouter.

TL;DR

If you want the best generally-available coding model in your agent today, run GPT-5.5 for agentic workflows or Claude Opus 4.7 for raw SWE-Bench accuracy — and seriously trial Qwen 3.6-Max-Preview (or open-weight Qwen 3.6-Plus) for cost-sensitive workloads. Mythos is a research curiosity for most teams, not a tool.


Sources: AISI evaluation of Claude Mythos Preview, llm-stats.com, DataCamp, MindStudio benchmarks, Qubrid, Qwen.ai blog, OpenAI GPT-5.5 release notes — May 2026.