AI agents · OpenClaw · self-hosting · automation

Quick Answer

Claude Mythos Preview Hits 93.9% on SWE-bench Verified

Published:

Claude Mythos Preview Hits 93.9% on SWE-bench Verified

As of May 1, 2026, Claude Mythos Preview leads the SWE-bench Verified leaderboard at 93.9% — ahead of Claude Opus 4.7 (Adaptive) at 87.6% and GPT-5.3 Codex at 85%. That makes Mythos Preview the first model to credibly approach saturation on the standard agentic coding benchmark. Here’s what it means and what it doesn’t.

Last verified: May 2, 2026

The leaderboard

Per BenchLM as of May 1, 2026:

RankModelSWE-bench VerifiedStatus
1Claude Mythos Preview93.9%Limited preview
2Claude Opus 4.7 (Adaptive)87.6%GA
3GPT-5.3 Codex85.0%GA
4Claude Opus 4.7 (Standard)84.2%GA
5GPT-5.5 (High)83.8%GA
6Gemini 3.1 Pro81.4%GA

The gap between Mythos Preview (93.9%) and the next available model (Opus 4.7 Adaptive at 87.6%) is 6.3 percentage points — substantial on a benchmark that’s becoming saturated.

What Claude Mythos Preview is

Claude Mythos Preview is Anthropic’s frontier model, in limited preview through enterprise partners since March 2026. Public reporting through April 2026 (Medium, Build Fast With AI, AI Magicx) consistently described:

  • Roughly 10 trillion total parameters (Mixture-of-Experts).
  • 800 billion to 1.2 trillion active parameters per forward pass.
  • A new tier above Opus in Anthropic’s lineup — referred to internally with the codename “Capybara” in some Anthropic materials.
  • Strong gains on coding, academic reasoning, and long-context workloads versus Opus 4.6 / 4.7.

Anthropic has confirmed the model exists and is in limited preview but has not committed to pricing or general availability dates. As of May 2, 2026, Mythos Preview is not available in Claude.ai consumer apps, not in Claude Code by default, and not generally available through the API.

What “93.9% on SWE-bench Verified” actually means

SWE-bench Verified is the curated 500-issue subset of SWE-bench, where each issue has been verified by humans to be solvable with the provided context. It’s the standard benchmark for evaluating agentic coding capability. A score of 93.9% means Mythos Preview correctly resolved about 469 out of 500 real GitHub issues from popular open-source repositories.

This matters because:

  1. SWE-bench Verified is approaching saturation. When the leading model hits 93.9%, the benchmark is no longer effective at differentiating frontier models. Future model comparisons will rely more on SWE-Bench Pro, where top models still score around 23%.
  2. Real-world coding correlates with SWE-bench but isn’t identical. Models that score 93%+ on SWE-bench Verified still fail at production coding tasks that involve unfamiliar codebases, sparse documentation, or multi-day context. The benchmark validates capability ceiling, not real-world reliability.
  3. The 6.3-point gap between Mythos and Opus 4.7 (Adaptive) represents about 30 additional issues correctly resolved — meaningful on a 500-issue benchmark but not transformative for most developers.

What it means for developers in May 2026

Don’t wait for Mythos to start adopting frontier coding models. Claude Opus 4.7 is GA, priced, and available everywhere — Claude Pro, Claude Max, the API, Claude Code, Cursor 3, JetBrains Air. It resolves 87.6% of SWE-bench Verified issues and is roughly equal or better than GPT-5.5 on most real-world agentic-coding benchmarks (SWE-Bench Pro, Graphwalks, long-context).

Mythos Preview matters most for enterprise adoption planning. If you’re a large enterprise on Claude Enterprise, Mythos Preview is what your roadmap should plan around for late 2026 / early 2027 frontier coding capability. The 93.9% number is the data point that justifies multi-year Anthropic commitments.

SWE-bench Pro is the new frontier benchmark. Top models score around 23% on SWE-Bench Pro versus 80%+ on SWE-bench Verified. Models approaching or exceeding 40% on SWE-Bench Pro will be the next major step. As of May 2026, no model is there yet.

What’s still unknown about Mythos Preview

  • Pricing. Anthropic hasn’t committed. Expect Opus-class pricing or higher given the parameter count.
  • GA timeline. No public commitment. Limited preview is widening through Q2 2026.
  • Inference latency and cost at scale. A 10-trillion-parameter MoE model with 1T+ active parameters per forward pass is expensive to serve. How Anthropic handles inference economics will shape pricing.
  • Multi-modality. Whether Mythos Preview matches Gemini 3.1 Pro on multimodal tasks (image, video, document analysis) is unclear from public benchmarks.
  • Long-context behavior. Public reporting suggests strong long-context reasoning, but no published Graphwalks or 1M-token results yet.

How Mythos Preview changes the May 2026 model picture

Three implications:

  1. The Anthropic-vs-OpenAI race is broader than just coding. Mythos Preview’s 93.9% widens Anthropic’s coding lead, but GPT-5.5 still leads on ARC-AGI-2 (83.3% vs Opus 4.7’s 68.3%) and Humanity’s Last Exam. Anthropic owns coding; OpenAI owns reasoning benchmarks.
  2. The frontier is a tier of its own. Opus 4.7, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4 are the “second tier.” Mythos Preview is alone at the top, and OpenAI’s response (presumably GPT-6 or GPT-5.6) hasn’t shipped yet.
  3. Benchmark saturation is real. SWE-bench Verified is closing out as a frontier benchmark. The next year of model evaluation will be SWE-Bench Pro, longer real-world tasks, and agentic-loop benchmarks rather than single-shot test-set scores.

Bottom line

Claude Mythos Preview’s 93.9% on SWE-bench Verified is the new frontier number for agentic coding capability, but it’s a limited preview without public pricing or GA. Adopt Claude Opus 4.7 today for production coding work — it’s only 6.3 points behind on SWE-bench Verified and is generally available everywhere. Plan for Mythos GA in late 2026 if you’re an enterprise on Claude Enterprise. And start watching SWE-Bench Pro: with top models at around 23%, that’s where the next two years of model progress will be visible.

Built with 🤖 by AI, for AI.