AI agents · OpenClaw · self-hosting · automation

Quick Answer

Scale SWE-Bench Pro Private Codebase Results (May 2026)

Published:

Scale SWE-Bench Pro Private Codebase Results (May 2026)

Scale Labs published updated SWE-Bench Pro Private results on May 3, 2026. The numbers tell an uncomfortable story: AI coding models perform meaningfully worse on private, previously unseen codebases than on the public SWE-Bench Pro benchmark. Claude Opus 4.1 drops from 22.7% to 17.8%. GPT-5 drops from 23.1% to 14.9%. Here’s what this means for AI coding agent selection.

Last verified: May 4, 2026

The numbers

ModelPublic SWE-Bench ProPrivate SWE-Bench ProDrop
Claude Opus 4.122.7%17.8%-4.9 pp
GPT-523.1%14.9%-8.2 pp

Source: Scale Labs, SWE-Bench Pro Public leaderboard (May 3, 2026 update). The private benchmark uses codebases that were not available during model training.

The pattern: GPT-5’s drop is nearly 2x Claude Opus 4.1’s. Both models lose ground on private codebases, but the OpenAI flagship loses more. This suggests Claude Opus 4.1 is generalizing better to novel codebases, while GPT-5’s public-benchmark performance was relying more on patterns seen during training.

Why this matters

Three reasons:

  1. Real-world coding work is private codebases. Your internal codebase is private. Your client’s codebase is private. Public-benchmark scores tell you how well a model performs on the kind of code most likely to be in its training set — not the kind of code it’ll see in production. The Scale private leaderboard gives you the closer-to-real-world number.

  2. The contamination problem is real and measurable. AI labs train on massive code corpora. SWE-Bench public issues exist on GitHub. It’s been suspected for years that public benchmark scores include some contamination effect. Scale’s private version quantifies it: 4.9-8.2 percentage points for top models.

  3. Vendor selection should use private-codebase numbers. If you’re an enterprise picking a coding agent, the Scale private leaderboard is now the most credible single number. Public SWE-Bench Pro is still useful for relative ordering, but for absolute capability expectations, use private results.

What about Claude Mythos Preview?

Anthropic’s Claude Mythos Preview leads the public SWE-Bench Pro at ~77.8% per llm-stats.com (May 2026). Its private score is the most-watched data point of late Q2 2026 — if Mythos holds most of its lead on private codebases, Anthropic’s coding moat is real. If it drops to GPT-5-territory, much of the lead was contamination-driven.

Expected timeline:

  • Mid-May 2026: Scale publishes private results for Mythos Preview.
  • June 2026: Independent third-party private evals (likely from Vals AI, BenchLM, etc.) confirm or contradict.
  • Q3 2026: Mythos GA changes the leaderboard again.

How to read benchmark numbers in May 2026

Three rules:

  1. Use private benchmarks for absolute capability expectations. Scale’s private SWE-Bench Pro, vals.ai private evals, and any internal-codebase eval you run yourself give you the realistic number.

  2. Use public benchmarks for relative ordering. If model A scores higher than model B on public SWE-Bench Pro, that ordering tends to hold on private — even if the absolute numbers drop.

  3. Discount any vendor-cited single benchmark by 30-50%. A vendor saying “our model scores 80% on benchmark X” likely translates to ~50-65% on novel real-world tasks. Plan accordingly.

What this means for picking a coding agent

For enterprise coding-agent selection in May 2026:

  • If you have access to private benchmark data (Scale’s leaderboard, vals.ai, your own internal eval), use it. It’s worth more than any public number.
  • If you’re picking based on public benchmarks, expect 4-10 percentage points of overstatement on real-world tasks. Plan for the floor, not the ceiling.
  • Claude Opus 4.1 generalizes better than GPT-5 on Scale’s private codebases as of May 2026 — for novel-codebase work, this matters.
  • Mythos Preview is unproven on private codebases as of May 4, 2026. If you’re betting on it, run your own internal eval before standardizing.

How to run your own private codebase eval

Best-practice setup for a 2-week internal eval:

  1. Pick 5 internal repos representing the codebase types your team actually works on.
  2. Pick 20-50 real, recently-resolved tickets with clear problem descriptions and known fixes.
  3. Run each candidate model / agent setup on all tickets without supervision.
  4. Score by: did it produce a passing change? did the change match the actual fix? did tests pass?
  5. Track tokens spent and time-to-first-passing-attempt.

Two weeks of this gives you better signal than any public benchmark. Most enterprises picking between Codex on Bedrock, Claude Code on Bedrock, and Pi should run this eval in May-June 2026 before standardizing on a single tool.

Implication for AI coding investment

Two structural takeaways:

  1. The “best coding model” claim should always include “on which benchmark.” A vendor’s leaderboard win on the public version is partial information. Private-benchmark wins matter more.

  2. The benchmark ecosystem is maturing fast. Scale’s private leaderboard, vals.ai, BenchLM, and the next wave of private evals will dominate vendor-selection discussions through 2026. Public benchmarks become marketing; private benchmarks become procurement.

Bottom line

Scale’s May 3, 2026 SWE-Bench Pro Private update shows that public benchmark scores overstate real-world coding ability by 4.9-8.2 percentage points for top frontier models. Claude Opus 4.1 generalizes better than GPT-5 on novel codebases. Mythos Preview’s private score is the next major data point. For enterprise coding-agent selection, treat public benchmarks as relative-ordering signals only, and use private-codebase numbers — Scale’s leaderboard or your own internal eval — for absolute capability expectations.

Sources: Scale Labs SWE-Bench Pro Public leaderboard (labs.scale.com/leaderboard/swe_bench_pro_public, updated May 3, 2026), llm-stats.com SWE-Bench Pro leaderboard (May 2026), benchlm.ai coding benchmarks (March-May 2026), vals.ai SWE-Bench leaderboard.