What is Scale's SWE-Bench Pro Private leaderboard?

Scale's SWE-Bench Pro Private leaderboard is a version of SWE-Bench Pro that evaluates AI coding models against private, previously unseen codebases — as opposed to the public SWE-Bench Pro dataset which models may have indirectly trained on. Updated by Scale Labs in early May 2026, it's intended to provide a more realistic measure of how well AI coding agents generalize to novel codebases. The May 2026 results show meaningful drops in resolution rates compared to the public benchmark.

How much do AI coding scores drop on private codebases?

Significantly. Per Scale Labs (May 2026), Claude Opus 4.1 dropped from 22.7% (public SWE-Bench Pro) to 17.8% (private), a 4.9 percentage point loss. GPT-5 dropped from 23.1% to 14.9%, an 8.2 point drop. The pattern shows that public benchmark performance overstates real-world coding ability on novel codebases — and that GPT-5 is more affected than Claude Opus on private code. This reinforces that benchmark numbers should be treated as upper bounds on capability, not realistic expectations.

Why do AI models do worse on private codebases?

Two reasons. First, training contamination: public benchmark issues and code may appear in training data, allowing models to memorize specific solutions. Second, architectural unfamiliarity: private codebases have idioms, naming conventions, and structural patterns the model has never seen, making it harder to navigate context and generate fitting code. Scale's private leaderboard isolates the contamination issue by testing on codebases that weren't available during training, giving a cleaner measure of generalization.

Which models hold up best on private codebases?

As of May 2026, the relative ordering on Scale's private leaderboard appears to favor Claude (Opus 4.1) more than GPT-5, though both drop materially. Claude Mythos Preview (Anthropic's preview model) leads the public SWE-Bench Pro at ~77.8% per llm-stats.com — its private-codebase score will be the most-watched data point of late Q2 2026 to determine whether Anthropic's lead is real or partly contamination-driven. The Scale private board is now the most credible coding eval for enterprise selection.

Quick Answer

Scale SWE-Bench Pro Private Codebase Results (May 2026)

Published: May 4, 2026

Scale SWE-Bench Pro Private Codebase Results (May 2026)

Scale Labs published updated SWE-Bench Pro Private results on May 3, 2026. The numbers tell an uncomfortable story: AI coding models perform meaningfully worse on private, previously unseen codebases than on the public SWE-Bench Pro benchmark. Claude Opus 4.1 drops from 22.7% to 17.8%. GPT-5 drops from 23.1% to 14.9%. Here’s what this means for AI coding agent selection.

Last verified: May 4, 2026

The numbers

Model	Public SWE-Bench Pro	Private SWE-Bench Pro	Drop
Claude Opus 4.1	22.7%	17.8%	-4.9 pp
GPT-5	23.1%	14.9%	-8.2 pp

Source: Scale Labs, SWE-Bench Pro Public leaderboard (May 3, 2026 update). The private benchmark uses codebases that were not available during model training.

The pattern: GPT-5’s drop is nearly 2x Claude Opus 4.1’s. Both models lose ground on private codebases, but the OpenAI flagship loses more. This suggests Claude Opus 4.1 is generalizing better to novel codebases, while GPT-5’s public-benchmark performance was relying more on patterns seen during training.

Why this matters

Three reasons:

Real-world coding work is private codebases. Your internal codebase is private. Your client’s codebase is private. Public-benchmark scores tell you how well a model performs on the kind of code most likely to be in its training set — not the kind of code it’ll see in production. The Scale private leaderboard gives you the closer-to-real-world number.
The contamination problem is real and measurable. AI labs train on massive code corpora. SWE-Bench public issues exist on GitHub. It’s been suspected for years that public benchmark scores include some contamination effect. Scale’s private version quantifies it: 4.9-8.2 percentage points for top models.
Vendor selection should use private-codebase numbers. If you’re an enterprise picking a coding agent, the Scale private leaderboard is now the most credible single number. Public SWE-Bench Pro is still useful for relative ordering, but for absolute capability expectations, use private results.

What about Claude Mythos Preview?

Anthropic’s Claude Mythos Preview leads the public SWE-Bench Pro at ~77.8% per llm-stats.com (May 2026). Its private score is the most-watched data point of late Q2 2026 — if Mythos holds most of its lead on private codebases, Anthropic’s coding moat is real. If it drops to GPT-5-territory, much of the lead was contamination-driven.

Expected timeline:

Mid-May 2026: Scale publishes private results for Mythos Preview.
June 2026: Independent third-party private evals (likely from Vals AI, BenchLM, etc.) confirm or contradict.
Q3 2026: Mythos GA changes the leaderboard again.

How to read benchmark numbers in May 2026

Three rules:

Use private benchmarks for absolute capability expectations. Scale’s private SWE-Bench Pro, vals.ai private evals, and any internal-codebase eval you run yourself give you the realistic number.
Use public benchmarks for relative ordering. If model A scores higher than model B on public SWE-Bench Pro, that ordering tends to hold on private — even if the absolute numbers drop.
Discount any vendor-cited single benchmark by 30-50%. A vendor saying “our model scores 80% on benchmark X” likely translates to ~50-65% on novel real-world tasks. Plan accordingly.

What this means for picking a coding agent

For enterprise coding-agent selection in May 2026:

If you have access to private benchmark data (Scale’s leaderboard, vals.ai, your own internal eval), use it. It’s worth more than any public number.
If you’re picking based on public benchmarks, expect 4-10 percentage points of overstatement on real-world tasks. Plan for the floor, not the ceiling.
Claude Opus 4.1 generalizes better than GPT-5 on Scale’s private codebases as of May 2026 — for novel-codebase work, this matters.
Mythos Preview is unproven on private codebases as of May 4, 2026. If you’re betting on it, run your own internal eval before standardizing.

How to run your own private codebase eval

Best-practice setup for a 2-week internal eval:

Pick 5 internal repos representing the codebase types your team actually works on.
Pick 20-50 real, recently-resolved tickets with clear problem descriptions and known fixes.
Run each candidate model / agent setup on all tickets without supervision.
Score by: did it produce a passing change? did the change match the actual fix? did tests pass?
Track tokens spent and time-to-first-passing-attempt.

Two weeks of this gives you better signal than any public benchmark. Most enterprises picking between Codex on Bedrock, Claude Code on Bedrock, and Pi should run this eval in May-June 2026 before standardizing on a single tool.

Implication for AI coding investment

Two structural takeaways:

The “best coding model” claim should always include “on which benchmark.” A vendor’s leaderboard win on the public version is partial information. Private-benchmark wins matter more.
The benchmark ecosystem is maturing fast. Scale’s private leaderboard, vals.ai, BenchLM, and the next wave of private evals will dominate vendor-selection discussions through 2026. Public benchmarks become marketing; private benchmarks become procurement.

Bottom line

Scale’s May 3, 2026 SWE-Bench Pro Private update shows that public benchmark scores overstate real-world coding ability by 4.9-8.2 percentage points for top frontier models. Claude Opus 4.1 generalizes better than GPT-5 on novel codebases. Mythos Preview’s private score is the next major data point. For enterprise coding-agent selection, treat public benchmarks as relative-ordering signals only, and use private-codebase numbers — Scale’s leaderboard or your own internal eval — for absolute capability expectations.

Sources: Scale Labs SWE-Bench Pro Public leaderboard (labs.scale.com/leaderboard/swe_bench_pro_public, updated May 3, 2026), llm-stats.com SWE-Bench Pro leaderboard (May 2026), benchlm.ai coding benchmarks (March-May 2026), vals.ai SWE-Bench leaderboard.