Why does GDPval-AA matter for picking AI models?

Because in production, you usually run agents — not single calls. An agent that scores high on SWE-Bench Pro but low on GDPval-AA may handle individual tickets well but fall apart in multi-step tool-use loops. DeepSeek V4 Pro Max's 1554 GDPval-AA score, for example, is the reason it's a better pick than Kimi K2.6 for autonomous coding agents despite both scoring similarly on SWE-Bench Pro. For enterprise procurement, GDPval-AA is now considered the more honest benchmark for agentic workloads.

Which model leads GDPval-AA in May 2026?

DeepSeek V4 Pro (Max) leads open-weights models at 1554 GDPval-AA per Artificial Analysis (April 2026), ahead of GLM-5.1 (1535), Kimi K2.6 (1484), GLM-5 (1402). Closed frontier models — Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro — sit above the open-weights field on GDPval-AA, with Claude Opus 4.7 and Mythos Preview reportedly leading the closed leaderboard. Anthropic and OpenAI don't always publish GDPval-AA scores publicly, so closed-model rankings come from third-party evals.

Quick Answer

What Is GDPval-AA? The Agentic AI Benchmark (May 2026)

Q: What is GDPval-AA?

GDPval-AA (sometimes 'GDPval Artificial Analysis' or 'GDPval AA') is an agentic AI benchmark that scores models on real-world agent workflows — multi-step tool use, long-horizon planning, autonomous problem solving — rather than isolated coding tickets or trivia. Per Artificial Analysis (April 2026), DeepSeek V4 Pro (Max) leads open-weights models at 1554, ahead of Kimi K2.6 (1484) and GLM-5.1 (1535). The benchmark has become a key reference for picking models for production agent loops, where SWE-Bench Pro alone doesn't capture how the model performs across multi-step workflows.

Q: How does GDPval-AA differ from SWE-Bench Pro?

SWE-Bench Pro evaluates models on individual coding tickets — usually a single bug fix or feature on a single repository, scored by whether the patch passes hidden tests. GDPval-AA evaluates models on broader real-world economic-value tasks: research, customer support, multi-system integrations, long-horizon agent loops with many tool calls. SWE-Bench Pro tests coding ability in isolation; GDPval-AA tests whether a model is good enough to run as a production agent doing actual work. Top models score very differently on the two benchmarks.

Published: May 5, 2026

What Is GDPval-AA? The Agentic AI Benchmark (May 2026)

GDPval-AA is the agentic-AI benchmark that’s become the May 2026 reference for picking models for production agent loops. Instead of scoring models on isolated coding tickets (SWE-Bench Pro) or trivia (MMLU), GDPval-AA scores models on real-world, multi-step economic-value tasks — research, customer support, multi-system integrations, autonomous loops with many tool calls. Per Artificial Analysis (April 2026), DeepSeek V4 Pro Max leads open-weights models at 1554, with GLM-5.1 (1535) and Kimi K2.6 (1484) trailing. Here’s why this matters more than SWE-Bench Pro for most enterprise procurement.

Last verified: May 5, 2026

What GDPval-AA measures

GDPval (“GDP value”) is a benchmark concept aimed at measuring AI usefulness on tasks that have real economic value. The “AA” suffix refers to Artificial Analysis (the firm publishing the leaderboard), distinguishing it from related benchmarks like OpenAI’s GDPval research evaluation.

GDPval-AA tests models on:

Multi-step research tasks — gather data from multiple sources, synthesize, deliver a structured output.
Tool-use sequences — chain 5-20 API calls correctly to accomplish a workflow.
Long-horizon planning — multi-hour or multi-session tasks where intermediate state matters.
Mixed-modality work — text + structured data + sometimes image / table interpretation.
Recovery from errors — when a tool call fails, does the model adapt or get stuck?

Tasks are scored by whether the agent reaches a correct, complete output. Partial credit is limited — production agents either work or they don’t.

May 2026 leaderboard (open weights)

Per Artificial Analysis “DeepSeek is back among the leading open weights models” (April 2026):

Rank	Model	GDPval-AA Score
1	DeepSeek V4 Pro (Max)	1554
2	GLM-5.1	1535
3	Kimi K2.6	1484
4	GLM-5	1402

Closed frontier models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, Mythos Preview) sit above this open-weights range, but exact closed-model GDPval-AA scores are less consistently published. Internal estimates put Claude Opus 4.7 in the 1700-1800 range and Mythos Preview higher.

How GDPval-AA differs from SWE-Bench Pro

The two benchmarks measure different things:

Aspect	SWE-Bench Pro	GDPval-AA
Task type	Single coding tickets	Multi-step economic-value tasks
Tool use	Limited (file ops, tests)	Heavy (5-20 tool calls per task)
Horizon	Minutes per ticket	Hours per task
Failure mode	Patch doesn’t pass tests	Agent gets stuck or wanders
Best at	Coding capability ceiling	Production agent reliability

A model can ace SWE-Bench Pro by being smart at code generation but still fail GDPval-AA by being unreliable in long agent loops. The reverse is also possible.

Real example: Kimi K2.6 scores higher on SWE-Bench Pro (58.6%) than DeepSeek V4 Pro (~58%), but DeepSeek V4 Pro Max scores dramatically higher on GDPval-AA (1554 vs 1484). For a single coding ticket, Kimi K2.6 is competitive. For an autonomous coding agent running for hours, DeepSeek V4 Pro Max is materially better.

Why GDPval-AA is becoming the procurement benchmark

Three reasons enterprises are weighting GDPval-AA more heavily:

1. Production workloads are agents, not single calls. Most enterprise AI deployments in 2026 are running agents — multi-step workflows with tool use. SWE-Bench Pro tells you how good a model is at coding in isolation; GDPval-AA tells you how good it is at being an agent.

2. Failure modes are different. A model that gets a coding ticket wrong is a quality problem. A model that gets stuck in an agent loop is an operations problem — it consumes tokens, blocks workflows, and generates support tickets. GDPval-AA captures these failure modes.

3. SWE-Bench Pro is increasingly contaminated. Public benchmarks suffer from training-data contamination over time. Scale’s private SWE-Bench Pro leaderboard showed 5-8 percentage point drops on private codebases. GDPval-AA’s task structure is less amenable to memorization, making it a more reliable signal of real-world capability.

Limitations of GDPval-AA

GDPval-AA isn’t perfect:

Less mature than SWE-Bench Pro. The benchmark methodology is newer and evolving; scores from different time periods may not be directly comparable.
Closed-model coverage is uneven. Anthropic and OpenAI don’t always submit models to third-party agentic benchmarks, so the closed-model side of the leaderboard has gaps.
Task distribution may not match yours. GDPval-AA tasks lean toward research, customer support, and multi-system workflows. If your workloads are dominated by something else (e.g., creative writing, code review), GDPval-AA may not predict your model’s performance well.
Single-attempt scoring. Most scores reflect single-attempt performance; in production, retries and human-in-the-loop fallbacks change the calculus.

How to use GDPval-AA in model selection

Practical guidance for May 2026:

For general agentic work, weight GDPval-AA above SWE-Bench Pro. Most production deployments run agents; GDPval-AA is the better single number.
For coding-specific work, look at both. SWE-Bench Pro tells you ceiling capability; GDPval-AA tells you operational reliability. The best model is usually the one that’s competitive on both.
For procurement evaluation, run your own internal eval. Public benchmarks are relative-ordering signals. Your own internal eval on your real workflows is the only number that tells you absolute performance on your tasks.
Don’t pick on a single benchmark. Use GDPval-AA + SWE-Bench Pro + your own internal eval together. Models that win on all three are safe defaults; models that win on one and lose on the others are usually overfit.

Models to consider by GDPval-AA tier

Based on May 2026 data:

Tier 1 (closed frontier, GDPval-AA est. 1700+):

Claude Opus 4.7
Mythos Preview
GPT-5.5
Gemini 3.1 Pro

Tier 2 (open weights, GDPval-AA 1500-1600):

DeepSeek V4 Pro Max (1554)
GLM-5.1 (1535)

Tier 3 (open weights, GDPval-AA 1400-1500):

Kimi K2.6 (1484)
GLM-5 (1402)

For most enterprises, a router that defaults to Tier 2 open weights and escalates to Tier 1 closed frontier on hard tasks is the cost-optimized path.

What’s coming

Three benchmark developments to watch through Q2-Q3 2026:

GDPval-AA expansion. Artificial Analysis is reportedly adding more domains (legal, healthcare, finance) to the benchmark.
Closed-model coverage. Pressure is building on Anthropic and OpenAI to submit models to third-party agentic benchmarks for transparency.
Private versions. Following Scale’s private SWE-Bench Pro pattern, expect private GDPval-AA variants within 6-12 months that test on novel agentic tasks.

Bottom line

GDPval-AA is the agentic-AI benchmark to watch in May 2026. It captures real-world agent reliability that SWE-Bench Pro doesn’t, and rankings can differ meaningfully — DeepSeek V4 Pro Max leads open weights at 1554 despite trailing slightly on SWE-Bench Pro. For enterprise model selection, weight GDPval-AA heavily for agentic workloads, run your own internal eval as the absolute-capability check, and use SWE-Bench Pro as a ceiling-capability sanity check.

Sources: Artificial Analysis “DeepSeek is back among the leading open weights models with V4 Pro and V4 Flash” (April 2026), BenchLM.ai Chinese leaderboard (April 2026), Atlas Cloud comparison (April 2026), Scale Labs SWE-Bench Pro private leaderboard (May 2026).