What Is GDPval-AA? The Agentic AI Benchmark (May 2026)
What Is GDPval-AA? The Agentic AI Benchmark (May 2026)
GDPval-AA is the agentic-AI benchmark that’s become the May 2026 reference for picking models for production agent loops. Instead of scoring models on isolated coding tickets (SWE-Bench Pro) or trivia (MMLU), GDPval-AA scores models on real-world, multi-step economic-value tasks — research, customer support, multi-system integrations, autonomous loops with many tool calls. Per Artificial Analysis (April 2026), DeepSeek V4 Pro Max leads open-weights models at 1554, with GLM-5.1 (1535) and Kimi K2.6 (1484) trailing. Here’s why this matters more than SWE-Bench Pro for most enterprise procurement.
Last verified: May 5, 2026
What GDPval-AA measures
GDPval (“GDP value”) is a benchmark concept aimed at measuring AI usefulness on tasks that have real economic value. The “AA” suffix refers to Artificial Analysis (the firm publishing the leaderboard), distinguishing it from related benchmarks like OpenAI’s GDPval research evaluation.
GDPval-AA tests models on:
- Multi-step research tasks — gather data from multiple sources, synthesize, deliver a structured output.
- Tool-use sequences — chain 5-20 API calls correctly to accomplish a workflow.
- Long-horizon planning — multi-hour or multi-session tasks where intermediate state matters.
- Mixed-modality work — text + structured data + sometimes image / table interpretation.
- Recovery from errors — when a tool call fails, does the model adapt or get stuck?
Tasks are scored by whether the agent reaches a correct, complete output. Partial credit is limited — production agents either work or they don’t.
May 2026 leaderboard (open weights)
Per Artificial Analysis “DeepSeek is back among the leading open weights models” (April 2026):
| Rank | Model | GDPval-AA Score |
|---|---|---|
| 1 | DeepSeek V4 Pro (Max) | 1554 |
| 2 | GLM-5.1 | 1535 |
| 3 | Kimi K2.6 | 1484 |
| 4 | GLM-5 | 1402 |
Closed frontier models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, Mythos Preview) sit above this open-weights range, but exact closed-model GDPval-AA scores are less consistently published. Internal estimates put Claude Opus 4.7 in the 1700-1800 range and Mythos Preview higher.
How GDPval-AA differs from SWE-Bench Pro
The two benchmarks measure different things:
| Aspect | SWE-Bench Pro | GDPval-AA |
|---|---|---|
| Task type | Single coding tickets | Multi-step economic-value tasks |
| Tool use | Limited (file ops, tests) | Heavy (5-20 tool calls per task) |
| Horizon | Minutes per ticket | Hours per task |
| Failure mode | Patch doesn’t pass tests | Agent gets stuck or wanders |
| Best at | Coding capability ceiling | Production agent reliability |
A model can ace SWE-Bench Pro by being smart at code generation but still fail GDPval-AA by being unreliable in long agent loops. The reverse is also possible.
Real example: Kimi K2.6 scores higher on SWE-Bench Pro (58.6%) than DeepSeek V4 Pro (~58%), but DeepSeek V4 Pro Max scores dramatically higher on GDPval-AA (1554 vs 1484). For a single coding ticket, Kimi K2.6 is competitive. For an autonomous coding agent running for hours, DeepSeek V4 Pro Max is materially better.
Why GDPval-AA is becoming the procurement benchmark
Three reasons enterprises are weighting GDPval-AA more heavily:
1. Production workloads are agents, not single calls. Most enterprise AI deployments in 2026 are running agents — multi-step workflows with tool use. SWE-Bench Pro tells you how good a model is at coding in isolation; GDPval-AA tells you how good it is at being an agent.
2. Failure modes are different. A model that gets a coding ticket wrong is a quality problem. A model that gets stuck in an agent loop is an operations problem — it consumes tokens, blocks workflows, and generates support tickets. GDPval-AA captures these failure modes.
3. SWE-Bench Pro is increasingly contaminated. Public benchmarks suffer from training-data contamination over time. Scale’s private SWE-Bench Pro leaderboard showed 5-8 percentage point drops on private codebases. GDPval-AA’s task structure is less amenable to memorization, making it a more reliable signal of real-world capability.
Limitations of GDPval-AA
GDPval-AA isn’t perfect:
-
Less mature than SWE-Bench Pro. The benchmark methodology is newer and evolving; scores from different time periods may not be directly comparable.
-
Closed-model coverage is uneven. Anthropic and OpenAI don’t always submit models to third-party agentic benchmarks, so the closed-model side of the leaderboard has gaps.
-
Task distribution may not match yours. GDPval-AA tasks lean toward research, customer support, and multi-system workflows. If your workloads are dominated by something else (e.g., creative writing, code review), GDPval-AA may not predict your model’s performance well.
-
Single-attempt scoring. Most scores reflect single-attempt performance; in production, retries and human-in-the-loop fallbacks change the calculus.
How to use GDPval-AA in model selection
Practical guidance for May 2026:
-
For general agentic work, weight GDPval-AA above SWE-Bench Pro. Most production deployments run agents; GDPval-AA is the better single number.
-
For coding-specific work, look at both. SWE-Bench Pro tells you ceiling capability; GDPval-AA tells you operational reliability. The best model is usually the one that’s competitive on both.
-
For procurement evaluation, run your own internal eval. Public benchmarks are relative-ordering signals. Your own internal eval on your real workflows is the only number that tells you absolute performance on your tasks.
-
Don’t pick on a single benchmark. Use GDPval-AA + SWE-Bench Pro + your own internal eval together. Models that win on all three are safe defaults; models that win on one and lose on the others are usually overfit.
Models to consider by GDPval-AA tier
Based on May 2026 data:
Tier 1 (closed frontier, GDPval-AA est. 1700+):
- Claude Opus 4.7
- Mythos Preview
- GPT-5.5
- Gemini 3.1 Pro
Tier 2 (open weights, GDPval-AA 1500-1600):
- DeepSeek V4 Pro Max (1554)
- GLM-5.1 (1535)
Tier 3 (open weights, GDPval-AA 1400-1500):
- Kimi K2.6 (1484)
- GLM-5 (1402)
For most enterprises, a router that defaults to Tier 2 open weights and escalates to Tier 1 closed frontier on hard tasks is the cost-optimized path.
What’s coming
Three benchmark developments to watch through Q2-Q3 2026:
-
GDPval-AA expansion. Artificial Analysis is reportedly adding more domains (legal, healthcare, finance) to the benchmark.
-
Closed-model coverage. Pressure is building on Anthropic and OpenAI to submit models to third-party agentic benchmarks for transparency.
-
Private versions. Following Scale’s private SWE-Bench Pro pattern, expect private GDPval-AA variants within 6-12 months that test on novel agentic tasks.
Bottom line
GDPval-AA is the agentic-AI benchmark to watch in May 2026. It captures real-world agent reliability that SWE-Bench Pro doesn’t, and rankings can differ meaningfully — DeepSeek V4 Pro Max leads open weights at 1554 despite trailing slightly on SWE-Bench Pro. For enterprise model selection, weight GDPval-AA heavily for agentic workloads, run your own internal eval as the absolute-capability check, and use SWE-Bench Pro as a ceiling-capability sanity check.
Sources: Artificial Analysis “DeepSeek is back among the leading open weights models with V4 Pro and V4 Flash” (April 2026), BenchLM.ai Chinese leaderboard (April 2026), Atlas Cloud comparison (April 2026), Scale Labs SWE-Bench Pro private leaderboard (May 2026).