What is Terminal-Bench 2.0? The New Agent Benchmark
What is Terminal-Bench 2.0?
Terminal-Bench 2.0 is the benchmark AI labs cite when they want to measure agent capability, not just coding ability. It’s the reason the April 2026 model debates keep saying “GPT-5.5 at 82.7%, Opus 4.7 at 69.4%.” Here’s what the benchmark actually tests and why the scores matter.
Last verified: April 24, 2026
The one-sentence definition
Terminal-Bench 2.0 is an evaluation suite that runs an AI agent inside a real shell, gives it a real task (e.g., “debug this failing build,” “set up this repository on a fresh Ubuntu VM”), and scores whether the agent achieves the goal using only terminal commands.
Why terminal tasks matter as a benchmark
A model can ace SWE-bench by generating a perfect patch in a single response. But real-world agent work looks different:
- Run a command
- Read the output
- Something unexpected happened
- Adapt, try a different command
- Check progress
- Repeat for 20–200 steps
This is what Terminal-Bench measures: sustained multi-step tool use with real feedback loops. A model that’s brilliant at one-shot coding can still fail miserably at running make test, noticing a segfault, and debugging it.
What changed in 2.0
Terminal-Bench 1.0 launched in 2024 as one of the earliest pure-agent benchmarks. By late 2025, the top models were saturating it — scores in the 90%s, lots of reward hacking, and edge cases that didn’t reflect real work.
Terminal-Bench 2.0 (early 2026) shipped three changes:
- Harder tasks. More open-ended goals, fewer “copy-paste this command” shortcuts.
- Stricter scoring. Partial credit removed; tasks either succeed or fail end-to-end.
- Adversarial setups. Broken Docker images, missing dependencies, conflicting packages — the kind of mess real engineers face.
The result: scores dropped by 15–25 points across the board. GPT-5.4 went from ~85% on Terminal-Bench 1.0 to ~64% on 2.0.
April 2026 leaderboard
| Model | Terminal-Bench 2.0 | Released |
|---|---|---|
| GPT-5.5 | 82.7% | Apr 23, 2026 |
| Claude Mythos Preview | 82.0% | Apr 2026 |
| Claude Opus 4.7 | 69.4% | Apr 16, 2026 |
| Gemini 3.1 Ultra | 66.0% | Apr 2026 |
| GPT-5.4 | ~64% | Mar 2026 |
| Claude Sonnet 4.6 | 62.8% | Feb 2026 |
| Gemini 3.1 Pro | 58.0% | Apr 2026 |
| Claude 3.7 Sonnet | ~45% | 2025 |
Three takeaways:
- Only two models crack 80% — GPT-5.5 and the still-restricted Claude Mythos Preview.
- Opus 4.7’s lead on SWE-bench doesn’t transfer. On Terminal-Bench 2.0, it lags GPT-5.5 by 13 points.
- The gap widens with model generation. Claude 3.7 Sonnet scored around 45%. Sonnet 4.6 scores 62.8%. The agentic gap is growing faster than the raw reasoning gap.
What Terminal-Bench 2.0 actually tests
Categories in the 2.0 task set:
| Category | Example task |
|---|---|
| Debugging | ”The tests are failing — figure out why and fix it” |
| Dependency hell | ”Set up this Python 3.11 project on Ubuntu 22.04” |
| Git workflows | ”Rebase this branch, resolve conflicts, push” |
| Server setup | ”Configure nginx with SSL on this fresh VM” |
| Data wrangling | ”Extract and transform this CSV into JSON by these rules” |
| Search & replace | ”Rename this API across all files without breaking tests” |
| Package management | ”Upgrade these dependencies safely” |
| Log analysis | ”Find why the service crashed at 3am from these logs” |
| Permissions | ”Fix this file-permissions issue blocking the build” |
| Recovery | ”This deploy partially rolled out — clean it up” |
Each task has a strict pass/fail check (all tests green, service responds correctly, etc.). No partial credit.
How labs use Terminal-Bench 2.0
OpenAI cited Terminal-Bench 2.0 in the GPT-5.5 launch as a primary agentic benchmark. The model card lists 82.7% as the headline agentic score.
Anthropic cites Terminal-Bench in Claude model cards and uses derived variants internally for training signal.
Google includes Terminal-Bench 2.0 in Gemini 3.1’s technical report.
Third parties like LLM-Stats, BenchLM, and Artificial Analysis include it in aggregate leaderboards.
Is Terminal-Bench 2.0 the “right” benchmark?
It’s one of the most useful agent benchmarks available in April 2026, but it has limits:
Good for:
- Measuring sustained multi-step tool use
- Measuring adaptation to unexpected output
- Differentiating pure coders from pure agents
Not good for:
- Predicting real-world Cursor or Claude Code quality (SWE-bench is better)
- Measuring creative coding or architecture
- Measuring quality of code written (just whether tasks succeed)
- Measuring user-facing polish
What high Terminal-Bench 2.0 predicts
If you’re choosing a model for:
- Production AI agents that run autonomously → Terminal-Bench 2.0 matters a lot
- Computer use / browser automation → Terminal-Bench 2.0 correlates well
- Devops and infrastructure automation → Terminal-Bench 2.0 is the single best predictor
- SRE / incident response agents → Terminal-Bench 2.0 is your benchmark
- Background “fix this overnight” agents → Terminal-Bench 2.0 matters more than SWE-bench
If you’re choosing a model for:
- IDE autocomplete → SWE-bench and HumanEval matter more
- Deep refactors → SWE-bench Verified is still king
- Pair programming → Real user testing beats any benchmark
Why GPT-5.5’s 82.7% matters
OpenAI explicitly trained GPT-5.5 for agentic computer use. The 82.7% Terminal-Bench 2.0 score is the clearest evidence that this training strategy worked — it’s a 19-point jump over GPT-5.4 on a benchmark designed to resist gaming.
For production agent builders, this is the data point that justified switching defaults to GPT-5.5 within 24 hours of launch.
Last verified: April 24, 2026. Sources: OpenAI GPT-5.5 announcement, Anthropic Opus 4.7 model card, Google Gemini 3.1 technical report, Terminal-Bench maintainers, VentureBeat, MarkTechPost, LLM-Stats.