AI agents · OpenClaw · self-hosting · automation

Quick Answer

What is Terminal-Bench 2.0? The New Agent Benchmark

Published:

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is the benchmark AI labs cite when they want to measure agent capability, not just coding ability. It’s the reason the April 2026 model debates keep saying “GPT-5.5 at 82.7%, Opus 4.7 at 69.4%.” Here’s what the benchmark actually tests and why the scores matter.

Last verified: April 24, 2026

The one-sentence definition

Terminal-Bench 2.0 is an evaluation suite that runs an AI agent inside a real shell, gives it a real task (e.g., “debug this failing build,” “set up this repository on a fresh Ubuntu VM”), and scores whether the agent achieves the goal using only terminal commands.

Why terminal tasks matter as a benchmark

A model can ace SWE-bench by generating a perfect patch in a single response. But real-world agent work looks different:

  1. Run a command
  2. Read the output
  3. Something unexpected happened
  4. Adapt, try a different command
  5. Check progress
  6. Repeat for 20–200 steps

This is what Terminal-Bench measures: sustained multi-step tool use with real feedback loops. A model that’s brilliant at one-shot coding can still fail miserably at running make test, noticing a segfault, and debugging it.

What changed in 2.0

Terminal-Bench 1.0 launched in 2024 as one of the earliest pure-agent benchmarks. By late 2025, the top models were saturating it — scores in the 90%s, lots of reward hacking, and edge cases that didn’t reflect real work.

Terminal-Bench 2.0 (early 2026) shipped three changes:

  1. Harder tasks. More open-ended goals, fewer “copy-paste this command” shortcuts.
  2. Stricter scoring. Partial credit removed; tasks either succeed or fail end-to-end.
  3. Adversarial setups. Broken Docker images, missing dependencies, conflicting packages — the kind of mess real engineers face.

The result: scores dropped by 15–25 points across the board. GPT-5.4 went from ~85% on Terminal-Bench 1.0 to ~64% on 2.0.

April 2026 leaderboard

ModelTerminal-Bench 2.0Released
GPT-5.582.7%Apr 23, 2026
Claude Mythos Preview82.0%Apr 2026
Claude Opus 4.769.4%Apr 16, 2026
Gemini 3.1 Ultra66.0%Apr 2026
GPT-5.4~64%Mar 2026
Claude Sonnet 4.662.8%Feb 2026
Gemini 3.1 Pro58.0%Apr 2026
Claude 3.7 Sonnet~45%2025

Three takeaways:

  1. Only two models crack 80% — GPT-5.5 and the still-restricted Claude Mythos Preview.
  2. Opus 4.7’s lead on SWE-bench doesn’t transfer. On Terminal-Bench 2.0, it lags GPT-5.5 by 13 points.
  3. The gap widens with model generation. Claude 3.7 Sonnet scored around 45%. Sonnet 4.6 scores 62.8%. The agentic gap is growing faster than the raw reasoning gap.

What Terminal-Bench 2.0 actually tests

Categories in the 2.0 task set:

CategoryExample task
Debugging”The tests are failing — figure out why and fix it”
Dependency hell”Set up this Python 3.11 project on Ubuntu 22.04”
Git workflows”Rebase this branch, resolve conflicts, push”
Server setup”Configure nginx with SSL on this fresh VM”
Data wrangling”Extract and transform this CSV into JSON by these rules”
Search & replace”Rename this API across all files without breaking tests”
Package management”Upgrade these dependencies safely”
Log analysis”Find why the service crashed at 3am from these logs”
Permissions”Fix this file-permissions issue blocking the build”
Recovery”This deploy partially rolled out — clean it up”

Each task has a strict pass/fail check (all tests green, service responds correctly, etc.). No partial credit.

How labs use Terminal-Bench 2.0

OpenAI cited Terminal-Bench 2.0 in the GPT-5.5 launch as a primary agentic benchmark. The model card lists 82.7% as the headline agentic score.

Anthropic cites Terminal-Bench in Claude model cards and uses derived variants internally for training signal.

Google includes Terminal-Bench 2.0 in Gemini 3.1’s technical report.

Third parties like LLM-Stats, BenchLM, and Artificial Analysis include it in aggregate leaderboards.

Is Terminal-Bench 2.0 the “right” benchmark?

It’s one of the most useful agent benchmarks available in April 2026, but it has limits:

Good for:

  • Measuring sustained multi-step tool use
  • Measuring adaptation to unexpected output
  • Differentiating pure coders from pure agents

Not good for:

  • Predicting real-world Cursor or Claude Code quality (SWE-bench is better)
  • Measuring creative coding or architecture
  • Measuring quality of code written (just whether tasks succeed)
  • Measuring user-facing polish

What high Terminal-Bench 2.0 predicts

If you’re choosing a model for:

  • Production AI agents that run autonomously → Terminal-Bench 2.0 matters a lot
  • Computer use / browser automation → Terminal-Bench 2.0 correlates well
  • Devops and infrastructure automation → Terminal-Bench 2.0 is the single best predictor
  • SRE / incident response agents → Terminal-Bench 2.0 is your benchmark
  • Background “fix this overnight” agents → Terminal-Bench 2.0 matters more than SWE-bench

If you’re choosing a model for:

  • IDE autocomplete → SWE-bench and HumanEval matter more
  • Deep refactors → SWE-bench Verified is still king
  • Pair programming → Real user testing beats any benchmark

Why GPT-5.5’s 82.7% matters

OpenAI explicitly trained GPT-5.5 for agentic computer use. The 82.7% Terminal-Bench 2.0 score is the clearest evidence that this training strategy worked — it’s a 19-point jump over GPT-5.4 on a benchmark designed to resist gaming.

For production agent builders, this is the data point that justified switching defaults to GPT-5.5 within 24 hours of launch.


Last verified: April 24, 2026. Sources: OpenAI GPT-5.5 announcement, Anthropic Opus 4.7 model card, Google Gemini 3.1 technical report, Terminal-Bench maintainers, VentureBeat, MarkTechPost, LLM-Stats.