How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench evaluates a model's ability to fix real GitHub issues in a given codebase — it's about code quality on a specific bug. Terminal-Bench 2.0 evaluates a model's ability to drive a shell — it's about agentic multi-step tool use (running commands, reading output, adapting, retrying). SWE-bench measures 'can you code?' Terminal-Bench measures 'can you be a sysadmin or devops agent?'

Who runs Terminal-Bench 2.0?

Terminal-Bench was created by researchers including those at Princeton and is maintained as an open benchmark. Terminal-Bench 2.0 shipped in early 2026 with harder tasks, improved scoring, and less reward-hackable evaluation. OpenAI, Anthropic, and other labs cite it in their model cards as a measure of agentic capability.

Why did GPT-5.5 score so high on Terminal-Bench 2.0?

GPT-5.5 was explicitly trained for agentic computer use and long-horizon tool use. Its 82.7% on Terminal-Bench 2.0 reflects the training focus: multi-step shell tasks, adaptive retries, and handling unexpected output. Claude Opus 4.7's 69.4% shows that even the best pure-coder model struggles with sustained agentic workflows — confirming that 'good at code' and 'good at shell agents' are becoming separate capabilities.

Quick Answer

What is Terminal-Bench 2.0? The New Agent Benchmark

Q: What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a benchmark suite that measures how well AI agents perform real-world terminal tasks — debugging, file manipulation, package management, git operations, and multi-step shell workflows. It replaced Terminal-Bench 1.0 in early 2026 with harder tasks and stricter evaluation. Leading scores in April 2026: GPT-5.5 at 82.7%, Claude Mythos Preview at 82.0%, Claude Opus 4.7 at 69.4%.

Published: April 24, 2026

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is the benchmark AI labs cite when they want to measure agent capability, not just coding ability. It’s the reason the April 2026 model debates keep saying “GPT-5.5 at 82.7%, Opus 4.7 at 69.4%.” Here’s what the benchmark actually tests and why the scores matter.

Last verified: April 24, 2026

The one-sentence definition

Terminal-Bench 2.0 is an evaluation suite that runs an AI agent inside a real shell, gives it a real task (e.g., “debug this failing build,” “set up this repository on a fresh Ubuntu VM”), and scores whether the agent achieves the goal using only terminal commands.

Why terminal tasks matter as a benchmark

A model can ace SWE-bench by generating a perfect patch in a single response. But real-world agent work looks different:

Run a command
Read the output
Something unexpected happened
Adapt, try a different command
Check progress
Repeat for 20–200 steps

This is what Terminal-Bench measures: sustained multi-step tool use with real feedback loops. A model that’s brilliant at one-shot coding can still fail miserably at running make test, noticing a segfault, and debugging it.

What changed in 2.0

Terminal-Bench 1.0 launched in 2024 as one of the earliest pure-agent benchmarks. By late 2025, the top models were saturating it — scores in the 90%s, lots of reward hacking, and edge cases that didn’t reflect real work.

Terminal-Bench 2.0 (early 2026) shipped three changes:

Harder tasks. More open-ended goals, fewer “copy-paste this command” shortcuts.
Stricter scoring. Partial credit removed; tasks either succeed or fail end-to-end.
Adversarial setups. Broken Docker images, missing dependencies, conflicting packages — the kind of mess real engineers face.

The result: scores dropped by 15–25 points across the board. GPT-5.4 went from ~85% on Terminal-Bench 1.0 to ~64% on 2.0.

April 2026 leaderboard

Model	Terminal-Bench 2.0	Released
GPT-5.5	82.7%	Apr 23, 2026
Claude Mythos Preview	82.0%	Apr 2026
Claude Opus 4.7	69.4%	Apr 16, 2026
Gemini 3.1 Ultra	66.0%	Apr 2026
GPT-5.4	~64%	Mar 2026
Claude Sonnet 4.6	62.8%	Feb 2026
Gemini 3.1 Pro	58.0%	Apr 2026
Claude 3.7 Sonnet	~45%	2025

Three takeaways:

Only two models crack 80% — GPT-5.5 and the still-restricted Claude Mythos Preview.
Opus 4.7’s lead on SWE-bench doesn’t transfer. On Terminal-Bench 2.0, it lags GPT-5.5 by 13 points.
The gap widens with model generation. Claude 3.7 Sonnet scored around 45%. Sonnet 4.6 scores 62.8%. The agentic gap is growing faster than the raw reasoning gap.

What Terminal-Bench 2.0 actually tests

Categories in the 2.0 task set:

Category	Example task
Debugging	”The tests are failing — figure out why and fix it”
Dependency hell	”Set up this Python 3.11 project on Ubuntu 22.04”
Git workflows	”Rebase this branch, resolve conflicts, push”
Server setup	”Configure nginx with SSL on this fresh VM”
Data wrangling	”Extract and transform this CSV into JSON by these rules”
Search & replace	”Rename this API across all files without breaking tests”
Package management	”Upgrade these dependencies safely”
Log analysis	”Find why the service crashed at 3am from these logs”
Permissions	”Fix this file-permissions issue blocking the build”
Recovery	”This deploy partially rolled out — clean it up”

Each task has a strict pass/fail check (all tests green, service responds correctly, etc.). No partial credit.

How labs use Terminal-Bench 2.0

OpenAI cited Terminal-Bench 2.0 in the GPT-5.5 launch as a primary agentic benchmark. The model card lists 82.7% as the headline agentic score.

Anthropic cites Terminal-Bench in Claude model cards and uses derived variants internally for training signal.

Google includes Terminal-Bench 2.0 in Gemini 3.1’s technical report.

Third parties like LLM-Stats, BenchLM, and Artificial Analysis include it in aggregate leaderboards.

Is Terminal-Bench 2.0 the “right” benchmark?

It’s one of the most useful agent benchmarks available in April 2026, but it has limits:

Good for:

Measuring sustained multi-step tool use
Measuring adaptation to unexpected output
Differentiating pure coders from pure agents

Not good for:

Predicting real-world Cursor or Claude Code quality (SWE-bench is better)
Measuring creative coding or architecture
Measuring quality of code written (just whether tasks succeed)
Measuring user-facing polish

What high Terminal-Bench 2.0 predicts

If you’re choosing a model for:

Production AI agents that run autonomously → Terminal-Bench 2.0 matters a lot
Computer use / browser automation → Terminal-Bench 2.0 correlates well
Devops and infrastructure automation → Terminal-Bench 2.0 is the single best predictor
SRE / incident response agents → Terminal-Bench 2.0 is your benchmark
Background “fix this overnight” agents → Terminal-Bench 2.0 matters more than SWE-bench

If you’re choosing a model for:

IDE autocomplete → SWE-bench and HumanEval matter more
Deep refactors → SWE-bench Verified is still king
Pair programming → Real user testing beats any benchmark

Why GPT-5.5’s 82.7% matters

OpenAI explicitly trained GPT-5.5 for agentic computer use. The 82.7% Terminal-Bench 2.0 score is the clearest evidence that this training strategy worked — it’s a 19-point jump over GPT-5.4 on a benchmark designed to resist gaming.

For production agent builders, this is the data point that justified switching defaults to GPT-5.5 within 24 hours of launch.

Last verified: April 24, 2026. Sources: OpenAI GPT-5.5 announcement, Anthropic Opus 4.7 model card, Google Gemini 3.1 technical report, Terminal-Bench maintainers, VentureBeat, MarkTechPost, LLM-Stats.