Which AI coding CLI tops Terminal-Bench 2.1?

OpenAI's Codex CLI paired with GPT-5.5 leads at 83.4% as of June 9, 2026 — the highest score on the public tbench.ai leaderboard. Anthropic's Claude Code paired with Claude Opus 4.8 is second at 78.9%, ahead of Opus 4.7's 69.7%. Google's Gemini CLI paired with Gemini 3.1 Pro is at 70.7%. The leaderboard measures the agent and model together because the same model scores differently inside different agents.

Does Terminal-Bench 2.1 leadership mean Codex CLI is the best coding agent overall?

Not necessarily. Terminal-Bench rewards driving a terminal end-to-end — editing files, running commands, fixing failures. SWE-Bench Pro rewards fixing real GitHub issues. On SWE-Bench Pro, Claude Opus 4.8 leads at 69.2% (up from Opus 4.7's 64.3%), outperforming GPT-5.5 and Gemini 3.1 Pro. So Codex CLI + GPT-5.5 wins terminal-driven tasks, Claude Code + Opus 4.8 wins issue-fixing tasks. For the depth-of-reasoning crown, Claude Opus 4.8 retains it. Pick the agent that matches the work you actually do.

Which CLI should I install if I want a free, model-agnostic option?

Three good options as of June 2026: (1) Gemini CLI — 60 requests per minute and 1,000 per day free with Gemini 3.1 Pro / 3.5 Flash. Best free option for occasional use. (2) Aider — open source, you pay only LLM API costs, model-agnostic across all major providers. (3) OpenCode — MIT-licensed, 172K+ GitHub stars, supports 75+ providers. Most-starred open source agent. For paid baseline: GitHub Copilot Pro at $10/mo is still cheapest, but Codex CLI is the highest-scoring paid agent.

Quick Answer

Codex CLI vs Claude Code vs Gemini CLI: Terminal-Bench 2.1 (June 2026)

Q: What about Claude Fable 5 — is it on the leaderboard?

Not yet, as of June 14, 2026. Terminal-Bench 2.1 results referenced here are from the June 9 leaderboard, which predates broad Fable 5 deployment in coding agents. Early reports suggest Fable 5 will push Claude Code's Terminal-Bench score above its current 78.9% and likely above Opus 4.8's SWE-Bench Pro 69.2% — but neither has been verified on a public leaderboard yet. Watch for an updated Terminal-Bench 2.2 leaderboard in late June or July 2026.

Published: June 14, 2026

Codex CLI vs Claude Code vs Gemini CLI: Terminal-Bench 2.1 (June 2026)

As of June 9, 2026, the public Terminal-Bench 2.1 leaderboard at tbench.ai puts OpenAI’s Codex CLI + GPT-5.5 at #1 with 83.4%, Anthropic’s Claude Code + Opus 4.8 at #2 with 78.9%, and Google’s Gemini CLI + Gemini 3.1 Pro at 70.7%. The picture on SWE-Bench Pro inverts — Claude Opus 4.8 leads there at 69.2%. This page maps which CLI to pick for which kind of work.

Last verified: June 14, 2026

TL;DR

Terminal-Bench 2.1 leader: Codex CLI + GPT-5.5 at 83.4% (#1 overall).
SWE-Bench Pro leader: Claude Opus 4.8 at 69.2% (best on issue-fixing).
Best free option: Gemini CLI (1,000 free requests/day) or open-source Aider / OpenCode.
Best paid baseline: GitHub Copilot Pro at $10/mo for inline; Codex CLI for top-end terminal performance.
For Claude-primary stacks: Claude Code + Opus 4.8 remains the practical leader despite #2 ranking.

The Terminal-Bench 2.1 leaderboard (June 9, 2026)

Terminal-Bench measures an agent driving a real terminal to complete development tasks: editing files, running commands, fixing failures. Scores from the public tbench.ai leaderboard.

Rank	Agent + Model	Score
#1	Codex CLI + GPT-5.5	83.4%
#2	Claude Code + Opus 4.8	78.9%
#3	Terminus 2 + GPT-5.5	78.2%
#5	Terminus 2 + Gemini 3 Pro	74.4%
#6	Gemini CLI + Gemini 3.1 Pro	70.7%
#8	Claude Code + Opus 4.7	69.7%

Two takeaways:

The same model scores differently inside different agents. GPT-5.5 inside Codex CLI scores 83.4%; inside Terminus 2 it scores 78.2%. The agent harness matters as much as the model.
The leaderboard ranks the pairing. When you “pick an AI coding agent,” you’re picking both the CLI and the default model. Most users don’t realize this.

Why SWE-Bench Pro tells a different story

Terminal-Bench rewards end-to-end terminal driving. SWE-Bench Pro rewards fixing real GitHub issues — the work most engineers actually do.

Model	SWE-Bench Pro score	Trend vs prior version
Claude Opus 4.8	69.2%	+4.9 pts vs Opus 4.7
GPT-5.5 (Codex)	~65% area (vendor-reported)	—
Gemini 3.1 Pro	~58% area (vendor-reported)	—
Claude Opus 4.7	64.3%	—

On the self-reported SWE-Bench Verified leaderboard at llm-stats.com, Claude Opus 4.8 sits at 88.6% and Opus 4.7 at 87.6%.

The two leaderboards disagree on the top model — and that’s fine, because they test different things. Codex CLI + GPT-5.5 wins terminal-driven tasks. Claude Code + Opus 4.8 wins issue-fixing tasks. Pick the agent that matches the work you actually do.

Quick agent comparison

Dimension	Codex CLI	Claude Code	Gemini CLI
Vendor	OpenAI	Anthropic	Google
License	Apache 2.0 (open)	Proprietary	Apache 2.0 (open)
Implementation	Rust + TypeScript	TypeScript	TypeScript
Default model	GPT-5.5	Claude Opus 4.8 (Fable 5 selectable, US-only)	Gemini 3.1 Pro
Context window	200K (Codex model)	200K (1M on Sonnet beta)	1M tokens
Sandboxing	Docker sandbox	OS-level permissions	Configurable trusted folders
Approval modes	Suggest / Auto-Edit / Full Auto	Per-action prompts	Trusted folders + Yolo mode
Free tier	Via ChatGPT Plus / API metering	$20/mo Pro (post-June 22 credits)	60 req/min, 1,000/day free
Terminal-Bench 2.1	83.4% (#1)	78.9% (#2)	70.7% (#6)
SWE-Bench Pro	~65% (vendor)	69.2% (best)	~58% (vendor)

When to pick Codex CLI

Pick Codex CLI if:

You want the highest current Terminal-Bench score (83.4% with GPT-5.5).
You want open-source Apache 2.0 code with Docker sandboxing for safety.
You’re already in the OpenAI / ChatGPT Plus ecosystem.
You’re doing multi-step terminal tasks — file edits, command runs, failure recovery.
You want rapid model upgrades — OpenAI ships new Codex models frequently.

Skip Codex CLI if:

You need 1M-token context (use Gemini CLI or Claude Sonnet 4.5 beta).
You’re standardized on Anthropic for safety/compliance (Claude Code fits better).
You want the highest SWE-Bench Pro depth (Claude Code + Opus 4.8 wins there).

When to pick Claude Code

Pick Claude Code if:

You’re doing deep multi-file refactors or hard GitHub issue fixes (SWE-Bench Pro leader at 69.2%).
You want Sub-agents and Dynamic Workflows (Opus 4.8 native; Fable 5 in selected regions).
You’re terminal-first and want the best agentic CLI experience for Claude users.
Your team has decided Anthropic is the safety/compliance primary.

Skip Claude Code if:

You need GPT-5.5 / Gemini 3.1 Pro access (it’s Claude-only).
Fable 5 access matters and you’re outside the US.
You’re budget-constrained and the June 22 credit paywall makes Pro economics worse for your usage.

When to pick Gemini CLI

Pick Gemini CLI if:

You want the best free tier (1,000 requests/day at the Gemini 3.1 Pro tier).
You need 1M-token context as a default (Gemini 3.1 Pro / 3.5 Flash).
You’re a Google Cloud customer and want tight integration.
You’re doing research-heavy work — long-document analysis, code+docs reasoning.
You want open-source Apache 2.0 transparency.

Skip Gemini CLI if:

You want the highest benchmark performance (Codex and Claude Code both lead).
You need the deepest agentic CLI UX (Claude Code is more mature).

What about Claude Fable 5?

Fable 5 launched June 9, 2026 (Anthropic’s first publicly accessible Mythos-class model) at $10 input / $50 output per million tokens. It is US-only following a US government restriction on Mythos-class capabilities.

As of June 14, Fable 5 is not yet on the public Terminal-Bench 2.1 leaderboard — the June 9 leaderboard predates broad Fable 5 deployment in coding agents. Early reports suggest Fable 5 will push Claude Code’s Terminal-Bench score above its current 78.9% and likely above Opus 4.8’s SWE-Bench Pro 69.2%, but neither has been verified.

Watch for an updated Terminal-Bench 2.2 leaderboard in late June or July 2026.

If you’re in the US, you can already select Fable 5 inside Claude Code and Cursor — see How to switch to Claude Fable 5.

Free / open-source alternatives

If you don’t want to pay any vendor, three options stand out as of June 2026:

Tool	License	Stars	Notes
Gemini CLI	Apache 2.0	—	1,000 free requests/day with Gemini 3.1 Pro
Aider	Apache 2.0	~30K	You pay only LLM API costs; model-agnostic; Polyglot benchmark reference
OpenCode	MIT	172,198	Most-starred open source agent; supports 75+ providers

OpenCode is the most popular open-source agent by GitHub stars and adopts a model-agnostic posture similar to Aider. Both are strong choices if you want to avoid vendor lock-in.

The decision tree

Question 1: Do you want the highest benchmark score on terminal tasks?
  Yes → Codex CLI + GPT-5.5 (83.4% Terminal-Bench 2.1)
  No  → Continue to Q2.

Question 2: Do you want the highest depth on real GitHub issue fixes?
  Yes → Claude Code + Opus 4.8 (69.2% SWE-Bench Pro)
  No  → Continue to Q3.

Question 3: Are you budget-constrained and want a generous free tier?
  Yes → Gemini CLI (1,000 free req/day at Gemini 3.1 Pro tier)
  No  → Continue to Q4.

Question 4: Do you want model-agnostic open source?
  Yes → Aider or OpenCode
  No  → GitHub Copilot CLI (cheapest paid at $10/mo)

Scores from the public tbench.ai Terminal-Bench 2.1 leaderboard and llm-stats.com SWE-Bench Verified leaderboard, verified June 14, 2026. Watch for Terminal-Bench 2.2 results with Fable 5 in late June or July 2026.

Codex CLI vs Claude Code vs Gemini CLI: Terminal-Bench 2.1 (June 2026)

TL;DR

The Terminal-Bench 2.1 leaderboard (June 9, 2026)

Why SWE-Bench Pro tells a different story

Quick agent comparison

When to pick Codex CLI

When to pick Claude Code

When to pick Gemini CLI

What about Claude Fable 5?

Free / open-source alternatives

The decision tree

Related reading