AI agents · OpenClaw · self-hosting · automation

Quick Answer

Codex CLI vs Claude Code vs Gemini CLI: Terminal-Bench 2.1 (June 2026)

Published:

Codex CLI vs Claude Code vs Gemini CLI: Terminal-Bench 2.1 (June 2026)

As of June 9, 2026, the public Terminal-Bench 2.1 leaderboard at tbench.ai puts OpenAI’s Codex CLI + GPT-5.5 at #1 with 83.4%, Anthropic’s Claude Code + Opus 4.8 at #2 with 78.9%, and Google’s Gemini CLI + Gemini 3.1 Pro at 70.7%. The picture on SWE-Bench Pro inverts — Claude Opus 4.8 leads there at 69.2%. This page maps which CLI to pick for which kind of work.

Last verified: June 14, 2026

TL;DR

  • Terminal-Bench 2.1 leader: Codex CLI + GPT-5.5 at 83.4% (#1 overall).
  • SWE-Bench Pro leader: Claude Opus 4.8 at 69.2% (best on issue-fixing).
  • Best free option: Gemini CLI (1,000 free requests/day) or open-source Aider / OpenCode.
  • Best paid baseline: GitHub Copilot Pro at $10/mo for inline; Codex CLI for top-end terminal performance.
  • For Claude-primary stacks: Claude Code + Opus 4.8 remains the practical leader despite #2 ranking.

The Terminal-Bench 2.1 leaderboard (June 9, 2026)

Terminal-Bench measures an agent driving a real terminal to complete development tasks: editing files, running commands, fixing failures. Scores from the public tbench.ai leaderboard.

RankAgent + ModelScore
#1Codex CLI + GPT-5.583.4%
#2Claude Code + Opus 4.878.9%
#3Terminus 2 + GPT-5.578.2%
#5Terminus 2 + Gemini 3 Pro74.4%
#6Gemini CLI + Gemini 3.1 Pro70.7%
#8Claude Code + Opus 4.769.7%

Two takeaways:

  1. The same model scores differently inside different agents. GPT-5.5 inside Codex CLI scores 83.4%; inside Terminus 2 it scores 78.2%. The agent harness matters as much as the model.
  2. The leaderboard ranks the pairing. When you “pick an AI coding agent,” you’re picking both the CLI and the default model. Most users don’t realize this.

Why SWE-Bench Pro tells a different story

Terminal-Bench rewards end-to-end terminal driving. SWE-Bench Pro rewards fixing real GitHub issues — the work most engineers actually do.

ModelSWE-Bench Pro scoreTrend vs prior version
Claude Opus 4.869.2%+4.9 pts vs Opus 4.7
GPT-5.5 (Codex)~65% area (vendor-reported)
Gemini 3.1 Pro~58% area (vendor-reported)
Claude Opus 4.764.3%

On the self-reported SWE-Bench Verified leaderboard at llm-stats.com, Claude Opus 4.8 sits at 88.6% and Opus 4.7 at 87.6%.

The two leaderboards disagree on the top model — and that’s fine, because they test different things. Codex CLI + GPT-5.5 wins terminal-driven tasks. Claude Code + Opus 4.8 wins issue-fixing tasks. Pick the agent that matches the work you actually do.

Quick agent comparison

DimensionCodex CLIClaude CodeGemini CLI
VendorOpenAIAnthropicGoogle
LicenseApache 2.0 (open)ProprietaryApache 2.0 (open)
ImplementationRust + TypeScriptTypeScriptTypeScript
Default modelGPT-5.5Claude Opus 4.8 (Fable 5 selectable, US-only)Gemini 3.1 Pro
Context window200K (Codex model)200K (1M on Sonnet beta)1M tokens
SandboxingDocker sandboxOS-level permissionsConfigurable trusted folders
Approval modesSuggest / Auto-Edit / Full AutoPer-action promptsTrusted folders + Yolo mode
Free tierVia ChatGPT Plus / API metering$20/mo Pro (post-June 22 credits)60 req/min, 1,000/day free
Terminal-Bench 2.183.4% (#1)78.9% (#2)70.7% (#6)
SWE-Bench Pro~65% (vendor)69.2% (best)~58% (vendor)

When to pick Codex CLI

Pick Codex CLI if:

  • You want the highest current Terminal-Bench score (83.4% with GPT-5.5).
  • You want open-source Apache 2.0 code with Docker sandboxing for safety.
  • You’re already in the OpenAI / ChatGPT Plus ecosystem.
  • You’re doing multi-step terminal tasks — file edits, command runs, failure recovery.
  • You want rapid model upgrades — OpenAI ships new Codex models frequently.

Skip Codex CLI if:

  • You need 1M-token context (use Gemini CLI or Claude Sonnet 4.5 beta).
  • You’re standardized on Anthropic for safety/compliance (Claude Code fits better).
  • You want the highest SWE-Bench Pro depth (Claude Code + Opus 4.8 wins there).

When to pick Claude Code

Pick Claude Code if:

  • You’re doing deep multi-file refactors or hard GitHub issue fixes (SWE-Bench Pro leader at 69.2%).
  • You want Sub-agents and Dynamic Workflows (Opus 4.8 native; Fable 5 in selected regions).
  • You’re terminal-first and want the best agentic CLI experience for Claude users.
  • Your team has decided Anthropic is the safety/compliance primary.

Skip Claude Code if:

  • You need GPT-5.5 / Gemini 3.1 Pro access (it’s Claude-only).
  • Fable 5 access matters and you’re outside the US.
  • You’re budget-constrained and the June 22 credit paywall makes Pro economics worse for your usage.

When to pick Gemini CLI

Pick Gemini CLI if:

  • You want the best free tier (1,000 requests/day at the Gemini 3.1 Pro tier).
  • You need 1M-token context as a default (Gemini 3.1 Pro / 3.5 Flash).
  • You’re a Google Cloud customer and want tight integration.
  • You’re doing research-heavy work — long-document analysis, code+docs reasoning.
  • You want open-source Apache 2.0 transparency.

Skip Gemini CLI if:

  • You want the highest benchmark performance (Codex and Claude Code both lead).
  • You need the deepest agentic CLI UX (Claude Code is more mature).

What about Claude Fable 5?

Fable 5 launched June 9, 2026 (Anthropic’s first publicly accessible Mythos-class model) at $10 input / $50 output per million tokens. It is US-only following a US government restriction on Mythos-class capabilities.

As of June 14, Fable 5 is not yet on the public Terminal-Bench 2.1 leaderboard — the June 9 leaderboard predates broad Fable 5 deployment in coding agents. Early reports suggest Fable 5 will push Claude Code’s Terminal-Bench score above its current 78.9% and likely above Opus 4.8’s SWE-Bench Pro 69.2%, but neither has been verified.

Watch for an updated Terminal-Bench 2.2 leaderboard in late June or July 2026.

If you’re in the US, you can already select Fable 5 inside Claude Code and Cursor — see How to switch to Claude Fable 5.

Free / open-source alternatives

If you don’t want to pay any vendor, three options stand out as of June 2026:

ToolLicenseStarsNotes
Gemini CLIApache 2.01,000 free requests/day with Gemini 3.1 Pro
AiderApache 2.0~30KYou pay only LLM API costs; model-agnostic; Polyglot benchmark reference
OpenCodeMIT172,198Most-starred open source agent; supports 75+ providers

OpenCode is the most popular open-source agent by GitHub stars and adopts a model-agnostic posture similar to Aider. Both are strong choices if you want to avoid vendor lock-in.

The decision tree

Question 1: Do you want the highest benchmark score on terminal tasks?
  Yes → Codex CLI + GPT-5.5 (83.4% Terminal-Bench 2.1)
  No  → Continue to Q2.

Question 2: Do you want the highest depth on real GitHub issue fixes?
  Yes → Claude Code + Opus 4.8 (69.2% SWE-Bench Pro)
  No  → Continue to Q3.

Question 3: Are you budget-constrained and want a generous free tier?
  Yes → Gemini CLI (1,000 free req/day at Gemini 3.1 Pro tier)
  No  → Continue to Q4.

Question 4: Do you want model-agnostic open source?
  Yes → Aider or OpenCode
  No  → GitHub Copilot CLI (cheapest paid at $10/mo)

Scores from the public tbench.ai Terminal-Bench 2.1 leaderboard and llm-stats.com SWE-Bench Verified leaderboard, verified June 14, 2026. Watch for Terminal-Bench 2.2 results with Fable 5 in late June or July 2026.