Codex CLI vs Claude Code vs Gemini CLI: Terminal-Bench 2.1 (June 2026)
Codex CLI vs Claude Code vs Gemini CLI: Terminal-Bench 2.1 (June 2026)
As of June 9, 2026, the public Terminal-Bench 2.1 leaderboard at tbench.ai puts OpenAI’s Codex CLI + GPT-5.5 at #1 with 83.4%, Anthropic’s Claude Code + Opus 4.8 at #2 with 78.9%, and Google’s Gemini CLI + Gemini 3.1 Pro at 70.7%. The picture on SWE-Bench Pro inverts — Claude Opus 4.8 leads there at 69.2%. This page maps which CLI to pick for which kind of work.
Last verified: June 14, 2026
TL;DR
- Terminal-Bench 2.1 leader: Codex CLI + GPT-5.5 at 83.4% (#1 overall).
- SWE-Bench Pro leader: Claude Opus 4.8 at 69.2% (best on issue-fixing).
- Best free option: Gemini CLI (1,000 free requests/day) or open-source Aider / OpenCode.
- Best paid baseline: GitHub Copilot Pro at $10/mo for inline; Codex CLI for top-end terminal performance.
- For Claude-primary stacks: Claude Code + Opus 4.8 remains the practical leader despite #2 ranking.
The Terminal-Bench 2.1 leaderboard (June 9, 2026)
Terminal-Bench measures an agent driving a real terminal to complete development tasks: editing files, running commands, fixing failures. Scores from the public tbench.ai leaderboard.
| Rank | Agent + Model | Score |
|---|---|---|
| #1 | Codex CLI + GPT-5.5 | 83.4% |
| #2 | Claude Code + Opus 4.8 | 78.9% |
| #3 | Terminus 2 + GPT-5.5 | 78.2% |
| #5 | Terminus 2 + Gemini 3 Pro | 74.4% |
| #6 | Gemini CLI + Gemini 3.1 Pro | 70.7% |
| #8 | Claude Code + Opus 4.7 | 69.7% |
Two takeaways:
- The same model scores differently inside different agents. GPT-5.5 inside Codex CLI scores 83.4%; inside Terminus 2 it scores 78.2%. The agent harness matters as much as the model.
- The leaderboard ranks the pairing. When you “pick an AI coding agent,” you’re picking both the CLI and the default model. Most users don’t realize this.
Why SWE-Bench Pro tells a different story
Terminal-Bench rewards end-to-end terminal driving. SWE-Bench Pro rewards fixing real GitHub issues — the work most engineers actually do.
| Model | SWE-Bench Pro score | Trend vs prior version |
|---|---|---|
| Claude Opus 4.8 | 69.2% | +4.9 pts vs Opus 4.7 |
| GPT-5.5 (Codex) | ~65% area (vendor-reported) | — |
| Gemini 3.1 Pro | ~58% area (vendor-reported) | — |
| Claude Opus 4.7 | 64.3% | — |
On the self-reported SWE-Bench Verified leaderboard at llm-stats.com, Claude Opus 4.8 sits at 88.6% and Opus 4.7 at 87.6%.
The two leaderboards disagree on the top model — and that’s fine, because they test different things. Codex CLI + GPT-5.5 wins terminal-driven tasks. Claude Code + Opus 4.8 wins issue-fixing tasks. Pick the agent that matches the work you actually do.
Quick agent comparison
| Dimension | Codex CLI | Claude Code | Gemini CLI |
|---|---|---|---|
| Vendor | OpenAI | Anthropic | |
| License | Apache 2.0 (open) | Proprietary | Apache 2.0 (open) |
| Implementation | Rust + TypeScript | TypeScript | TypeScript |
| Default model | GPT-5.5 | Claude Opus 4.8 (Fable 5 selectable, US-only) | Gemini 3.1 Pro |
| Context window | 200K (Codex model) | 200K (1M on Sonnet beta) | 1M tokens |
| Sandboxing | Docker sandbox | OS-level permissions | Configurable trusted folders |
| Approval modes | Suggest / Auto-Edit / Full Auto | Per-action prompts | Trusted folders + Yolo mode |
| Free tier | Via ChatGPT Plus / API metering | $20/mo Pro (post-June 22 credits) | 60 req/min, 1,000/day free |
| Terminal-Bench 2.1 | 83.4% (#1) | 78.9% (#2) | 70.7% (#6) |
| SWE-Bench Pro | ~65% (vendor) | 69.2% (best) | ~58% (vendor) |
When to pick Codex CLI
Pick Codex CLI if:
- You want the highest current Terminal-Bench score (83.4% with GPT-5.5).
- You want open-source Apache 2.0 code with Docker sandboxing for safety.
- You’re already in the OpenAI / ChatGPT Plus ecosystem.
- You’re doing multi-step terminal tasks — file edits, command runs, failure recovery.
- You want rapid model upgrades — OpenAI ships new Codex models frequently.
Skip Codex CLI if:
- You need 1M-token context (use Gemini CLI or Claude Sonnet 4.5 beta).
- You’re standardized on Anthropic for safety/compliance (Claude Code fits better).
- You want the highest SWE-Bench Pro depth (Claude Code + Opus 4.8 wins there).
When to pick Claude Code
Pick Claude Code if:
- You’re doing deep multi-file refactors or hard GitHub issue fixes (SWE-Bench Pro leader at 69.2%).
- You want Sub-agents and Dynamic Workflows (Opus 4.8 native; Fable 5 in selected regions).
- You’re terminal-first and want the best agentic CLI experience for Claude users.
- Your team has decided Anthropic is the safety/compliance primary.
Skip Claude Code if:
- You need GPT-5.5 / Gemini 3.1 Pro access (it’s Claude-only).
- Fable 5 access matters and you’re outside the US.
- You’re budget-constrained and the June 22 credit paywall makes Pro economics worse for your usage.
When to pick Gemini CLI
Pick Gemini CLI if:
- You want the best free tier (1,000 requests/day at the Gemini 3.1 Pro tier).
- You need 1M-token context as a default (Gemini 3.1 Pro / 3.5 Flash).
- You’re a Google Cloud customer and want tight integration.
- You’re doing research-heavy work — long-document analysis, code+docs reasoning.
- You want open-source Apache 2.0 transparency.
Skip Gemini CLI if:
- You want the highest benchmark performance (Codex and Claude Code both lead).
- You need the deepest agentic CLI UX (Claude Code is more mature).
What about Claude Fable 5?
Fable 5 launched June 9, 2026 (Anthropic’s first publicly accessible Mythos-class model) at $10 input / $50 output per million tokens. It is US-only following a US government restriction on Mythos-class capabilities.
As of June 14, Fable 5 is not yet on the public Terminal-Bench 2.1 leaderboard — the June 9 leaderboard predates broad Fable 5 deployment in coding agents. Early reports suggest Fable 5 will push Claude Code’s Terminal-Bench score above its current 78.9% and likely above Opus 4.8’s SWE-Bench Pro 69.2%, but neither has been verified.
Watch for an updated Terminal-Bench 2.2 leaderboard in late June or July 2026.
If you’re in the US, you can already select Fable 5 inside Claude Code and Cursor — see How to switch to Claude Fable 5.
Free / open-source alternatives
If you don’t want to pay any vendor, three options stand out as of June 2026:
| Tool | License | Stars | Notes |
|---|---|---|---|
| Gemini CLI | Apache 2.0 | — | 1,000 free requests/day with Gemini 3.1 Pro |
| Aider | Apache 2.0 | ~30K | You pay only LLM API costs; model-agnostic; Polyglot benchmark reference |
| OpenCode | MIT | 172,198 | Most-starred open source agent; supports 75+ providers |
OpenCode is the most popular open-source agent by GitHub stars and adopts a model-agnostic posture similar to Aider. Both are strong choices if you want to avoid vendor lock-in.
The decision tree
Question 1: Do you want the highest benchmark score on terminal tasks?
Yes → Codex CLI + GPT-5.5 (83.4% Terminal-Bench 2.1)
No → Continue to Q2.
Question 2: Do you want the highest depth on real GitHub issue fixes?
Yes → Claude Code + Opus 4.8 (69.2% SWE-Bench Pro)
No → Continue to Q3.
Question 3: Are you budget-constrained and want a generous free tier?
Yes → Gemini CLI (1,000 free req/day at Gemini 3.1 Pro tier)
No → Continue to Q4.
Question 4: Do you want model-agnostic open source?
Yes → Aider or OpenCode
No → GitHub Copilot CLI (cheapest paid at $10/mo)
Related reading
- Cursor 4 vs Claude Code vs Claude Fable 5 (June 2026)
- Claude Fable 5 vs GPT-5.5 vs Gemini 3.5 Pro on SWE-Bench
- How to Switch to Claude Fable 5 in Claude Code / Cursor
- What is Terminal-Bench 2.0 Benchmark?
Scores from the public tbench.ai Terminal-Bench 2.1 leaderboard and llm-stats.com SWE-Bench Verified leaderboard, verified June 14, 2026. Watch for Terminal-Bench 2.2 results with Fable 5 in late June or July 2026.