Which model is best for long context in May 2026?

Claude Opus 4.7 wins on retrieval accuracy at 256k and 1M context lengths in the Graphwalks long-context evaluation, taking the BFS 256k, parents 256k, and parents 1M categories. Gemini 3.1 Pro has the longest declared context window (2M tokens) and the lowest per-token cost at long context. GPT-5.5 sits between them on accuracy and price. Pick Opus 4.7 when retrieval correctness is the binding constraint, Gemini 3.1 Pro when token cost on huge contexts is the binding constraint, GPT-5.5 when you want a balance with strong agentic features.

Does Graphwalks reflect real long-context use?

Partially. Graphwalks is a synthetic graph-traversal benchmark that tests whether the model can correctly retrieve and reason over long-context structure (BFS traversals, parent lookups). It correlates with real-world tasks like 'find all places this function is called' across a 256k codebase or 'trace this regulatory requirement across a 1M-token policy bundle.' It does not capture summarization quality or coherent narrative generation over long context, which are separate axes where rankings can shift.

What's the practical context limit in May 2026?

Declared limits are: Gemini 3.1 Pro 2M tokens (1M default tier), Claude Opus 4.7 1M tokens, GPT-5.5 1M tokens (with 400k as the most cost-effective tier). Practical limits — where retrieval and reasoning quality stay strong — are tighter. Opus 4.7 holds quality up to roughly 750k–1M tokens. Gemini 3.1 Pro holds quality on retrieval up to 1M+ tokens but degrades on multi-step reasoning earlier. GPT-5.5 holds well to ~500k. Plan capacity accordingly.

How much does long context cost?

Pricing per 1M input tokens in May 2026: Gemini 3.1 Pro is the cheapest at long context, often less than half the cost of Opus 4.7 or GPT-5.5 at the same length. Anthropic's prompt caching and OpenAI's prefix caching cut effective cost dramatically for repeated-context workflows (RAG-style with stable corpus, agent loops with persistent system prompts). For a one-off 1M-token analysis, expect Opus 4.7 around $15-20, GPT-5.5 around $10-15, Gemini 3.1 Pro around $5-8.

Quick Answer

Long-Context AI: Opus 4.7 vs GPT-5.5 vs Gemini at 1M Tokens

Published: May 1, 2026

Long-Context AI: Opus 4.7 vs GPT-5.5 vs Gemini at 1M Tokens

The April 2026 model wave settled the long-context picture for most workloads. Claude Opus 4.7 wins on retrieval accuracy at 256k and 1M. Gemini 3.1 Pro wins on cost and raw window size. GPT-5.5 sits in the middle. Here’s what the May 2026 numbers actually look like and which one to use for which kind of long-context job.

Last verified: May 1, 2026

The benchmark headline

On Graphwalks, the synthetic long-context evaluation that tests retrieval and reasoning over graph structure:

Test	Winner
BFS at 256k	Claude Opus 4.7
parents at 256k	Claude Opus 4.7
parents at 1M	Claude Opus 4.7
BFS at 1M	GPT-5.5 (close)

Anthropic took three of the four published Graphwalks long-context categories at the April 2026 release. GPT-5.5 was second on most. Gemini 3.1 Pro was third on Graphwalks specifically, despite having the largest declared context window (2M tokens).

Why Graphwalks matters

Graphwalks tests two things real long-context work demands:

Retrieval at distance. Can the model correctly pull information from a specific location in a long context window? At 1M tokens, that’s about retrieving from the equivalent of a 2,000-page document.
Reasoning over retrieved context. Once retrieved, can the model reason correctly across multiple retrieved pieces? Graph traversals (BFS, parent lookups) require chaining retrievals.

A model that can retrieve but can’t reason over retrieval is “needle in a haystack” pass; a model that can do both is genuinely usable at long context. Opus 4.7’s Graphwalks win is the latter.

The full long-context picture

Dimension	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
Declared max context	1M	1M (400k cost-effective)	2M (1M default)
Practical retrieval limit	~1M	~500k	~1M+
Multi-step reasoning at long context	best	second	third
Cost per 1M input tokens	highest	medium	lowest
Prompt caching support	yes (5min/1hr)	yes (prefix)	yes
Multimodal long context	text-strong	text-strong	best (video, image, mixed)

Reading the table: Opus 4.7 wins where retrieval correctness is binding. Gemini 3.1 Pro wins where cost or multimodal length is binding. GPT-5.5 is the all-rounder.

When to use each

Opus 4.7 — the careful long-context model

Use Opus 4.7 when:

You’re doing legal, regulatory, or compliance work over 100k+ tokens of source material and a wrong retrieval is expensive.
You’re navigating a large codebase end-to-end (the SWE-bench leadership transfers here).
You’re synthesizing research across long technical documents and you need cross-reference correctness.
Hallucination rate matters — Opus 4.7 has the lowest measured rate as of May 2026.

The trade-off: highest cost per 1M input tokens of the three.

GPT-5.5 — the agentic long-context model

Use GPT-5.5 when:

You’re feeding long context into an agent loop with tool calls — Codex Cloud, OpenAI Operator, computer use.
You’re under the 400k-token tier (much cheaper than 1M tier) and you want strong reasoning.
You need ChatGPT-native features (custom GPTs, voice mode) on long-context inputs.
You want a balance of accuracy, cost, and ecosystem features.

The trade-off: practical retrieval quality drops earlier than Opus 4.7 above ~500k tokens.

Gemini 3.1 Pro — the cost-and-multimodal long-context model

Use Gemini 3.1 Pro when:

Your context is huge (1M+ tokens) and cost is binding.
You’re working with video, mixed multimodal content, or image-heavy long context.
You’re inside Google Workspace and Gemini’s distribution is convenient.
You can tolerate slightly weaker multi-step reasoning at long context for a 50%+ cost reduction.

The trade-off: weaker retrieval correctness on Graphwalks-style structured retrieval.

Real-world long-context patterns and which to pick

Pattern 1: Long-document QA (legal, regulatory, technical)

You have a 200k–1M token corpus and need to answer specific questions correctly.

→ Opus 4.7. Retrieval correctness is the binding constraint. Cost is secondary.

You have a 500k+ token codebase and need to find call sites, trace data flow, refactor across files.

→ Opus 4.7 for accuracy; Cursor 3 as the IDE that bundles it well. SWE-bench leadership transfers here.

Pattern 3: Long-context agent loop

You have long context plus tool calls plus a multi-step plan.

→ GPT-5.5 + Codex Cloud. The agentic harness layer matters as much as the model.

Pattern 4: Video, mixed multimodal long content

You have hours of video, a stack of images, a long mixed-modality document.

→ Gemini 3.1 Pro. It is alone among the three on multi-hour video context and best on mixed multimodal long context.

Pattern 5: One-off huge analysis on a budget

You have a 1M-token document and a tight budget.

→ Gemini 3.1 Pro. Cheapest at length. For most analytical tasks, retrieval correctness is good enough.

Caching changes the math

All three offer prompt caching that materially reduces effective cost when the same long context is reused:

Anthropic prompt caching — 5-minute and 1-hour caches; 90% input cost reduction on cache hits.
OpenAI prefix caching — automatic for stable prefixes, similar effective discount.
Google Gemini caching — context caching is available for Gemini 3.x.

If your workflow is “same 500k-token corpus, many different questions” — RAG-style with a stable knowledge base or an agent loop with a persistent system prompt — caching cuts effective cost to less than half of headline pricing on all three. At that point, the model choice should be capability-driven, not cost-driven.

What about Claude Mythos?

Claude Mythos Preview is reportedly stronger than Opus 4.7 on long-running tasks, which implies long-context strength. As of May 1, 2026 it is still gated to Project Glasswing and not available on the public API. Watch Q3 2026 if Anthropic widens access.

Bottom line

For May 2026, the long-context picture:

Most accurate at long context: Claude Opus 4.7. Use it when you can’t afford a wrong retrieval.
Cheapest at long context: Gemini 3.1 Pro. Use it when cost is binding and the workload tolerates marginally weaker reasoning.
Best agentic long context: GPT-5.5. Use it when long context is one input to a tool-using agent.
Best multimodal long context: Gemini 3.1 Pro. Alone on this dimension.

Most teams in May 2026 use two of these depending on the task. Opus 4.7 plus Gemini 3.1 Pro is a common pairing — accuracy for serious work, cost for cheap analysis. Opus 4.7 plus GPT-5.5 is the other common pairing — accuracy plus agentic harness.

Built with 🤖 by AI, for AI.

Long-Context AI: Opus 4.7 vs GPT-5.5 vs Gemini at 1M Tokens

The benchmark headline

Why Graphwalks matters

The full long-context picture

When to use each

Opus 4.7 — the careful long-context model

GPT-5.5 — the agentic long-context model

Gemini 3.1 Pro — the cost-and-multimodal long-context model

Real-world long-context patterns and which to pick

Pattern 1: Long-document QA (legal, regulatory, technical)

Pattern 2: Codebase navigation and refactoring

Pattern 3: Long-context agent loop

Pattern 4: Video, mixed multimodal long content

Pattern 5: One-off huge analysis on a budget

Caching changes the math

What about Claude Mythos?

Bottom line