Headroom Review: 60-95% LLM Token Compression (2026)

TL;DR

Headroom is an open-source “context compression layer” that sits between your AI agent and the LLM provider, squeezing tool outputs, RAG chunks, logs, code, and conversation history before they hit the model. It claims 60–95% token reduction with the same answers, and the project’s own benchmarks back it up — including a 92% cut on code search (17.7K → 1.4K tokens) and an SRE incident debugging workload that goes from 65,694 tokens down to 5,118.

Author Tejas Chopra (Senior Engineer at Netflix) built it because he was burning $200/day on tool-heavy agent runs. The repo is gaining momentum fast — currently the #2 monthly trending repository on Trendshift in June 2026 — and the project ships across PyPI (headroom-ai), npm (headroom-ai), Docker (ghcr.io/chopratejas/headroom), and HuggingFace (chopratejas/kompress-base).

Key facts:

Apache 2.0 license, open source at chopratejas/headroom
Three deployment modes — Python/TypeScript library, drop-in HTTP proxy (zero code changes), or MCP server with headroom_compress / headroom_retrieve / headroom_stats tools
One-command agent wrap — headroom wrap claude|codex|cursor|aider|copilot for instant compression on coding agents
Reversible by design (CCR) — originals never deleted; the LLM gets a retrieval tool and can pull full data when it actually needs it
Six compression algorithms — SmartCrusher (JSON), CodeCompressor (AST for Python/JS/Go/Rust/Java/C++), Kompress-base (HuggingFace model for prose), CacheAligner (KV cache hit-rate), Image compression, IntelligentContext (score-based fitting)
Cross-agent shared memory — Claude Code, Codex, and Gemini agents can read the same compressed context store
Provider-agnostic — Anthropic, OpenAI, Bedrock, Gemini; integrates with LangChain, LangGraph, Vercel AI SDK, LiteLLM, Agno, Strands
Accuracy verified on GSM8K, TruthfulQA, SQuAD v2, BFCL — no measurable accuracy loss in benchmark suite

What Headroom Actually Does

Most “context engineering” tools either truncate (fast, risky) or summarize (slow, lossy). Headroom takes a third path: aggressive structural compression that’s fully reversible.

When your agent calls a tool and gets back a 2,000-row JSON blob, Headroom’s ContentRouter detects the content type and picks the right compressor:

SmartCrusher for JSON: keeps all errors, statistical outliers, BM25/embedding-matched items for the current query, plus first/last items. The other 1,950 rows get cached locally, not discarded.
CodeCompressor for source files: uses tree-sitter AST parsing to keep imports, function signatures, types, and structure. Output is guaranteed syntactically valid — not just chopped lines.
Kompress-base for prose: a HuggingFace model Tejas trained specifically on agentic traces (tool call transcripts, log files, RAG outputs).
CacheAligner stabilizes message prefixes so Anthropic and OpenAI KV caches actually hit, instead of being invalidated by the dynamic content earlier in the prompt.

Then the magic part — CCR (Compress-Cache-Retrieve). Headroom injects a headroom_retrieve tool into the agent’s tool list. If the model decides it needs row 1,847 of that JSON, it calls headroom_retrieve("query_id_42", index=1847) and gets the original data back in under 1ms. In practice, Tejas reports the model “almost never retrieves” because the smart compression keeps what matters — but the escape hatch is what makes 90% compression safe instead of reckless.

Three things landed at once:

Coding agents got 10x more tool-heavy in 2026. Claude Code, Codex, Cursor’s agent mode, Copilot CLI — they all chain dozens of tool calls per task. By turn 10, an unoptimized agent is paying for 100K+ tokens per LLM call, most of which is stale tool output the model already used.
Provider compaction is provider-locked. OpenAI shipped native conversation history compaction, Anthropic has prompt caching, but neither helps with cross-provider workflows or non-conversation context (tool outputs, RAG chunks, file content).
The Headroom HN launches caught fire. The original Show HN: Headroom – Reversible context compression in January 2026 hit the front page with a follow-up Show HN: Headroom (OSS): Cuts LLM costs by 85% a week later. Trendshift now ranks it the #2 monthly trending repository for June 2026 with 262+ stars gained this month.

The headline number cited by users on r/codex: “I was spending $200/day, now I’m spending $30/day with the same agent doing the same work.”

Install & First Run (60 Seconds)

The Python install is the canonical path:

# Full install with all extras (proxy, MCP, ML, integrations)
pip install "headroom-ai[all]"

# Or granular
pip install "headroom-ai[proxy]"      # just the proxy mode
pip install "headroom-ai[mcp]"         # MCP server only
pip install "headroom-ai[langchain]"   # LangChain middleware

TypeScript users get parity:

npm install headroom-ai

Then pick a deployment mode.

Mode 1: Wrap a Coding Agent (Easiest)

headroom wrap claude         # Claude Code
headroom wrap codex          # OpenAI Codex CLI
headroom wrap cursor         # Cursor (prints config to paste once)
headroom wrap aider          # Aider
headroom wrap copilot        # Copilot CLI

Each command auto-detects the agent’s API endpoint, starts a local proxy, sets ANTHROPIC_BASE_URL / OPENAI_BASE_URL for the child process, and launches your agent. Token savings start immediately.

Mode 2: Drop-In Proxy (Any Language)

headroom proxy --port 8787

Then point your app at it:

export ANTHROPIC_BASE_URL=http://localhost:8787
# or
export OPENAI_BASE_URL=http://localhost:8787/v1

Zero code changes. The proxy speaks both Anthropic and OpenAI wire formats and transparently compresses every request.

Mode 3: Inline Library

Python:

from headroom import compress, withHeadroom
from anthropic import Anthropic

# Option A: manual compression
compressed = compress(messages, model="claude-sonnet-4-5")
response = client.messages.create(model="claude-sonnet-4-5", messages=compressed)

# Option B: wrap the client
client = withHeadroom(Anthropic())
response = client.messages.create(...)  # auto-compressed

TypeScript:

import { compress, withHeadroom } from 'headroom-ai';
import Anthropic from '@anthropic-ai/sdk';

const client = withHeadroom(new Anthropic());
const response = await client.messages.create({
  model: 'claude-sonnet-4-5',
  messages,  // compressed automatically
});

Mode 4: MCP Server

headroom mcp install

Exposes three tools to any MCP-aware client (Claude Desktop, Cursor, custom agents):

headroom_compress(content, content_type) → returns compressed payload + cache key
headroom_retrieve(cache_key, query?) → returns original or filtered subset
headroom_stats() → returns session-level token savings

See the Savings

headroom stats

Prints a per-session ledger: tokens in, tokens out, dollars saved (using current provider pricing), cache hit rate, and which compressor handled which content type.

Real Benchmarks From the Repo

Headroom ships its own reproducible benchmark suite — python -m headroom.evals suite --tier 1. From the README and project docs:

Workload	Before	After	Savings
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub issue triage	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

Accuracy benchmarks on standard suites:

Benchmark	Category	N	Baseline	Headroom	Delta
GSM8K	Math	100	0.870	0.870	±0.000
TruthfulQA	Factual	100	0.530	0.560	+0.030
SQuAD v2	QA	100	—	97%	19% compression
BFCL	Tools	100	—	97%	32% compression

The math/factual benchmarks are essentially noise-level — no measurable accuracy loss. SQuAD and BFCL hit 97% accuracy at 19% and 32% compression respectively, which is the sweet spot for production use (less savings but provably safe).

Community Reactions (HN + Reddit + Dev.to)

The reception has been pragmatically positive with healthy skepticism:

On Hacker News (Show HN: Headroom, January 2026):

“The CCR idea — keep originals cached, inject a retrieval tool — is the right architecture. Truncation is a footgun, summarization is too slow, this gets you both.”

On r/codex (headroom – Compress LLM Input to reduce token usage, 3 days ago):

“Wrapped codex with headroom wrap codex and my Plus quota is lasting 3x longer. Same tasks, same outputs.”

On r/ClaudeCode — the right critique:

“Run A/B with optimizer off/on. If Headroom says 30–40% but cost per accepted diff or cache hit rate doesn’t improve, it may be moving tokens between buckets rather than saving useful work.”

That’s the warning to heed: aggregate token savings ≠ bottom-line cost savings if cache invalidation or rework cancels out the gains. CacheAligner is designed to address this, but A/B test on your real workload. Tejas also posted Stop Feeding “Junk” Tokens to Your LLM on dev.to walking through the CCR architecture in detail.

Honest Limitations

CCR adds memory overhead. The local LRU cache holds your originals. For long sessions or huge RAG payloads, that’s hundreds of MB of RAM. Configurable, but you have to tune it.
AST compression requires tree-sitter (~50MB install). Skipping the [ml] extra avoids it but you lose the CodeCompressor.
Cache invalidation is the real boss fight. CacheAligner helps, but if your prompt prefix changes mid-session, you lose KV cache savings and Headroom’s gains shrink.
Kompress-base is small and English-biased. Trained primarily on English agentic traces — multilingual prompts get worse compression ratios.
Not battle-tested on every edge case. Tejas himself flags this — A/B test before assuming the 85% claim applies to your bill.
Anthropic prompt caching is still better for fixed prefixes. If your workload is “same 50K-token system prompt, varying short user messages,” native prompt caching gives you ~90% savings on cached tokens for free. Headroom shines when context is dynamic — tool outputs, RAG, logs.

How It Compares

Headroom isn’t the only entrant in this space. The README’s own comparison:

Tool	Scope	Deploy	Local	Reversible
Headroom	All context — tools, RAG, logs, files, history	Proxy · library · middleware · MCP	✅	✅
RTK	CLI command outputs	CLI wrapper	✅	❌
lean-ctx	CLI commands, MCP tools, editor rules	CLI wrapper · MCP	✅	❌
Compresr / Token Co.	Text sent to their API	Hosted API call	❌	❌
OpenAI Compaction	Conversation history only	Provider-native	❌	❌

The differentiators are scope (all content types vs. just CLI/conversation), reversibility (CCR vs. lossy), and local-first deployment (everything runs on your machine — your data never leaves your network unless you forward it to a provider).

Headroom actually bundles RTK for shell-output rewriting and can use lean-ctx as the CLI context tool (HEADROOM_CONTEXT_TOOL=lean-ctx), which is a smart “stand on shoulders” move.

Architecture Cheat Sheet

The full pipeline lifecycle is exposed via on_pipeline_event() hooks:

Setup → Pre-Start → Input Received → Input Cached →
Input Routed → Input Compressed → Pre-Send → Response Received

Provider-specific shaping (Anthropic vs. OpenAI vs. Gemini wire formats) lives under headroom/providers/, so the core orchestration stays focused on lifecycle, sequencing, and policy. To plug in your own compressor, write a BaseCompressor subclass, register it for a content type, and your transform sits in the same pipeline as SmartCrusher.

FAQ

Is Headroom safe to use in production? Tejas labels it “not battle-tested on all edge cases yet.” For coding agents and internal tooling, the community reports are positive. For customer-facing production with strict SLAs, do A/B testing on your real workload first — measure not just tokens saved but cost per accepted output (the r/ClaudeCode critique above is the right framing).

Does Headroom work with Anthropic prompt caching? Yes — CacheAligner is specifically designed to stabilize message prefixes so Anthropic and OpenAI KV caches hit instead of getting invalidated by dynamic content. They’re complementary, not competing: prompt caching saves cost on the cached prefix, Headroom saves cost on the dynamic tool outputs that follow.

What’s the latency overhead? Tejas reports 1–5ms per request for the compression pipeline. SmartCrusher is essentially statistical operations on JSON, CodeCompressor is tree-sitter parsing (fast), and Kompress-base is the only “real” ML inference path — it’s a small HuggingFace model and runs on CPU. If you skip [ml] and stick to SmartCrusher + CodeCompressor + CacheAligner, you’re under 5ms in almost all cases.

Can I use Headroom with multiple agents that share context? Yes — that’s the SharedContext and cross-agent memory feature. Run headroom wrap claude and headroom wrap codex on the same machine, and they read from the same compressed memory store with auto-dedup and agent provenance tracking. This is one of the strongest selling points for teams using Claude Code + Codex + Cursor in parallel.

Does it work with LangChain / LangGraph / Vercel AI SDK? Yes to all three: HeadroomChatModel(your_llm) for LangChain, headroomMiddleware() for Vercel AI SDK’s wrapLanguageModel, and LangGraph works via the LangChain integration. There’s also HeadroomAgnoModel for Agno, a dedicated Strands guide, ASGI middleware (CompressionMiddleware) for any FastAPI/Starlette app, and a LiteLLM callback (HeadroomCallback).

Is the Kompress-base model trustworthy? It’s published on HuggingFace (chopratejas/kompress-base) with a model card. Trained on “agentic traces” — meaning tool call transcripts, logs, RAG outputs — not generic web text. Apache 2.0 weights. Skim the model card before using in regulated environments, but it’s transparent.

What about MCP — how do the three exposed tools work? headroom mcp install registers the server. The LLM (Claude Desktop, Cursor, etc.) sees three tools: headroom_compress(content, content_type) to actively compress something, headroom_retrieve(cache_key, query?) to get back the original or a filtered slice, and headroom_stats() for visibility. The retrieval tool is what makes aggressive compression safe — the model can always reach back for the original data.

Is there a hosted version? No — it’s deliberately local-first. The README’s positioning: “Your data stays here.” Compresr and Token Co. take the hosted-API approach (and Tejas explicitly contrasts Headroom against them). If you want hosted, those exist; if you want local + reversible + open-source, Headroom is the only option in the comparison table.

Verdict

Headroom is the most architecturally-honest entrant in the “LLM cost reduction” space because it does three things competitors don’t: compresses every content type (not just one), is reversible (originals stay local, LLM retrieves on demand), and runs entirely on your machine (no third-party API, no data egress).

The 60–95% numbers are real on the workloads they benchmark, but they’re not universal. The honest version: for tool-heavy coding agents and RAG-heavy retrieval workloads, expect 70–90% reduction. For prompt-heavy chat workloads, expect 20–40%. For long-conversation workflows, Anthropic’s prompt caching probably saves more than Headroom on its own (use both).

If you’re running Claude Code, Codex, or Cursor with significant tool usage and your bill is over $50/month, install it tonight with headroom wrap <agent> and check headroom stats after a week. Downside: 50MB disk and 1–5ms per request. Upside on the right workload: paying for one agent instead of five.

The repo is chopratejas/headroom, Apache 2.0, with active maintenance and a healthy Discord. The author is a Netflix senior engineer who built this to solve his own $200/day problem — the best signal you can get that a tool will keep being maintained.