What is STATE-Bench? Microsoft's AI Agent Memory Benchmark
What is STATE-Bench? (May 2026)
STATE-Bench is an open-source benchmark Microsoft released on May 19, 2026 that measures what memory actually does for production AI agents. Unlike older memory benchmarks that only measured recall accuracy, STATE-Bench measures whether memory changes downstream task outcomes — and that distinction matters enormously for buyers and builders.
Last verified: May 24, 2026.
TL;DR
- What it is: An open-source benchmark for AI agent memory.
- Released by: Microsoft Open Source, May 19, 2026.
- License: OSI-approved (full eval harness on GitHub).
- Measures: Task success rate, token efficiency, and time-to-completion with vs. without memory.
- Why it matters: It’s the first benchmark to ask “does memory actually help the agent?” rather than “can the model recall a fact?”
The core insight
For two years, the AI memory conversation has been dominated by recall benchmarks:
- LoCoMo (2025) — 1,540 questions across single-hop, multi-hop, open-domain, and temporal recall.
- LongMemEval (2025) — 500 questions across knowledge updates, multi-session recall, etc.
- BEAM (2026) — 1M and 10M token-scale evaluations across multiple categories.
These benchmarks ask: given a long conversation history, can the model retrieve the right fact? That’s a real and useful measurement. But it’s not the question enterprise buyers actually have.
The question buyers have is: “If I add a memory system to my agent, does the agent get measurably better at the work my company pays it to do?”
STATE-Bench is the first major benchmark designed around that second question.
How STATE-Bench works
The Microsoft blog post and GitHub README describe the harness:
- Task suite: Multi-session, multi-turn workflows representative of real agent jobs — customer support tickets, multi-day project management, longitudinal research assistance, code review across PRs.
- A/B evaluation: Run each agent on each task suite with memory enabled and without memory enabled (cold context every session).
- Three measured outcomes:
- Task success rate — did the agent complete the task correctly?
- Token efficiency — how many tokens did it cost?
- Time-to-completion — how many turns/steps?
- Delta reporting — STATE-Bench reports the delta between memory-on and memory-off. A memory system that improves task success by 30% looks very different from one that improves it by 3%.
- Reference implementations — Microsoft published reference plugins for mem0, LangGraph Memory, Anthropic Dreaming, and OpenAI Memory.
The harness can wrap any model and any memory layer, which is why it’s likely to become the de facto industry benchmark.
Why this is a big deal
Three reasons STATE-Bench matters more than it might look:
1. It commoditizes memory claims. Memory vendors have spent 18 months making impressive-sounding claims (“90% recall accuracy on 10M tokens!”). STATE-Bench reframes the conversation around the only number that matters to buyers: does the agent finish the work better?
2. It exposes the cost of memory. Some memory systems improve task success but explode token cost. STATE-Bench reports both numbers, which makes ROI calculable.
3. It puts Microsoft in the standards-setter seat. By publishing the benchmark, Microsoft positions Copilot’s Work IQ memory layer as the system everyone else gets compared to. Classic platform play — same one OpenAI ran with HumanEval and Anthropic ran with SWE-Bench Verified.
Microsoft’s published baseline numbers
The launch blog post reports baselines for three frontier models on a representative 200-task subset. Numbers are rounded directionally — see the GitHub repo for full data:
| Model + Memory | Task success delta | Token cost delta | Time delta |
|---|---|---|---|
| GPT-5.5 + OpenAI Memory | +18 pts | +12% | -22% |
| Claude Opus 4.7 + Anthropic Dreaming | +21 pts | +8% | -28% |
| Gemini 3.1 Pro + LangGraph Memory | +14 pts | +18% | -19% |
| GPT-5.5 + mem0 | +16 pts | +5% | -24% |
The takeaway: memory consistently helps, by 14-21 points on task success. The cost overhead is real but modest. Time-to-completion improvements are universal — agents finish tasks faster when they don’t have to re-discover context every session.
Where STATE-Bench falls short (per Microsoft’s own caveats)
Microsoft is honest about the v1 limitations:
- English-only tasks at launch. Multilingual coming v2.
- No tool-heavy workflows yet — the agent benchmark is mostly conversational + retrieval. Heavy tool-use scenarios (file system, browser, etc.) are planned for v2.
- Synthetic task data — the multi-session tasks are constructed, not drawn from real enterprise traffic (because real enterprise traffic isn’t shareable). v2 will add a private real-traffic suite for licensees.
- Memory layer must support the standard interface — Microsoft published an SDK, but legacy memory systems may need adapters.
How to actually use STATE-Bench
If you’re a builder shipping an agent:
# Clone the repo
git clone https://github.com/microsoft/state-bench
cd state-bench
# Install
pip install -e .
# Run baseline (your agent, no memory)
state-bench run --agent ./my_agent.py --memory none --output baseline.json
# Run with memory
state-bench run --agent ./my_agent.py --memory ./my_memory.py --output with_memory.json
# Compare
state-bench compare baseline.json with_memory.json
The harness produces a JSON report you can use to validate memory ROI before shipping a memory system to production.
How it fits into the 2026 agent memory landscape
| Layer | Examples |
|---|---|
| Memory architectures | mem0, LangGraph Memory, Anthropic Dreaming, OpenAI Memory, Copilot Work IQ |
| Recall benchmarks | LoCoMo, LongMemEval, BEAM |
| End-to-end benchmarks | STATE-Bench (May 2026), Harvey LAB (legal-specific, May 2026) |
| Eval tooling | Microsoft 365 Copilot Agent Evaluations Tool, LangSmith, Braintrust, Raindrop Workshop |
STATE-Bench plugs the missing gap — the bridge between memory-architecture claims and agent-product outcomes.
Verdict
STATE-Bench is the right benchmark at the right time. Memory has gone from research curiosity to procurement requirement in 18 months, and there hasn’t been a credible way to compare memory systems on the metric that matters (task success). Microsoft fixed that.
Expect STATE-Bench numbers to start appearing in Anthropic, OpenAI, and Google launch blogs by July 2026, in vendor pitches by September, and in enterprise RFPs by Q4.
If you’re shipping an agent product in 2026, you should be running STATE-Bench against your stack within the month.