STATE-Bench is an open-source benchmark Microsoft released on May 19, 2026 that measures the impact of long-term memory on AI agent performance in production-like workflows. Unlike single-turn memory evaluations (LoCoMo, LongMemEval) that ask 'can the model recall this fact?', STATE-Bench measures whether memory actually changes task outcomes — does the agent complete real multi-session tasks better when memory is enabled vs. when it's not? It's the first widely-recognized benchmark to focus on memory's downstream task impact rather than recall accuracy alone.

How is STATE-Bench different from LoCoMo and LongMemEval?

LoCoMo (1,540 questions) and LongMemEval (500 questions) measure recall accuracy — given a long conversation history, can the model retrieve the right fact? STATE-Bench measures something different: does adding memory to an agent improve task success rate, reduce token cost, and shorten time-to-completion on real multi-session workflows? It evaluates the end-to-end agent, not just the memory layer. The two types of benchmarks are complementary.

Why does Microsoft care about agent memory benchmarks?

Microsoft has shipped multiple agent memory systems in 2026 — Microsoft 365 Copilot's Work IQ, Copilot Studio's agent context, and the Agent Evaluations Tool released in May 2026. STATE-Bench gives Microsoft (and the rest of the industry) a credible way to claim 'our memory layer improves agent task success by X%' — a measurable claim that pricing and procurement can be built on. It's part of Microsoft's broader strategy of making agentic AI measurable and auditable for enterprise buyers.

Is STATE-Bench actually open source?

Yes. Microsoft released STATE-Bench under an OSI-approved license on GitHub on May 19, 2026, with the full evaluation harness, task suite, and reference implementations for several memory architectures (mem0, LangGraph memory, Anthropic Dreaming, OpenAI Memory). Anyone can run it against their own agent and memory stack. Microsoft published baseline numbers for GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro with and without memory.

Quick Answer

What is STATE-Bench? Microsoft's AI Agent Memory Benchmark

Published: May 24, 2026

What is STATE-Bench? (May 2026)

STATE-Bench is an open-source benchmark Microsoft released on May 19, 2026 that measures what memory actually does for production AI agents. Unlike older memory benchmarks that only measured recall accuracy, STATE-Bench measures whether memory changes downstream task outcomes — and that distinction matters enormously for buyers and builders.

Last verified: May 24, 2026.

TL;DR

What it is: An open-source benchmark for AI agent memory.
Released by: Microsoft Open Source, May 19, 2026.
License: OSI-approved (full eval harness on GitHub).
Measures: Task success rate, token efficiency, and time-to-completion with vs. without memory.
Why it matters: It’s the first benchmark to ask “does memory actually help the agent?” rather than “can the model recall a fact?”

The core insight

For two years, the AI memory conversation has been dominated by recall benchmarks:

LoCoMo (2025) — 1,540 questions across single-hop, multi-hop, open-domain, and temporal recall.
LongMemEval (2025) — 500 questions across knowledge updates, multi-session recall, etc.
BEAM (2026) — 1M and 10M token-scale evaluations across multiple categories.

These benchmarks ask: given a long conversation history, can the model retrieve the right fact? That’s a real and useful measurement. But it’s not the question enterprise buyers actually have.

The question buyers have is: “If I add a memory system to my agent, does the agent get measurably better at the work my company pays it to do?”

STATE-Bench is the first major benchmark designed around that second question.

How STATE-Bench works

The Microsoft blog post and GitHub README describe the harness:

Task suite: Multi-session, multi-turn workflows representative of real agent jobs — customer support tickets, multi-day project management, longitudinal research assistance, code review across PRs.
A/B evaluation: Run each agent on each task suite with memory enabled and without memory enabled (cold context every session).
Three measured outcomes:
- Task success rate — did the agent complete the task correctly?
- Token efficiency — how many tokens did it cost?
- Time-to-completion — how many turns/steps?
Delta reporting — STATE-Bench reports the delta between memory-on and memory-off. A memory system that improves task success by 30% looks very different from one that improves it by 3%.
Reference implementations — Microsoft published reference plugins for mem0, LangGraph Memory, Anthropic Dreaming, and OpenAI Memory.

The harness can wrap any model and any memory layer, which is why it’s likely to become the de facto industry benchmark.

Why this is a big deal

Three reasons STATE-Bench matters more than it might look:

1. It commoditizes memory claims. Memory vendors have spent 18 months making impressive-sounding claims (“90% recall accuracy on 10M tokens!”). STATE-Bench reframes the conversation around the only number that matters to buyers: does the agent finish the work better?

2. It exposes the cost of memory. Some memory systems improve task success but explode token cost. STATE-Bench reports both numbers, which makes ROI calculable.

3. It puts Microsoft in the standards-setter seat. By publishing the benchmark, Microsoft positions Copilot’s Work IQ memory layer as the system everyone else gets compared to. Classic platform play — same one OpenAI ran with HumanEval and Anthropic ran with SWE-Bench Verified.

Microsoft’s published baseline numbers

The launch blog post reports baselines for three frontier models on a representative 200-task subset. Numbers are rounded directionally — see the GitHub repo for full data:

Model + Memory	Task success delta	Token cost delta	Time delta
GPT-5.5 + OpenAI Memory	+18 pts	+12%	-22%
Claude Opus 4.7 + Anthropic Dreaming	+21 pts	+8%	-28%
Gemini 3.1 Pro + LangGraph Memory	+14 pts	+18%	-19%
GPT-5.5 + mem0	+16 pts	+5%	-24%

The takeaway: memory consistently helps, by 14-21 points on task success. The cost overhead is real but modest. Time-to-completion improvements are universal — agents finish tasks faster when they don’t have to re-discover context every session.

Where STATE-Bench falls short (per Microsoft’s own caveats)

Microsoft is honest about the v1 limitations:

English-only tasks at launch. Multilingual coming v2.
No tool-heavy workflows yet — the agent benchmark is mostly conversational + retrieval. Heavy tool-use scenarios (file system, browser, etc.) are planned for v2.
Synthetic task data — the multi-session tasks are constructed, not drawn from real enterprise traffic (because real enterprise traffic isn’t shareable). v2 will add a private real-traffic suite for licensees.
Memory layer must support the standard interface — Microsoft published an SDK, but legacy memory systems may need adapters.

How to actually use STATE-Bench

If you’re a builder shipping an agent:

# Clone the repo
git clone https://github.com/microsoft/state-bench
cd state-bench

# Install
pip install -e .

# Run baseline (your agent, no memory)
state-bench run --agent ./my_agent.py --memory none --output baseline.json

# Run with memory
state-bench run --agent ./my_agent.py --memory ./my_memory.py --output with_memory.json

# Compare
state-bench compare baseline.json with_memory.json

The harness produces a JSON report you can use to validate memory ROI before shipping a memory system to production.

How it fits into the 2026 agent memory landscape

Layer	Examples
Memory architectures	mem0, LangGraph Memory, Anthropic Dreaming, OpenAI Memory, Copilot Work IQ
Recall benchmarks	LoCoMo, LongMemEval, BEAM
End-to-end benchmarks	STATE-Bench (May 2026), Harvey LAB (legal-specific, May 2026)
Eval tooling	Microsoft 365 Copilot Agent Evaluations Tool, LangSmith, Braintrust, Raindrop Workshop

STATE-Bench plugs the missing gap — the bridge between memory-architecture claims and agent-product outcomes.

Verdict

STATE-Bench is the right benchmark at the right time. Memory has gone from research curiosity to procurement requirement in 18 months, and there hasn’t been a credible way to compare memory systems on the metric that matters (task success). Microsoft fixed that.

Expect STATE-Bench numbers to start appearing in Anthropic, OpenAI, and Google launch blogs by July 2026, in vendor pitches by September, and in enterprise RFPs by Q4.

If you’re shipping an agent product in 2026, you should be running STATE-Bench against your stack within the month.