What's the difference between STATE-Bench, LoCoMo, and LongMemEval?

LoCoMo (1,540 questions) and LongMemEval (500 questions) are recall benchmarks — they measure whether a model can retrieve specific facts from long conversation histories. STATE-Bench, released by Microsoft on May 19, 2026, is an end-to-end agent benchmark — it measures whether adding memory to an agent improves task success rate, token cost, and time-to-completion on multi-session workflows. Recall benchmarks test the memory layer; STATE-Bench tests the agent.

Which AI memory benchmark should I use?

Use LoCoMo or LongMemEval if you're building a memory architecture and need to optimize recall. Use STATE-Bench if you're a buyer or product team deciding whether memory is worth the cost overhead. Use BEAM if you need to evaluate memory at 1M-10M token scale. In practice, serious memory teams run all three — recall benchmarks to tune, STATE-Bench to validate ROI, BEAM for scale stress tests.

Is STATE-Bench replacing LoCoMo and LongMemEval?

No — they answer different questions and are complementary. A memory system can ace LoCoMo (high recall) but fail STATE-Bench (recall doesn't translate to task success). Or it can win STATE-Bench (huge task improvement) without leading on LoCoMo (it does aggressive summarization that loses some recall but improves agent decisions). Both numbers matter, and 2026's memory teams report both.

Are these benchmarks free and open source?

Yes. STATE-Bench is OSI-licensed on GitHub. LoCoMo is on Hugging Face under MIT license. LongMemEval is on GitHub under Apache 2.0. BEAM has a free tier and a paid enterprise tier. All four can be run against any frontier model API or self-hosted model — you just pay for the inference.

Quick Answer

STATE-Bench vs LoCoMo vs LongMemEval: AI Memory Benchmarks 2026

Published: May 24, 2026

STATE-Bench vs LoCoMo vs LongMemEval (May 2026)

Microsoft released STATE-Bench on May 19, 2026 — a new open-source benchmark that measures the impact of memory on AI agent task success. It’s the first widely-recognized benchmark to ask “does memory help the agent?” rather than “can the model recall a fact?”. Here’s how it stacks against the two established memory benchmarks: LoCoMo and LongMemEval.

Last verified: May 24, 2026.

TL;DR table

	STATE-Bench	LoCoMo	LongMemEval
Released	May 19, 2026 (Microsoft)	2025	2025
License	OSI-approved (GitHub)	MIT (Hugging Face)	Apache 2.0 (GitHub)
Question it answers	”Does memory improve agent task success?"	"Can the model recall facts from long conversations?"	"Can the model handle multi-session memory tasks?”
Size	200 task baseline (more in repo)	1,540 questions	500 questions
Categories	Customer support, project mgmt, longitudinal research, code review	Single-hop, multi-hop, open-domain, temporal	Knowledge updates, multi-session recall, reasoning
Eval style	A/B (memory on vs. memory off)	Recall accuracy	Recall accuracy + reasoning
What it measures	Task success, token cost, time-to-completion	Answer correctness	Answer correctness + chain-of-thought
Best for	Buyers, product teams, ROI validation	Memory architecture optimization	Memory + reasoning research
Vendor focus	Microsoft-led, vendor-neutral harness	Academic + industry collab	Academic

What each benchmark is actually measuring

LoCoMo — recall accuracy

LoCoMo was the first widely-used memory benchmark to escape research-paper land and get adopted by memory startups. It works like this:

Generate a long, plausible conversation history (thousands of turns).
Ask the model questions whose answers are buried in the history.
Score answer accuracy across four categories: single-hop, multi-hop, open-domain, temporal.

It’s a clean, well-designed benchmark for the question “can your memory layer surface the right fact when asked directly?”

LongMemEval — recall + reasoning

LongMemEval extends LoCoMo with harder reasoning categories:

Knowledge updates: the user’s preference changed in turn 47 — does the model use the new preference?
Multi-session recall: facts spread across multiple separate conversations.
Implicit reasoning: facts that require combining multiple memory pieces.

500 questions, smaller than LoCoMo, but harder and more representative of real assistant workflows.

STATE-Bench — end-to-end agent task success

STATE-Bench takes a completely different angle. Instead of asking the model questions, it gives an agent a multi-session task and measures whether the agent completes it better with memory enabled vs. disabled.

The tasks are realistic agent workflows:

A 5-session customer-support investigation (escalation, root cause, resolution).
A multi-day project-management workflow (kickoff, status updates, deliverable).
A longitudinal research task (read 10 docs over 4 sessions, write a synthesis).
A multi-PR code review (context across PRs over a week).

For each task, STATE-Bench runs the agent twice — with memory enabled, with memory disabled — and reports the delta on three metrics:

Task success rate (did the agent finish correctly?)
Token cost (how expensive was the run?)
Time-to-completion (how many turns/steps?)

The output is a delta report: “memory improves Claude Opus 4.7 task success by 21 points, with 8% more tokens, and 28% fewer turns.”

Why STATE-Bench is the benchmark that matters for buyers

Recall benchmarks like LoCoMo and LongMemEval are essential for memory architects — the people building mem0, Anthropic Dreaming, OpenAI Memory, etc. They’re how you tune a memory system.

But for buyers — the CTO deciding whether to enable Microsoft 365 Copilot’s Work IQ, the founder deciding whether to add mem0 to their agent stack — recall accuracy isn’t the right question. The right question is: does my agent do better work?

That’s STATE-Bench’s exact frame. It’s the benchmark you cite when you tell your CFO “this memory layer is worth the $X/month.”

How they compose

The mature memory team in mid-2026 runs all three:

Stage	Benchmark	Purpose
Design	LoCoMo	Does the memory layer recall the right facts?
Optimize	LongMemEval	Does it handle multi-session reasoning?
Validate ROI	STATE-Bench	Does the agent actually get better?
Stress test at scale	BEAM	Does it hold up at 1M-10M tokens?

You can’t skip stages. A memory system that aces LoCoMo but loses on STATE-Bench is theoretically interesting but commercially useless. A system that wins STATE-Bench but fails LoCoMo is suspicious — it’s probably memorizing patterns rather than recalling facts, and will break on new task types.

Where each benchmark gets gamed

Every benchmark gets gamed eventually. Here’s the current state:

LoCoMo: Memory layers can train on the publicly known LoCoMo distribution and inflate scores. Solution: hold-out sets, contamination checks. Several vendors have been caught doing this in 2025.

LongMemEval: Same contamination risk, smaller dataset = easier to overfit. Mitigated by frequent task rotation in 2026.

STATE-Bench: The task suite is public, so vendors can train against it. Microsoft has signaled a private real-traffic suite is coming in v2 for licensees to verify against. Treat the public score as a ceiling, not a guarantee.

What the numbers look like in practice (May 2026)

Approximate published / reported numbers as of May 24, 2026:

Memory system	LoCoMo recall	LongMemEval accuracy	STATE-Bench Δ task success
OpenAI Memory (GPT-5.5)	~78%	~71%	+18 pts
Anthropic Dreaming (Opus 4.7)	~82%	~76%	+21 pts
LangGraph Memory (any model)	~74%	~68%	+14 pts
mem0 (any model)	~76%	~70%	+16 pts
Microsoft Work IQ (M365 Copilot)	—	—	+19 pts (Microsoft-reported)

Numbers vary by configuration and are best treated as directional. The pattern: Anthropic Dreaming has been the strongest performer across categories in May 2026 reports, but the field is competitive and shifting fast.

Which to pick for your job

You’re building a memory layer: LoCoMo + LongMemEval for tuning, STATE-Bench for validation, BEAM for scale.
You’re a buyer evaluating memory systems: STATE-Bench is the only number you should care about.
You’re an agent product team: STATE-Bench, run against your own agent and your candidate memory layers.
You’re a researcher: LoCoMo + LongMemEval — they have the clearest published methodology and easiest comparisons.
You’re a procurement officer: STATE-Bench is the benchmark that will end up in RFPs by Q4 2026. Get familiar now.

Verdict

LoCoMo and LongMemEval pioneered the field and remain essential for memory architects. STATE-Bench is the missing buyer-side benchmark that turns memory from a research metric into a procurable capability with measurable ROI.

Expect STATE-Bench numbers in every memory vendor pitch by Q3 2026 and in enterprise procurement requirements by Q4.