STATE-Bench vs LoCoMo vs LongMemEval: AI Memory Benchmarks 2026
STATE-Bench vs LoCoMo vs LongMemEval (May 2026)
Microsoft released STATE-Bench on May 19, 2026 — a new open-source benchmark that measures the impact of memory on AI agent task success. It’s the first widely-recognized benchmark to ask “does memory help the agent?” rather than “can the model recall a fact?”. Here’s how it stacks against the two established memory benchmarks: LoCoMo and LongMemEval.
Last verified: May 24, 2026.
TL;DR table
| STATE-Bench | LoCoMo | LongMemEval | |
|---|---|---|---|
| Released | May 19, 2026 (Microsoft) | 2025 | 2025 |
| License | OSI-approved (GitHub) | MIT (Hugging Face) | Apache 2.0 (GitHub) |
| Question it answers | ”Does memory improve agent task success?" | "Can the model recall facts from long conversations?" | "Can the model handle multi-session memory tasks?” |
| Size | 200 task baseline (more in repo) | 1,540 questions | 500 questions |
| Categories | Customer support, project mgmt, longitudinal research, code review | Single-hop, multi-hop, open-domain, temporal | Knowledge updates, multi-session recall, reasoning |
| Eval style | A/B (memory on vs. memory off) | Recall accuracy | Recall accuracy + reasoning |
| What it measures | Task success, token cost, time-to-completion | Answer correctness | Answer correctness + chain-of-thought |
| Best for | Buyers, product teams, ROI validation | Memory architecture optimization | Memory + reasoning research |
| Vendor focus | Microsoft-led, vendor-neutral harness | Academic + industry collab | Academic |
What each benchmark is actually measuring
LoCoMo — recall accuracy
LoCoMo was the first widely-used memory benchmark to escape research-paper land and get adopted by memory startups. It works like this:
- Generate a long, plausible conversation history (thousands of turns).
- Ask the model questions whose answers are buried in the history.
- Score answer accuracy across four categories: single-hop, multi-hop, open-domain, temporal.
It’s a clean, well-designed benchmark for the question “can your memory layer surface the right fact when asked directly?”
LongMemEval — recall + reasoning
LongMemEval extends LoCoMo with harder reasoning categories:
- Knowledge updates: the user’s preference changed in turn 47 — does the model use the new preference?
- Multi-session recall: facts spread across multiple separate conversations.
- Implicit reasoning: facts that require combining multiple memory pieces.
500 questions, smaller than LoCoMo, but harder and more representative of real assistant workflows.
STATE-Bench — end-to-end agent task success
STATE-Bench takes a completely different angle. Instead of asking the model questions, it gives an agent a multi-session task and measures whether the agent completes it better with memory enabled vs. disabled.
The tasks are realistic agent workflows:
- A 5-session customer-support investigation (escalation, root cause, resolution).
- A multi-day project-management workflow (kickoff, status updates, deliverable).
- A longitudinal research task (read 10 docs over 4 sessions, write a synthesis).
- A multi-PR code review (context across PRs over a week).
For each task, STATE-Bench runs the agent twice — with memory enabled, with memory disabled — and reports the delta on three metrics:
- Task success rate (did the agent finish correctly?)
- Token cost (how expensive was the run?)
- Time-to-completion (how many turns/steps?)
The output is a delta report: “memory improves Claude Opus 4.7 task success by 21 points, with 8% more tokens, and 28% fewer turns.”
Why STATE-Bench is the benchmark that matters for buyers
Recall benchmarks like LoCoMo and LongMemEval are essential for memory architects — the people building mem0, Anthropic Dreaming, OpenAI Memory, etc. They’re how you tune a memory system.
But for buyers — the CTO deciding whether to enable Microsoft 365 Copilot’s Work IQ, the founder deciding whether to add mem0 to their agent stack — recall accuracy isn’t the right question. The right question is: does my agent do better work?
That’s STATE-Bench’s exact frame. It’s the benchmark you cite when you tell your CFO “this memory layer is worth the $X/month.”
How they compose
The mature memory team in mid-2026 runs all three:
| Stage | Benchmark | Purpose |
|---|---|---|
| Design | LoCoMo | Does the memory layer recall the right facts? |
| Optimize | LongMemEval | Does it handle multi-session reasoning? |
| Validate ROI | STATE-Bench | Does the agent actually get better? |
| Stress test at scale | BEAM | Does it hold up at 1M-10M tokens? |
You can’t skip stages. A memory system that aces LoCoMo but loses on STATE-Bench is theoretically interesting but commercially useless. A system that wins STATE-Bench but fails LoCoMo is suspicious — it’s probably memorizing patterns rather than recalling facts, and will break on new task types.
Where each benchmark gets gamed
Every benchmark gets gamed eventually. Here’s the current state:
LoCoMo: Memory layers can train on the publicly known LoCoMo distribution and inflate scores. Solution: hold-out sets, contamination checks. Several vendors have been caught doing this in 2025.
LongMemEval: Same contamination risk, smaller dataset = easier to overfit. Mitigated by frequent task rotation in 2026.
STATE-Bench: The task suite is public, so vendors can train against it. Microsoft has signaled a private real-traffic suite is coming in v2 for licensees to verify against. Treat the public score as a ceiling, not a guarantee.
What the numbers look like in practice (May 2026)
Approximate published / reported numbers as of May 24, 2026:
| Memory system | LoCoMo recall | LongMemEval accuracy | STATE-Bench Δ task success |
|---|---|---|---|
| OpenAI Memory (GPT-5.5) | ~78% | ~71% | +18 pts |
| Anthropic Dreaming (Opus 4.7) | ~82% | ~76% | +21 pts |
| LangGraph Memory (any model) | ~74% | ~68% | +14 pts |
| mem0 (any model) | ~76% | ~70% | +16 pts |
| Microsoft Work IQ (M365 Copilot) | — | — | +19 pts (Microsoft-reported) |
Numbers vary by configuration and are best treated as directional. The pattern: Anthropic Dreaming has been the strongest performer across categories in May 2026 reports, but the field is competitive and shifting fast.
Which to pick for your job
- You’re building a memory layer: LoCoMo + LongMemEval for tuning, STATE-Bench for validation, BEAM for scale.
- You’re a buyer evaluating memory systems: STATE-Bench is the only number you should care about.
- You’re an agent product team: STATE-Bench, run against your own agent and your candidate memory layers.
- You’re a researcher: LoCoMo + LongMemEval — they have the clearest published methodology and easiest comparisons.
- You’re a procurement officer: STATE-Bench is the benchmark that will end up in RFPs by Q4 2026. Get familiar now.
Verdict
LoCoMo and LongMemEval pioneered the field and remain essential for memory architects. STATE-Bench is the missing buyer-side benchmark that turns memory from a research metric into a procurable capability with measurable ROI.
Expect STATE-Bench numbers in every memory vendor pitch by Q3 2026 and in enterprise procurement requirements by Q4.