AI agents · OpenClaw · self-hosting · automation

Quick Answer

STATE-Bench vs LoCoMo vs LongMemEval: AI Memory Benchmarks 2026

Published:

STATE-Bench vs LoCoMo vs LongMemEval (May 2026)

Microsoft released STATE-Bench on May 19, 2026 — a new open-source benchmark that measures the impact of memory on AI agent task success. It’s the first widely-recognized benchmark to ask “does memory help the agent?” rather than “can the model recall a fact?”. Here’s how it stacks against the two established memory benchmarks: LoCoMo and LongMemEval.

Last verified: May 24, 2026.

TL;DR table

STATE-BenchLoCoMoLongMemEval
ReleasedMay 19, 2026 (Microsoft)20252025
LicenseOSI-approved (GitHub)MIT (Hugging Face)Apache 2.0 (GitHub)
Question it answers”Does memory improve agent task success?""Can the model recall facts from long conversations?""Can the model handle multi-session memory tasks?”
Size200 task baseline (more in repo)1,540 questions500 questions
CategoriesCustomer support, project mgmt, longitudinal research, code reviewSingle-hop, multi-hop, open-domain, temporalKnowledge updates, multi-session recall, reasoning
Eval styleA/B (memory on vs. memory off)Recall accuracyRecall accuracy + reasoning
What it measuresTask success, token cost, time-to-completionAnswer correctnessAnswer correctness + chain-of-thought
Best forBuyers, product teams, ROI validationMemory architecture optimizationMemory + reasoning research
Vendor focusMicrosoft-led, vendor-neutral harnessAcademic + industry collabAcademic

What each benchmark is actually measuring

LoCoMo — recall accuracy

LoCoMo was the first widely-used memory benchmark to escape research-paper land and get adopted by memory startups. It works like this:

  1. Generate a long, plausible conversation history (thousands of turns).
  2. Ask the model questions whose answers are buried in the history.
  3. Score answer accuracy across four categories: single-hop, multi-hop, open-domain, temporal.

It’s a clean, well-designed benchmark for the question “can your memory layer surface the right fact when asked directly?”

LongMemEval — recall + reasoning

LongMemEval extends LoCoMo with harder reasoning categories:

  • Knowledge updates: the user’s preference changed in turn 47 — does the model use the new preference?
  • Multi-session recall: facts spread across multiple separate conversations.
  • Implicit reasoning: facts that require combining multiple memory pieces.

500 questions, smaller than LoCoMo, but harder and more representative of real assistant workflows.

STATE-Bench — end-to-end agent task success

STATE-Bench takes a completely different angle. Instead of asking the model questions, it gives an agent a multi-session task and measures whether the agent completes it better with memory enabled vs. disabled.

The tasks are realistic agent workflows:

  • A 5-session customer-support investigation (escalation, root cause, resolution).
  • A multi-day project-management workflow (kickoff, status updates, deliverable).
  • A longitudinal research task (read 10 docs over 4 sessions, write a synthesis).
  • A multi-PR code review (context across PRs over a week).

For each task, STATE-Bench runs the agent twice — with memory enabled, with memory disabled — and reports the delta on three metrics:

  • Task success rate (did the agent finish correctly?)
  • Token cost (how expensive was the run?)
  • Time-to-completion (how many turns/steps?)

The output is a delta report: “memory improves Claude Opus 4.7 task success by 21 points, with 8% more tokens, and 28% fewer turns.”

Why STATE-Bench is the benchmark that matters for buyers

Recall benchmarks like LoCoMo and LongMemEval are essential for memory architects — the people building mem0, Anthropic Dreaming, OpenAI Memory, etc. They’re how you tune a memory system.

But for buyers — the CTO deciding whether to enable Microsoft 365 Copilot’s Work IQ, the founder deciding whether to add mem0 to their agent stack — recall accuracy isn’t the right question. The right question is: does my agent do better work?

That’s STATE-Bench’s exact frame. It’s the benchmark you cite when you tell your CFO “this memory layer is worth the $X/month.”

How they compose

The mature memory team in mid-2026 runs all three:

StageBenchmarkPurpose
DesignLoCoMoDoes the memory layer recall the right facts?
OptimizeLongMemEvalDoes it handle multi-session reasoning?
Validate ROISTATE-BenchDoes the agent actually get better?
Stress test at scaleBEAMDoes it hold up at 1M-10M tokens?

You can’t skip stages. A memory system that aces LoCoMo but loses on STATE-Bench is theoretically interesting but commercially useless. A system that wins STATE-Bench but fails LoCoMo is suspicious — it’s probably memorizing patterns rather than recalling facts, and will break on new task types.

Where each benchmark gets gamed

Every benchmark gets gamed eventually. Here’s the current state:

LoCoMo: Memory layers can train on the publicly known LoCoMo distribution and inflate scores. Solution: hold-out sets, contamination checks. Several vendors have been caught doing this in 2025.

LongMemEval: Same contamination risk, smaller dataset = easier to overfit. Mitigated by frequent task rotation in 2026.

STATE-Bench: The task suite is public, so vendors can train against it. Microsoft has signaled a private real-traffic suite is coming in v2 for licensees to verify against. Treat the public score as a ceiling, not a guarantee.

What the numbers look like in practice (May 2026)

Approximate published / reported numbers as of May 24, 2026:

Memory systemLoCoMo recallLongMemEval accuracySTATE-Bench Δ task success
OpenAI Memory (GPT-5.5)~78%~71%+18 pts
Anthropic Dreaming (Opus 4.7)~82%~76%+21 pts
LangGraph Memory (any model)~74%~68%+14 pts
mem0 (any model)~76%~70%+16 pts
Microsoft Work IQ (M365 Copilot)+19 pts (Microsoft-reported)

Numbers vary by configuration and are best treated as directional. The pattern: Anthropic Dreaming has been the strongest performer across categories in May 2026 reports, but the field is competitive and shifting fast.

Which to pick for your job

  • You’re building a memory layer: LoCoMo + LongMemEval for tuning, STATE-Bench for validation, BEAM for scale.
  • You’re a buyer evaluating memory systems: STATE-Bench is the only number you should care about.
  • You’re an agent product team: STATE-Bench, run against your own agent and your candidate memory layers.
  • You’re a researcher: LoCoMo + LongMemEval — they have the clearest published methodology and easiest comparisons.
  • You’re a procurement officer: STATE-Bench is the benchmark that will end up in RFPs by Q4 2026. Get familiar now.

Verdict

LoCoMo and LongMemEval pioneered the field and remain essential for memory architects. STATE-Bench is the missing buyer-side benchmark that turns memory from a research metric into a procurable capability with measurable ROI.

Expect STATE-Bench numbers in every memory vendor pitch by Q3 2026 and in enterprise procurement requirements by Q4.