SL#60 - Your Agent Scores 94% on Memory. It Still Thinks You Live in New York.

❝

Memory benchmarks are nearly saturated on recall. A new benchmark that penalizes stale memory shows the same systems falling apart, and getting worse the longer they run.

If you shopped for an agent memory layer this year, you saw the numbers. 92.5 on LoCoMo. 94.4 on LongMemEval. The pitch writes itself: memory used to be a longer prompt and a prayer, now it is a measured, near-solved component with a leaderboard and an SDK. Wire it in over an afternoon and your agent remembers the user across sessions.

Then you ship it, and three weeks later a user who moved from New York to San Francisco gets a restaurant recommendation in Manhattan. The agent did not fail to remember. It remembered too well. It still has the old city, retrieves it with high confidence, and acts on it.

That gap between the score and the failure is not a rough edge. It is the thing the score was never measuring. The benchmarks that made memory look solved test whether an agent can pull back a fact it was told once. The failure your users actually hit is whether the agent can let go of a fact that stopped being true. Those are different problems, and the second one is barely on the leaderboard.

What the leaderboards actually test

LoCoMo and LongMemEval are the two benchmarks every memory vendor quotes. They are real, reproducible, and a genuine improvement over the era when memory quality was self-reported. LoCoMo asks 1,540 questions across multi-session conversations: single-hop recall, multi-hop, open-domain, temporal. LongMemEval adds knowledge-update and multi-session categories. Both test something harder than single-pass attention over a fixed input.

But look at how much history a question actually requires. A team from Genies and Arizona State went through both benchmarks and found that in LoCoMo, 94% of questions need grounding from no more than two prior sessions. In LongMemEval, 85% do. Averaged across the standard benchmarks, the typical question depends on about one prior session. The task most of these questions reduce to is: a fact was stated once, can you find it again.

That is recall. It is useful, and explicit memory stores are genuinely good at it. But it quietly assumes that a stored fact stays valid forever. Real long-term interaction is not like that. People correct themselves, change jobs, move cities, abandon goals, swap preferences. The benchmarks barely touch this. The same analysis measured how many updates or deletions each benchmark applies to a fact before asking about it, and for LoCoMo, PerLTQA, and MemDaily the answer is zero. LongMemEval includes knowledge updates but caps them at two sessions; PersonaMem at three. So a 94% on LongMemEval tells you the system is good at finding things. It tells you almost nothing about whether the system knows which of two contradictory things is currently true.

What happens when you score forgetting

The Genies and ASU group built a benchmark, Memora, to stress the part the others skip. It spans weeks-to-months conversations and pushes both consolidation (how many past sessions a query depends on) and mutation (how many times a fact gets updated or deleted before you ask). Where existing benchmarks average about one session of consolidation and roughly zero mutations, Memora's monthly setting averages 17.3 sessions of consolidation and 8.8 mutations per query. That is closer to what a real long-running assistant accumulates.

The more interesting contribution is the metric. They call it Forgetting-Aware Memory Accuracy, FAMA. Standard memory scoring checks whether the required fact shows up in the response. FAMA also penalizes the response for relying on a fact that has been invalidated. If the user moved to San Francisco and the agent answers using the New York memory, presence-based scoring can still give partial credit because the city is "in there." FAMA does not. It asks whether the answer reflects the user's current state.

Apply that penalty and the comfortable numbers collapse. Across every model and every memory agent they tested, scoring for forgetting produces large reductions. The detail that should worry anyone running this in production is the direction. For the memory agents, the penalty gets bigger as the timeline gets longer: an 18.2 point reduction at the weekly scale grows to 29.5 at the quarterly scale. The systems sold specifically as memory get worse at staying current the longer they run, because they keep accumulating old facts and keep retrieving them. Retaining access to everything without an effective way to discard amplifies the inconsistency instead of reducing it.

The raw recall scores show the same brittleness once the horizon stretches. On the remembering task, MemoBase drops from 43.6 at weekly to 15.18 at quarterly. MemoryOS goes 51.84 to 25.05. Mem-0 goes 40.42 to 19.90. These are not edge cases buried in a long tail. This is the headline capability, recall, degrading by half or more as the conversation grows into the range where memory was supposed to matter most.

Why this is a production problem, not a benchmark nitpick

It would be easy to wave this off as one research group moving the goalposts. Memory vendors are not hiding the issue, which is the tell that it is real. Mem0's own state-of-the-field writeup lists memory staleness as an open problem in plain language: a highly-retrieved memory about a user's employer is accurate until they change jobs, at which point it becomes confidently wrong. Decay handles low-relevance memories that fade on their own. Staleness in high-relevance, frequently-retrieved memories is the hard, unsolved case, and it is exactly the case that hurts, because the more central a fact is to a user, the more often the agent will surface it and the more damage it does when it is out of date.

There is a second-order failure that makes this worse. The Memora results separate two ways a system can go wrong on stale data. The memory agents fail by retrieving an old fact and using it. The plain language models, oddly, get a shrinking penalty as the timeline grows, but not because they improve. Their histories simply overflow the context window, so the stale fact never gets retrieved at all and the relevant information is omitted altogether. One failure is acting on the wrong memory; the other is acting on no memory. Neither is the thing you want, and a recall benchmark scores both of them generously.

And the capability that would let an agent reconcile all this, reasoning over temporally distributed facts, is the worst-scoring task in the whole study. On a scale where each task sums to 300, the best memory agents average 27.55 on reasoning. Several configurations score zero in the monthly setting. The authors' conclusion is blunt: the bottleneck is not memory capacity, not context length, not storage size. It is the failure to maintain a coherent, up-to-date state under frequent change. You cannot buy your way out of that with a bigger context window or another vector store, which is precisely what most of the current tooling is selling.

Yes, but recall still matters

The honest objection is that recall is not a strawman. For a lot of products, retrieving the right fact is most of the job, and the explicit memory layers earn their keep there. The same study shows it: on remembering, memory agents average 119.45 versus 65.60 for bare language models. That is a real, large advantage, and if your agent's main task is "surface what the user told you," you should still use one of these systems. Recommendation is even more forgiving, because a plausible suggestion that is broadly consistent with the user's tastes gets credit even when retrieval is incomplete. Plenty of useful agents live entirely in that tolerant zone.

So the claim is not that memory layers are useless or that the benchmarks are fraudulent. It is narrower and harder to dodge: the number on the box measures the easy half of the problem, the half these systems are already good at, and is silent on the half that breaks in production. When a vendor quotes 94.4, ask what that 94.4 would look like if every point earned by relying on an invalidated fact were taken back. Based on Memora, the answer is double-digit reductions that grow over time. The benchmark is not lying. It is answering a question you did not ask.

What to do on Monday

Stop treating a single recall score as a memory evaluation. If you are choosing a memory layer, the question that separates the contenders is not "what do you score on LongMemEval," it is "what happens when I tell the agent something, contradict it three sessions later, and query a month after that." Build that test from your own data. Take ten facts your users actually revise, employer, city, plan tier, dietary restriction, project status, write the update path, then check whether the agent answers from the current value or the original one. That five-question harness will tell you more than any leaderboard.

Then design for invalidation, not just storage. Most current systems treat a change as a new memory stacked on top of the old one and lean on retrieval ranking to surface the right version, which is the exact behavior that decays over time. Treat memory writes as having an owner and a lifecycle: when a fact changes, the old value should be retired, not merely outranked. If your stack cannot express "this supersedes that," you are accumulating confidently-wrong answers and the bill comes due slowly, one stale recommendation at a time, in the part of the curve no benchmark is watching.

SL#60 - Your Agent Scores 94% on Memory. It Still Thinks You Live in New York.

What the leaderboards actually test

What happens when you score forgetting

Why this is a production problem, not a benchmark nitpick

Yes, but recall still matters

What to do on Monday

Sources

Keep Reading

Software Letters

Home