AI Agent Memory Benchmarks Explained
LongMemEval is the emerging standard for measuring AI agent memory quality. Here is what it tests, who has published scores, and what to look for beyond benchmarks.
TL;DR: LongMemEval (ICLR 2025) tests 5 memory capabilities across 500 questions. Published scores: OMEGA 95.4% (GPT-4.1), Mastra OM 94.87% (gpt-5-mini), Zep/Graphiti 71.2% (GPT-4o). Most memory systems (Mem0, Letta, and others) have not published scores. Benchmarks matter, but they are not the whole picture.
What LongMemEval Tests
Five core capabilities that define whether a memory system actually works.
Information Extraction
Can the system extract and store specific facts from conversations? Tests whether memories are correctly captured from natural dialogue.
Example: User mentions their dog's name is Max. Later asked: 'What is the user's dog's name?'
Multi-Session Reasoning
Can the system combine information from multiple past sessions to answer a question? Tests cross-session knowledge synthesis.
Example: Session 1: User likes Italian food. Session 5: User is lactose intolerant. Question: 'What Italian dishes should I recommend?'
Temporal Reasoning
Can the system track how information changes over time? Tests understanding of chronological knowledge updates.
Example: January: User works at Google. March: User starts at Anthropic. Question: 'Where does the user work?'
Knowledge Updates
When information changes, does the system return the most current version? Tests whether old facts are properly superseded.
Example: User's address changed twice. System should return the most recent address, not the original.
Abstention
When the system does not have the information, does it correctly say 'I don't know'? Tests resistance to hallucination from false memories.
Example: Asked about something never discussed. System should abstain rather than guess.
Published Scores
| System | Score | Model | Notes |
|---|---|---|---|
| OMEGA | 95.4% | GPT-4.1 | Dedicated memory system. Local-first, 25 MCP tools. |
| Mastra OM | 94.87% | gpt-5-mini | In-context observational memory. No persistence layer. |
| Zep / Graphiti | 71.2% | GPT-4o + GPT-4o-mini | Graph-based memory. Requires Neo4j. |
| Mem0 | Not published | — | Cloud-first memory platform. 47.3K GitHub stars. |
| Letta | Not published | — | Full agent framework with memory. 21.1K GitHub stars. |
Scores as of April 2026. See /benchmarks for detailed methodology and per-category breakdowns.
Why Most Systems Haven't Published
As of April 2026, only three memory systems have published LongMemEval scores. The rest — including Mem0 (47.3K GitHub stars) and Letta (21.1K stars) — have not. There are several possible reasons:
- •Cost and complexity: Running LongMemEval properly requires 500 LLM-graded evaluations. With GPT-4-class models as judges, a single run costs $50-200 and takes hours. Many teams have not invested the engineering effort.
- •Architecture mismatch: Some memory systems are designed for different use cases (consumer chatbots, team knowledge bases) and may not map cleanly to LongMemEval's conversational agent framing.
- •Scores may not be competitive: Teams that run LongMemEval internally and get low scores have little incentive to publish. Silence is not evidence of poor quality, but it is not evidence of good quality either.
- •Focus on other metrics: Some teams prioritize latency, scalability, or integration simplicity over retrieval accuracy benchmarks. These are legitimate priorities, but benchmarks provide a common comparison point.
Beyond Benchmarks
Benchmarks measure retrieval accuracy. Real-world performance depends on more than that.
Scalability
Does accuracy hold at 10K memories? 100K? Without intelligent forgetting, retrieval quality degrades as memory grows.
Operational Simplicity
How many external services are required? Docker? Neo4j? API keys? The simpler the setup, the faster you ship.
Data Privacy
Where is the data stored? Who has access? For regulated industries, this can be a dealbreaker regardless of benchmark scores.
Contradiction Handling
When the agent stores conflicting information over time, does the system detect and surface the conflict?
Multi-Agent Support
Can multiple agents share memory safely? This requires coordination primitives, not just a shared database.
Cost
Cloud memory systems charge per API call or per stored memory. Local-first systems have zero ongoing cost after installation.
Frequently Asked
What is LongMemEval?
LongMemEval is a benchmark published at ICLR 2025 that evaluates long-term memory in AI conversational agents. It tests 5 core capabilities — information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention — across 500 questions derived from real conversation histories.
Why haven't most memory systems published LongMemEval scores?
Running LongMemEval properly is expensive (500 LLM-graded questions) and time-consuming. Some systems focus on different evaluation methods. Others may not score well. As of April 2026, only OMEGA (95.4%), Mastra Observational Memory (94.87%), and Zep/Graphiti (71.2%) have published scores.
Can I trust benchmark scores alone?
No. Benchmarks measure retrieval accuracy under controlled conditions. They do not test real-world factors like performance at scale (10K+ memories), contradiction handling, multi-agent coordination, or operational simplicity. Use benchmarks as one signal among many when evaluating memory systems.
How do I evaluate memory systems for my use case?
Start with LongMemEval scores for a baseline accuracy comparison. Then evaluate: architecture (local-first vs cloud), dependencies (what external services are required), privacy (where data is stored), scalability (intelligent forgetting), and integration complexity (how many lines of config to get started). The best memory system is one that works well and that you can actually deploy.
95.4% on LongMemEval
The highest published score among dedicated memory systems. Free, open source, and local-first.