Guide

AI Agent Memory Benchmarks Explained

Name: OMEGA
Author: OMEGA

LongMemEval is the emerging standard for measuring AI agent memory quality. Here is what it tests, who has published scores, and what to look for beyond benchmarks.

TL;DR: LongMemEval (ICLR 2025) tests 5 memory capabilities across 500 questions. Published scores: OMEGA 95.4% (GPT-4.1), Mastra OM 94.87% (gpt-5-mini), Zep/Graphiti 71.2% (GPT-4o). Most memory systems (Mem0, Letta, and others) have not published scores. Benchmarks matter, but they are not the whole picture.

The Benchmark

What LongMemEval Tests

Five core capabilities that define whether a memory system actually works.

Information Extraction

Can the system extract and store specific facts from conversations? Tests whether memories are correctly captured from natural dialogue.

Example: User mentions their dog's name is Max. Later asked: 'What is the user's dog's name?'

Multi-Session Reasoning

Can the system combine information from multiple past sessions to answer a question? Tests cross-session knowledge synthesis.

Example: Session 1: User likes Italian food. Session 5: User is lactose intolerant. Question: 'What Italian dishes should I recommend?'

Temporal Reasoning

Can the system track how information changes over time? Tests understanding of chronological knowledge updates.

Example: January: User works at Google. March: User starts at Anthropic. Question: 'Where does the user work?'

Knowledge Updates

When information changes, does the system return the most current version? Tests whether old facts are properly superseded.

Example: User's address changed twice. System should return the most recent address, not the original.

Abstention

When the system does not have the information, does it correctly say 'I don't know'? Tests resistance to hallucination from false memories.

Example: Asked about something never discussed. System should abstain rather than guess.

Scores

Published Scores

System	Score	Model	Notes
OMEGA	95.4%	GPT-4.1	Dedicated memory system. Local-first, core MCP tools.
Mastra OM	94.87%	gpt-5-mini	In-context observational memory. No persistence layer.
Zep / Graphiti	71.2%	GPT-4o + GPT-4o-mini	Graph-based memory. Requires Neo4j.
Mem0	Not published	—	Cloud-first memory platform.
Letta	Not published	—	Full agent framework with memory.

Scores from published sources. See /benchmarks for detailed methodology and per-category breakdowns.

Context

Why Most Systems Haven't Published

In the sources checked for this comparison, three memory systems have published LongMemEval scores. The rest — including Mem0 and Letta — have not. There are several possible reasons:

•
Cost and complexity: Running LongMemEval properly requires 500 LLM-graded evaluations. With GPT-4-class models as judges, a single run costs $50-200 and takes hours. Many teams have not invested the engineering effort.
•
Architecture mismatch: Some memory systems are designed for different use cases (consumer chatbots, team knowledge bases) and may not map cleanly to LongMemEval's conversational agent framing.
•
Scores may not be competitive: Teams that run LongMemEval internally and get low scores have little incentive to publish. Silence is not evidence of poor quality, but it is not evidence of good quality either.
•
Focus on other metrics: Some teams prioritize latency, scalability, or integration simplicity over retrieval accuracy benchmarks. These are legitimate priorities, but benchmarks provide a common comparison point.

Evaluation

Beyond Benchmarks

Benchmarks measure retrieval accuracy. Real-world performance depends on more than that.

Scalability

Does accuracy hold at 10K memories? 100K? Without intelligent forgetting, retrieval quality degrades as memory grows.

Operational Simplicity

How many external services are required? Docker? Neo4j? API keys? The simpler the setup, the faster you ship.

Data Privacy

Where is the data stored? Who has access? For regulated industries, this can be a dealbreaker regardless of benchmark scores.

Contradiction Handling

When the agent stores conflicting information over time, does the system detect and surface the conflict?

Multi-Agent Support

Can multiple agents share memory safely? This requires coordination primitives, not just a shared database.

Cost

Cloud memory systems charge per API call or per stored memory. Local-first systems have zero ongoing cost after installation.

FAQ

Frequently Asked

What is LongMemEval?

LongMemEval is a benchmark published at ICLR 2025 that evaluates long-term memory in AI conversational agents. It tests 5 core capabilities — information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention — across 500 questions derived from real conversation histories.

Why haven't most memory systems published LongMemEval scores?

Running LongMemEval properly is expensive (500 LLM-graded questions) and time-consuming. Some systems focus on different evaluation methods. Others may not score well. In the sources checked for this comparison, OMEGA (95.4%), Mastra Observational Memory (94.87%), and Zep/Graphiti (71.2%) have published scores.

Can I trust benchmark scores alone?

No. Benchmarks measure retrieval accuracy under controlled conditions. They do not test real-world factors like performance at scale (10K+ memories), contradiction handling, multi-agent coordination, or operational simplicity. Use benchmarks as one signal among many when evaluating memory systems.

How do I evaluate memory systems for my use case?

Start with LongMemEval scores for a baseline accuracy comparison. Then evaluate: architecture (local-first vs cloud), dependencies (what external services are required), privacy (where data is stored), scalability (intelligent forgetting), and integration complexity (how many lines of config to get started). The best memory system is one that works well and that you can actually deploy.

95.4% on LongMemEval

The highest published score among dedicated memory systems. Free, open source, and local-first.

Install OMEGA Free Compare Systems