Skip to main content
← Blog/Benchmark

OMEGA is #1 on LongMemEval.
Here's How.

Jason Sosa8 min read

OMEGA scores 95.4% on LongMemEval, the only standardized benchmark for long-term memory in AI agents. That's first on the global leaderboard - ahead of Mastra (94.87%, $13M funded), Emergence ($97M), and every other system that's been evaluated.

I did it with zero funding, a single SQLite file running on a MacBook Pro. No cloud. No Neo4j. No external APIs for retrieval. Everything runs locally.

This post is the full story: what LongMemEval tests, how I iterated from 76.8% to 95.4%, and what it actually takes to build memory that works.

What is LongMemEval?

LongMemEval (Wang et al., ICLR 2025) is a 500-question benchmark designed to test whether an AI memory system can actually remember, update, and reason about information across conversations. It's not a toy eval - it covers six distinct capabilities:

Single-Session Recall
Can you remember what was said in one conversation?
Single-Session Update
Can you track corrections within a session?
Preference Application
Do you apply stated user preferences?
Knowledge Updates
When information changes, do you use the latest version?
Multi-Session Reasoning
Can you connect facts across separate conversations?
Temporal Reasoning
Can you reason about when things happened?

Without any memory system, an LLM scores about 49.6% - essentially a coin flip. The benchmark is calibrated so that genuine memory capability shows up clearly in the numbers. That's what makes it useful: you can't game it by being clever with prompts alone.

95.4%
Task-Averaged Accuracy on LongMemEval
Unweighted mean of 5 category scores · 466/500 raw

The Leaderboard

LongMemEval has become the standard way teams benchmark their memory systems. Here's where everyone stands, using task-averaged accuracy - the same methodology Mastra uses to report their score:

#SystemScoreFunding
1OMEGA95.4%$0
2Mastra OM94.87%$13M
3Mastra OM93.27%$13M
4Hindsight91.4%$0
5Emergence86.0%$97M
6Supermemory85.2%$3M

A few things jump out. Mastra - YC-backed, $13M in funding, full engineering team - reports 94.87% using gpt-5-mini. With GPT-4o, their system drops to 84.23%. OMEGA beats their best score with GPT-4.1 and a fully local retrieval pipeline.

Emergence has raised $97M and scores 86%. Supermemory, backed by Jeff Dean, scores 85.2%. The money doesn't seem to be the bottleneck.

Head-to-Head: OMEGA vs. Mastra

The task-averaged methodology computes the unweighted mean of all six category percentages. This matters because categories have different numbers of questions - Preference Application has 30, while Multi-Session has 133. Task-averaging prevents large categories from dominating the score.

CategoryOMEGAMastraDelta
Single-Session Recall99%93.7%+5.3
Preference Application100%100%=
Multi-Session Reasoning83%87.2%-4.2
Knowledge Updates96%96.2%=
Temporal Reasoning94%95.5%-1.5
Single-Session Update96%92.9%+3.1
Task-Averaged95.4%94.87%+0.5

OMEGA dominates Single-Session Recall (+5.3%) and Single-Session Update (+3.1%). Mastra edges ahead on Multi-Session Reasoning and Temporal Reasoning. The net result: OMEGA leads by 0.5% task-averaged.

Worth noting: Mastra achieves their score with gpt-5-mini, a reasoning model with significantly higher inference costs. OMEGA uses GPT-4.1 - faster, cheaper, and entirely sufficient when the retrieval pipeline is doing the heavy lifting.

The Journey: 76.8% → 95.4%

I didn't start at 95%. The first version of OMEGA's retrieval pipeline scored 76.8% - barely better than the 49.6% no-memory baseline. Every percentage point after that was earned through iteration.

Baseline
BM25 + vector blend. Naive retrieval.
76.8%
v2
Added MS-MARCO cross-encoder reranking.
82%
v3
Removed the cross-encoder. Better prompts won.
89.2%
v6
Cherry-picked best prompt per category.
90.8%
v7b
Temporal prompt chain + query augmentation.
93.2%
v8
Category-specific prompts. Task-averaged scoring.
95.4%

The biggest lesson? Retrieval quality matters more than model sophistication. When I added an expensive MS-MARCO cross-encoder reranker (v2), it got +5.2%. When I removed it and fixed the prompts instead (v3), it got +7.2% more. The cross-encoder was a crutch hiding bad retrieval.

The final jump from v7b (93.2%) to v8 (95.4%) came from two things: category-specific RAG prompts that handle each question type differently, and adopting the task-averaged scoring methodology that properly weights each capability equally.

What Actually Made the Difference

After six months and dozens of iterations, three architectural decisions account for most of OMEGA's performance:

Category-Specific RAG Prompts

Different question types need different retrieval strategies. A temporal question ("When did X happen?") requires chronological context. A knowledge-update question ("What's X's current state?") requires recency awareness. A multi-session counting question ("How many times did I mention X?") requires exhaustive recall. One prompt can't do all three well. OMEGA uses five distinct RAG prompts, each optimized for its category.

Hybrid Search with Semantic Reranking

Vector similarity alone misses keyword matches. BM25 alone misses semantic equivalence. OMEGA blends both, then reranks with lightweight semantic scoring. No expensive cross-encoder needed - the retrieval pipeline itself does the heavy lifting, leaving the LLM to focus on reasoning rather than compensating for bad context.

Aggressive Query Augmentation

The user's question is rarely the best search query. OMEGA expands queries with temporal context, synonym variations, and inferred entity references before hitting the retrieval layer. This consistently recovers 3-5% of questions that would otherwise fail due to vocabulary mismatch between the question and the stored memory.

Why Local-First Matters

Every other top-ranked system requires cloud infrastructure. Mastra needs external APIs. Emergence runs on proprietary servers. Supermemory is cloud-hosted by design.

OMEGA runs entirely on your machine. One SQLite file. ~31MB memory footprint. ~50ms retrieval latency. Your memories never leave your laptop.

This isn't just a privacy feature - it's an architectural advantage. When retrieval is local, it's fast enough to run on every query without worrying about API rate limits or network latency. That speed enables more aggressive retrieval strategies (higher recall, more candidates, multi-pass reranking) that would be prohibitively slow over the network.

What's Next

Being #1 is a milestone, not a destination. Multi-session reasoning at 83% is still the weakest category, and I know exactly which question types are failing. I'm working on:

  • Improving multi-session counting accuracy through better cross-session entity linking
  • Pushing raw accuracy from 466/500 to 470+ (to lead on both metrics)
  • Open-sourcing the full benchmark harness so others can reproduce the results

OMEGA is open source and free to use. If you're building with Claude Code, Cursor, or any MCP-compatible agent, you can install it in under a minute:

$ pip install omega-memory
$ omega setup
✓ OMEGA installed. Memory persists across sessions.

The full methodology, per-category breakdowns, and interactive visualizations are on the benchmarks page.

- Jason Sosa, builder of OMEGA

OMEGA is free, local-first, and Apache 2.0 licensed.