Skip to main content
← Blog/Benchmark

Why I Built
MemoryStress.

Every AI memory system claims high recall. None have been tested at 1,000 sessions. So I built the benchmark that does.

Jason Sosa10 min read
1,000
sessions
583
facts
10
simulated months

The first benchmark that measures what happens when memory systems age.

OMEGA scores 95.4% on LongMemEval. That number measures recall across 40 static sessions. But here's the question nobody is asking: what happens at session 500? At session 1,000? When your memory store has ingested ten months of daily conversations and must still find a fact mentioned once, six months ago?

No existing benchmark answers this. So I built one.

MemoryStress is a longitudinal memory benchmark - 583 facts embedded naturally across 1,000 GPT-4o-generated conversation sessions spanning 10 simulated months. It tests retention under accumulation pressure, contradiction chains, cross-agent handoffs, and the slow entropy that destroys memory systems over time.

The Gap in Memory Benchmarks

Every memory system on the market publishes LongMemEval scores. Mastra claims 94.87%. OMEGA claims 95.4%. But LongMemEval tests recall from ~40 clean sessions with no accumulation pressure, no eviction, and no multi-agent complexity. It's a great test of retrieval quality. It tells you nothing about what happens when memory ages.

BenchmarkSessionsWhat It Misses
LongMemEval~40No accumulation pressure
MemoryAgentBenchShortNo degradation curves
BEAMSyntheticNo realistic noise
MemoryStress1,000First longitudinal benchmark

The architectural question MemoryStress exposes is this: systems that compress all memories into a fixed-size context window (Mastra's Observational Memory, MemGPT's context packing) hit a ceiling when the data exceeds that window. At session 200, maybe session 300, they're forced to evict or summarize - and old facts start disappearing.

Persistent architectures like OMEGA's don't have this problem. SQLite doesn't run out of context. The question is whether retrieval degrades as the store grows. MemoryStress measures exactly that.

How MemoryStress Works

The benchmark runs in three phases, each designed to add more pressure to the memory system:

Phase 1: Foundation

Sessions 1–100

Clean, low noise. Core facts are established. This is the baseline — if you can't recall facts from here, your system has a fundamental problem.

Phase 2: Growth

Sessions 101–500

Volume increases. Some contradictions appear. Topics multiply. This phase simulates a few months of real usage where the memory store grows significantly.

Phase 3: Stress

Sessions 501–1,000

Dense, high-entropy, multi-topic sessions. Facts compete for retrieval space. Contradictions chain. This is where compression-based systems would cliff.

At phase boundaries, the benchmark asks 300 questions across 7 types: fact recall, temporal ordering, preference recall, contradiction resolution, single-mention recall, cross-agent recall, and cold start recall. Each question is graded by GPT-4o using type-aware prompts.

32.7%
98/300 on MemoryStress v1
Intentionally brutal benchmark · 1,000 sessions · 583 facts · 7 question types

The Degradation Curve

This is the key metric. Not the absolute score - the shape of the curve as sessions accumulate:

0%10%20%30%40%50%100 sessions500 sessions1000 sessionsRecoveryPhase 1Phase 2Phase 3Phase 4peak: more data helps31.4%42.4%27.3%25.5%benchmark phaseaccuracy

The Phase 2 peak at 42.4% is the important signal. OMEGA's persistent architecture means more data actually helps retrieval - a richer embedding space produces better semantic matches. The Phase 3 dip is noise dilution, not data loss. The memories are still there; they're just harder to find in a larger store.

A compression-based system would show a different shape entirely: flat or rising through Phase 1 as the context window fills, then a steep cliff at the point where eviction begins. Early facts don't gradually get harder to find - they're gone.

Is 32.7% Good?

Yes - for what this benchmark tests. MemoryStress asks questions about facts buried in noisy conversations from hundreds of sessions ago, including single-mention facts, contradicted facts, and cross-agent facts. A null adapter (always answers “I don't know”) scores 0%. A raw context-window approach would hit its token ceiling around session 200 and fail everything after that.

For reference, OMEGA scores 95.4% on LongMemEval - which tests recall from ~40 clean sessions. MemoryStress is 25× the session volume with adversarial conditions. The absolute number will go up as I optimize, but the benchmark is calibrated to be hard enough that it reveals real architectural differences.

Per-Type Breakdown

Seven question types expose different failure modes. The spread between best (41.2%) and worst (21.4%) tells you exactly where the retrieval pipeline succeeds and struggles:

Question TypeScore
Temporal ordering41.2%
Fact recall37.5%
Cold start recall37.5%
Preference recall37.1%
Cross-agent recall31.2%
Single-mention recall27.7%
Contradiction resolution21.4%

Contradiction resolution (21.4%) is the hardest category. The LLM retrieves both old and new versions of a fact, and despite strong prompting to prefer the most recent, sometimes picks the wrong one. This is a fundamental retrieval+reasoning problem that every memory system must solve.

The Optimization Journey

I iterated through five configurations, each testing a different retrieval strategy. The baseline was 27.3%. I improved to 32.7% by combining multiple techniques:

v1
Default retrieval, no enhancements
82/300
v2
Contradiction +1.8pp
86/300
v3
Contradiction +8.9pp vs v1
89/300
v4
Single-mention +6.4pp
93/300
v5
Best overall, +5.4pp
98/300

The five techniques that contributed:

Contradiction-Aware RAG Prompt

Explicitly instructing the LLM: "when multiple notes discuss the same topic, ALWAYS use the MOST RECENT note." Notes are sorted chronologically (oldest→newest) so recency is structurally communicated. +5 correct answers.

Query Augmentation

Using gpt-4.1-mini to generate 3 alternative search queries per question, then merging results from all retrieval passes. This gives single-mention facts more chances of matching, since the original question's wording may not overlap with the stored conversation. +4 correct.

Recency Boosting

A 1.0→1.8× multiplicative boost to retrieval relevance scores based on note date. Adapted from OMEGA's LongMemEval optimizations. +3 correct.

Fact Extraction at Ingest

Extracting discrete facts from session conversations and storing them as separate memory entries creates additional semantic hooks. The benefit is diffuse rather than targeted. +3 correct overall.

Cross-Agent Fallback

When agent-scoped retrieval returns fewer than 5 results, a secondary unscoped pass catches facts planted by other agents. +2 correct.

The Architectural Insight

MemoryStress was designed to expose a specific failure mode: what happens when your memory architecture can't scale beyond a fixed context window.

Systems like Mastra's Observational Memory pack all memories into a ~70k token context. At session 100, that might be fine. At session 500, you're compressing aggressively. At session 1,000, you're throwing information away. The degradation isn't graceful - it's a cliff.

OMEGA's persistent vector store doesn't have this failure mode. Nothing is evicted. The trade-off is that retrieval gets harder as the store grows - which is exactly what the Phase 3 dip shows. But harder-to-find is fundamentally different from gone. You can improve retrieval. You can't recover evicted memories.

Cost Transparency

The full benchmark run costs $4.06 using GPT-4o for generation, answering, and grading. That's 4¢ per correct answer.

Critically, the cost scales linearly with sessions, not quadratically. Compression-based architectures that regenerate their entire context block on every cycle face quadratic cost growth as the memory store expands. OMEGA's retrieval-based approach stays constant per query regardless of store size.

$4.06
Total cost
Per correct answer
41 min
Runtime
GPT-4o
Model

Run It on Your System

MemoryStress is open source. You can generate the dataset, write an adapter for your memory system, and see your own degradation curve. Here's how:

# Generate dataset (~$5, ~45 min)
$ python scripts/memorystress_generate.py \
--model gpt-4o --seed 42 --output dataset.json
# Run your adapter (~$4, ~40 min)
$ python scripts/memorystress_harness.py \
--dataset dataset.json --adapter your_system \
--model gpt-4o --grade --output-dir results/
# Check results
$ cat results/metrics.json
✓ degradation_curve.json, per_type.json, summary.md

The harness outputs a degradation curve, per-type breakdown, and full metrics. Write an adapter that implements store() and query() for your system and you're done.

- Jason Sosa, builder of OMEGA

MemoryStress is part of the OMEGA project. Apache 2.0 licensed.