← Blog/Benchmark

Why I Built
MemoryStress.

Every AI memory system claims high recall. None have been tested at 1,000 sessions. So I built the benchmark that does.

February 9, 2026|Jason Sosa|10 min read

1,000

sessions

583

facts

simulated months

The first benchmark that measures what happens when memory systems age.

OMEGA scores 95.4% on LongMemEval. That number measures recall across 40 static sessions. But here's the question nobody is asking: what happens at session 500? At session 1,000? When your memory store has ingested ten months of daily conversations and must still find a fact mentioned once, six months ago?

No existing benchmark answers this. So I built one.

MemoryStress is a longitudinal memory benchmark - 583 facts embedded naturally across 1,000 GPT-4o-generated conversation sessions spanning 10 simulated months. It tests retention under accumulation pressure, contradiction chains, cross-agent handoffs, and the slow entropy that destroys memory systems over time.

The Gap in Memory Benchmarks

Every memory system on the market publishes LongMemEval scores. Mastra claims 94.87%. OMEGA claims 95.4%. But LongMemEval tests recall from ~40 clean sessions with no accumulation pressure, no eviction, and no multi-agent complexity. It's a great test of retrieval quality. It tells you nothing about what happens when memory ages.

Benchmark	Sessions	What It Tests	What It Misses
LongMemEval	~40	Recall from static sessions	No accumulation pressure
MemoryAgentBench	Short	Agent task completion	No degradation curves
BEAM	Synthetic	Basic memory ops	No realistic noise
MemoryStress	1,000	Retention under pressure	First longitudinal benchmark

The architectural question MemoryStress exposes is this: systems that compress all memories into a fixed-size context window (Mastra's Observational Memory, MemGPT's context packing) hit a ceiling when the data exceeds that window. At session 200, maybe session 300, they're forced to evict or summarize - and old facts start disappearing.

Persistent architectures like OMEGA's don't have this problem. SQLite doesn't run out of context. The question is whether retrieval degrades as the store grows. MemoryStress measures exactly that.

How MemoryStress Works

The benchmark runs in three phases, each designed to add more pressure to the memory system:

Phase 1: Foundation

Sessions 1–100

Clean, low noise. Core facts are established. This is the baseline — if you can't recall facts from here, your system has a fundamental problem.

Phase 2: Growth

Sessions 101–500

Volume increases. Some contradictions appear. Topics multiply. This phase simulates a few months of real usage where the memory store grows significantly.

Phase 3: Stress

Sessions 501–1,000

Dense, high-entropy, multi-topic sessions. Facts compete for retrieval space. Contradictions chain. This is where compression-based systems would cliff.

At phase boundaries, the benchmark asks 300 questions across 7 types: fact recall, temporal ordering, preference recall, contradiction resolution, single-mention recall, cross-agent recall, and cold start recall. Each question is graded by GPT-4o using type-aware prompts.

32.7%

98/300 on MemoryStress v1

Intentionally brutal benchmark · 1,000 sessions · 583 facts · 7 question types

The Degradation Curve

This is the key metric. Not the absolute score - the shape of the curve as sessions accumulate:

The Phase 2 peak at 42.4% is the important signal. OMEGA's persistent architecture means more data actually helps retrieval - a richer embedding space produces better semantic matches. The Phase 3 dip is noise dilution, not data loss. The memories are still there; they're just harder to find in a larger store.

A compression-based system would show a different shape entirely: flat or rising through Phase 1 as the context window fills, then a steep cliff at the point where eviction begins. Early facts don't gradually get harder to find - they're gone.

Is 32.7% Good?

Yes - for what this benchmark tests. MemoryStress asks questions about facts buried in noisy conversations from hundreds of sessions ago, including single-mention facts, contradicted facts, and cross-agent facts. A null adapter (always answers “I don't know”) scores 0%. A raw context-window approach would hit its token ceiling around session 200 and fail everything after that.

For reference, OMEGA scores 95.4% on LongMemEval - which tests recall from ~40 clean sessions. MemoryStress is 25× the session volume with adversarial conditions. The absolute number will go up as I optimize, but the benchmark is calibrated to be hard enough that it reveals real architectural differences.

Per-Type Breakdown

Seven question types expose different failure modes. The spread between best (41.2%) and worst (21.4%) tells you exactly where the retrieval pipeline succeeds and struggles:

Question Type	Score	N	Assessment
Temporal ordering	41.2%	34	Strong - date-aware retrieval works
Fact recall	37.5%	80	Strong baseline for direct retrieval
Cold start recall	37.5%	16	Persisted store survives fresh agent
Preference recall	37.1%	35	Preferences well-embedded
Cross-agent recall	31.2%	32	Unscoped fallback catches cross-agent facts
Single-mention recall	27.7%	47	Query augmentation finds buried facts
Contradiction resolution	21.4%	56	Hardest - requires retrieval + reasoning

Contradiction resolution (21.4%) is the hardest category. The LLM retrieves both old and new versions of a fact, and despite strong prompting to prefer the most recent, sometimes picks the wrong one. This is a fundamental retrieval+reasoning problem that every memory system must solve.

The Optimization Journey

I iterated through five configurations, each testing a different retrieval strategy. The baseline was 27.3%. I improved to 32.7% by combining multiple techniques:

Default retrieval, no enhancements

82/300

Contradiction +1.8pp

86/300

Contradiction +8.9pp vs v1

89/300

Single-mention +6.4pp

93/300

Best overall, +5.4pp

98/300

The five techniques that contributed:

Contradiction-Aware RAG Prompt

Explicitly instructing the LLM: "when multiple notes discuss the same topic, ALWAYS use the MOST RECENT note." Notes are sorted chronologically (oldest→newest) so recency is structurally communicated. +5 correct answers.

Query Augmentation

Using gpt-4.1-mini to generate 3 alternative search queries per question, then merging results from all retrieval passes. This gives single-mention facts more chances of matching, since the original question's wording may not overlap with the stored conversation. +4 correct.

Recency Boosting

A 1.0→1.8× multiplicative boost to retrieval relevance scores based on note date. Adapted from OMEGA's LongMemEval optimizations. +3 correct.

Fact Extraction at Ingest

Extracting discrete facts from session conversations and storing them as separate memory entries creates additional semantic hooks. The benefit is diffuse rather than targeted. +3 correct overall.

Cross-Agent Fallback

When agent-scoped retrieval returns fewer than 5 results, a secondary unscoped pass catches facts planted by other agents. +2 correct.

The Architectural Insight

MemoryStress was designed to expose a specific failure mode: what happens when your memory architecture can't scale beyond a fixed context window.

Systems like Mastra's Observational Memory pack all memories into a ~70k token context. At session 100, that might be fine. At session 500, you're compressing aggressively. At session 1,000, you're throwing information away. The degradation isn't graceful - it's a cliff.

OMEGA's persistent vector store doesn't have this failure mode. Nothing is evicted. The trade-off is that retrieval gets harder as the store grows - which is exactly what the Phase 3 dip shows. But harder-to-find is fundamentally different from gone. You can improve retrieval. You can't recover evicted memories.

Cost Transparency

The full benchmark run costs $4.06 using GPT-4o for generation, answering, and grading. That's 4¢ per correct answer.

Critically, the cost scales linearly with sessions, not quadratically. Compression-based architectures that regenerate their entire context block on every cycle face quadratic cost growth as the memory store expands. OMEGA's retrieval-based approach stays constant per query regardless of store size.

$4.06

Total cost

4¢

Per correct answer

41 min

Runtime

GPT-4o

Model

Run It on Your System

MemoryStress is open source. You can generate the dataset, write an adapter for your memory system, and see your own degradation curve. Here's how:

# Generate dataset (~$5, ~45 min)

$ python scripts/memorystress_generate.py \

--model gpt-4o --seed 42 --output dataset.json

# Run your adapter (~$4, ~40 min)

$ python scripts/memorystress_harness.py \

--dataset dataset.json --adapter your_system \

--model gpt-4o --grade --output-dir results/

# Check results

$ cat results/metrics.json

✓ degradation_curve.json, per_type.json, summary.md

The harness outputs a degradation curve, per-type breakdown, and full metrics. Write an adapter that implements store() and query() for your system and you're done.

- Jason Sosa, builder of OMEGA