Skip to main content

95.4%

466 / 500 on LongMemEval (ICLR 2025)

Measured February 2026 · Open methodology · Honest numbers

Where OMEGA Excels.
Where It Doesn't.

LongMemEval tests 5 capability areas across 500 questions. OMEGA scores above 83% in every category. Multi-session reasoning (83%) is the hardest - connecting facts across separate conversations requires deep retrieval.

99%
Single-Session Recall
100%
Preference Application
83%
Multi-Session Reasoning
96%
Knowledge Updates
94%
Temporal Reasoning
95.4%
Overall
Single-Session Recall125 / 126
99%
Preference Application30 / 30
100%
Multi-Session Reasoning111 / 133
83%
Knowledge Updates75 / 78
96%
Temporal Reasoning125 / 133
94%

Category scores from our 95.4% task-averaged accuracy (466/500 raw). Methodology: LongMemEval.

How Others Compare

Most memory systems don't publish LongMemEval scores. Where available, scores are shown. “N/A” means no published benchmark.

OMEGA
12 toolslocal
95.4%
Mastra
94.87%
Emergence AI
86%
Zep / Graphiti
10 tools
71.2%
Mem0
9 tools
N/A
Letta
7 toolslocal
N/A
mcp-memory-service
12 toolslocal
N/A
Claude Native
0 toolslocal
N/A

Tool counts approximate, based on public docs as of Feb 2026.

Real Numbers, Real Hardware

~31 MB
Cold Start
RSS before first query
~337 MB
Warm
After ONNX model loads
<50 ms
Retrieval
Vector + FTS5 combined
~8 ms
Embedding
Per query, bge-small ONNX
~12 ms
Store
Write + embed + auto-relate
None
GPU Required
CPU-only inference

M1 MacBook Pro · ~240 memories · bge-small-en-v1.5 ONNX · RSS via Activity Monitor

MemoryStress: Memory Under Pressure

LongMemEval tests recall from 40 clean sessions. MemoryStress tests what happens at 1,000 - 583 facts, 10 simulated months, and the degradation curve that reveals whether your architecture can survive longitudinal pressure.

0%10%20%30%40%50%100 sessions500 sessions1000 sessionsRecoveryPhase 1Phase 2Phase 3Phase 4peak: more data helps31.4%42.4%27.3%25.5%benchmark phaseaccuracy
32.7%
98/300

OMEGA scores 32.7% on an intentionally brutal benchmark - 25× the session volume of LongMemEval with adversarial conditions. The Phase 2 peak (42.4%) shows that persistent architectures improve with more data. Compression-based systems would cliff here.

$ python scripts/memorystress_harness.py \
--dataset dataset.json --adapter omega \
--model gpt-4o --grade --output-dir results/
✓ degradation_curve.json, per_type.json, summary.md

OMEGA Core: Architecture & Design

A deep dive into how OMEGA stores, retrieves, and manages long-term memory for AI coding agents. This section serves as the technical reference for the open-source core.

1. Search Pipeline

Every query passes through a six-stage pipeline that combines vector similarity with full-text search, type weighting, contextual boosting, deduplication, and time-decay. The result is high-precision retrieval that improves with use.

1
Vector Similarity

sqlite-vec cosine distance, 384-dim bge-small-en-v1.5

Primary recall
2
Full-Text Search

FTS5 keyword matching for exact phrases

Precision boost
3
Type-Weighted Scoring

Decisions and lessons weighted 2×

+15% relevance
4
Contextual Re-ranking

Boosts by tag, project, file context

+8% relevance
5
Cross-Encoder Reranking

Neural re-scoring of top 20 via ms-marco-MiniLM-L-6-v2 ONNX

Precision refinement
6
Deduplication

SHA256 hash + 0.85 embedding similarity

Noise reduction
7
Time-Decay Weighting

Old unaccessed memories rank lower (floor 0.35)

Freshness bias

2. Storage Architecture

A single SQLite database with three core layers. No external services, no network calls, no GPU.

Memory Table

Typed records (decision, lesson, error, preference, session_summary) with SHA256 deduplication, auto-tags, timestamps, and access counters.

SQLite + FTS5
Vector Index

384-dimensional embeddings from bge-small-en-v1.5 (ONNX Runtime, CPU-only). Cosine similarity search via sqlite-vec extension.

sqlite-vec + ONNX
Graph Layer

Typed edges (related, supersedes, contradicts) between memory nodes. BFS traversal up to 5 hops. Auto-created for similarity ≥ 0.45.

Adjacency table + BFS
~/.omega/omega.db
memories · typed records + SHA256 dedup
vec_memories · 384-dim float32 embeddings
edges · weighted typed relationships
fts_memories · FTS5 full-text index
forgetting_log · deletion audit trail
~10.5 MB for ~240 memories · 0o600 permissions

3. Memory Lifecycle

Memories aren't static. They evolve, get consolidated, and decay over time. This lifecycle prevents unbounded growth while preserving what matters.

Ingest

SHA256 exact dedup + embedding similarity 0.85+ (semantic) + Jaccard per-type

Evolve

Similar content (55-95%) appends new insights to existing memories rather than creating duplicates

Relate

Auto-creates 'related' edges (cosine similarity ≥ 0.45) to top-3 similar memories on store

TTL

Session summaries expire after 1 day. Decisions, lessons, and preferences are permanent

Compact

Clusters related memories by Jaccard similarity, creates summary nodes, marks originals as superseded

Decay

Unaccessed memories lose ranking weight over time. Floor at 0.35. Preferences and errors exempt

4. Forgetting Intelligence

Memory systems that only accumulate become noisy over time. OMEGA includes principled forgetting - every deletion is audited, decay is transparent, and conflicts are detected automatically.

100%
deletions tracked
Audit Trail

Every deletion is logged with the reason - TTL expiry, LRU eviction, consolidation, negative feedback, or manual deletion. Query the full log anytime via omega_forgetting_log.

0.35
decay floor
Decay Curves

Memories that haven't been accessed lose ranking weight over time. The decay follows an exponential curve with a floor at 0.35 - old memories are deprioritized, never fully erased. Preferences and error patterns are exempt from decay.

Auto
conflict resolution
Conflict Detection

When a new memory contradicts an existing one, OMEGA detects it automatically. For decisions, the newest wins (auto-resolve). For lessons and other types, the conflict is flagged for manual review.

Decay Curve Visualization

0.000.350.501.000d30d60d90dfloorpreferences & errors (exempt)days since last accessweight

5. Embedding Model

OMEGA uses bge-small-en-v1.5 via ONNX Runtime for local, CPU-only embedding. No API calls, no GPU, no network dependency.

bge-small-en-v1.5
Model
384
Dimensions
~90 MB
Size
ONNX (CPU)
Runtime

Why bge-small? It ranks in the top tier for retrieval quality at its size class on MTEB. The small footprint (~90 MB, 384 dims) means fast inference on any laptop CPU - typically under 8ms per embedding. Larger models (bge-large, e5-large) score marginally better on benchmarks but add 4-10× latency and memory overhead. For a memory system that embeds on every store and query, the speed-quality tradeoff favors small.

6. Hook System

Four lifecycle hooks connect OMEGA to Claude Code. All dispatch via a Unix domain socket with fail-open semantics - if the daemon is down, the IDE continues unaffected.

HookHandlerPurpose
SessionStartsession_startWelcome briefing with recent memories, profile, reminders
Stopsession_stopAuto-generate session summary
UserPromptSubmitauto_captureExtract decisions, lessons, and preferences from conversation
PostToolUsesurface_memoriesSurface relevant memories before file edits

Fail-open design. Hooks dispatch via fast_hook.py to a Unix domain socket. If the OMEGA daemon isn't running, the hook exits silently with code 0 - the IDE never blocks. Average hook latency is under 15ms for session start (the heaviest hook) and under 3ms for post-tool-use surfacing.

How I Tested

I used the LongMemEval benchmark (Wang et al., ICLR 2025), which evaluates long-term memory systems across 500 questions in 5 categories. The benchmark was designed to test real-world memory capabilities - not just retrieval, but reasoning, temporal understanding, and the ability to abstain when information is missing.

The test environment: OMEGA v1.0.0, GPT-4.1 as the generation and grading LLM, bge-small-en-v1.5 ONNX embeddings, running on an M1 MacBook Pro with 16GB RAM. No cloud services involved in retrieval - all local inference.

The task-averaged score of 95.4% (466/500 raw) represents OMEGA's best run. Multi-session reasoning (83%) is the most challenging category - I'm actively working to improve it.