95.4%
466 / 500 on LongMemEval (ICLR 2025)
Measured February 2026 · Open methodology · Honest numbers
Where OMEGA Excels.
Where It Doesn't.
LongMemEval tests 5 capability areas across 500 questions. OMEGA scores above 83% in every category. Multi-session reasoning (83%) is the hardest - connecting facts across separate conversations requires deep retrieval.
Category scores from our 95.4% task-averaged accuracy (466/500 raw). Methodology: LongMemEval.
How Others Compare
Most memory systems don't publish LongMemEval scores. Where available, scores are shown. “N/A” means no published benchmark.
Tool counts approximate, based on public docs as of Feb 2026.
Real Numbers, Real Hardware
M1 MacBook Pro · ~240 memories · bge-small-en-v1.5 ONNX · RSS via Activity Monitor
MemoryStress: Memory Under Pressure
LongMemEval tests recall from 40 clean sessions. MemoryStress tests what happens at 1,000 - 583 facts, 10 simulated months, and the degradation curve that reveals whether your architecture can survive longitudinal pressure.
OMEGA scores 32.7% on an intentionally brutal benchmark - 25× the session volume of LongMemEval with adversarial conditions. The Phase 2 peak (42.4%) shows that persistent architectures improve with more data. Compression-based systems would cliff here.
OMEGA Core: Architecture & Design
A deep dive into how OMEGA stores, retrieves, and manages long-term memory for AI coding agents. This section serves as the technical reference for the open-source core.
1. Search Pipeline
Every query passes through a six-stage pipeline that combines vector similarity with full-text search, type weighting, contextual boosting, deduplication, and time-decay. The result is high-precision retrieval that improves with use.
sqlite-vec cosine distance, 384-dim bge-small-en-v1.5
FTS5 keyword matching for exact phrases
Decisions and lessons weighted 2×
Boosts by tag, project, file context
Neural re-scoring of top 20 via ms-marco-MiniLM-L-6-v2 ONNX
SHA256 hash + 0.85 embedding similarity
Old unaccessed memories rank lower (floor 0.35)
2. Storage Architecture
A single SQLite database with three core layers. No external services, no network calls, no GPU.
Typed records (decision, lesson, error, preference, session_summary) with SHA256 deduplication, auto-tags, timestamps, and access counters.
384-dimensional embeddings from bge-small-en-v1.5 (ONNX Runtime, CPU-only). Cosine similarity search via sqlite-vec extension.
Typed edges (related, supersedes, contradicts) between memory nodes. BFS traversal up to 5 hops. Auto-created for similarity ≥ 0.45.
3. Memory Lifecycle
Memories aren't static. They evolve, get consolidated, and decay over time. This lifecycle prevents unbounded growth while preserving what matters.
SHA256 exact dedup + embedding similarity 0.85+ (semantic) + Jaccard per-type
Similar content (55-95%) appends new insights to existing memories rather than creating duplicates
Auto-creates 'related' edges (cosine similarity ≥ 0.45) to top-3 similar memories on store
Session summaries expire after 1 day. Decisions, lessons, and preferences are permanent
Clusters related memories by Jaccard similarity, creates summary nodes, marks originals as superseded
Unaccessed memories lose ranking weight over time. Floor at 0.35. Preferences and errors exempt
4. Forgetting Intelligence
Memory systems that only accumulate become noisy over time. OMEGA includes principled forgetting - every deletion is audited, decay is transparent, and conflicts are detected automatically.
Every deletion is logged with the reason - TTL expiry, LRU eviction, consolidation, negative feedback, or manual deletion. Query the full log anytime via omega_forgetting_log.
Memories that haven't been accessed lose ranking weight over time. The decay follows an exponential curve with a floor at 0.35 - old memories are deprioritized, never fully erased. Preferences and error patterns are exempt from decay.
When a new memory contradicts an existing one, OMEGA detects it automatically. For decisions, the newest wins (auto-resolve). For lessons and other types, the conflict is flagged for manual review.
Decay Curve Visualization
5. Embedding Model
OMEGA uses bge-small-en-v1.5 via ONNX Runtime for local, CPU-only embedding. No API calls, no GPU, no network dependency.
Why bge-small? It ranks in the top tier for retrieval quality at its size class on MTEB. The small footprint (~90 MB, 384 dims) means fast inference on any laptop CPU - typically under 8ms per embedding. Larger models (bge-large, e5-large) score marginally better on benchmarks but add 4-10× latency and memory overhead. For a memory system that embeds on every store and query, the speed-quality tradeoff favors small.
6. Hook System
Four lifecycle hooks connect OMEGA to Claude Code. All dispatch via a Unix domain socket with fail-open semantics - if the daemon is down, the IDE continues unaffected.
Fail-open design. Hooks dispatch via fast_hook.py to a Unix domain socket. If the OMEGA daemon isn't running, the hook exits silently with code 0 - the IDE never blocks. Average hook latency is under 15ms for session start (the heaviest hook) and under 3ms for post-tool-use surfacing.
How I Tested
I used the LongMemEval benchmark (Wang et al., ICLR 2025), which evaluates long-term memory systems across 500 questions in 5 categories. The benchmark was designed to test real-world memory capabilities - not just retrieval, but reasoning, temporal understanding, and the ability to abstain when information is missing.
The test environment: OMEGA v1.0.0, GPT-4.1 as the generation and grading LLM, bge-small-en-v1.5 ONNX embeddings, running on an M1 MacBook Pro with 16GB RAM. No cloud services involved in retrieval - all local inference.
The task-averaged score of 95.4% (466/500 raw) represents OMEGA's best run. Multi-session reasoning (83%) is the most challenging category - I'm actively working to improve it.