Benchmarks & Technical Report

95.4%

466 / 500 on LongMemEval (ICLR 2025)

Measured February 2026 · Open methodology · Honest numbers

Read Whitepaper ↓LongMemEval Paper &nearr;

Category Scores

Where OMEGA Excels.
Where It Doesn't.

LongMemEval tests 5 capability areas across 500 questions. OMEGA scores above 83% in every category. Multi-session reasoning (83%) is the hardest - connecting facts across separate conversations requires deep retrieval.

99%

Single-Session Recall

100%

Preference Application

83%

Multi-Session Reasoning

96%

Knowledge Updates

94%

Temporal Reasoning

95.4%

Overall

Single-Session Recall125 / 126

99%

Preference Application30 / 30

100%

Multi-Session Reasoning111 / 133

83%

Knowledge Updates75 / 78

96%

Temporal Reasoning125 / 133

94%

Category scores from our 95.4% task-averaged accuracy (466/500 raw). Methodology: LongMemEval.

Landscape

How Others Compare

Most memory systems don't publish LongMemEval scores. Where available, scores are shown. “N/A” means no published benchmark.

OMEGA

12 toolslocal

95.4%

Mastra

94.87%

Emergence AI

86%

Zep / Graphiti

10 tools

71.2%

Mem0

9 tools

N/A

Letta

7 toolslocal

N/A

mcp-memory-service

12 toolslocal

N/A

Claude Native

0 toolslocal

N/A

Tool counts approximate, based on public docs as of Feb 2026.

Performance

Real Numbers, Real Hardware

~31 MB

Cold Start

RSS before first query

~337 MB

Warm

After ONNX model loads

<50 ms

Retrieval

Vector + FTS5 combined

~8 ms

Embedding

Per query, bge-small ONNX

~12 ms

Store

Write + embed + auto-relate

None

GPU Required

CPU-only inference

M1 MacBook Pro · ~240 memories · bge-small-en-v1.5 ONNX · RSS via Activity Monitor

Longitudinal Benchmark

MemoryStress: Memory Under Pressure

LongMemEval tests recall from 40 clean sessions. MemoryStress tests what happens at 1,000 - 583 facts, 10 simulated months, and the degradation curve that reveals whether your architecture can survive longitudinal pressure.

32.7%

98/300

OMEGA scores 32.7% on an intentionally brutal benchmark - 25× the session volume of LongMemEval with adversarial conditions. The Phase 2 peak (42.4%) shows that persistent architectures improve with more data. Compression-based systems would cliff here.

$ python scripts/memorystress_harness.py \

--dataset dataset.json --adapter omega \

--model gpt-4o --grade --output-dir results/

✓ degradation_curve.json, per_type.json, summary.md

Read the Full Story →

Technical Whitepaper

OMEGA Core: Architecture & Design

A deep dive into how OMEGA stores, retrieves, and manages long-term memory for AI coding agents. This section serves as the technical reference for the open-source core.

1. Search Pipeline

Every query passes through a six-stage pipeline that combines vector similarity with full-text search, type weighting, contextual boosting, deduplication, and time-decay. The result is high-precision retrieval that improves with use.

Vector Similarity

sqlite-vec cosine distance, 384-dim bge-small-en-v1.5

Primary recall

Full-Text Search

FTS5 keyword matching for exact phrases

Precision boost

Type-Weighted Scoring

Decisions and lessons weighted 2×

+15% relevance

Contextual Re-ranking

Boosts by tag, project, file context

+8% relevance

Cross-Encoder Reranking

Neural re-scoring of top 20 via ms-marco-MiniLM-L-6-v2 ONNX

Precision refinement

Deduplication

SHA256 hash + 0.85 embedding similarity

Noise reduction

Time-Decay Weighting

Old unaccessed memories rank lower (floor 0.35)

Freshness bias

2. Storage Architecture

A single SQLite database with three core layers. No external services, no network calls, no GPU.

Memory Table

Typed records (decision, lesson, error, preference, session_summary) with SHA256 deduplication, auto-tags, timestamps, and access counters.

SQLite + FTS5

Vector Index

384-dimensional embeddings from bge-small-en-v1.5 (ONNX Runtime, CPU-only). Cosine similarity search via sqlite-vec extension.

sqlite-vec + ONNX

Graph Layer

Typed edges (related, supersedes, contradicts) between memory nodes. BFS traversal up to 5 hops. Auto-created for similarity ≥ 0.45.

Adjacency table + BFS

~/.omega/omega.db

memories · typed records + SHA256 dedup

vec_memories · 384-dim float32 embeddings

edges · weighted typed relationships

fts_memories · FTS5 full-text index

forgetting_log · deletion audit trail

~10.5 MB for ~240 memories · 0o600 permissions

3. Memory Lifecycle

Memories aren't static. They evolve, get consolidated, and decay over time. This lifecycle prevents unbounded growth while preserving what matters.

Ingest

SHA256 exact dedup + embedding similarity 0.85+ (semantic) + Jaccard per-type

Evolve

Similar content (55-95%) appends new insights to existing memories rather than creating duplicates

Relate

Auto-creates 'related' edges (cosine similarity ≥ 0.45) to top-3 similar memories on store

TTL

Session summaries expire after 1 day. Decisions, lessons, and preferences are permanent

Compact

Clusters related memories by Jaccard similarity, creates summary nodes, marks originals as superseded

Decay

Unaccessed memories lose ranking weight over time. Floor at 0.35. Preferences and errors exempt

4. Forgetting Intelligence

Memory systems that only accumulate become noisy over time. OMEGA includes principled forgetting - every deletion is audited, decay is transparent, and conflicts are detected automatically.

100%

deletions tracked

Audit Trail

Every deletion is logged with the reason - TTL expiry, LRU eviction, consolidation, negative feedback, or manual deletion. Query the full log anytime via omega_forgetting_log.

0.35

decay floor

Decay Curves

Memories that haven't been accessed lose ranking weight over time. The decay follows an exponential curve with a floor at 0.35 - old memories are deprioritized, never fully erased. Preferences and error patterns are exempt from decay.

Auto

conflict resolution

Conflict Detection

When a new memory contradicts an existing one, OMEGA detects it automatically. For decisions, the newest wins (auto-resolve). For lessons and other types, the conflict is flagged for manual review.

Decay Curve Visualization

5. Embedding Model

OMEGA uses bge-small-en-v1.5 via ONNX Runtime for local, CPU-only embedding. No API calls, no GPU, no network dependency.

bge-small-en-v1.5

Model

384

Dimensions

~90 MB

Size

ONNX (CPU)

Runtime

Why bge-small? It ranks in the top tier for retrieval quality at its size class on MTEB. The small footprint (~90 MB, 384 dims) means fast inference on any laptop CPU - typically under 8ms per embedding. Larger models (bge-large, e5-large) score marginally better on benchmarks but add 4-10× latency and memory overhead. For a memory system that embeds on every store and query, the speed-quality tradeoff favors small.

6. Hook System

Four lifecycle hooks connect OMEGA to Claude Code. All dispatch via a Unix domain socket with fail-open semantics - if the daemon is down, the IDE continues unaffected.

Hook	Handler	Purpose
SessionStart	session_start	Welcome briefing with recent memories, profile, reminders
Stop	session_stop	Auto-generate session summary
UserPromptSubmit	auto_capture	Extract decisions, lessons, and preferences from conversation
PostToolUse	surface_memories	Surface relevant memories before file edits

Fail-open design. Hooks dispatch via fast_hook.py to a Unix domain socket. If the OMEGA daemon isn't running, the hook exits silently with code 0 - the IDE never blocks. Average hook latency is under 15ms for session start (the heaviest hook) and under 3ms for post-tool-use surfacing.

Methodology

How I Tested

I used the LongMemEval benchmark (Wang et al., ICLR 2025), which evaluates long-term memory systems across 500 questions in 5 categories. The benchmark was designed to test real-world memory capabilities - not just retrieval, but reasoning, temporal understanding, and the ability to abstain when information is missing.

The test environment: OMEGA v1.0.0, GPT-4.1 as the generation and grading LLM, bge-small-en-v1.5 ONNX embeddings, running on an M1 MacBook Pro with 16GB RAM. No cloud services involved in retrieval - all local inference.

The task-averaged score of 95.4% (466/500 raw) represents OMEGA's best run. Multi-session reasoning (83%) is the most challenging category - I'm actively working to improve it.

Read LongMemEval Paper &nearr;View Source

Ready to remember?

Get Started Star on GitHub

Compare How It Works Blog Whitepaper

95.4%

Where OMEGA Excels.Where It Doesn't.

How Others Compare

Real Numbers, Real Hardware

MemoryStress: Memory Under Pressure

OMEGA Core: Architecture & Design

1. Search Pipeline

2. Storage Architecture

3. Memory Lifecycle

4. Forgetting Intelligence

5. Embedding Model

6. Hook System

How I Tested

Ready to remember?

Where OMEGA Excels.
Where It Doesn't.