Skip to main content
← Blog/Analysis

Observational Memory
Is Not Enough.

Mastra just shipped observational memory for coding agents. It's clever, well-designed, and genuinely useful within a session. But when the session ends, every observation vanishes.

Jason Sosa10 min read
Abstract visualization of ephemeral in-context memory versus persistent crystalline memory

Mastra launched observational memory as part of Mastra Code, their AI coding agent. Two background LLM agents, an Observer and a Reflector, compress your conversation into timestamped observations that stay inside the context window. No database. No embeddings. No external storage. It's elegant, and it works.

I built OMEGA, which takes the opposite approach: external storage in SQLite with semantic search and entity graphs. I'm biased. But the architectural differences between these two approaches are real, and they matter more than any benchmark score.

What Mastra Gets Right

Credit where it's due. Mastra's observational memory is a genuinely clever design:

Zero infrastructure. No database, no vector store, no files on disk. Everything lives in the prompt. For developers who want memory without any operational overhead, this is the simplest possible answer.
Prompt-cache friendly. Observations are append-only. New observations go at the end. This means the prefix (system prompt + older observations) stays stable, which plays nicely with prompt caching on providers that support it.
No retrieval failure. Because all observations are always in context, there is no semantic search that could miss a relevant memory. Every observation is visible to the model on every turn.
Clean abstraction. The Observer watches your conversation and summarizes when the token count gets high. The Reflector compresses when observations grow too large. Two agents, clear responsibilities, easy to reason about.

If your use case is “make this single coding session smarter,” Mastra's approach is genuinely good. The 94.87% LongMemEval score proves it works within a single evaluation run.

The Problem LongMemEval Doesn't Test

LongMemEval is a single-run benchmark. It loads ~40 sessions of conversation history and asks 500 questions. It tests whether a memory system can recall facts, handle updates, track preferences, reason temporally, and connect information across sessions.

What it does not test:

Cross-session persistence. When LongMemEval ends, does the memory survive? OMEGA: yes, it's in SQLite. Mastra: no, the observations are gone when the context window closes.
Months of accumulated context. What happens after 500 sessions instead of 40? OMEGA's SQLite file grows on disk. Mastra's observations hit the context window ceiling and the Reflector starts discarding detail.
Multi-agent memory sharing. Can two agents working on the same project share memories? OMEGA: yes, through shared SQLite storage and coordination tools. Mastra: no, each agent has its own context window.
Contradiction handling. When a user changes their preference ("actually, I prefer tabs over spaces now"), does the system detect and resolve the conflict? OMEGA flags contradictions explicitly. Mastra's Reflector may or may not preserve the latest version during rewriting.

This is the fundamental architectural difference. In-context memory is session-scoped. External memory is permanent. For a quick coding session, session-scoped might be enough. For an agent that works with you across weeks and months, it isn't.

The Benchmark Numbers

Both systems score well on LongMemEval. But the more interesting signal isn't the raw scores. It's what happens when you change the actor model:

10.6%
Mastra's score drop when switching from gpt-5-mini to gpt-4o
94.87% → 84.23%. Same memory architecture, different actor model.
SystemScoreNote
OMEGA95.4%#1 overall. External SQLite + ONNX embeddings. Category-tuned answer prompts.
Mastra OM (gpt-5-mini)94.87%Observer + Reflector with gemini-2.5-flash. In-context only.
Mastra OM (gpt-4o)84.23%Same architecture, weaker actor model. 10+ point drop.
Zep / Graphiti71.2%Graph-based approach. Self-reported.

The key insight from these numbers: in-context memory is inherently coupled to the actor model. The model must parse the observation block, find relevant facts, and reason about them. A stronger model does this better. That's why Mastra drops 10+ points when switching from gpt-5-mini to gpt-4o. Same memory architecture, same observations, dramatically different results.

External retrieval decouples the memory system from the reasoning model. OMEGA's retrieval pipeline (BM25 + vector search + reranking) works the same regardless of which LLM answers the questions. Your memory quality doesn't degrade when you switch to a cheaper or faster model.

Transparency note: OMEGA's 95.4% uses category-tuned answer prompts (different prompts per question type), making a direct score-to-score comparison misleading. The architectural advantages above hold regardless of benchmark methodology.

How They Actually Work

The architectural difference explains everything else. Here's the step-by-step:

OMEGA: External storage

  1. 1.Agent learns something new or completes a task
  2. 2.OMEGA extracts the memory and stores it in SQLite with ONNX embeddings (local, no API)
  3. 3.Consolidation engine merges duplicates, flags contradictions, decays stale memories
  4. 4.On next query: hybrid BM25 + vector search retrieves the top 5-10 relevant memories
  5. 5.~1,500 tokens injected into context. The rest stays in storage.
  6. 6.Memories persist forever across sessions, agents, projects, and tools

Mastra: In-context compression

  1. 1.Conversation accumulates messages normally
  2. 2.At ~30K tokens, the Observer agent (gemini-2.5-flash) summarizes recent messages into timestamped observations
  3. 3.Observations are appended to a context block (append-only, prompt-cache-friendly)
  4. 4.At ~40K tokens, the Reflector agent rewrites the observations into a shorter summary
  5. 5.Rewriting is lossy: detail is permanently discarded during reflection
  6. 6.When the session ends, all observations are gone. No external persistence.

The key tradeoff: Mastra's approach means every observation is always visible to the model. No retrieval can miss anything, because there is no retrieval. But it also means the context window carries an ever-growing block of compressed text, and the Reflector's rewriting is lossy. Details that seem unimportant during rewriting are gone forever.

OMEGA's approach means retrieval can theoretically miss a relevant memory. But memories are never lost, only harder to find. Hybrid BM25 + vector search with semantic reranking minimizes retrieval failures, and the consolidation engine actively manages memory quality over time.

The Token Economics

Here is where the architectural difference hits your budget. In-context memory means every API call carries the full observation block. External memory means you pay for retrieval results only.

DimensionOMEGAMastra
Memory storage cost$0 (local SQLite file)LLM API cost for Observer + Reflector (gemini-2.5-flash per invocation)
Tokens per memory query~1,500 (top-k retrieval)~30,000-70,000 (full context block)
At 10K queries/month~15M tokens in, near-zero cost~300M-700M tokens in, significant cost at scale
Embedding cost$0 (local ONNX, no API)N/A (no embeddings)
InfrastructureNone (single SQLite file)None (in-context)

At small scale (a few sessions a day), the cost difference is negligible. At production scale (thousands of agent sessions per month), the difference between ~1,500 and ~50,000 tokens per memory query compounds into significant spend.

Mastra partially offsets this with prompt caching. Because observations are append-only, the cached prefix stays valid across turns within a session. This is a real advantage for providers that support prompt caching. But it doesn't help across sessions, because there are no cross-session observations.

What In-Context Memory Can't Do

These aren't feature gaps that Mastra could add later. They're architectural constraints of in-context memory:

Cross-session memory

When the context window closes, observations are gone. Starting a new session means starting from zero. OMEGA's SQLite persists across sessions, tools, projects, and reboots.

Selective retrieval

You can't retrieve specific memories from an observation block. The entire block is injected into every turn. With 50K tokens of observations, every API call pays for all of them, even when asking about one specific fact.

Memory lifecycle management

There is no way to decay, archive, or selectively forget observations. The Reflector compresses everything uniformly. OMEGA's consolidation engine decays stale memories, merges duplicates, and flags contradictions with full audit trails.

Multi-agent memory sharing

Each Mastra agent has its own context window. Two agents working on the same project can't share observations. OMEGA's shared SQLite store with entity graphs enables genuine multi-agent coordination.

Unbounded growth

After thousands of interactions, OMEGA's retrieval stays O(log n). Mastra's observation block grows until the Reflector starts discarding, and you have no control over what gets discarded.

When Mastra Is the Right Choice

I'm not going to pretend OMEGA is better in every scenario. Mastra's approach genuinely wins if:

You only need memory within a single session and never need to recall across sessions
Prompt cache friendliness is critical to your latency budget
You want zero external state: no database, no files, no disk writes
You're building in TypeScript and want memory integrated into an agent framework, not a separate tool
Your sessions are short enough that the observation block never triggers the Reflector's lossy compression

These are legitimate use cases. If your agent does one task per session and doesn't need to remember anything from yesterday, Mastra's zero-infra approach is arguably simpler than running any external memory system.

The Real Question: What Does “Memory” Mean?

Here's what this comparison really comes down to: what do you mean when you say you want your AI agent to have “memory”?

If you mean “remember more within this conversation,” Mastra's observational memory is a good solution. It extends effective context length by compressing older messages into observations.

If you mean “remember across conversations, learn over time, and build up knowledge,” you need external persistent memory. That's what OMEGA does. Memories survive after sessions end. They accumulate. The consolidation engine manages them over time. Your agent on day 100 is meaningfully different from your agent on day 1.

The first is context extension. The second is memory. Both are valuable. They're just solving different problems.

47x
Token cost difference per memory query
~1,500 tokens (OMEGA) vs ~30,000-70,000 (Mastra). Every turn, every session.

If persistent, cross-session memory is what you need, OMEGA takes 30 seconds to set up:

$ pip install omega-memory
$ omega setup
✓ Memory persists across sessions. No API keys. No Docker.

- Jason Sosa, builder of OMEGA

See the full feature matrix with Mastra, Mem0, Zep, and Letta.