Skip to main content
Research··5 min read

OMEGA on a Phone

We ran a feasibility study: can a phone run local semantic memory for AI agents? The Samsung S25 Ultra made the answer obvious.

Samsung S25 Ultra with golden neural network memory visualization on screen, floating in dark space

AI agents are headed to phones. Not as cloud clients that pipe everything to a server, but as local processes that reason, remember, and act on the device itself. Samsung ships Gemini Nano on the S25 for exactly this reason. Apple is building on-device inference into every new chip.

But inference alone is not enough. An agent that can think but cannot remember is a party trick. It will answer your question, then forget it asked. The missing piece is memory, and on mobile, that memory needs to be local. You cannot afford 200ms round-trips to a cloud memory API while the user is mid-conversation. You cannot send private context to someone else's server every time the agent needs to recall something.

So we asked the question: can OMEGA, which already runs on laptops and desktops as a local MCP server, run on a phone?

The Hardware Is Already There

The Samsung S25 Ultra ships with a Hexagon NPU rated at 45 TOPS (trillion operations per second). For context, OMEGA's embedding model (bge-small-en-v1.5, 384 dimensions) needs a fraction of that. The NPU is designed for exactly this kind of workload: small model inference, high throughput, low power.

We modeled the full OMEGA stack against the S25 Ultra's specs. The results were not close. The phone is massively overpowered for the job.

MetricOn S25 Ultra
Total package size (INT8 quantized)~100-120 MB
Semantic search latency (NPU)< 5 ms
Semantic search latency (CPU fallback)< 20 ms
RAM usage (1,500 memories active)~160 MB
RAM when idle~50 MB
Database on disk (1,500 memories)~10 MB
Battery impactMinimal (burst compute, auto-unload after 10 min idle)

120 MB is smaller than most social media apps. 160 MB of RAM is nothing on a device with 12 GB. And sub-5ms search means the memory lookup completes before the UI can even render the loading spinner.

Why Local Matters More on Mobile

On a laptop connected to fiber, you can tolerate a cloud memory API. The latency is annoying but survivable. On a phone, the calculus changes completely.

Phones lose connectivity. They switch between WiFi and cellular. They go through tunnels, elevators, airplanes. A cloud-dependent memory system fails in all of these situations. Not degrades. Fails. The agent cannot recall anything until the connection comes back.

Then there is privacy. A phone is the most personal device people carry. The conversations you have with an AI assistant on your phone, the decisions it helps you make, the context it accumulates about your work, your schedule, your thinking patterns. That data should not leave the device. Full stop.

Local memory on mobile is not a nice-to-have. It is the only architecture that works.

What Scales and What Doesn't

The S25 Ultra can handle far more than a demo. We modeled four usage tiers to find the limits.

ScenarioMemoriesDB SizeRAMSearch
Demo1,500~10 MB~160 MB< 5 ms
Daily use10K-50K60-300 MB200-400 MB< 30 ms
Power user100K-200K0.6-1.2 GB0.7-1.4 GB< 150 ms
With INT8 + HNSW500K-1M< 2 GB< 1 GB< 50 ms

A million memories in under a gigabyte of RAM, with sub-50ms search. On a phone. The constraint is not hardware. It is software, and that is the part we know how to build.

The Architecture

OMEGA on desktop is Python. OMEGA on mobile would not be. The production path is a Rust core with a Kotlin UI layer. Rust gives us iOS portability later without rewriting the engine. The core components that move over: SQLite storage, ONNX embedding inference, vector search via sqlite-vec, and the query/store pipeline.

What stays behind: the MCP server stack, PDF parsing, cloud sync, the HTTP bridge. Mobile does not need any of that. It needs fast local search, smart caching (512-entry LRU for embeddings, top 50 memories in hot cache), and aggressive memory management (model auto-unload after 10 minutes idle saves 90-170 MB).

On the S25 specifically, Samsung's QNN/SNPE SDK exposes the Hexagon NPU directly. The embedding model runs on dedicated silicon rather than competing with the app for CPU time. Gemini Nano, which ships on the device, handles simple classification queries with zero latency and zero network cost.

What This Means

The phones people carry today have enough compute to run a full semantic memory system for AI agents. Not a stripped-down version. Not a cache that calls home. The real thing, with embeddings, vector search, intelligent forgetting, and sub-5ms retrieval.

This changes what mobile AI assistants can be. Instead of stateless chatbots that reset every session, imagine an assistant that remembers your decisions, learns from your corrections, and builds context over months. All on the device. All private. All instant.

Current status

This is a completed feasibility study, not a shipped product. OMEGA runs today on macOS, Linux, and Windows as an MCP server for coding agents. The mobile implementation is a research milestone that proves the architecture is portable. If you are working on on-device AI and want to talk about local memory, reach out.

Today, OMEGA runs on your laptop. The question was never whether it could run on your phone. The question was whether the phone could keep up. It can.

Install OMEGA · Benchmark results · GitHub