Open-Model Transition Runbook — replacing Claude Code / Codex inference cost with DeepSeek + GLM

Name: OMEGA
Author: OMEGA

Status: Draft / investigation deliverable Date: 2026-06-24 Author: Moss Objective: Run the daily claude (and, where viable, codex) workflow on open models (DeepSeek-v4, Z.AI GLM) to cut inference cost, without disrupting the workflow — same commands, same OMEGA hooks/memory/MCP, with a one-keystroke fallback to real Claude.

This runbook changes nothing until you execute the steps. It is reference, not applied state. Related memory: mem-b68c84d1590d (seamless transition), mem-94bae3cee62f (router architecture), mem-c7c5f6ee34ca (key location).

0. What this does and does not touch

Replaces: your interactive claude / codex inference spend.
Does NOT touch: OMEGA's own daemon, hooks, memory, MCP — all verified backend-agnostic (fast_hook.py only tags client="claude-code"; hooks fire on CLI lifecycle, not the model). They keep working unchanged.
Does NOT redirect OMEGA's internal LLM calls (router / llm.py) — as long as you scope the env to the claude process (Section 4). Do not export ANTHROPIC_BASE_URL globally in your shell rc, or you will silently redirect OMEGA's own Anthropic calls too.

1. Prerequisites

Item	State	Notes
`DEEPSEEK_API_KEY`	✅ in `~/.omega/secrets.json`	prefix `sk-8…`
`Z_AI_API_KEY`	✅ in `~/.omega/secrets.json`	prefix `22e1…`, pay-as-you-go key
`ANTHROPIC_API_KEY` (real Claude)	keep available	this is your fallback tier — do not remove
`claude` CLI	required	`claude --version`
`codex` CLI	optional	only if pursuing Codex (Section 8 — deferred)

Load the keys into the shell session that will launch claude (scoped, not exported globally — see Section 4):

# convenience: pull from OMEGA secrets without hardcoding
DEEPSEEK_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text())['DEEPSEEK_API_KEY'])")
Z_AI_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text())['Z_AI_API_KEY'])")

2. PRE-FLIGHT GATES (blocking — verify before rollout)

These two facts are unverified and each changes the plan. Settle them first.

Gate A — Does the PAYG Z.AI key work on the Anthropic endpoint?

GLM-via-Claude-Code uses https://api.z.ai/api/anthropic. Whether a pay-as-you-go key is accepted there (vs. requiring the $18/mo Coding Plan) is undocumented.

curl -s -w "\nHTTP %{http_code}\n" https://api.z.ai/api/anthropic/v1/messages \
  -H "x-api-key: $Z_AI_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" \
  -d '{"model":"glm-4.6","max_tokens":16,"messages":[{"role":"user","content":"pong"}]}'

HTTP 200 → GLM-via-CLI is viable on PAYG. Proceed with GLM (Section 5).
HTTP 401/403 → /api/anthropic needs the Coding Plan. Either buy it or skip GLM-via-CLI and use DeepSeek only. DeepSeek is unaffected.

Gate B — Does each provider honor prompt caching?

DeepSeek's Anthropic shim ignores cache_control (confirmed in docs) → Claude Code's cache economics vanish, inflating real cost. Confirm GLM's behavior and treat caching loss as the dominant cost variable. There is no clean curl for this; rely on the cost instrumentation in Section 9 (compare billed input tokens across two identical repeated turns — flat or near-flat = caching works; full re-bill = ignored).

3. Verified provider facts (use these exact strings)

	DeepSeek	Z.AI GLM
`ANTHROPIC_BASE_URL`	`https://api.deepseek.com/anthropic`	`https://api.z.ai/api/anthropic`
Auth env	`ANTHROPIC_AUTH_TOKEN` (or `ANTHROPIC_API_KEY`) = key	`ANTHROPIC_AUTH_TOKEN` = key
Opus-tier model	`deepseek-v4-pro`	`glm-4.6` (or `glm-4.7`)
Sonnet/Haiku-tier	`deepseek-v4-flash`	`glm-4.6` / lighter variant
Context / max out	1M / 384K	200K / 128K (glm-4.6)
Caching	`cache_control` ignored	unverified (Gate B)
Images / MCP-content blocks	not supported	unverified

Model names drift (GLM 4.7/5.x, DeepSeek v4). Treat them as config; re-verify in the provider console if a call 404s on the model.

4. Phase 1 — Claude Code → DeepSeek (primary path, scoped wrappers)

The recommended mechanism is scoped shell wrappers (env applies to the claude process only — never pollutes OMEGA or other tools). Add to ~/.zshrc:

# --- open-model Claude Code wrappers -------------------------------------
# Keys (scoped to this shell; do NOT export ANTHROPIC_BASE_URL globally)
DEEPSEEK_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text()).get('DEEPSEEK_API_KEY',''))")
Z_AI_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text()).get('Z_AI_API_KEY',''))")

# DeepSeek-backed Claude Code
claude-ds() {
  env \
    ANTHROPIC_BASE_URL="https://api.deepseek.com/anthropic" \
    ANTHROPIC_AUTH_TOKEN="$DEEPSEEK_API_KEY" \
    ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-pro" \
    ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash" \
    ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash" \
    command claude "$@"
}

# Real-Claude escape hatch (unsets the open-model vars, falls back to login/API key)
claude-pro() {
  env -u ANTHROPIC_BASE_URL -u ANTHROPIC_AUTH_TOKEN command claude "$@"
}

# OPTIONAL: make cheap the default. Comment out to keep `claude` = real Claude.
# alias claude=claude-ds
# -------------------------------------------------------------------------

Reload: source ~/.zshrc.

Why wrappers over settings.json: ~/.claude/settings.json env blocks are valid too, but they are sticky and global to every project, and make the escape hatch awkward (you must override with --settings). Wrappers keep the toggle one word (claude-ds vs claude-pro) and guarantee OMEGA's environment is untouched.

Per-project alternative (if you want one repo always cheap): create <project>/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.deepseek.com/anthropic",
    "ANTHROPIC_AUTH_TOKEN": "<deepseek-key>",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-v4-pro",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-v4-flash",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-v4-flash"
  }
}

Verify Phase 1

claude-ds -p "reply with exactly: pong"      # expect: pong

Then in an interactive claude-ds session run /status — confirm the active endpoint is the DeepSeek base URL (not your subscription). Confirm OMEGA's SessionStart briefing still fires (proves hooks/MCP intact).

5. Phase 2 — GLM as an alternative (only if Gate A = 200)

Add a sibling wrapper:

claude-glm() {
  env \
    ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" \
    ANTHROPIC_AUTH_TOKEN="$Z_AI_API_KEY" \
    ANTHROPIC_DEFAULT_OPUS_MODEL="glm-4.6" \
    ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.6" \
    ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.6" \
    command claude "$@"
}

Verify identically: claude-glm -p "reply with exactly: pong".

6. The seams — mitigations to set up alongside the switch

Seam	Mitigation in this runbook
Web search tool breaks (Anthropic server-side)	Add an MCP web-search server (e.g. a Tavily/Brave MCP — select + configure separately) so search still works in-session. Until then, web research degrades.
Image inputs (DeepSeek rejects)	Use `claude-pro` for any task that pastes images.
Caching loss	Trim per-session context where possible; rely on Section 9 to quantify the real cost.
MCP tools load into prefix on custom endpoint	Settle your MCP server set at session start; avoid connecting/disconnecting mid-session.

7. The escape hatch (this is what makes it "seamless")

Rule of thumb: cheap by default, real Claude on demand. Reach for claude-pro when:

the task is hard / architectural and the cheap model is thrashing,
you need web search or image input,
you hit any model-side error you can't explain.

No state to change — it's just which wrapper you type.

8. Codex (DEFERRED — not seamless)

Current Codex requires the OpenAI Responses API; both DeepSeek and GLM serve only Chat Completions, so a direct model_providers entry will fail. Two options, both non-trivial:

Recommended: migrate Codex usage into claude-ds / claude-glm and retire Codex from the cost equation.

If you must keep Codex: stand up a LiteLLM proxy translating Responses→Chat, then point Codex at it:

# ~/.codex/config.toml  (requires a running LiteLLM proxy at :4000)
model = "deepseek-v4-flash"
[model_providers.litellm]
base_url = "http://localhost:4000"
env_key = "LITELLM_KEY"
wire_api = "responses"

This adds a always-on local service — defer until Phase 1 proves value.

9. Cost instrumentation (prove the savings are real)

Baseline: before switching, note 1 week of real-Claude spend (Anthropic console) for comparable work.
After: the provider consoles are ground truth — DeepSeek dashboard / Z.AI console show actual billed tokens & $. /cost inside Claude Code estimates against Anthropic pricing and will be wrong for third-party endpoints; don't trust it for $.
Caching check (Gate B): run the same prompt twice in one session; compare billed input tokens on the provider dashboard. Flat = caching honored; full re-bill = ignored (raises real cost).
Track fallback rate — how often you reach for claude-pro. High fallback rate erodes the savings and signals the cheap tier isn't carrying your workload.

10. Capability-parity eval (the gate before trusting cheap-by-default)

Before flipping alias claude=claude-ds, run a fixed eval:

Pick 8–10 representative tasks from your real work (a bug fix, a refactor, a feature, a multi-file change, a tool-heavy task).
Run each under claude-ds (and claude-glm if live). Score: pass / partial / fail, plus latency and whether you had to fall back.
Compare against a claude-pro run of the same tasks.
Decision rule: if cheap tier passes ≥ ~80% unaided and fallback rate is low, make it default. Otherwise keep claude-pro as default and use cheap selectively.

11. Rollback

Nothing structural changes, so rollback is trivial:

Stop using claude-ds/claude-glm; use claude-pro (or plain claude).
If you set alias claude=claude-ds, remove the alias line and source ~/.zshrc.
Delete any per-project .claude/settings.json env block.
OMEGA, hooks, MCP, memory are unaffected throughout — no cleanup needed.

12. Open items carried from investigation

Gate A: GLM /api/anthropic PAYG entitlement (curl in Section 2).
Gate B: caching behavior per provider (Section 9 method).
Select + configure an MCP web-search server to replace the broken native tool.
Run the parity eval (Section 10) before cheap-by-default.
Decide Codex: migrate to Claude Code vs. stand up LiteLLM proxy.