Skip to main content

Open-Model Transition Runbook — replacing Claude Code / Codex inference cost with DeepSeek + GLM

Status: Draft / investigation deliverable Date: 2026-06-24 Author: Moss Objective: Run the daily claude (and, where viable, codex) workflow on open models (DeepSeek-v4, Z.AI GLM) to cut inference cost, without disrupting the workflow — same commands, same OMEGA hooks/memory/MCP, with a one-keystroke fallback to real Claude.

This runbook changes nothing until you execute the steps. It is reference, not applied state. Related memory: mem-b68c84d1590d (seamless transition), mem-94bae3cee62f (router architecture), mem-c7c5f6ee34ca (key location).

0. What this does and does not touch

  • Replaces: your interactive claude / codex inference spend.
  • Does NOT touch: OMEGA's own daemon, hooks, memory, MCP — all verified backend-agnostic (fast_hook.py only tags client="claude-code"; hooks fire on CLI lifecycle, not the model). They keep working unchanged.
  • Does NOT redirect OMEGA's internal LLM calls (router / llm.py) — as long as you scope the env to the claude process (Section 4). Do not export ANTHROPIC_BASE_URL globally in your shell rc, or you will silently redirect OMEGA's own Anthropic calls too.

1. Prerequisites

ItemStateNotes
DEEPSEEK_API_KEY✅ in ~/.omega/secrets.jsonprefix sk-8…
Z_AI_API_KEY✅ in ~/.omega/secrets.jsonprefix 22e1…, pay-as-you-go key
ANTHROPIC_API_KEY (real Claude)keep availablethis is your fallback tier — do not remove
claude CLIrequiredclaude --version
codex CLIoptionalonly if pursuing Codex (Section 8 — deferred)

Load the keys into the shell session that will launch claude (scoped, not exported globally — see Section 4):

# convenience: pull from OMEGA secrets without hardcoding
DEEPSEEK_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text())['DEEPSEEK_API_KEY'])")
Z_AI_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text())['Z_AI_API_KEY'])")

2. PRE-FLIGHT GATES (blocking — verify before rollout)

These two facts are unverified and each changes the plan. Settle them first.

Gate A — Does the PAYG Z.AI key work on the Anthropic endpoint?

GLM-via-Claude-Code uses https://api.z.ai/api/anthropic. Whether a pay-as-you-go key is accepted there (vs. requiring the $18/mo Coding Plan) is undocumented.

curl -s -w "\nHTTP %{http_code}\n" https://api.z.ai/api/anthropic/v1/messages \
  -H "x-api-key: $Z_AI_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" \
  -d '{"model":"glm-4.6","max_tokens":16,"messages":[{"role":"user","content":"pong"}]}'
  • HTTP 200 → GLM-via-CLI is viable on PAYG. Proceed with GLM (Section 5).
  • HTTP 401/403/api/anthropic needs the Coding Plan. Either buy it or skip GLM-via-CLI and use DeepSeek only. DeepSeek is unaffected.

Gate B — Does each provider honor prompt caching?

DeepSeek's Anthropic shim ignores cache_control (confirmed in docs) → Claude Code's cache economics vanish, inflating real cost. Confirm GLM's behavior and treat caching loss as the dominant cost variable. There is no clean curl for this; rely on the cost instrumentation in Section 9 (compare billed input tokens across two identical repeated turns — flat or near-flat = caching works; full re-bill = ignored).

3. Verified provider facts (use these exact strings)

DeepSeekZ.AI GLM
ANTHROPIC_BASE_URLhttps://api.deepseek.com/anthropichttps://api.z.ai/api/anthropic
Auth envANTHROPIC_AUTH_TOKEN (or ANTHROPIC_API_KEY) = keyANTHROPIC_AUTH_TOKEN = key
Opus-tier modeldeepseek-v4-proglm-4.6 (or glm-4.7)
Sonnet/Haiku-tierdeepseek-v4-flashglm-4.6 / lighter variant
Context / max out1M / 384K200K / 128K (glm-4.6)
Cachingcache_control ignoredunverified (Gate B)
Images / MCP-content blocksnot supportedunverified

Model names drift (GLM 4.7/5.x, DeepSeek v4). Treat them as config; re-verify in the provider console if a call 404s on the model.

4. Phase 1 — Claude Code → DeepSeek (primary path, scoped wrappers)

The recommended mechanism is scoped shell wrappers (env applies to the claude process only — never pollutes OMEGA or other tools). Add to ~/.zshrc:

# --- open-model Claude Code wrappers -------------------------------------
# Keys (scoped to this shell; do NOT export ANTHROPIC_BASE_URL globally)
DEEPSEEK_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text()).get('DEEPSEEK_API_KEY',''))")
Z_AI_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text()).get('Z_AI_API_KEY',''))")

# DeepSeek-backed Claude Code
claude-ds() {
  env \
    ANTHROPIC_BASE_URL="https://api.deepseek.com/anthropic" \
    ANTHROPIC_AUTH_TOKEN="$DEEPSEEK_API_KEY" \
    ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-pro" \
    ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash" \
    ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash" \
    command claude "$@"
}

# Real-Claude escape hatch (unsets the open-model vars, falls back to login/API key)
claude-pro() {
  env -u ANTHROPIC_BASE_URL -u ANTHROPIC_AUTH_TOKEN command claude "$@"
}

# OPTIONAL: make cheap the default. Comment out to keep `claude` = real Claude.
# alias claude=claude-ds
# -------------------------------------------------------------------------

Reload: source ~/.zshrc.

Why wrappers over settings.json: ~/.claude/settings.json env blocks are valid too, but they are sticky and global to every project, and make the escape hatch awkward (you must override with --settings). Wrappers keep the toggle one word (claude-ds vs claude-pro) and guarantee OMEGA's environment is untouched.

Per-project alternative (if you want one repo always cheap): create <project>/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.deepseek.com/anthropic",
    "ANTHROPIC_AUTH_TOKEN": "<deepseek-key>",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-v4-pro",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-v4-flash",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-v4-flash"
  }
}

Verify Phase 1

claude-ds -p "reply with exactly: pong"      # expect: pong

Then in an interactive claude-ds session run /status — confirm the active endpoint is the DeepSeek base URL (not your subscription). Confirm OMEGA's SessionStart briefing still fires (proves hooks/MCP intact).

5. Phase 2 — GLM as an alternative (only if Gate A = 200)

Add a sibling wrapper:

claude-glm() {
  env \
    ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" \
    ANTHROPIC_AUTH_TOKEN="$Z_AI_API_KEY" \
    ANTHROPIC_DEFAULT_OPUS_MODEL="glm-4.6" \
    ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.6" \
    ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.6" \
    command claude "$@"
}

Verify identically: claude-glm -p "reply with exactly: pong".

6. The seams — mitigations to set up alongside the switch

SeamMitigation in this runbook
Web search tool breaks (Anthropic server-side)Add an MCP web-search server (e.g. a Tavily/Brave MCP — select + configure separately) so search still works in-session. Until then, web research degrades.
Image inputs (DeepSeek rejects)Use claude-pro for any task that pastes images.
Caching lossTrim per-session context where possible; rely on Section 9 to quantify the real cost.
MCP tools load into prefix on custom endpointSettle your MCP server set at session start; avoid connecting/disconnecting mid-session.

7. The escape hatch (this is what makes it "seamless")

Rule of thumb: cheap by default, real Claude on demand. Reach for claude-pro when:

  • the task is hard / architectural and the cheap model is thrashing,
  • you need web search or image input,
  • you hit any model-side error you can't explain.

No state to change — it's just which wrapper you type.

8. Codex (DEFERRED — not seamless)

Current Codex requires the OpenAI Responses API; both DeepSeek and GLM serve only Chat Completions, so a direct model_providers entry will fail. Two options, both non-trivial:

  • Recommended: migrate Codex usage into claude-ds / claude-glm and retire Codex from the cost equation.
  • If you must keep Codex: stand up a LiteLLM proxy translating Responses→Chat, then point Codex at it:
    # ~/.codex/config.toml  (requires a running LiteLLM proxy at :4000)
    model = "deepseek-v4-flash"
    [model_providers.litellm]
    base_url = "http://localhost:4000"
    env_key = "LITELLM_KEY"
    wire_api = "responses"
    
    This adds a always-on local service — defer until Phase 1 proves value.

9. Cost instrumentation (prove the savings are real)

  1. Baseline: before switching, note 1 week of real-Claude spend (Anthropic console) for comparable work.
  2. After: the provider consoles are ground truth — DeepSeek dashboard / Z.AI console show actual billed tokens & $. /cost inside Claude Code estimates against Anthropic pricing and will be wrong for third-party endpoints; don't trust it for $.
  3. Caching check (Gate B): run the same prompt twice in one session; compare billed input tokens on the provider dashboard. Flat = caching honored; full re-bill = ignored (raises real cost).
  4. Track fallback rate — how often you reach for claude-pro. High fallback rate erodes the savings and signals the cheap tier isn't carrying your workload.

10. Capability-parity eval (the gate before trusting cheap-by-default)

Before flipping alias claude=claude-ds, run a fixed eval:

  1. Pick 8–10 representative tasks from your real work (a bug fix, a refactor, a feature, a multi-file change, a tool-heavy task).
  2. Run each under claude-ds (and claude-glm if live). Score: pass / partial / fail, plus latency and whether you had to fall back.
  3. Compare against a claude-pro run of the same tasks.
  4. Decision rule: if cheap tier passes ≥ ~80% unaided and fallback rate is low, make it default. Otherwise keep claude-pro as default and use cheap selectively.

11. Rollback

Nothing structural changes, so rollback is trivial:

  • Stop using claude-ds/claude-glm; use claude-pro (or plain claude).
  • If you set alias claude=claude-ds, remove the alias line and source ~/.zshrc.
  • Delete any per-project .claude/settings.json env block.
  • OMEGA, hooks, MCP, memory are unaffected throughout — no cleanup needed.

12. Open items carried from investigation

  • Gate A: GLM /api/anthropic PAYG entitlement (curl in Section 2).
  • Gate B: caching behavior per provider (Section 9 method).
  • Select + configure an MCP web-search server to replace the broken native tool.
  • Run the parity eval (Section 10) before cheap-by-default.
  • Decide Codex: migrate to Claude Code vs. stand up LiteLLM proxy.