Open-Model Transition Runbook — replacing Claude Code / Codex inference cost with DeepSeek + GLM
Status: Draft / investigation deliverable
Date: 2026-06-24
Author: Moss
Objective: Run the daily claude (and, where viable, codex) workflow on open models (DeepSeek-v4, Z.AI GLM) to cut inference cost, without disrupting the workflow — same commands, same OMEGA hooks/memory/MCP, with a one-keystroke fallback to real Claude.
This runbook changes nothing until you execute the steps. It is reference, not applied state. Related memory:
mem-b68c84d1590d(seamless transition),mem-94bae3cee62f(router architecture),mem-c7c5f6ee34ca(key location).
0. What this does and does not touch
- Replaces: your interactive
claude/codexinference spend. - Does NOT touch: OMEGA's own daemon, hooks, memory, MCP — all verified backend-agnostic (
fast_hook.pyonly tagsclient="claude-code"; hooks fire on CLI lifecycle, not the model). They keep working unchanged. - Does NOT redirect OMEGA's internal LLM calls (router /
llm.py) — as long as you scope the env to theclaudeprocess (Section 4). Do notexport ANTHROPIC_BASE_URLglobally in your shell rc, or you will silently redirect OMEGA's own Anthropic calls too.
1. Prerequisites
| Item | State | Notes |
|---|---|---|
DEEPSEEK_API_KEY | ✅ in ~/.omega/secrets.json | prefix sk-8… |
Z_AI_API_KEY | ✅ in ~/.omega/secrets.json | prefix 22e1…, pay-as-you-go key |
ANTHROPIC_API_KEY (real Claude) | keep available | this is your fallback tier — do not remove |
claude CLI | required | claude --version |
codex CLI | optional | only if pursuing Codex (Section 8 — deferred) |
Load the keys into the shell session that will launch claude (scoped, not exported globally — see Section 4):
# convenience: pull from OMEGA secrets without hardcoding
DEEPSEEK_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text())['DEEPSEEK_API_KEY'])")
Z_AI_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text())['Z_AI_API_KEY'])")
2. PRE-FLIGHT GATES (blocking — verify before rollout)
These two facts are unverified and each changes the plan. Settle them first.
Gate A — Does the PAYG Z.AI key work on the Anthropic endpoint?
GLM-via-Claude-Code uses https://api.z.ai/api/anthropic. Whether a pay-as-you-go key is accepted there (vs. requiring the $18/mo Coding Plan) is undocumented.
curl -s -w "\nHTTP %{http_code}\n" https://api.z.ai/api/anthropic/v1/messages \
-H "x-api-key: $Z_AI_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" \
-d '{"model":"glm-4.6","max_tokens":16,"messages":[{"role":"user","content":"pong"}]}'
- HTTP 200 → GLM-via-CLI is viable on PAYG. Proceed with GLM (Section 5).
- HTTP 401/403 →
/api/anthropicneeds the Coding Plan. Either buy it or skip GLM-via-CLI and use DeepSeek only. DeepSeek is unaffected.
Gate B — Does each provider honor prompt caching?
DeepSeek's Anthropic shim ignores cache_control (confirmed in docs) → Claude Code's cache economics vanish, inflating real cost. Confirm GLM's behavior and treat caching loss as the dominant cost variable. There is no clean curl for this; rely on the cost instrumentation in Section 9 (compare billed input tokens across two identical repeated turns — flat or near-flat = caching works; full re-bill = ignored).
3. Verified provider facts (use these exact strings)
| DeepSeek | Z.AI GLM | |
|---|---|---|
ANTHROPIC_BASE_URL | https://api.deepseek.com/anthropic | https://api.z.ai/api/anthropic |
| Auth env | ANTHROPIC_AUTH_TOKEN (or ANTHROPIC_API_KEY) = key | ANTHROPIC_AUTH_TOKEN = key |
| Opus-tier model | deepseek-v4-pro | glm-4.6 (or glm-4.7) |
| Sonnet/Haiku-tier | deepseek-v4-flash | glm-4.6 / lighter variant |
| Context / max out | 1M / 384K | 200K / 128K (glm-4.6) |
| Caching | cache_control ignored | unverified (Gate B) |
| Images / MCP-content blocks | not supported | unverified |
Model names drift (GLM 4.7/5.x, DeepSeek v4). Treat them as config; re-verify in the provider console if a call 404s on the model.
4. Phase 1 — Claude Code → DeepSeek (primary path, scoped wrappers)
The recommended mechanism is scoped shell wrappers (env applies to the claude process only — never pollutes OMEGA or other tools). Add to ~/.zshrc:
# --- open-model Claude Code wrappers -------------------------------------
# Keys (scoped to this shell; do NOT export ANTHROPIC_BASE_URL globally)
DEEPSEEK_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text()).get('DEEPSEEK_API_KEY',''))")
Z_AI_API_KEY=$(python3.11 -c "import json,pathlib;print(json.loads((pathlib.Path.home()/'.omega/secrets.json').read_text()).get('Z_AI_API_KEY',''))")
# DeepSeek-backed Claude Code
claude-ds() {
env \
ANTHROPIC_BASE_URL="https://api.deepseek.com/anthropic" \
ANTHROPIC_AUTH_TOKEN="$DEEPSEEK_API_KEY" \
ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-pro" \
ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash" \
ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash" \
command claude "$@"
}
# Real-Claude escape hatch (unsets the open-model vars, falls back to login/API key)
claude-pro() {
env -u ANTHROPIC_BASE_URL -u ANTHROPIC_AUTH_TOKEN command claude "$@"
}
# OPTIONAL: make cheap the default. Comment out to keep `claude` = real Claude.
# alias claude=claude-ds
# -------------------------------------------------------------------------
Reload: source ~/.zshrc.
Why wrappers over settings.json: ~/.claude/settings.json env blocks are valid too, but they are sticky and global to every project, and make the escape hatch awkward (you must override with --settings). Wrappers keep the toggle one word (claude-ds vs claude-pro) and guarantee OMEGA's environment is untouched.
Per-project alternative (if you want one repo always cheap): create <project>/.claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "https://api.deepseek.com/anthropic",
"ANTHROPIC_AUTH_TOKEN": "<deepseek-key>",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-v4-pro",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-v4-flash",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-v4-flash"
}
}
Verify Phase 1
claude-ds -p "reply with exactly: pong" # expect: pong
Then in an interactive claude-ds session run /status — confirm the active endpoint is the DeepSeek base URL (not your subscription). Confirm OMEGA's SessionStart briefing still fires (proves hooks/MCP intact).
5. Phase 2 — GLM as an alternative (only if Gate A = 200)
Add a sibling wrapper:
claude-glm() {
env \
ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" \
ANTHROPIC_AUTH_TOKEN="$Z_AI_API_KEY" \
ANTHROPIC_DEFAULT_OPUS_MODEL="glm-4.6" \
ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.6" \
ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.6" \
command claude "$@"
}
Verify identically: claude-glm -p "reply with exactly: pong".
6. The seams — mitigations to set up alongside the switch
| Seam | Mitigation in this runbook |
|---|---|
| Web search tool breaks (Anthropic server-side) | Add an MCP web-search server (e.g. a Tavily/Brave MCP — select + configure separately) so search still works in-session. Until then, web research degrades. |
| Image inputs (DeepSeek rejects) | Use claude-pro for any task that pastes images. |
| Caching loss | Trim per-session context where possible; rely on Section 9 to quantify the real cost. |
| MCP tools load into prefix on custom endpoint | Settle your MCP server set at session start; avoid connecting/disconnecting mid-session. |
7. The escape hatch (this is what makes it "seamless")
Rule of thumb: cheap by default, real Claude on demand. Reach for claude-pro when:
- the task is hard / architectural and the cheap model is thrashing,
- you need web search or image input,
- you hit any model-side error you can't explain.
No state to change — it's just which wrapper you type.
8. Codex (DEFERRED — not seamless)
Current Codex requires the OpenAI Responses API; both DeepSeek and GLM serve only Chat Completions, so a direct model_providers entry will fail. Two options, both non-trivial:
- Recommended: migrate Codex usage into
claude-ds/claude-glmand retire Codex from the cost equation. - If you must keep Codex: stand up a LiteLLM proxy translating Responses→Chat, then point Codex at it:
This adds a always-on local service — defer until Phase 1 proves value.# ~/.codex/config.toml (requires a running LiteLLM proxy at :4000) model = "deepseek-v4-flash" [model_providers.litellm] base_url = "http://localhost:4000" env_key = "LITELLM_KEY" wire_api = "responses"
9. Cost instrumentation (prove the savings are real)
- Baseline: before switching, note 1 week of real-Claude spend (Anthropic console) for comparable work.
- After: the provider consoles are ground truth — DeepSeek dashboard / Z.AI console show actual billed tokens & $.
/costinside Claude Code estimates against Anthropic pricing and will be wrong for third-party endpoints; don't trust it for $. - Caching check (Gate B): run the same prompt twice in one session; compare billed input tokens on the provider dashboard. Flat = caching honored; full re-bill = ignored (raises real cost).
- Track fallback rate — how often you reach for
claude-pro. High fallback rate erodes the savings and signals the cheap tier isn't carrying your workload.
10. Capability-parity eval (the gate before trusting cheap-by-default)
Before flipping alias claude=claude-ds, run a fixed eval:
- Pick 8–10 representative tasks from your real work (a bug fix, a refactor, a feature, a multi-file change, a tool-heavy task).
- Run each under
claude-ds(andclaude-glmif live). Score: pass / partial / fail, plus latency and whether you had to fall back. - Compare against a
claude-prorun of the same tasks. - Decision rule: if cheap tier passes ≥ ~80% unaided and fallback rate is low, make it default. Otherwise keep
claude-proas default and use cheap selectively.
11. Rollback
Nothing structural changes, so rollback is trivial:
- Stop using
claude-ds/claude-glm; useclaude-pro(or plainclaude). - If you set
alias claude=claude-ds, remove the alias line andsource ~/.zshrc. - Delete any per-project
.claude/settings.jsonenv block. - OMEGA, hooks, MCP, memory are unaffected throughout — no cleanup needed.
12. Open items carried from investigation
- Gate A: GLM
/api/anthropicPAYG entitlement (curl in Section 2). - Gate B: caching behavior per provider (Section 9 method).
- Select + configure an MCP web-search server to replace the broken native tool.
- Run the parity eval (Section 10) before cheap-by-default.
- Decide Codex: migrate to Claude Code vs. stand up LiteLLM proxy.