Skip to main content
OpenAIintermediate

How to Reduce Codex CLI & OpenAI API Token Costs

Cut your OpenAI API bill from Codex CLI and the OpenAI SDK. Use prompt caching, smaller models, batching, context trimming, and local routing to spend fewer tokens without losing quality.

10 min readUpdated June 2026

Want us to handle this for you?

Get expert help →

OpenAI Codex CLI and the OpenAI API bill per token, and a busy day of coding can add up faster than expected. The good news is that most spend is avoidable: it comes from re-sending context, using a premium model for trivial tasks, and paying full price for requests that could have been cached or routed elsewhere. This guide walks through the highest-impact ways to lower the bill without giving up the assistance you rely on.

Understand Where Tokens Go

You are billed for input tokens (everything you send) and output tokens (everything the model generates). In an agentic CLI like Codex, input usually dominates because every turn re-sends:

  • The system prompt and tool definitions
  • The running conversation history
  • File context Codex has read into the session

A single long session over a large file can re-send tens of thousands of input tokens per turn. Output is comparatively small. That means your biggest savings come from controlling context, not from writing shorter questions.

Check your actual usage before optimizing:

# OpenAI usage dashboard breaks spend down by model and day
# platform.openai.com/usage

Tactic 1: Trim and Reset Context

Long-lived sessions accumulate history that you pay for on every turn.

  • Start a fresh session for unrelated tasks. Do not ask Codex about your auth module in the same session you were debugging CSS.
  • Avoid pulling huge files into context when a snippet will do. Reference the specific function rather than the whole 2,000-line file.
  • Summarize and restart for very long sessions instead of letting history grow without bound.

This single habit often produces the largest reduction because it attacks the multiplier (history re-sent per turn), not a one-time cost.

Tactic 2: Lean on Prompt Caching

OpenAI automatically caches identical prompt prefixes longer than 1,024 tokens and bills the cached portion at a steep discount. You do not enable it, but you can structure prompts to benefit:

  • Keep the stable content at the front. Put the system message, instructions, and unchanging file context first; put the variable part (your specific question) last. A stable prefix stays cacheable across calls.
  • Reuse the same context shape. Re-reading the same file in the same order keeps the prefix identical, so repeat calls hit the cache.

For scripted OpenAI SDK usage, the same rule applies: identical leading messages across requests get the cached discount automatically.

Tactic 3: Match the Model to the Task

The most capable coding model is also the most expensive. Most day-to-day requests do not need it.

Set up profiles in ~/.codex/config.toml so switching is one flag:

# ~/.codex/config.toml

# Cheap default for routine work
model = "gpt-5.2-codex-mini"

[profiles.heavy]
model_provider = "openai"
model = "gpt-5.2-codex"
# Routine — cheap model
codex "add a docstring to this function"

# Hard — premium model, only when it earns its cost
codex --profile heavy "refactor this module to use dependency injection"

Reserving the premium model for genuinely hard problems is one of the easiest structural savings available.

Tactic 4: Batch Independent Requests

If you are scripting many small, independent completions (for example generating docstrings for 200 functions), do not send them as one giant interactive session. Two cheaper paths:

  • The Batch API processes large request sets asynchronously at a significant discount versus synchronous calls, when you do not need an instant response.
  • Group locally first. Combine several tiny related asks into one request rather than paying the fixed prompt overhead 200 times.

Tactic 5: Route Routine Work to a GPU You Own

The tactics above shrink the bill; local routing can remove it for a large share of requests. Because Codex CLI and the OpenAI SDK speak the OpenAI-compatible API, you can point them at a local coding model (Qwen-Coder, DeepSeek-Coder, CodeLlama) served as an OpenAI-compatible endpoint. Requests that model handles well, explanations, simple edits, docstrings, cost nothing per token.

# Route to a local OpenAI-compatible server
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="local"
codex --model qwen2.5-coder:14b "explain this function"

The catch with a hard split is that local hardware is not always the right place to run a request: a complex refactor still wants cloud quality, and your GPU is sometimes busy or offline. That is where an edge-first gateway helps. A gateway sits at the base URL Codex already uses, serves repeated prompts from an edge cache, runs eligible requests on the GPU you own for free, and bursts to the cloud only on failover or for tasks that need it. Wide Area AI implements exactly this routing, so you get the cost profile of local inference with the reliability of the cloud, without changing how Codex is configured.

Putting It Together

A practical low-cost setup combines all five tactics:

  1. Reset sessions between unrelated tasks and keep context tight.
  2. Order prompts so the stable prefix stays cacheable.
  3. Default to a cheap or local model; escalate to premium per command.
  4. Batch large independent jobs through the Batch API.
  5. Route routine requests to hardware you own and burst to cloud only when needed.

Each tactic is independent, so adopt them incrementally and watch the usage dashboard confirm the drop.

Troubleshooting

My Bill Did Not Drop After Switching Models

Confirm the cheap model is actually the default. Run codex --version and check that no environment variable or profile is forcing the premium model. Verify against the per-model breakdown in the usage dashboard.

Caching Does Not Seem to Apply

Cached discounts require an identical prefix over 1,024 tokens. If the variable part of your prompt is near the front, every call invalidates the cache. Move stable content forward.

Local Model Quality Is Too Low for Some Tasks

That is expected. Use local routing for routine work and escalate hard tasks to the cloud. A gateway automates this split so you do not have to decide per request.

Next Steps

Cut your token bill

Stop paying per token — route to your own GPU

Wide Area Intelligence is an edge-first AI gateway: repeated requests hit an edge cache, the rest run on GPUs you own for free, and the cloud is only a burst-capacity failover. OpenAI-compatible, bring your own key.

Start routing — free

Frequently Asked Questions

Find answers to common questions

Input tokens usually dominate. Codex re-sends conversation history and file context on every turn, so a long session or a large included file costs tokens on each request. Trimming context and starting fresh sessions for unrelated tasks cuts the bill more than shortening your prompts.

Yes. OpenAI automatically caches identical prompt prefixes over 1,024 tokens and bills cached input tokens at a large discount. Keep the stable part of your prompt (system message, file context) at the front so it stays cacheable, and the discount applies on repeat calls.

For routine tasks, yes. Reserve the most expensive coding model for complex multi-file work and use a smaller or local model for explanations, docstrings, and simple edits. A profile-based setup lets you switch per command instead of paying premium rates for everything.

Requests served by a GPU you own cost nothing per token after the hardware. If half your daily Codex calls are routine tasks a local coding model handles well, routing those locally and bursting to the cloud only for hard problems can cut a large share of the API bill.