The repeats problem
Production LLM traffic is far more repetitive than most teams assume. Health checks and warm-up pings hit the same endpoint with the same canned prompt on a schedule. Retries replay an identical request after a timeout. Autocomplete, classification, moderation, and FAQ flows send the same handful of inputs thousands of times a day. RAG pipelines re-ask the same grounded question whenever a popular document is viewed.
Every one of those repeats, by default, runs the model from scratch — and you pay full freight each time. With output tokens billed at 3x–6x input across nearly every provider (Claude is 5x), and reasoning/"thinking" tokens billed as output, a single repeated answer can be expensive. Re-running it is pure waste: the inputs are identical, so the output is (or should be) too.
Caching turns that waste into a near-free lookup. An LLM call is a function of its inputs — same model, same prompt, same sampling parameters yields the same distribution over outputs. If you have already computed the answer, store it and serve it again. This is exactly why providers ship prompt caching that cuts cached-input cost to roughly 10% of standard (Anthropic's prompt caching cuts cached input by up to 90%; DeepSeek-V3 cache-hit input drops to ~$0.014 vs ~$0.14). Edge response caching takes the same idea one tier further out — and makes the repeat cost effectively zero.
How LLM caching works
The mechanics are the same as any cache, with one twist: the key has to capture everything that determines the answer.
- A request arrives. You normalize it (trim whitespace, canonicalize JSON field order, strip fields that don't affect output) and compute a cache key — typically a hash of model ID + full prompt + sampling parameters.
- You look the key up. On a hit, return the stored response immediately — no inference, no GPU, no per-token billing.
- On a miss, run inference, then write the response under that key with a TTL before returning it.
Where the cache lives matters as much as the logic. A cache on the inference box (llama.cpp's prompt cache, vLLM's prefix cache via PagedAttention) only helps after the request has already traveled to your hardware. An edge cache — a key/value lookup in front of inference, close to the user — returns hits in single-digit milliseconds and never wakes a GPU at all. The two compose: check the edge first, fall through to the box's prefix cache, fall through to live inference only when both miss.
Exact-match vs semantic caching
There are two fundamentally different ways to decide "have I seen this before."
| Exact-match cache | Semantic cache | |
|---|---|---|
| Match rule | Byte-identical key (hash of model + prompt + params) | Embedding within a cosine-similarity threshold |
| Hit rate | Lower — only literal repeats | Higher — catches paraphrases |
| False positives | None | Possible — similar-looking, different-meaning prompts |
| Per-request cost | One hash + one KV lookup | Embed the prompt + vector search |
| Tuning | None (it's deterministic) | Threshold, embedding model, index — all need tuning |
| Best for | Health checks, retries, idempotent calls, fixed prompts | FAQ/support phrasings, search, high-variation NL input |
| Failure mode | Misses a near-duplicate (cheap mistake) | Serves a wrong-but-similar answer (expensive mistake) |
Exact-match is the safe default: zero false positives, no extra inference, trivial to reason about. Its weakness is that "What's your refund policy?" and "what is the refund policy" are different keys, so it misses obvious paraphrases.
Semantic caching closes that gap by embedding the prompt and returning a cached answer when an existing entry is within a similarity threshold (cosine similarity ≥ ~0.95, tuned per workload). The payoff is a much higher hit rate on natural-language input. The danger is real: set the threshold too loose and "How do I cancel my subscription?" can match "How do I upgrade my subscription?" and serve the wrong answer with full confidence. Start with exact-match, add a semantic tier only where you measure enough paraphrase traffic to justify it, and keep the threshold conservative.
What to cache — and what never to
The deciding question is simple: is this output a stable function of its inputs? If yes, cache it. If the same input can legitimately produce a different correct answer, don't.
| Cache it | Never cache it |
|---|---|
Deterministic calls (temperature: 0, fixed seed) | High-temperature / creative generation |
| FAQ, docs, and support answers | Personalized answers (per user/account/session) |
| Classification, moderation, extraction, tagging | Time-sensitive data (prices, stock, "today's…") |
| Retrieval answers over stable context | Retrieval over context that changes per request |
| Idempotent tool calls and health checks | Anything legally required to be freshly generated |
| Embeddings (pure function of input text) | Streaming partials you can't replay safely |
Two operational levers govern the boundary:
- TTL. Pick a time-to-live that matches how fast the underlying truth changes. Documentation answers can live for days; anything touching live data should be minutes or seconds — or not cached at all.
- Invalidation. When the source of truth changes, the cache must follow. Bump a version segment in the key (e.g. include a
kb_versionor model snapshot) so a deploy or content update instantly orphans every stale entry without a manual purge. This is far safer than chasing individual keys.
When you cache personalized output, scope the key to the tenant or user so one person's answer can never leak to another (see Pitfalls).
Caching at the edge
Inference is the expensive tier. The whole point of an edge cache is to answer as many requests as possible before a request ever reaches a GPU. Picture three tiers, cheapest first:
Tier 1 — the edge cache — is the cheapest token you will ever serve: a KV lookup, no inference, billed at essentially nothing. Tier 2 is your own hardware, where there are no per-token fees on requests the cache missed. Tier 3 is cloud burst for the long tail and for failover when your hardware is saturated or down.
This is precisely the shape of an edge-first AI gateway like WideAreaAI (WAI) — one OpenAI-compatible endpoint that does request-level routing: it checks an edge cache, routes the miss to your own llama.cpp node over a Cloudflare Tunnel, and bursts to a cloud provider for failover when you choose. The model is "own your baseline, burst to the cloud" — a markup-free baseline on hardware you control, with the edge cache tier handling the repeats for free. (To be clear, this is request-level routing and caching across whole nodes, not tensor-parallel model-splitting across machines — that's a different problem.) Reach for a gateway like this when you actually hit the wall: you need a single endpoint, you need failover, or you need a shared cache in front of inference. If you just want to know whether your own box can carry the baseline, size it first with the self-hosted LLM cost calculator and the LLM token counter.
Pitfalls
Caching LLM responses goes wrong in specific, avoidable ways.
- Under-keyed responses (the big one). If your key omits the model, the sampling parameters, or the system prompt, you will serve answers that no longer match the request. A
temperature: 0answer cached under a key that ignores temperature gets returned to atemperature: 1.2creative request. Hash everything that influences output: model ID and version, the full normalized prompt (all roles/messages),temperature,top_p,top_k,max_tokens,seed,stopsequences, and any tool/function schema. When in doubt, include it. - Privacy leakage across tenants. A shared cache without a tenant/user scope can serve User A's personalized answer to User B — a data breach, not a cache miss. Namespace keys by tenant for anything personalized, and never cache content that contains another user's PII under a globally shared key.
- Cache poisoning. If untrusted input can write to the cache, an attacker can seed a malicious or wrong answer that you then serve to everyone. Only cache responses from trusted inference paths, validate before storing, and treat the cache as part of your trust boundary.
- Staleness. A TTL that is too long serves yesterday's truth. Match TTL to the volatility of the underlying data, and prefer version-segment invalidation (bump a key segment on deploy/content change) over hoping individual entries expire in time.
- Semantic mis-serve. Covered above: too-loose a similarity threshold returns a confidently wrong answer. Keep it conservative and measure false-positive rate on real traffic before loosening.
Conclusion
A large share of LLM traffic is repeats, and repeats don't need a GPU — they need a lookup. Start with an exact-match cache keyed on model, prompt, and every sampling parameter; cache only deterministic, non-personalized, time-stable outputs; and put the cache at the edge so hits never reach inference. Add a conservative semantic tier only where you can measure the paraphrase traffic to justify it. Tier the rest — edge cache, then your own hardware, then cloud burst — and the repeats become the cheapest token you will ever serve. Size the economics first with the self-hosted LLM cost calculator and the LLM token counter.