Skip to main content
Home/Blog/Edge Caching for LLM Requests: Stop Paying to Answer the Same Question Twice
Artificial Intelligence

Edge Caching for LLM Requests: Stop Paying to Answer the Same Question Twice

A surprising share of LLM traffic is repeats — identical prompts re-run from scratch. Caching responses at the edge serves those instantly for near-zero cost. Here's how LLM caching works, what to cache, and the pitfalls.

By InventiveHQ Team

The repeats problem

Production LLM traffic is far more repetitive than most teams assume. Health checks and warm-up pings hit the same endpoint with the same canned prompt on a schedule. Retries replay an identical request after a timeout. Autocomplete, classification, moderation, and FAQ flows send the same handful of inputs thousands of times a day. RAG pipelines re-ask the same grounded question whenever a popular document is viewed.

Every one of those repeats, by default, runs the model from scratch — and you pay full freight each time. With output tokens billed at 3x–6x input across nearly every provider (Claude is 5x), and reasoning/"thinking" tokens billed as output, a single repeated answer can be expensive. Re-running it is pure waste: the inputs are identical, so the output is (or should be) too.

Caching turns that waste into a near-free lookup. An LLM call is a function of its inputs — same model, same prompt, same sampling parameters yields the same distribution over outputs. If you have already computed the answer, store it and serve it again. This is exactly why providers ship prompt caching that cuts cached-input cost to roughly 10% of standard (Anthropic's prompt caching cuts cached input by up to 90%; DeepSeek-V3 cache-hit input drops to ~$0.014 vs ~$0.14). Edge response caching takes the same idea one tier further out — and makes the repeat cost effectively zero.

How LLM caching works

The mechanics are the same as any cache, with one twist: the key has to capture everything that determines the answer.

  1. A request arrives. You normalize it (trim whitespace, canonicalize JSON field order, strip fields that don't affect output) and compute a cache key — typically a hash of model ID + full prompt + sampling parameters.
  2. You look the key up. On a hit, return the stored response immediately — no inference, no GPU, no per-token billing.
  3. On a miss, run inference, then write the response under that key with a TTL before returning it.

Where the cache lives matters as much as the logic. A cache on the inference box (llama.cpp's prompt cache, vLLM's prefix cache via PagedAttention) only helps after the request has already traveled to your hardware. An edge cache — a key/value lookup in front of inference, close to the user — returns hits in single-digit milliseconds and never wakes a GPU at all. The two compose: check the edge first, fall through to the box's prefix cache, fall through to live inference only when both miss.

Exact-match vs semantic caching

There are two fundamentally different ways to decide "have I seen this before."

Exact-match cacheSemantic cache
Match ruleByte-identical key (hash of model + prompt + params)Embedding within a cosine-similarity threshold
Hit rateLower — only literal repeatsHigher — catches paraphrases
False positivesNonePossible — similar-looking, different-meaning prompts
Per-request costOne hash + one KV lookupEmbed the prompt + vector search
TuningNone (it's deterministic)Threshold, embedding model, index — all need tuning
Best forHealth checks, retries, idempotent calls, fixed promptsFAQ/support phrasings, search, high-variation NL input
Failure modeMisses a near-duplicate (cheap mistake)Serves a wrong-but-similar answer (expensive mistake)

Exact-match is the safe default: zero false positives, no extra inference, trivial to reason about. Its weakness is that "What's your refund policy?" and "what is the refund policy" are different keys, so it misses obvious paraphrases.

Semantic caching closes that gap by embedding the prompt and returning a cached answer when an existing entry is within a similarity threshold (cosine similarity ≥ ~0.95, tuned per workload). The payoff is a much higher hit rate on natural-language input. The danger is real: set the threshold too loose and "How do I cancel my subscription?" can match "How do I upgrade my subscription?" and serve the wrong answer with full confidence. Start with exact-match, add a semantic tier only where you measure enough paraphrase traffic to justify it, and keep the threshold conservative.

What to cache — and what never to

The deciding question is simple: is this output a stable function of its inputs? If yes, cache it. If the same input can legitimately produce a different correct answer, don't.

Cache itNever cache it
Deterministic calls (temperature: 0, fixed seed)High-temperature / creative generation
FAQ, docs, and support answersPersonalized answers (per user/account/session)
Classification, moderation, extraction, taggingTime-sensitive data (prices, stock, "today's…")
Retrieval answers over stable contextRetrieval over context that changes per request
Idempotent tool calls and health checksAnything legally required to be freshly generated
Embeddings (pure function of input text)Streaming partials you can't replay safely

Two operational levers govern the boundary:

  • TTL. Pick a time-to-live that matches how fast the underlying truth changes. Documentation answers can live for days; anything touching live data should be minutes or seconds — or not cached at all.
  • Invalidation. When the source of truth changes, the cache must follow. Bump a version segment in the key (e.g. include a kb_version or model snapshot) so a deploy or content update instantly orphans every stale entry without a manual purge. This is far safer than chasing individual keys.

When you cache personalized output, scope the key to the tenant or user so one person's answer can never leak to another (see Pitfalls).

Caching at the edge

Inference is the expensive tier. The whole point of an edge cache is to answer as many requests as possible before a request ever reaches a GPU. Picture three tiers, cheapest first:

Three-tier LLM request path: edge cache, own hardware, cloud burst Request 1. Edge cache ~ms, ~$0/token HIT → return miss 2. Own hardware no per-token fee prefix cache burst 3. Cloud full $/token failover Cost per answered request increases left → right cheapest most expensive

Tier 1 — the edge cache — is the cheapest token you will ever serve: a KV lookup, no inference, billed at essentially nothing. Tier 2 is your own hardware, where there are no per-token fees on requests the cache missed. Tier 3 is cloud burst for the long tail and for failover when your hardware is saturated or down.

This is precisely the shape of an edge-first AI gateway like WideAreaAI (WAI) — one OpenAI-compatible endpoint that does request-level routing: it checks an edge cache, routes the miss to your own llama.cpp node over a Cloudflare Tunnel, and bursts to a cloud provider for failover when you choose. The model is "own your baseline, burst to the cloud" — a markup-free baseline on hardware you control, with the edge cache tier handling the repeats for free. (To be clear, this is request-level routing and caching across whole nodes, not tensor-parallel model-splitting across machines — that's a different problem.) Reach for a gateway like this when you actually hit the wall: you need a single endpoint, you need failover, or you need a shared cache in front of inference. If you just want to know whether your own box can carry the baseline, size it first with the self-hosted LLM cost calculator and the LLM token counter.

Pitfalls

Caching LLM responses goes wrong in specific, avoidable ways.

  • Under-keyed responses (the big one). If your key omits the model, the sampling parameters, or the system prompt, you will serve answers that no longer match the request. A temperature: 0 answer cached under a key that ignores temperature gets returned to a temperature: 1.2 creative request. Hash everything that influences output: model ID and version, the full normalized prompt (all roles/messages), temperature, top_p, top_k, max_tokens, seed, stop sequences, and any tool/function schema. When in doubt, include it.
  • Privacy leakage across tenants. A shared cache without a tenant/user scope can serve User A's personalized answer to User B — a data breach, not a cache miss. Namespace keys by tenant for anything personalized, and never cache content that contains another user's PII under a globally shared key.
  • Cache poisoning. If untrusted input can write to the cache, an attacker can seed a malicious or wrong answer that you then serve to everyone. Only cache responses from trusted inference paths, validate before storing, and treat the cache as part of your trust boundary.
  • Staleness. A TTL that is too long serves yesterday's truth. Match TTL to the volatility of the underlying data, and prefer version-segment invalidation (bump a key segment on deploy/content change) over hoping individual entries expire in time.
  • Semantic mis-serve. Covered above: too-loose a similarity threshold returns a confidently wrong answer. Keep it conservative and measure false-positive rate on real traffic before loosening.

Conclusion

A large share of LLM traffic is repeats, and repeats don't need a GPU — they need a lookup. Start with an exact-match cache keyed on model, prompt, and every sampling parameter; cache only deterministic, non-personalized, time-stable outputs; and put the cache at the edge so hits never reach inference. Add a conservative semantic tier only where you can measure the paraphrase traffic to justify it. Tier the rest — edge cache, then your own hardware, then cloud burst — and the repeats become the cheapest token you will ever serve. Size the economics first with the self-hosted LLM cost calculator and the LLM token counter.

Frequently Asked Questions

Find answers to common questions

Yes. An LLM API call is a pure function of its inputs (model, prompt, and sampling parameters), so identical — or semantically similar — requests can be served from a cache instead of re-running inference. Exact-match caching is simple and safe and is what providers already do internally for prompt prefixes. Semantic caching, which matches 'close enough' prompts via embeddings, raises hit rate but introduces a mis-serve risk you have to tune for.

Exact-match caching returns a stored response only when the request hashes byte-for-byte identical (same model, same prompt, same parameters). It is predictable and has zero false positives, but only catches literal repeats. Semantic caching embeds the prompt and returns a cached answer when an existing entry is within a cosine-similarity threshold — higher hit rate, but it can serve the wrong answer when two prompts look similar yet mean different things, so it needs a tuned threshold and a per-request embedding lookup.

Never cache high-temperature or deliberately creative outputs, personalized answers (anything keyed to a specific user, account, or session), time-sensitive data (prices, stock, 'today's…'), or anything legally required to be freshly generated. Cache the deterministic, repeatable, non-personalized queries — FAQ answers, classification, retrieval-grounded responses with stable context, and idempotent tool/health-check calls.

It scales with your repeat rate. If 30% of requests are exact repeats and you cache them, you cut inference cost and latency on that 30% to near zero — a cached hit is the cheapest token you will ever serve. For FAQ-style and retrieval workloads repeat rates are often higher. Model your own numbers with the self-hosted LLM cost calculator and the LLM token counter.

Both, in tiers. An edge cache (e.g. a Workers KV lookup) sits in front of inference and returns hits in single-digit milliseconds without ever waking a GPU, so it should be checked first. A local cache on the inference box still helps for prefix reuse and KV-cache warmth, but it only triggers after the request has already reached your hardware. Check the edge first, fall through to the box, fall through to live inference.

Hash everything that changes the output: the model ID and version, the full normalized prompt (all messages/roles), and every sampling parameter that affects determinism — temperature, top_p, top_k, max_tokens, seed, stop sequences, and any tool/function schema. Add a tenant or user scope when responses are personalized so one user's answer can never be served to another. Omit only fields that provably do not change the output.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.