Skip to main content
Home/Blog/Local LLM Performance: What Tokens-Per-Second to Expect From Your Hardware
Artificial Intelligence

Local LLM Performance: What Tokens-Per-Second to Expect From Your Hardware

Why local inference is memory-bandwidth bound, what tokens/sec you'll realistically get from a 4090, a 5090, an H100, or an M-series Mac, and how model size, quantization, and context change the numbers.

By InventiveHQ Team

The one idea that explains local LLM speed

Almost every "why is my local model slow?" question has the same answer: single-stream text generation is memory-bandwidth bound, not compute bound. A dense transformer has to read every one of its weights out of memory to produce each new token. So the ceiling on generation speed is just arithmetic:

tok/s  ≈  memory_bandwidth  ÷  active_bytes_per_token

where active_bytes_per_token is essentially the size of the weights the model touches per token. For a dense model that is the whole weight file; for a mixture-of-experts model it is only the active experts. Real-world throughput lands at roughly 50–80% of this ceiling once you account for attention over the KV cache, kernel overhead, and sampling.

A worked example. An 8B model quantized to Q4_K_M is about 5 GB of weights. On an RTX 4090 with ~1008 GB/s of bandwidth, the ceiling is 1008 / 5 ≈ 200 tok/s, and you measure 90–140 tok/s in practice — squarely in the 50–80% band. Multiply the weights ~8x (a 70B at the same quant is ~40 GB) and the ceiling drops to 1008 / 40 ≈ 25 tok/s before it even fits.

The two levers fall straight out of the formula: raise bandwidth (a faster card) or shrink bytes per token (a smaller model or a more aggressive quant). Raw TFLOPs are almost irrelevant at batch size 1 — which is exactly why a 4090 and an H100 are closer on small-model decode than their FLOP ratings suggest, and why an Apple Mac with modest GPU compute but unified memory is competitive at all. Plug your own numbers into the LLM inference speed calculator to see the ceiling for any card/model/quant combination.

Prefill vs decode: two workloads, two metrics

Inference has two phases, and conflating them is the most common source of confused benchmarks.

PhaseWhat it doesBottleneckParallel?Metric it sets
PrefillProcesses the whole prompt at onceCompute (large GEMMs)Yes — all prompt tokens togetherTime-to-first-token (TTFT)
DecodeGenerates the reply, one token at a timeMemory bandwidthNo — token t needs token t−1Tokens/sec

Prefill runs your entire prompt through the network in a single batched pass. Because all prompt tokens are processed together, the GPU does big, dense matrix multiplies and saturates its compute units — this is the one phase where TFLOPs matter. It is fast per token, and it determines how long you wait before the first word appears.

Decode is the opposite. Each new token depends on the one before it, so generation is an inherently serial loop. Every step re-reads the full set of weights to emit a single token, which is why it is bandwidth-bound and why it dominates the felt speed of a chat. A long prompt makes TTFT worse (more to prefill); a long reply takes time proportional to decode tok/s. When a vendor quotes "throughput," ask which phase they mean.

What to expect by hardware

Here are realistic single-stream decode numbers at Q4, synthesized from the bandwidth-bound model and community benchmarks (llama.cpp / Ollama / vLLM). Treat them as ballparks: actual figures move with engine, flash-attention, quant kernel, and context length. The key column is bandwidth — decode scales with it almost linearly.

GPUVRAMBandwidth8B Q4 (~5 GB)70B Q4 (~40 GB)
RTX 409024 GB~1008 GB/s~90–140 tok/sdoesn't fit — ~15–20 tok/s w/ CPU offload; ~18–25 on 2×4090
RTX 509032 GB~1792 GB/s~140–220 tok/sneeds 2×5090 → ~30–45 tok/s
A100 80GB80 GB~2039 GB/s~120–200 tok/s~25–40 tok/s
H100 80GB80 GB~3350 GB/s~180–280 tok/s~40–60 tok/s
Apple M3 Maxup to 128 GB unified~300–400 GB/s~30–60 tok/s~6–10 tok/s
Apple M4 Maxup to 128 GB unified~410–546 GB/s~40–80 tok/s~8–13 tok/s

Two things to notice. First, a 70B Q4 does not fit a single 24 or 32 GB consumer card — the single-GPU numbers in that column assume slow CPU/RAM offload; you want two cards or an 80 GB data-center GPU (or an Apple Mac whose unified memory holds it). Second, the Apple machines punch above their compute weight precisely because unified memory gives the GPU 300–546 GB/s straight to a large pool — slow on a 70B, but it runs one on a laptop, which a 4090 cannot.

The next chart shows the same decode numbers (midpoint of each range) for an 8B at Q4. Decode speed broadly tracks memory bandwidth rather than price or FLOPs, but the ranges overlap — clock speed, cache, kernel maturity, and quant implementation also move the numbers, so a higher-bandwidth card does not always post a higher midpoint.

8B model at Q4 — single-stream decode tokens/sec by GPU (range midpoints) 0 50 100 150 200 250 tokens / second (decode, 8B Q4) Apple M3 Max Apple M4 Max RTX 4090 A100 80GB RTX 5090 H100 80GB ~45 ~60 ~115 ~160 ~180 ~230

To map these to feel: human reading is ~5–10 tok/s, so even the slowest row here generates faster than you can read. Above ~20 tok/s a chat feels snappy; the difference between 140 and 230 tok/s is invisible to a single human reader and only matters when you batch or run agents. Check what your GPU can do with the LLM GPU benchmark, which measures real WebGPU bandwidth and tokens/sec in the browser, or start from a spec lookup in what LLM can I run.

How quantization and model size move the needle

Both levers act through the same active_bytes_per_token term. Model size scales it linearly: an 8B is ~5 GB at Q4, a 70B is ~40 GB, so on identical hardware the 70B runs roughly 8x slower (and may not fit). Quantization scales the bytes-per-weight:

PrecisionBytes/weight8B weightsRelative decode speed
FP16 / BF162.0~16 GB1.0x (baseline)
Q8_0~1.06~8.5 GB~1.9x
Q5_K_M~0.71~5.7 GB~2.8x
Q4_K_M~0.55~4.4–4.9 GB~3.6x
Q3_K_M~0.49~3.9 GB~4.1x

The quality cost is mild down to about 4-bit and then falls off a cliff. Q4_K_M is the default for a reason — roughly +3% perplexity over FP16 (Llama-3.1-8B: ~7.56 vs ~7.32) for a ~3.6x speedup and a quarter of the VRAM. Q5_K_M and Q6_K are near-lossless; Q8_0 is effectively indistinguishable from FP16. Below Q4, quality degrades sharply — Q3_K_M is usable but noticeably weaker, and Q2_K is clearly compromised (prefer an IQ2 i-quant if you are truly that tight on VRAM). The practical rule: quantize to the lowest tier that still fits comfortably and stays at or above Q4, and only go below 4-bit when nothing else fits. Size a specific model and quant against your card with the LLM VRAM calculator.

Context length and the KV cache

Two things get worse as your context grows. First, VRAM: the KV cache stores keys and values for every token, and it grows linearly with sequence length. A 70B-class model with grouped-query attention (GQA) costs about 0.3125 MB/token in FP16 — ~1.25 GB at 4K tokens but ~40 GB at 128K, on par with the weights themselves. Quantizing the KV cache to FP8/INT8 halves that. Second, speed: each decode step now also attends over the entire KV cache, so longer context means slower tokens/sec on top of the memory pressure.

Context (70B GQA)KV cache (FP16)KV cache (FP8/INT8)
4K~1.25 GB~0.625 GB
32K~10 GB~5 GB
128K~40 GB~20 GB

GQA (Llama-3 70B uses 8 KV heads instead of 64 query heads — 8x smaller KV) and KV quantization are what make long context feasible at all; without GQA that 128K figure would be ~320 GB. If you mostly run short prompts, ignore this; if you are doing long-document RAG, budget VRAM for the KV cache and expect decode to slow as the window fills.

Single user vs serving many (batching)

Everything above assumes batch size 1 — one person, one stream. Batching changes the regime. When a server like vLLM processes many requests concurrently with continuous (in-flight) batching, it reads each weight from memory once and reuses it across every sequence in the batch. The workload shifts from bandwidth-bound toward compute-bound: per-request tok/s stays roughly the same, but aggregate throughput can climb 10–20x before compute saturates.

That is why your single-user llama.cpp number and a production serving benchmark can differ by an order of magnitude on the same GPU — they are measuring different things. vLLM's PagedAttention stores the KV cache in non-contiguous fixed-size pages, which kills fragmentation and lets you pack far more concurrent sequences into the same VRAM, sustaining high batch sizes in practice.

The practical split:

  • One user / interactive chat / a laptop: optimize for single-stream decode. Bandwidth and quant are your levers; llama.cpp, Ollama, or LM Studio are the right tools.
  • Many users / an API / agents firing in parallel: optimize for aggregate throughput. vLLM with continuous batching on a data-center GPU, and measure tokens/sec/dollar, not single-request latency.

Cross the break-even between renting cloud API tokens and running your own batched server with the self-hosted LLM cost calculator.

Measure your own

Formulas give you the ceiling; your stack determines how close you get. Two tools close the loop without installing anything:

  • LLM GPU Benchmark — runs a real WebGPU workload in your browser and reports measured memory bandwidth and tokens/sec for your actual hardware.
  • LLM inference speed calculator — estimates tok/s from bandwidth, model size, and quant, and compares GPUs side by side so you can plan before you buy.

Conclusion

Local LLM speed is not mysterious. For single-stream generation, memory bandwidth sets the ceiling and tok/s ≈ bandwidth ÷ active_bytes_per_token gets you within a factor of two before you run anything. Shrink the bytes (quantize, stay at or above Q4) or raise the bandwidth (a faster card, or unified memory on Apple Silicon) to go faster; remember that prefill and decode are different metrics, that long context taxes both VRAM and speed, and that batching is a separate throughput regime entirely. Then stop guessing and measure — the ceiling tells you what is possible, the benchmark tells you what you have.

Frequently Asked Questions

Find answers to common questions

For single-stream generation, memory bandwidth is the bottleneck, not raw compute. A dense model reads every weight once per token, so tokens/sec ≈ memory bandwidth ÷ active model bytes per token. Quantizing from FP16 to Q4 roughly quarters the bytes moved and so roughly quadruples the ceiling. FLOPs barely matter at batch size 1.

Human reading speed is about 5–10 tok/s, so anything above ~10 tok/s feels readable and 20–40+ feels snappy. An 8B model at Q4 hits 90–140 tok/s on an RTX 4090 and 30–80 tok/s on an Apple M-series Max; a 70B at Q4 is more like 15–60 tok/s depending on the card. Estimate your own with the LLM inference speed calculator and confirm with the in-browser GPU benchmark.

They are two different workloads. Prefill (processing your prompt) runs all tokens through the model in parallel as large matrix multiplies — it is compute-bound and fast, and it sets time-to-first-token. Decode (generating the reply) produces one token at a time, each requiring a full re-read of the weights — it is sequential and bandwidth-bound, and it sets your tokens/sec. The same GPU can show very different numbers for each.

Yes. When you serve many requests at once with vLLM's continuous batching, the weights read from memory are reused across every sequence in the batch, so the workload shifts from bandwidth-bound toward compute-bound. Per-request speed stays roughly the same, but aggregate throughput can rise 10–20x. Single-user latency and multi-user throughput are genuinely different regimes — optimize for the one you actually run.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.