LLM Inference Speed Calculator
Estimate LLM tokens per second from memory bandwidth, model size, quantization, and context window. Compare generation speed across GPUs and understand the memory-bandwidth bottleneck.
Model & Hardware
You build the idea. I'll ship the product.
Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.
Why LLM Inference Speed Is About Memory, Not Compute
The most counterintuitive fact about running LLMs: the GPU's math performance (TFLOPS) barely matters. What matters is memory bandwidth.
Here is why. To generate one token, the model must read every active parameter from memory exactly once, and perform only about 2 floating-point operations per parameter read. Modern GPUs can execute hundreds of operations in the time it takes to fetch one value from VRAM. The compute units spend most of their time waiting.
So the speed formula is simple: tokens/sec = memory bandwidth / bytes per token, derated by 50-75% for real-world engine efficiency.
This explains otherwise-strange benchmark results: an RTX 3090 (936 GB/s) nearly matches an RTX 4080 (717 GB/s) despite being far behind on paper compute. It is also why datacenter GPUs cost what they do — an H100's 3350 GB/s of HBM bandwidth is its real product, not its TFLOPS.
Prefill vs. Decode: The Two Phases of Inference
LLM inference has two phases with completely different performance characteristics:
Prefill (prompt processing) happens once, before the first token appears. All prompt tokens are processed in parallel, which saturates the GPU's compute units — this phase IS compute-bound. Time to first token = prompt length / prefill speed. This is why pasting a long document into a local model causes a long pause before anything happens.
Decode (generation) produces tokens one at a time, each requiring a full read of the active weights and KV cache — memory-bound, as described above.
The practical consequences: a GPU with strong compute but modest bandwidth (like an RTX 4060 Ti) will have snappy prompt processing but mediocre generation speed. Apple Silicon is the opposite — generation speed is respectable for its bandwidth, but prefill on long prompts is slow because its GPU compute is far below discrete cards. If your workload involves repeatedly feeding long documents, prefill speed may matter more to you than tokens/sec.
How to Actually Get More Tokens Per Second
Ranked by impact:
1. Make sure the model fully fits in VRAM. Partial CPU offloading is the number one cause of disappointing speed — even 10% of layers on the CPU can halve your throughput. Check with the VRAM calculator first.
2. Use a smaller quantization. Q4_K_M reads ~70% less data per token than FP16 — that is a direct ~3x speedup. Quality loss is modest; this is the best speed-per-quality trade available.
3. Pick a MoE model. Qwen3 30B-A3B reads ~3B parameters per token; Qwen3 32B (dense) reads all 32B. The MoE model generates roughly 8-10x faster on identical hardware with broadly comparable capability.
4. Buy bandwidth, not compute. When choosing hardware for LLMs, sort by memory bandwidth per dollar. Used RTX 3090s (936 GB/s) routinely beat newer, more expensive cards.
5. Keep context lean. Past ~32K tokens, the KV cache meaningfully adds to per-token reads. Summarize or truncate long conversations, or use KV cache quantization.
6. Use tensor parallelism for multi-GPU. If you have matched GPUs, vLLM with tensor parallel scales speed; llama.cpp's default layer split does not.
Frequently Asked Questions
Common questions about the LLM Inference Speed Calculator
For interactive chat: 20+ tokens/sec feels instant (faster than you can read), 10-20 tokens/sec is comfortable, 5-10 tokens/sec is usable but noticeably slow, and below 5 tokens/sec gets frustrating. For agentic workloads (coding assistants, tool use) where the model generates long outputs you skim rather than read, higher speeds matter much more — 50+ tokens/sec is the difference between a 30-second and a 3-minute task.
Single-user generation is memory-bound: every new token requires reading all active model weights plus the KV cache from memory. So tokens/sec = effective memory bandwidth / bytes read per token. A 8B model at Q4_K_M is about 5 GB of reads per token; on an RTX 4090 (1008 GB/s at 50-75% real-world efficiency) that gives roughly 85-125 tokens/sec. GPU compute (TFLOPS) barely matters for this — bandwidth is everything.
Because generating one token requires reading every active weight from memory but only doing 2 floating-point operations per weight read. Modern GPUs can do hundreds of operations in the time it takes to read one value from memory — so the math units sit idle waiting for data. This is why an RTX 3090 (936 GB/s, 36 TFLOPS) generates tokens almost as fast as an RTX 4080 (717 GB/s, 49 TFLOPS) despite being two generations older: it has more bandwidth.
Mixture-of-Experts models only route each token through a few experts, so they read a fraction of their parameters per token. GPT-OSS 120B reads only ~5.1B parameters per token — so it generates about as fast as a 5B dense model while having 120B-model knowledge. This is why MoE has taken over: Qwen3 30B-A3B, Gemma 4 26B-A4B, and DeepSeek V4 all generate at small-model speeds. The catch: you still need enough memory to hold all the parameters.
TTFT is the delay between sending your prompt and seeing the first word of the response. It is dominated by prefill — processing your prompt — which unlike generation is compute-bound (all prompt tokens process in parallel). Long prompts on weak hardware mean long waits: a 32K-token prompt on Apple Silicon can take 30+ seconds before anything appears, while the same prompt on an H100 takes about a second. This calculator estimates both TTFT and generation speed.
Only with tensor parallelism (vLLM, TensorRT-LLM), which splits every layer across GPUs and scales bandwidth to roughly 90% per added card — so 2 GPUs give about 1.9x speed. The default in llama.cpp and Ollama is layer splitting, which adds VRAM capacity but NOT speed: GPUs process their layers in turn, so you get one GPU's worth of throughput. If you have two identical GPUs and want speed, use vLLM with tensor parallel.
Common reasons: 1) Layers offloaded to CPU because the model does not quite fit in VRAM (the biggest one — check with the VRAM calculator). 2) The inference engine is not optimized for your hardware (llama.cpp Metal vs CUDA vs ROCm differ a lot). 3) Long context — the KV cache adds reads per token as the conversation grows. 4) Power/thermal limits, especially on laptops. 5) Batch size 1 is assumed here; serving multiple users splits bandwidth between them.
Every generated token must read the entire KV cache (the stored attention state for all previous tokens) in addition to the model weights. At short contexts this is negligible, but at 64K+ tokens of conversation, the KV cache can add gigabytes of reads per token and cut generation speed by 30-50%. Models with MLA (DeepSeek), sliding-window attention (Gemma), or aggressive grouped-query attention degrade much less with long context.
Apple Silicon trades speed for capacity. An M4 Max (546 GB/s) generates at roughly half the speed of an RTX 4090 (1008 GB/s) for models that fit on both — think 40-60 tokens/sec for an 8B Q4 model vs 85-125 on the 4090. But the M4 Max can have 128 GB of unified memory, letting it run 70B+ models that need two or three consumer GPUs. For large models, a Mac is often the cheapest way to run them at all; for small models, a discrete GPU is much faster.
Explore More Tools
Continue with these related tools