Question 1

How many tokens per second do I need for a good experience?

Accepted Answer

For interactive chat: **20+ tokens/sec feels instant** (faster than you can read), **10-20 tokens/sec is comfortable**, **5-10 tokens/sec is usable but noticeably slow**, and below 5 tokens/sec gets frustrating. For agentic workloads (coding assistants, tool use) where the model generates long outputs you skim rather than read, higher speeds matter much more — 50+ tokens/sec is the difference between a 30-second and a 3-minute task.

Question 2

How is token generation speed calculated?

Accepted Answer

Single-user generation is **memory-bound**: every new token requires reading all active model weights plus the KV cache from memory. So tokens/sec = effective memory bandwidth / bytes read per token. A 8B model at Q4_K_M is about 5 GB of reads per token; on an RTX 4090 (1008 GB/s at 50-75% real-world efficiency) that gives roughly 85-125 tokens/sec. GPU compute (TFLOPS) barely matters for this — bandwidth is everything.

Question 3

Why is memory bandwidth more important than GPU compute for LLMs?

Accepted Answer

Because generating one token requires reading every active weight from memory but only doing 2 floating-point operations per weight read. Modern GPUs can do hundreds of operations in the time it takes to read one value from memory — so the math units sit idle waiting for data. This is why an RTX 3090 (936 GB/s, 36 TFLOPS) generates tokens almost as fast as an RTX 4080 (717 GB/s, 49 TFLOPS) despite being two generations older: it has more bandwidth.

Question 4

Why do MoE models generate faster than dense models of the same size?

Accepted Answer

Mixture-of-Experts models only route each token through a few experts, so they read a fraction of their parameters per token. GPT-OSS 120B reads only ~5.1B parameters per token — so it generates about as fast as a 5B dense model while having 120B-model knowledge. This is why MoE has taken over: Qwen3 30B-A3B, Gemma 4 26B-A4B, and DeepSeek V4 all generate at small-model speeds. The catch: you still need enough memory to hold all the parameters.

Question 5

What is time to first token (TTFT)?

Accepted Answer

TTFT is the delay between sending your prompt and seeing the first word of the response. It is dominated by **prefill** — processing your prompt — which unlike generation is compute-bound (all prompt tokens process in parallel). Long prompts on weak hardware mean long waits: a 32K-token prompt on Apple Silicon can take 30+ seconds before anything appears, while the same prompt on an H100 takes about a second. This calculator estimates both TTFT and generation speed.

Question 6

Does adding a second GPU double my generation speed?

Accepted Answer

Only with **tensor parallelism** (vLLM, TensorRT-LLM), which splits every layer across GPUs and scales bandwidth to roughly 90% per added card — so 2 GPUs give about 1.9x speed. The default in llama.cpp and Ollama is **layer splitting**, which adds VRAM capacity but NOT speed: GPUs process their layers in turn, so you get one GPU's worth of throughput. If you have two identical GPUs and want speed, use vLLM with tensor parallel.

Question 7

Why is my actual speed lower than this estimate?

Accepted Answer

Common reasons: **1)** Layers offloaded to CPU because the model does not quite fit in VRAM (the biggest one — check with the VRAM calculator). **2)** The inference engine is not optimized for your hardware (llama.cpp Metal vs CUDA vs ROCm differ a lot). **3)** Long context — the KV cache adds reads per token as the conversation grows. **4)** Power/thermal limits, especially on laptops. **5)** Batch size 1 is assumed here; serving multiple users splits bandwidth between them.

Question 8

How does context length affect generation speed?

Accepted Answer

Every generated token must read the entire KV cache (the stored attention state for all previous tokens) in addition to the model weights. At short contexts this is negligible, but at 64K+ tokens of conversation, the KV cache can add gigabytes of reads per token and cut generation speed by 30-50%. Models with MLA (DeepSeek), sliding-window attention (Gemma), or aggressive grouped-query attention degrade much less with long context.

Question 9

How fast is Apple Silicon for running LLMs?

Accepted Answer

Apple Silicon trades speed for capacity. An M4 Max (546 GB/s) generates at roughly half the speed of an RTX 4090 (1008 GB/s) for models that fit on both — think 40-60 tokens/sec for an 8B Q4 model vs 85-125 on the 4090. But the M4 Max can have 128 GB of unified memory, letting it run 70B+ models that need two or three consumer GPUs. For large models, a Mac is often the cheapest way to run them at all; for small models, a discrete GPU is much faster.

LLM Inference Speed Calculator

Estimate LLM Tokens per Second

Why Bandwidth Is the Bottleneck

What Changes the Number

When to Use It

Why LLM Inference Speed Is About Memory, Not Compute

Prefill vs. Decode: The Two Phases of Inference

How to Actually Get More Tokens Per Second

Frequently Asked Questions

Related tools