Flash attention is one of the most-cited GPU optimizations in local-LLM circles. The llama.cpp -fa flag is how you're told to turn it on. Setup guides bake it into their boilerplate launch commands, blog posts recommend adding it, and the promise is concrete: lower VRAM at long context and faster generation, especially once the KV cache starts to dominate at 16K and 32K tokens. So we did the obvious thing — we measured it.
We ran a clean 3×2 sweep on an RTX 5060 Ti: three context lengths (4096, 16384, 32768) with -fa explicitly off and then on, recording generation throughput and VRAM at every one of the six cells. We expected a tidy "here's where flash attention earns its keep" chart. Instead we got an anticlimax — and the anticlimax turned out to be the interesting part. Everything below is reproducible on a stock llama.cpp release, and all the raw data is open source.
The results, up front
Toggling -fa changed essentially nothing. Throughput moved by less than 1 tok/s at every context length — inside run-to-run noise — and VRAM was identical, even at 32K where flash attention should have its biggest effect. Here is the entire experiment in one table:
| Context | -fa off | -fa on | VRAM off | VRAM on |
|---|---|---|---|---|
| 4K (4096) | 81.3 tok/s | 82.1 tok/s | 4673 MB | 4673 MB |
| 16K (16384) | 82.3 tok/s | 82.7 tok/s | 5357 MB | 5357 MB |
| 32K (32768) | 83.0 tok/s | 83.0 tok/s | 6269 MB | 6273 MB |
The short version: on this build, -fa is a no-op — not because flash attention does nothing, but because it's already on by default. The flag re-enables something the runtime had quietly opted you into. Below is exactly why we're confident in that read, and how to check your own build in two minutes.
What flash attention does — when it's actually doing something
To understand why a flat result is meaningful, you have to know what flash attention changes.
Standard attention computes the sequence Q × Kᵀ → softmax → × V as three distinct steps, writing an N×N intermediate matrix to high-bandwidth memory (HBM) between each one. At large context lengths that matrix grows quadratically — so the round-trips between the compute units and HBM start dominating latency, and VRAM fills up fast. Flash attention (Dao et al., 2022) fuses those three steps into a single tiled kernel that keeps its working data in on-chip SRAM throughout, avoiding the intermediate HBM writes entirely. The result is lower peak memory for the KV cache and faster attention at long context. Same math, shorter path through memory.
The payoff is biggest at long context: the attention matrix grows quadratically with sequence length, so HBM pressure rises fast. Flash attention's fused pass keeps that under control — you'd expect both lower VRAM and faster generation at 16K–32K if standard attention were the baseline. That's exactly the difference we set out to measure. (If you're sizing hardware around context, our context-length vs VRAM benchmark and KV-cache memory guide go deeper on why long context gets expensive.)
The setup — reproducible on any CUDA GPU
We kept the methodology deliberately boring so the numbers are comparable: a single GPU with all layers resident, greedy decoding (temperature 0), 256 generated tokens per prompt, and a fixed 12-prompt suite. Throughput is the server's own predicted_per_second — pure generation rate. VRAM is the nvidia-smi used-memory delta measured after model load on an otherwise-idle GPU.
| Component | What we used |
|---|---|
| GPU | RTX 5060 Ti — Blackwell, 16 GB VRAM |
| Model | Qwen2.5-Coder-7B-Instruct (Q4_K_M, ~5.4 GB on disk) |
| Runtime | llama.cpp, CUDA backend |
| Context sweep | 4096 / 16384 / 32768 tokens |
| Conditions | Flash attention off (no -fa) vs -fa on |
The two launch commands were as plain as they get — the only difference between conditions is the flag:
# baseline — no -fa flag (uses whatever the runtime default is)
llama-server -m Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf -c 4096 -ngl 999
# flash-attention on — explicit flag
llama-server -m Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf -c 4096 -fa on -ngl 999
Three context lengths times two flash-attention states equals six cells. That's the whole experiment.
Reading the (non-)result
Walk the table cell by cell and the pattern is the same everywhere. At 4K context, -fa off produced 81.3 tok/s and on produced 82.1 tok/s — a 0.8 tok/s gap that sits squarely within measurement noise, with VRAM identical at 4673 MB on both sides. At 16K the throughput is 82.3 vs 82.7 tok/s and the VRAM is 5357 MB vs 5357 MB — literally the same number. At 32K, where flash attention's fused kernel should make the biggest difference, throughput is 83.0 vs 83.0 tok/s and VRAM is 6269 vs 6273 MB — a 4 MB spread, less than a single allocation block, and notably in the wrong direction (the flag-on run uses fractionally more).
There is no story in these charts. And that absence of a story is the story.
What we learned
Flash attention is already on by default. The only coherent explanation for identical on/off results is that the "off" baseline wasn't actually running standard attention — it was already running flash attention. Current llama.cpp builds enable it by default when the CUDA backend is active. The -fa on flag is redundant; it confirms what the runtime was already doing rather than changing it.
The VRAM gap at long context is the real diagnostic. If flash attention were truly disabled and then switched on, you'd see a clear VRAM drop at 32K — potentially hundreds of MB, because standard attention has to materialise that N×N matrix in HBM. We measured a 4 MB gap, with the flag-on side using slightly more. A noise-level delta in the wrong direction is exactly what you'd expect if both conditions execute the same internal code path. This is also why we anchored the experiment at long context rather than 4K: it's the most sensitive place to catch a difference, and there wasn't one.
Don't cargo-cult flags — verify them. The broader lesson isn't about flash attention specifically. It's about assuming a flag does what its name implies without checking. A two-minute sweep — three context lengths, tok/s and VRAM both — gave a definitive answer. If you're adding -fa on to your scripts because a guide recommended it, keep it for documentation value if you like, but don't budget for a runtime benefit on a current CUDA build.
When -fa does still matter
None of this means the flag is pointless everywhere — only that it's a no-op on this build. Flash attention defaults vary across llama.cpp versions and backends, and there are real situations where flipping the flag changes your numbers:
- Older llama.cpp builds. Flash attention was opt-in for a long stretch before it became the default. On an older release,
-famay be the difference between the standard path and the fused one — and there it genuinely cuts VRAM at long context. - Non-CUDA backends. Metal, Vulkan, and CPU inference have followed their own timelines for flash-attention support and defaults. A backend that doesn't enable it automatically will respond to the flag where CUDA shrugs.
- Builds where it's explicitly off. Some distributions and custom builds ship with conservative defaults. If yours is one of them, the flag does real work.
The test is always the same, regardless of build: compare VRAM at 32K context with the flag explicitly off versus on. A real savings of hundreds of MB means flash attention wasn't on by default and the flag matters. Zero MB difference means it already was, and you're fine either way.
Practical takeaway
On a current llama.cpp CUDA build, -fa on is almost certainly a no-op — you don't need it, but it won't hurt. To verify for your own setup, run with -c 32768 and compare VRAM on both sides of the toggle on an idle GPU. Identical VRAM is good news: it means your runtime is already taking the fast path. A large VRAM drop means the flag is real and you should keep it. Either way, you've replaced a recommendation you read somewhere with a number you measured yourself — which is the entire point of this series. (For more on what to realistically expect from local inference speed, see local LLM performance: what to expect.)
Reproduce it — and send us your numbers
Everything here runs on a stock llama.cpp release with a public Qwen2.5-Coder GGUF. Drop Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf in your models directory and:
python scripts/bench.py --gpu 0 # sweeps 6 cells (3 ctx x fa off/on)
python scripts/aggregate.py # -> results/data.json
python scripts/inject.py # bakes results into the writeup
This is where you come in. This is experiment #7 in our open-source local-LLM benchmark series, and a single machine can't map every build and backend. We have a current CUDA result showing -fa is already-on; we want yours. An older llama.cpp release? A Vulkan or Metal build? An Apple Silicon Mac, or CPU-only inference? Those are exactly the configurations where the flag might not be a no-op — and a contrary data point is the most useful kind. The benchmark harness auto-captures your hardware into a submission; open a pull request with your results/community/<name>.json file and we'll fold it into a growing community comparison.
New to running models on your own hardware? Start with how much VRAM you need to run an LLM, then the complete guide to running local AI. And if you're chasing throughput, our companion experiment on speculative decoding on consumer GPUs covers a flag that very much is not a no-op.
