InventiveHQ Lab

Flash Attention in llama.cpp: -fa Is Free Because It's Already On

We swept llama.cpp's -fa flag across 4K, 16K, and 32K context on an RTX 5060 Ti. Speed and VRAM were identical on and off — because this build already defaults flash attention on.

By InventiveHQ Team

Flash attention is one of the most-cited GPU optimizations in local-LLM circles. The llama.cpp -fa flag is how you're told to turn it on. Setup guides bake it into their boilerplate launch commands, blog posts recommend adding it, and the promise is concrete: lower VRAM at long context and faster generation, especially once the KV cache starts to dominate at 16K and 32K tokens. So we did the obvious thing — we measured it.

We ran a clean 3×2 sweep on an RTX 5060 Ti: three context lengths (4096, 16384, 32768) with -fa explicitly off and then on, recording generation throughput and VRAM at every one of the six cells. We expected a tidy "here's where flash attention earns its keep" chart. Instead we got an anticlimax — and the anticlimax turned out to be the interesting part. Everything below is reproducible on a stock llama.cpp release, and all the raw data is open source.

The results, up front

Toggling -fa changed essentially nothing. Throughput moved by less than 1 tok/s at every context length — inside run-to-run noise — and VRAM was identical, even at 32K where flash attention should have its biggest effect. Here is the entire experiment in one table:

Context-fa off-fa onVRAM offVRAM on
4K (4096)81.3 tok/s82.1 tok/s4673 MB4673 MB
16K (16384)82.3 tok/s82.7 tok/s5357 MB5357 MB
32K (32768)83.0 tok/s83.0 tok/s6269 MB6273 MB

The short version: on this build, -fa is a no-op — not because flash attention does nothing, but because it's already on by default. The flag re-enables something the runtime had quietly opted you into. Below is exactly why we're confident in that read, and how to check your own build in two minutes.

What flash attention does — when it's actually doing something

To understand why a flat result is meaningful, you have to know what flash attention changes.

Standard attention computes the sequence Q × Kᵀ → softmax → × V as three distinct steps, writing an N×N intermediate matrix to high-bandwidth memory (HBM) between each one. At large context lengths that matrix grows quadratically — so the round-trips between the compute units and HBM start dominating latency, and VRAM fills up fast. Flash attention (Dao et al., 2022) fuses those three steps into a single tiled kernel that keeps its working data in on-chip SRAM throughout, avoiding the intermediate HBM writes entirely. The result is lower peak memory for the KV cache and faster attention at long context. Same math, shorter path through memory.

STANDARD ATTENTION

Q × Kᵀ HBM write softmax HBM write × V output

FLASH ATTENTION

fused QKV kernel Q × Kᵀ → softmax → × V SRAM-resident · no HBM round-trips tiles... output

← same result →

The payoff is biggest at long context: the attention matrix grows quadratically with sequence length, so HBM pressure rises fast. Flash attention's fused pass keeps that under control — you'd expect both lower VRAM and faster generation at 16K–32K if standard attention were the baseline. That's exactly the difference we set out to measure. (If you're sizing hardware around context, our context-length vs VRAM benchmark and KV-cache memory guide go deeper on why long context gets expensive.)

The setup — reproducible on any CUDA GPU

We kept the methodology deliberately boring so the numbers are comparable: a single GPU with all layers resident, greedy decoding (temperature 0), 256 generated tokens per prompt, and a fixed 12-prompt suite. Throughput is the server's own predicted_per_second — pure generation rate. VRAM is the nvidia-smi used-memory delta measured after model load on an otherwise-idle GPU.

ComponentWhat we used
GPURTX 5060 Ti — Blackwell, 16 GB VRAM
ModelQwen2.5-Coder-7B-Instruct (Q4_K_M, ~5.4 GB on disk)
Runtimellama.cpp, CUDA backend
Context sweep4096 / 16384 / 32768 tokens
ConditionsFlash attention off (no -fa) vs -fa on

The two launch commands were as plain as they get — the only difference between conditions is the flag:

# baseline — no -fa flag (uses whatever the runtime default is)
llama-server -m Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf -c 4096 -ngl 999

# flash-attention on — explicit flag
llama-server -m Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf -c 4096 -fa on -ngl 999

Three context lengths times two flash-attention states equals six cells. That's the whole experiment.

Reading the (non-)result

Walk the table cell by cell and the pattern is the same everywhere. At 4K context, -fa off produced 81.3 tok/s and on produced 82.1 tok/s — a 0.8 tok/s gap that sits squarely within measurement noise, with VRAM identical at 4673 MB on both sides. At 16K the throughput is 82.3 vs 82.7 tok/s and the VRAM is 5357 MB vs 5357 MB — literally the same number. At 32K, where flash attention's fused kernel should make the biggest difference, throughput is 83.0 vs 83.0 tok/s and VRAM is 6269 vs 6273 MB — a 4 MB spread, less than a single allocation block, and notably in the wrong direction (the flag-on run uses fractionally more).

There is no story in these charts. And that absence of a story is the story.

What we learned

Flash attention is already on by default. The only coherent explanation for identical on/off results is that the "off" baseline wasn't actually running standard attention — it was already running flash attention. Current llama.cpp builds enable it by default when the CUDA backend is active. The -fa on flag is redundant; it confirms what the runtime was already doing rather than changing it.

The VRAM gap at long context is the real diagnostic. If flash attention were truly disabled and then switched on, you'd see a clear VRAM drop at 32K — potentially hundreds of MB, because standard attention has to materialise that N×N matrix in HBM. We measured a 4 MB gap, with the flag-on side using slightly more. A noise-level delta in the wrong direction is exactly what you'd expect if both conditions execute the same internal code path. This is also why we anchored the experiment at long context rather than 4K: it's the most sensitive place to catch a difference, and there wasn't one.

Don't cargo-cult flags — verify them. The broader lesson isn't about flash attention specifically. It's about assuming a flag does what its name implies without checking. A two-minute sweep — three context lengths, tok/s and VRAM both — gave a definitive answer. If you're adding -fa on to your scripts because a guide recommended it, keep it for documentation value if you like, but don't budget for a runtime benefit on a current CUDA build.

When -fa does still matter

None of this means the flag is pointless everywhere — only that it's a no-op on this build. Flash attention defaults vary across llama.cpp versions and backends, and there are real situations where flipping the flag changes your numbers:

  • Older llama.cpp builds. Flash attention was opt-in for a long stretch before it became the default. On an older release, -fa may be the difference between the standard path and the fused one — and there it genuinely cuts VRAM at long context.
  • Non-CUDA backends. Metal, Vulkan, and CPU inference have followed their own timelines for flash-attention support and defaults. A backend that doesn't enable it automatically will respond to the flag where CUDA shrugs.
  • Builds where it's explicitly off. Some distributions and custom builds ship with conservative defaults. If yours is one of them, the flag does real work.

The test is always the same, regardless of build: compare VRAM at 32K context with the flag explicitly off versus on. A real savings of hundreds of MB means flash attention wasn't on by default and the flag matters. Zero MB difference means it already was, and you're fine either way.

Practical takeaway

On a current llama.cpp CUDA build, -fa on is almost certainly a no-op — you don't need it, but it won't hurt. To verify for your own setup, run with -c 32768 and compare VRAM on both sides of the toggle on an idle GPU. Identical VRAM is good news: it means your runtime is already taking the fast path. A large VRAM drop means the flag is real and you should keep it. Either way, you've replaced a recommendation you read somewhere with a number you measured yourself — which is the entire point of this series. (For more on what to realistically expect from local inference speed, see local LLM performance: what to expect.)

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp release with a public Qwen2.5-Coder GGUF. Drop Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf in your models directory and:

python scripts/bench.py --gpu 0    # sweeps 6 cells (3 ctx x fa off/on)
python scripts/aggregate.py        # -> results/data.json
python scripts/inject.py           # bakes results into the writeup

This is where you come in. This is experiment #7 in our open-source local-LLM benchmark series, and a single machine can't map every build and backend. We have a current CUDA result showing -fa is already-on; we want yours. An older llama.cpp release? A Vulkan or Metal build? An Apple Silicon Mac, or CPU-only inference? Those are exactly the configurations where the flag might not be a no-op — and a contrary data point is the most useful kind. The benchmark harness auto-captures your hardware into a submission; open a pull request with your results/community/<name>.json file and we'll fold it into a growing community comparison.


New to running models on your own hardware? Start with how much VRAM you need to run an LLM, then the complete guide to running local AI. And if you're chasing throughput, our companion experiment on speculative decoding on consumer GPUs covers a flag that very much is not a no-op.

Frequently Asked Questions

Does llama.cpp's -fa flag actually speed up generation?

On a current CUDA build, not measurably — because flash attention is already running by default. We swept the -fa flag off and on across three context lengths (4K, 16K, 32K) on an RTX 5060 Ti and the throughput gap was under 1 tok/s at every cell, which is inside run-to-run measurement noise. The flag isn't broken; it's redundant. It re-enables something the runtime had already turned on for you.

Why didn't enabling flash attention save any VRAM?

Because no VRAM was being wasted to begin with. Flash attention's main memory win comes from not materialising the large N×N attention matrix in HBM, which matters most at long context. At 32K context our VRAM was 6269 MB with the flag off and 6273 MB with it on — a 4 MB difference, smaller than a single allocation block, and in the wrong direction. If the baseline had truly been standard attention, you'd expect to see hundreds of MB drop when the flag flips on. We saw nothing, which means both runs took the same fused code path.

Should I still put -fa on in my llama.cpp launch command?

It's harmless either way on a current CUDA build, so keep it if you like the documentation value of being explicit. Just don't expect it to change your speed or memory use. The one situation where it genuinely matters is an older build or a backend where flash attention is off by default — there the flag does real work. The honest takeaway is to verify rather than cargo-cult it.

How do I check whether flash attention is on by default on my build?

Run the same model at a long context length (for example -c 32768) twice — once with the flag explicitly off and once on — and compare VRAM with nvidia-smi on an otherwise-idle GPU. Identical VRAM means flash attention was already on and the flag is a no-op (good news, you're on the fast path). A large VRAM drop of hundreds of MB when you enable it means it was off by default and the flag is doing real work, so keep it.

What actually is flash attention?

Flash attention (Dao et al., 2022) is a reformulation of the attention computation that produces identical math by a faster route through memory. Standard attention computes Q×Kᵀ, softmax, then ×V as three separate steps, writing a large intermediate matrix to high-bandwidth memory (HBM) between each. Flash attention fuses all three into a single tiled kernel that keeps its working data in fast on-chip SRAM, avoiding those HBM round-trips. The result is lower peak memory and faster attention at long context — same output, shorter path.

Does this mean flash attention is useless?

Not at all — it means it's so useful that llama.cpp adopted it as the default. The optimization is doing real work on every one of our runs; it's just that the -fa flag no longer toggles it because the runtime already opts you in. The non-result here is a story about flags and defaults, not about the technique. Flash attention remains one of the most important reasons long-context inference is practical on consumer GPUs.

Why test VRAM at 32K context specifically?

Because that's where the difference between standard and flash attention would be largest. The attention matrix grows quadratically with sequence length, so HBM pressure rises fast as context grows. At 4K the gap between the two approaches would be small; at 32K it should be obvious. Testing at long context is the most sensitive way to tell whether flash attention is genuinely active — and getting a flat result there is strong evidence the fast path was already in use.

Local AILLM InferenceFlash Attentionllama.cppVRAMBenchmarks

Need help from an IT & cybersecurity partner?

InventiveHQ helps businesses secure, modernize, and run their technology. Let's talk about your goals.

Get in touch