Local LLMs run hot. A GPU pulling 170-plus watts continuously — for a model you're querying one prompt at a time — adds up to real electricity, real heat, and real fan noise. The obvious fix is to cap the power limit. The obvious fear is that you'll trade away the one thing you care about: tokens per second.
So we measured it. We took an RTX 5060 Ti (Blackwell, 16 GB), running Qwen2.5-Coder-7B-Instruct in llama.cpp, and swept its power limit from the card's 180 W default down to its 150 W hardware floor — measuring generation speed and actual power draw at every step. The result is one of the cleanest free wins we've found in this whole benchmark series, and it falls straight out of how local decoding actually works. Everything below is reproducible on stock llama.cpp, and the raw data and harness are open source.
The result, up front
We cut the power cap by 17% (180 W → 150 W) and the GPU lost less than 1% of its throughput — while getting 16% more efficient and drawing 26 fewer watts at the wall.
| Power cap | tok/s | Avg draw | Efficiency |
|---|---|---|---|
| 180 W (default) | 81.2 | 172.7 W | 0.470 tok/s/W |
| 162 W | 81.4 | 159.3 W | 0.511 tok/s/W |
| 150 W (floor) | 80.4 | 147.2 W | 0.546 tok/s/W |
The short version: on a single-user local LLM, watts are mostly wasted. You can cap the card hard, run it cooler and quieter, and barely feel it in the token stream. Here's exactly why.
Why power doesn't buy speed here
Normally you'd assume more power means more performance — that's true for gaming, for training, for anything compute-bound. Single-user LLM decoding is none of those things. At batch size 1 — which is every local chat, every code completion, every one-prompt-at-a-time run on your own machine — the GPU is memory-bandwidth-bound.
Each token's forward pass has to read the entire model weight matrix out of VRAM: several gigabytes streamed sequentially, every single token. The arithmetic to apply those weights finishes in microseconds. Waiting for the next weight tile to arrive from VRAM takes milliseconds. The math units spend most of their time idle, tapping their feet, waiting on the memory bus.
Power throttling reduces clock frequency, which reduces how fast the arithmetic runs. But the arithmetic was never the bottleneck. So speed stays flat — right up until you throttle so hard that the memory controller itself clocks down, which on this card does not happen anywhere between 150 W and 180 W. (This is the same bandwidth ceiling that explains why fancy new silicon doesn't help single-user decode as much as the spec sheet implies — see what tokens-per-second to expect from local models and our breakdown of how much VRAM you actually need.)
What the numbers show
Here are the three measurable points on the 5060 Ti — at the default cap, in the middle, and at the floor:
At the 150 W cap — the 5060 Ti's hardware floor — measured draw falls to 147 W while generation speed holds at 80.4 tok/s, essentially unchanged from the 81.2 tok/s at 180 W. Efficiency rises from 0.470 to 0.546 tok/s/W.
The middle point matters too, because it proves the trend isn't a fluke at the extremes: at a 162 W cap the card drew 159 W, ran at 81.4 tok/s, and hit 0.511 tok/s/W. Every watt removed improved efficiency without meaningfully denting speed. That's the memory-bandwidth-bound regime in action — the same work gets done at the same rate, it just costs fewer joules.
The setup
We kept it deliberately boring so the numbers are comparable: one model (Qwen2.5-Coder-7B-Instruct, Q4_K_M), one CUDA GPU with all layers resident, a 4096-token context, greedy decoding (temperature 0), 256 tokens generated per prompt, and the same fixed 12-prompt suite we use across the series. The reference rig is an Alienware Aurora R7 — Intel Core i7-8700, 64 GB DDR4, the RTX 5060 Ti running at PCIe 3.0 ×8, Windows Server 2025.
For each step, the harness queries the GPU's default power limit via nvidia-smi, sets the cap, launches the server fresh, runs the full suite measuring generation tok/s, and samples actual power draw in a background thread throughout — then computes tokens per watt. Because VRAM is already resident, the power-limit changes affect clocks, not memory, which is exactly what makes the bandwidth bottleneck visible.
The RTX 5060 Ti's driver-enforced minimum power limit is 150 W — it will not go lower. This experiment tested only three points: 150 W, 162 W, and 180 W (default). Cards with lower hardware floors (e.g. older 30-series and 40-series GPUs) can be capped further; the pattern is the same, but the efficiency gains scale wider.
What we learned
Almost no speed lost — ~1% for 25 fewer watts
Going from 180 W to 150 W costs 0.8 tok/s (81.2 → 80.4). That rounds to zero in any practical context. The compute clocks throttle, but compute was never the bottleneck — VRAM bandwidth is. Weight tensors stream at the same rate regardless of how fast the arithmetic units run once they receive a tile.
+16% efficiency is real and free
0.470 → 0.546 tok/s/W is a genuine 16% improvement in useful work per joule. Running an inference server around the clock at 150 W instead of 180 W saves roughly 250 Wh per day — small in dollars (a few cents at average US rates, on the order of $11/year), but it also means a cooler card and quieter fans, every hour of every day. The efficiency gain costs nothing except a one-time nvidia-smi -pl 150 call before you start the server.
Why bandwidth-bound workloads ignore watts
At batch size 1, the GPU processes one token at a time, and each token pass reads the full weight matrix from VRAM. The arithmetic finishes in microseconds; waiting for the next tile takes milliseconds. Power throttling slows the arithmetic, which doesn't matter, because the limiting step is always the memory bus. Speed only drops once you throttle hard enough to clock down the memory controller itself — which for the 5060 Ti doesn't happen between 150 W and 180 W. The same logic is why adding CPU threads stops helping past a point and why spilling layers off the GPU falls off a cliff: in all three cases, the thing you're adding isn't the thing you're waiting on.
The 150 W floor is a real limit
We'd have loved to push lower. The driver wouldn't let us — nvidia-smi -pl 149 is simply rejected. So this sweep has only three points, and the true bottom of the efficiency curve on this card is hidden behind that floor. A GPU with a lower minimum would expose a wider range and potentially larger gains; the bandwidth-bound shape would look identical, just longer.
How to do it yourself
If you run local inference on an NVIDIA GPU, set the power limit to its hardware floor and forget about it. On an RTX 5060 Ti:
# Check the allowed range first
nvidia-smi -q -d POWER | grep -i "power limit"
# Set the cap (watts). 150 is this card's floor.
nvidia-smi -pl 150
# ...then launch your server as normal
llama-server -m qwen2.5-coder-7b-instruct-q4_k_m.gguf -ngl 999
Three caveats worth knowing before you rely on it:
- It needs elevated privileges. Run it from an Administrator prompt on Windows, or with
sudoon Linux. A normal shell will get permission-denied. - It doesn't survive a reboot. The limit resets to default on restart. Put the
nvidia-smi -plcall in a startup script, a scheduled task, or the same script that launches your inference server. - 150 W is this card's floor, not yours. Check
nvidia-smi -q -d POWERfor your GPU's min/max. Many 30-series and 40-series cards go lower, which means more efficiency headroom than we could measure here.
The principle generalizes beyond this one card: any single-user, batch-size-1 decode workload is memory-bandwidth-bound, and power caps only hurt throughput when compute is the bottleneck — which, at batch 1 on consumer hardware, it isn't. The flat-then-cliff efficiency curve is a property of the regime, not of this specific GPU. (If you're standing up a full home setup, the complete guide to running local AI ties this together with the rest of the tuning knobs.)
Reproduce it — and send us your numbers
Everything here runs on a stock llama.cpp release with the public Qwen2.5-Coder-7B GGUF. The harness sweeps the power limit and samples real draw via nvidia-smi for you:
python scripts/bench.py --gpu 0
python scripts/aggregate.py && python scripts/inject.py
Because the script changes GPU power limits, it requires administrator/elevated privileges (Windows: Administrator prompt; Linux: sudo). It always restores the default limit in a finally block, so a crash mid-run won't leave your card throttled.
This is where you come in. This is an experiment in our open-source local-LLM benchmark series, and the whole point is a community map of how the speed/efficiency trade-off behaves across real hardware. We have one card with a 150 W floor; we want the cards we couldn't test — a 3090 that caps at 100 W, a 4090, an older mining card with a wide power range. Every GPU with a lower floor fills in part of the curve we can't see. Full reproduction and submission steps are in the repo: experiments/tokens-per-watt.
