InventiveHQ Lab

I Capped My GPU to 150W and Barely Lost Any Speed

We cut an RTX 5060 Ti's power limit by 17% and lost essentially zero tokens per second while gaining 16% efficiency. Local LLM decoding is memory-bandwidth-bound, not compute-bound — so watts are mostly wasted.

By InventiveHQ Team

Local LLMs run hot. A GPU pulling 170-plus watts continuously — for a model you're querying one prompt at a time — adds up to real electricity, real heat, and real fan noise. The obvious fix is to cap the power limit. The obvious fear is that you'll trade away the one thing you care about: tokens per second.

So we measured it. We took an RTX 5060 Ti (Blackwell, 16 GB), running Qwen2.5-Coder-7B-Instruct in llama.cpp, and swept its power limit from the card's 180 W default down to its 150 W hardware floor — measuring generation speed and actual power draw at every step. The result is one of the cleanest free wins we've found in this whole benchmark series, and it falls straight out of how local decoding actually works. Everything below is reproducible on stock llama.cpp, and the raw data and harness are open source.

The result, up front

We cut the power cap by 17% (180 W → 150 W) and the GPU lost less than 1% of its throughput — while getting 16% more efficient and drawing 26 fewer watts at the wall.

Power captok/sAvg drawEfficiency
180 W (default)81.2172.7 W0.470 tok/s/W
162 W81.4159.3 W0.511 tok/s/W
150 W (floor)80.4147.2 W0.546 tok/s/W

The short version: on a single-user local LLM, watts are mostly wasted. You can cap the card hard, run it cooler and quieter, and barely feel it in the token stream. Here's exactly why.

Why power doesn't buy speed here

Normally you'd assume more power means more performance — that's true for gaming, for training, for anything compute-bound. Single-user LLM decoding is none of those things. At batch size 1 — which is every local chat, every code completion, every one-prompt-at-a-time run on your own machine — the GPU is memory-bandwidth-bound.

Each token's forward pass has to read the entire model weight matrix out of VRAM: several gigabytes streamed sequentially, every single token. The arithmetic to apply those weights finishes in microseconds. Waiting for the next weight tile to arrive from VRAM takes milliseconds. The math units spend most of their time idle, tapping their feet, waiting on the memory bus.

Power throttling reduces clock frequency, which reduces how fast the arithmetic runs. But the arithmetic was never the bottleneck. So speed stays flat — right up until you throttle so hard that the memory controller itself clocks down, which on this card does not happen anywhere between 150 W and 180 W. (This is the same bandwidth ceiling that explains why fancy new silicon doesn't help single-user decode as much as the spec sheet implies — see what tokens-per-second to expect from local models and our breakdown of how much VRAM you actually need.)

What the numbers show

Here are the three measurable points on the 5060 Ti — at the default cap, in the middle, and at the floor:

POWER DRAW 147 W actual draw 173 W 0 ↓ −26 W saved SPEED (tok/s) 80.4 tok/s 0 81.2 ≈ flat (−1%) EFFICIENCY (tok/s/W) 0.55 tok/s/W 0.47 0.55 ↑ +16% efficiency

At the 150 W cap — the 5060 Ti's hardware floor — measured draw falls to 147 W while generation speed holds at 80.4 tok/s, essentially unchanged from the 81.2 tok/s at 180 W. Efficiency rises from 0.470 to 0.546 tok/s/W.

The middle point matters too, because it proves the trend isn't a fluke at the extremes: at a 162 W cap the card drew 159 W, ran at 81.4 tok/s, and hit 0.511 tok/s/W. Every watt removed improved efficiency without meaningfully denting speed. That's the memory-bandwidth-bound regime in action — the same work gets done at the same rate, it just costs fewer joules.

The setup

We kept it deliberately boring so the numbers are comparable: one model (Qwen2.5-Coder-7B-Instruct, Q4_K_M), one CUDA GPU with all layers resident, a 4096-token context, greedy decoding (temperature 0), 256 tokens generated per prompt, and the same fixed 12-prompt suite we use across the series. The reference rig is an Alienware Aurora R7 — Intel Core i7-8700, 64 GB DDR4, the RTX 5060 Ti running at PCIe 3.0 ×8, Windows Server 2025.

For each step, the harness queries the GPU's default power limit via nvidia-smi, sets the cap, launches the server fresh, runs the full suite measuring generation tok/s, and samples actual power draw in a background thread throughout — then computes tokens per watt. Because VRAM is already resident, the power-limit changes affect clocks, not memory, which is exactly what makes the bandwidth bottleneck visible.

The RTX 5060 Ti's driver-enforced minimum power limit is 150 W — it will not go lower. This experiment tested only three points: 150 W, 162 W, and 180 W (default). Cards with lower hardware floors (e.g. older 30-series and 40-series GPUs) can be capped further; the pattern is the same, but the efficiency gains scale wider.

What we learned

Almost no speed lost — ~1% for 25 fewer watts

Going from 180 W to 150 W costs 0.8 tok/s (81.2 → 80.4). That rounds to zero in any practical context. The compute clocks throttle, but compute was never the bottleneck — VRAM bandwidth is. Weight tensors stream at the same rate regardless of how fast the arithmetic units run once they receive a tile.

+16% efficiency is real and free

0.470 → 0.546 tok/s/W is a genuine 16% improvement in useful work per joule. Running an inference server around the clock at 150 W instead of 180 W saves roughly 250 Wh per day — small in dollars (a few cents at average US rates, on the order of $11/year), but it also means a cooler card and quieter fans, every hour of every day. The efficiency gain costs nothing except a one-time nvidia-smi -pl 150 call before you start the server.

Why bandwidth-bound workloads ignore watts

At batch size 1, the GPU processes one token at a time, and each token pass reads the full weight matrix from VRAM. The arithmetic finishes in microseconds; waiting for the next tile takes milliseconds. Power throttling slows the arithmetic, which doesn't matter, because the limiting step is always the memory bus. Speed only drops once you throttle hard enough to clock down the memory controller itself — which for the 5060 Ti doesn't happen between 150 W and 180 W. The same logic is why adding CPU threads stops helping past a point and why spilling layers off the GPU falls off a cliff: in all three cases, the thing you're adding isn't the thing you're waiting on.

The 150 W floor is a real limit

We'd have loved to push lower. The driver wouldn't let us — nvidia-smi -pl 149 is simply rejected. So this sweep has only three points, and the true bottom of the efficiency curve on this card is hidden behind that floor. A GPU with a lower minimum would expose a wider range and potentially larger gains; the bandwidth-bound shape would look identical, just longer.

How to do it yourself

If you run local inference on an NVIDIA GPU, set the power limit to its hardware floor and forget about it. On an RTX 5060 Ti:

# Check the allowed range first
nvidia-smi -q -d POWER | grep -i "power limit"

# Set the cap (watts). 150 is this card's floor.
nvidia-smi -pl 150

# ...then launch your server as normal
llama-server -m qwen2.5-coder-7b-instruct-q4_k_m.gguf -ngl 999

Three caveats worth knowing before you rely on it:

  1. It needs elevated privileges. Run it from an Administrator prompt on Windows, or with sudo on Linux. A normal shell will get permission-denied.
  2. It doesn't survive a reboot. The limit resets to default on restart. Put the nvidia-smi -pl call in a startup script, a scheduled task, or the same script that launches your inference server.
  3. 150 W is this card's floor, not yours. Check nvidia-smi -q -d POWER for your GPU's min/max. Many 30-series and 40-series cards go lower, which means more efficiency headroom than we could measure here.

The principle generalizes beyond this one card: any single-user, batch-size-1 decode workload is memory-bandwidth-bound, and power caps only hurt throughput when compute is the bottleneck — which, at batch 1 on consumer hardware, it isn't. The flat-then-cliff efficiency curve is a property of the regime, not of this specific GPU. (If you're standing up a full home setup, the complete guide to running local AI ties this together with the rest of the tuning knobs.)

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp release with the public Qwen2.5-Coder-7B GGUF. The harness sweeps the power limit and samples real draw via nvidia-smi for you:

python scripts/bench.py --gpu 0
python scripts/aggregate.py && python scripts/inject.py

Because the script changes GPU power limits, it requires administrator/elevated privileges (Windows: Administrator prompt; Linux: sudo). It always restores the default limit in a finally block, so a crash mid-run won't leave your card throttled.

This is where you come in. This is an experiment in our open-source local-LLM benchmark series, and the whole point is a community map of how the speed/efficiency trade-off behaves across real hardware. We have one card with a 150 W floor; we want the cards we couldn't test — a 3090 that caps at 100 W, a 4090, an older mining card with a wide power range. Every GPU with a lower floor fills in part of the curve we can't see. Full reproduction and submission steps are in the repo: experiments/tokens-per-watt.

Frequently Asked Questions

Will capping my GPU's power limit slow down local LLM inference?

Barely. In our test, dropping an RTX 5060 Ti from its 180 W default to its 150 W floor cost less than 1% of throughput — 81.2 tok/s fell to 80.4 tok/s. Single-user decoding at batch size 1 is bound by memory bandwidth, not compute, so the GPU spends most of each token streaming weights from VRAM rather than doing math. Power caps throttle the arithmetic clocks, but those clocks were never the bottleneck, so speed stays essentially flat.

How do I cap my NVIDIA GPU's power limit?

Use nvidia-smi. Run nvidia-smi -pl 150 to set a 150-watt cap (substitute your card's floor). The command needs elevated privileges — an Administrator prompt on Windows or sudo on Linux — and it does not survive a reboot, so add it to a startup script or run it before launching your inference server. You can check the allowed range with nvidia-smi -q -d POWER.

How much efficiency do you actually gain from a power cap?

We measured a 16% improvement in tokens per watt — from 0.470 tok/s/W at the 180 W default to 0.546 tok/s/W at the 150 W cap — for almost no speed loss. Measured draw at the wall fell from 173 W to 147 W. Running an inference server around the clock at the lower cap saves roughly 250 Wh per day, plus meaningful heat and fan-noise reduction.

Why doesn't more power make local inference faster?

Because the limiting step is memory, not math. Each token's forward pass reads the full multi-gigabyte weight matrix from VRAM sequentially. Applying those weights finishes in microseconds; waiting for the next weight tile takes milliseconds. Extra power buys faster compute clocks, but the arithmetic units are already idle waiting on the memory bus — so the extra clocks have nothing to speed up.

What is the lowest I can set my GPU's power limit?

It depends on the card. The RTX 5060 Ti enforces a 150 W hardware floor and the driver refuses anything lower, so our sweep only had three usable points (150, 162, and 180 W). Other cards — including many 30-series and 40-series GPUs — have lower floors and can be capped further, which exposes a wider efficiency curve. The bandwidth-bound shape is the same on all of them.

Does this apply to AMD or Apple Silicon GPUs too?

The underlying principle does. Any single-user, batch-size-1 decode workload is memory-bandwidth-bound, so power-limited compute clocks rarely cap throughput. We measured this specific result on an NVIDIA RTX 5060 Ti with nvidia-smi; the exact tooling and floors differ on AMD (e.g. rocm-smi) and are largely abstracted away on Apple Silicon, but the flat-then-cliff efficiency curve is a property of the workload, not the vendor.

Can I reproduce this benchmark myself?

Yes. It runs on a stock llama.cpp release with the public Qwen2.5-Coder-7B GGUF, and the harness sweeps the power limit while sampling real draw via nvidia-smi. Everything — the script, the methodology, and our raw results — is in the open-source repo, and we are collecting community results from other GPUs via pull request. Because the script changes GPU power limits, it requires administrator/elevated privileges.

Local AILLM InferenceGPUPower EfficiencyEnergyBenchmarks

Need help from an IT & cybersecurity partner?

InventiveHQ helps businesses secure, modernize, and run their technology. Let's talk about your goals.

Get in touch