There is a moment, when you load a local model that is slightly too big for your graphics card, where everything falls apart. Not gradually — all at once. We wanted to find exactly where that edge is and how far the drop goes, so we took a single 14B model on one GPU and measured generation speed at every step from "all CPU" to "all GPU."
The result is not a gentle slope you can tune your way along. It is a cliff. Full GPU runs at 43.18 tok/s; full CPU crawls at 2.89 tok/s — a roughly 15× gap on the exact same model and quant. And almost the entire recovery happens in one final step. Everything below is reproducible on a stock llama.cpp release, and all of our raw data and the benchmark harness are open source.
The results, up front
On Qwen2.5-Coder-14B (Q4_K_M, 48 layers) on an RTX 5060 Ti, sweeping -ngl from 0 to 99:
-ngl (layers on GPU) | Throughput | VRAM used |
|---|---|---|
| 0 (all CPU) | 2.89 tok/s | 265 MB |
| 8 | 3.58 tok/s | 2,161 MB |
| 16 | 4.24 tok/s | 3,509 MB |
| 24 | 5.61 tok/s | 4,875 MB |
| 32 | 7.26 tok/s | 6,241 MB |
| 40 | 12.5 tok/s | 7,593 MB |
| 99 (all 48) | 43.18 tok/s | 9,179 MB |
The short version: moving layers onto the GPU one batch at a time buys you almost nothing until the very end. From -ngl 0 to 40 — putting 83% of the model on the GPU — throughput climbs only from 2.89 to 12.5 tok/s. Then the last 8 layers come off the CPU and speed leaps to 43.18 tok/s, a 3.5× jump from a single step. Below is exactly why, and how to make sure you land on the right side of it.
Why this matters: the hidden cost of VRAM spill
When a model doesn't fully fit in your GPU's VRAM, llama.cpp offloads the overflow layers to system RAM. The -ngl flag controls how many layers go on the GPU; every layer past that limit runs on the CPU. Most people assume this is a gradual trade-off — more GPU layers means a bit faster, fewer means a bit slower, and the system finds its own equilibrium somewhere in the middle. It does not.
The penalty for even a few CPU layers is severe, because the GPU must stall and pull those weights across PCIe on every single decoding step. As the data shows, the curve barely begins to steepen until the final step. This experiment sweeps every meaningful -ngl value on a real 14B model and measures exactly where the cliff is and how steep the fall is.
How GPU layer offloading works
A transformer model is a stack of layers processed sequentially for each generated token. On the GPU, each layer is served at a memory bandwidth of roughly 480 GB/s on the RTX 5060 Ti's GDDR7. On the CPU, system memory peaks an order of magnitude lower — and every time the decode pass hits a CPU layer, the GPU must cross PCIe (practical throughput around 16 GB/s one-way) to fetch those weights, process them, and send activations back. The GPU stalls while it waits. Not just for one layer — for every CPU layer, on every forward pass, for every token you generate.
The relationship between layer count and throughput is therefore not linear. Each CPU layer forces a PCIe round trip that the GPU cannot overlap with useful work, and the stalls compound. The practical consequence is a threshold: as long as any layer is on the CPU, every step still pays the crossing, so throughput stays pinned near the bottom — and only breaks free once the model is entirely resident in fast GPU memory.
Three -ngl settings on the same 48-layer model. As layers fall to the CPU, throughput collapses. At -ngl 40 — with only 8 of 48 layers still on the CPU — you are already 71% below full-GPU speed.
The absolute gain from
-ngl40 → 99 (+30.7 tok/s) is more than three times the total throughput gained across all five previous steps combined (+9.6 tok/s). One final push — 8 fewer layers on the CPU — does what the entire partial-offload range could not. The last layers matter most.
The setup
We kept the methodology deliberately boring so the numbers are comparable: a fixed model and quant, greedy decoding (temperature 0), 256 tokens generated per prompt, a 4096-token context, and a shared 12-prompt suite run at each -ngl step. Throughput is the server's own predicted_per_second — pure generation rate. VRAM is the nvidia-smi used-memory delta on model load, measured on an otherwise-idle GPU.
| Component | What we used |
|---|---|
| GPU | NVIDIA RTX 5060 Ti — Blackwell (sm_120), 16 GB GDDR7 |
| Runtime | llama.cpp (CUDA 12.4 build) |
| Model | Qwen2.5-Coder-14B-Instruct (Q4_K_M, ~9.2 GB on disk) |
| Sweep | -ngl 0, 8, 16, 24, 32, 40, 99 — same 12-prompt suite at each step |
| Metrics | tok/s (predicted_per_second) · VRAM MB (nvidia-smi delta on model load) |
Reference test rig. Every reference number was measured on one machine — an Alienware Aurora R7: Intel Core i7-8700 (6 cores / 12 threads, Coffee Lake); Intel 300-series board with a PCIe 3.0 root complex; 64 GB DDR4-2400 (dual-channel); the RTX 5060 Ti running at PCIe 3.0 ×8; models on a SATA SSD; Windows Server 2025. In-VRAM throughput is GPU-memory-bandwidth bound, so bus, RAM, and SSD speed barely move the full-GPU number — but they shape the offload and CPU results. A faster PCIe link would soften the partial-offload penalty somewhat, but it will not turn the cliff into a slope.
# Sweep -ngl from 0 (all CPU) to 99 (all GPU) — repeated for each value
llama-server -m Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf \
-ngl 0 --temp 0 --top-k 1 -n 256 # repeat with 8 16 24 32 40 99
# Or run the automated sweep via bench.py
python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py # -> results/data.json
Reading the cliff
The throughput line tells a precise story. From -ngl 0 to 40, each increment of 8 GPU layers buys modest, compounding gains — 2.89 → 3.58 → 4.24 → 5.61 → 7.26 → 12.5 tok/s, a 4.3× improvement spanning the entire partial-offload range. Then from -ngl 40 to 99 the last 8 CPU layers come off and throughput leaps to 43.18 tok/s — a 3.5× jump for just those final layers.
The VRAM table tracks the cost of getting there. At -ngl 0, only 265 MB of VRAM is in use — just the KV cache and runtime overhead; the model itself lives in system RAM. Each batch of 8 GPU layers adds roughly 1,300–1,600 MB, climbing steadily to 9,179 MB at -ngl 99. A 16 GB card handles this with over 6 GB of headroom at a 4K context. (If you want to understand where that growing number comes from, our breakdowns of how much VRAM you need to run an LLM and the KV cache memory cost go layer by layer.)
What we learned
The gap is ~15× — not 2×, not 5×
Full GPU (-ngl 99) runs at 43.18 tok/s. Full CPU (-ngl 0) delivers 2.89 tok/s. That is a 14.9× gap on the exact same model and quant. If you assume partial offload will recover "most of the GPU performance," the data disagrees sharply: at -ngl 40, with 83% of layers on the GPU, you are still at only 12.5 tok/s — less than 30% of full-GPU throughput.
The non-linear last-layers effect
The biggest surprise in the sweep is not the size of the cliff — it is where the cliff is. Going from -ngl 32 to 40 (7.26 → 12.5 tok/s) is already the largest single-step gain in the partial range. But removing the final 8 CPU layers (40 → 99) yields 43.18 tok/s — a 3.5× multiplier on one step. Once the GPU has every layer in fast GDDR7, it stops stalling on PCIe entirely; the decode pipeline becomes fully bandwidth-bound on local memory and throughput jumps dramatically.
Partial offload is a cliff, not a slope
There is no sweet spot in the partial-offload zone that recovers most of the GPU's performance. Even at -ngl 40 — the high end of the partial range, with 40 of 48 layers on the GPU — throughput is 12.5 tok/s against a full-GPU ceiling of 43.18. The curve does not level off; it stays near the bottom until the very last CPU layers are removed. You are either fully on the GPU, or you are not. Tuning -ngl to find a partial compromise does not work.
The model fits if your card has 10 GB+
At -ngl 99, VRAM usage is 9,179 MB. An RTX 5060 Ti (16 GB), RTX 3090 (24 GB), or RTX 4090 (24 GB) all clear this easily. A 10 GB card (RTX 3080, RTX 2080 Ti) still fits -ngl 99 with roughly 800 MB of headroom before the KV cache fills it — keep context to 2K or use -ctk q8_0 to stay within bounds. An 8 GB card cannot fit every layer of this model and quant.
How to read your own offload
You do not have to guess where you are on this curve — llama.cpp tells you at startup. When the server loads, it prints a line like offloaded 40/49 layers to GPU. (The count is layers plus the output head, so it can be one higher than the transformer-layer count.) If that number is anything less than "all," you are paying the cliff penalty on every token, and the fix is not to nudge -ngl upward by a few — it is to get all the way up or change the model.
The practical decision tree is short:
- It already says all layers offloaded. You are on the fast side of the cliff. Leave
-nglhigh (99 is the standard "everything" value) and move on. - It is offloading some but not all. Check VRAM headroom with
nvidia-smi. If you have room and it still won't take every layer, your context or KV cache is eating the budget — shrink the context, quantize the KV cache (-ctk q8_0 -ctv q8_0), or drop to a smaller model. Do not settle for a partial split as a "compromise"; it is the worst of both worlds. - It physically cannot fit. Move to a heavier quant or a smaller model that runs entirely on the GPU. See below.
The practical takeaway
Size your model to fit entirely in VRAM, or accept the cliff. There is no useful partial-offload zone — the curve doesn't gradually recover, it collapses and stays down. The right move is to pick a model that runs at -ngl 99 on your card:
- 16 GB card (RTX 5060 Ti, RTX 4080): Q4_K_M 14B fits cleanly at ~9.2 GB VRAM. Use
-ngl 99and get 43+ tok/s. - 10–12 GB card (RTX 3080, RTX 2080 Ti): Q4_K_M 7B (~4.4 GB) fits with headroom. Avoid partial-offloading a 14B — you'll be stuck at 12.5 tok/s or below.
- 8 GB card: Q4_K_M 7B fits. A 14B at any partial
-nglsetting will live near the bottom of the cliff.
If you need 14B output quality on a card with less VRAM, consider a Q2_K quant (~5.5 GB). A fully-GPU Q2_K will significantly outperform a partial-offload Q4_K_M — the cliff penalty overwhelms the quant penalty every time. And if you have two smaller cards instead of one big one, splitting the model across both GPUs keeps every layer on fast memory and sidesteps the cliff entirely. For a sense of what "good" looks like on your hardware once you're over the edge, see what tokens-per-second to expect from a local LLM.
Reproduce it — and send us your numbers
Everything here runs on a stock llama.cpp CUDA build with the public Qwen2.5-Coder-14B GGUF. Place Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf (or set SPECBENCH_MODEL_BASE) in your models directory (SPECBENCH_MODELS_DIR), then:
python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py && python scripts/inject.py
This is where you come in. This is experiment #5 in our open-source local-LLM benchmark series, and the whole point is to map where the cliff sits across real consumer hardware. A faster PCIe 4.0 or 5.0 link, a different memory configuration, an Apple Silicon Mac with unified memory, an older card — every rig moves the partial-offload numbers, and we want to see how far. The make_submission.py script auto-captures your CPU, RAM, GPU, PCIe link, board, and storage into a single submission file; open a pull request adding it and we'll fold it into a growing community comparison.
Full instructions, the sweep script, and our raw results are in the repo: github.com/InventiveHQ/local-llm-benchmarks/tree/main/experiments/vram-offload-cliff. Submitted data is licensed CC BY 4.0.
New to running models on your own hardware? Start with running local AI: the complete guide, then size your card with how much VRAM you need to run an LLM.
