InventiveHQ Lab

The VRAM Cliff: 15× Slower the Moment Layers Spill to CPU

We swept -ngl from 0 to 99 on a 14B model: 2.89 → 43 tok/s as it moves onto the GPU. Partial offload is a cliff, not a slope — and the last 8 layers matter most.

By InventiveHQ Team

There is a moment, when you load a local model that is slightly too big for your graphics card, where everything falls apart. Not gradually — all at once. We wanted to find exactly where that edge is and how far the drop goes, so we took a single 14B model on one GPU and measured generation speed at every step from "all CPU" to "all GPU."

The result is not a gentle slope you can tune your way along. It is a cliff. Full GPU runs at 43.18 tok/s; full CPU crawls at 2.89 tok/s — a roughly 15× gap on the exact same model and quant. And almost the entire recovery happens in one final step. Everything below is reproducible on a stock llama.cpp release, and all of our raw data and the benchmark harness are open source.

The results, up front

On Qwen2.5-Coder-14B (Q4_K_M, 48 layers) on an RTX 5060 Ti, sweeping -ngl from 0 to 99:

-ngl (layers on GPU)ThroughputVRAM used
0 (all CPU)2.89 tok/s265 MB
83.58 tok/s2,161 MB
164.24 tok/s3,509 MB
245.61 tok/s4,875 MB
327.26 tok/s6,241 MB
4012.5 tok/s7,593 MB
99 (all 48)43.18 tok/s9,179 MB

The short version: moving layers onto the GPU one batch at a time buys you almost nothing until the very end. From -ngl 0 to 40 — putting 83% of the model on the GPU — throughput climbs only from 2.89 to 12.5 tok/s. Then the last 8 layers come off the CPU and speed leaps to 43.18 tok/s, a 3.5× jump from a single step. Below is exactly why, and how to make sure you land on the right side of it.

Why this matters: the hidden cost of VRAM spill

When a model doesn't fully fit in your GPU's VRAM, llama.cpp offloads the overflow layers to system RAM. The -ngl flag controls how many layers go on the GPU; every layer past that limit runs on the CPU. Most people assume this is a gradual trade-off — more GPU layers means a bit faster, fewer means a bit slower, and the system finds its own equilibrium somewhere in the middle. It does not.

The penalty for even a few CPU layers is severe, because the GPU must stall and pull those weights across PCIe on every single decoding step. As the data shows, the curve barely begins to steepen until the final step. This experiment sweeps every meaningful -ngl value on a real 14B model and measures exactly where the cliff is and how steep the fall is.

How GPU layer offloading works

A transformer model is a stack of layers processed sequentially for each generated token. On the GPU, each layer is served at a memory bandwidth of roughly 480 GB/s on the RTX 5060 Ti's GDDR7. On the CPU, system memory peaks an order of magnitude lower — and every time the decode pass hits a CPU layer, the GPU must cross PCIe (practical throughput around 16 GB/s one-way) to fetch those weights, process them, and send activations back. The GPU stalls while it waits. Not just for one layer — for every CPU layer, on every forward pass, for every token you generate.

The relationship between layer count and throughput is therefore not linear. Each CPU layer forces a PCIe round trip that the GPU cannot overlap with useful work, and the stalls compound. The practical consequence is a threshold: as long as any layer is on the CPU, every step still pays the crossing, so throughput stays pinned near the bottom — and only breaks free once the model is entirely resident in fast GPU memory.

All CPU -ngl 0 CPU 48 layers · slow throughput 2.89 tok/s Partial offload -ngl 40 GPU 40 layers CPU · 8 layers (stall!) throughput 12.5 tok/s All GPU -ngl 99 GPU 48 layers throughput 43.18 tok/s ≈ 15× faster than all-CPU Qwen2.5-Coder-14B-Instruct Q4_K_M · RTX 5060 Ti 16 GB · llama.cpp CUDA

Three -ngl settings on the same 48-layer model. As layers fall to the CPU, throughput collapses. At -ngl 40 — with only 8 of 48 layers still on the CPU — you are already 71% below full-GPU speed.

The absolute gain from -ngl 40 → 99 (+30.7 tok/s) is more than three times the total throughput gained across all five previous steps combined (+9.6 tok/s). One final push — 8 fewer layers on the CPU — does what the entire partial-offload range could not. The last layers matter most.

The setup

We kept the methodology deliberately boring so the numbers are comparable: a fixed model and quant, greedy decoding (temperature 0), 256 tokens generated per prompt, a 4096-token context, and a shared 12-prompt suite run at each -ngl step. Throughput is the server's own predicted_per_second — pure generation rate. VRAM is the nvidia-smi used-memory delta on model load, measured on an otherwise-idle GPU.

ComponentWhat we used
GPUNVIDIA RTX 5060 Ti — Blackwell (sm_120), 16 GB GDDR7
Runtimellama.cpp (CUDA 12.4 build)
ModelQwen2.5-Coder-14B-Instruct (Q4_K_M, ~9.2 GB on disk)
Sweep-ngl 0, 8, 16, 24, 32, 40, 99 — same 12-prompt suite at each step
Metricstok/s (predicted_per_second) · VRAM MB (nvidia-smi delta on model load)

Reference test rig. Every reference number was measured on one machine — an Alienware Aurora R7: Intel Core i7-8700 (6 cores / 12 threads, Coffee Lake); Intel 300-series board with a PCIe 3.0 root complex; 64 GB DDR4-2400 (dual-channel); the RTX 5060 Ti running at PCIe 3.0 ×8; models on a SATA SSD; Windows Server 2025. In-VRAM throughput is GPU-memory-bandwidth bound, so bus, RAM, and SSD speed barely move the full-GPU number — but they shape the offload and CPU results. A faster PCIe link would soften the partial-offload penalty somewhat, but it will not turn the cliff into a slope.

# Sweep -ngl from 0 (all CPU) to 99 (all GPU) — repeated for each value
llama-server -m Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf \
  -ngl 0 --temp 0 --top-k 1 -n 256   # repeat with 8 16 24 32 40 99

# Or run the automated sweep via bench.py
python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py   # -> results/data.json

Reading the cliff

The throughput line tells a precise story. From -ngl 0 to 40, each increment of 8 GPU layers buys modest, compounding gains — 2.89 → 3.58 → 4.24 → 5.61 → 7.26 → 12.5 tok/s, a 4.3× improvement spanning the entire partial-offload range. Then from -ngl 40 to 99 the last 8 CPU layers come off and throughput leaps to 43.18 tok/s — a 3.5× jump for just those final layers.

50 40 30 20 10 0 2.89 12.5 43.18 0 8 16 24 32 40 all GPU layers (-ngl; 0 = all CPU, all = all GPU) tokens / sec

The VRAM table tracks the cost of getting there. At -ngl 0, only 265 MB of VRAM is in use — just the KV cache and runtime overhead; the model itself lives in system RAM. Each batch of 8 GPU layers adds roughly 1,300–1,600 MB, climbing steadily to 9,179 MB at -ngl 99. A 16 GB card handles this with over 6 GB of headroom at a 4K context. (If you want to understand where that growing number comes from, our breakdowns of how much VRAM you need to run an LLM and the KV cache memory cost go layer by layer.)

What we learned

The gap is ~15× — not 2×, not 5×

Full GPU (-ngl 99) runs at 43.18 tok/s. Full CPU (-ngl 0) delivers 2.89 tok/s. That is a 14.9× gap on the exact same model and quant. If you assume partial offload will recover "most of the GPU performance," the data disagrees sharply: at -ngl 40, with 83% of layers on the GPU, you are still at only 12.5 tok/s — less than 30% of full-GPU throughput.

The non-linear last-layers effect

The biggest surprise in the sweep is not the size of the cliff — it is where the cliff is. Going from -ngl 32 to 40 (7.26 → 12.5 tok/s) is already the largest single-step gain in the partial range. But removing the final 8 CPU layers (40 → 99) yields 43.18 tok/s — a 3.5× multiplier on one step. Once the GPU has every layer in fast GDDR7, it stops stalling on PCIe entirely; the decode pipeline becomes fully bandwidth-bound on local memory and throughput jumps dramatically.

Partial offload is a cliff, not a slope

There is no sweet spot in the partial-offload zone that recovers most of the GPU's performance. Even at -ngl 40 — the high end of the partial range, with 40 of 48 layers on the GPU — throughput is 12.5 tok/s against a full-GPU ceiling of 43.18. The curve does not level off; it stays near the bottom until the very last CPU layers are removed. You are either fully on the GPU, or you are not. Tuning -ngl to find a partial compromise does not work.

The model fits if your card has 10 GB+

At -ngl 99, VRAM usage is 9,179 MB. An RTX 5060 Ti (16 GB), RTX 3090 (24 GB), or RTX 4090 (24 GB) all clear this easily. A 10 GB card (RTX 3080, RTX 2080 Ti) still fits -ngl 99 with roughly 800 MB of headroom before the KV cache fills it — keep context to 2K or use -ctk q8_0 to stay within bounds. An 8 GB card cannot fit every layer of this model and quant.

How to read your own offload

You do not have to guess where you are on this curve — llama.cpp tells you at startup. When the server loads, it prints a line like offloaded 40/49 layers to GPU. (The count is layers plus the output head, so it can be one higher than the transformer-layer count.) If that number is anything less than "all," you are paying the cliff penalty on every token, and the fix is not to nudge -ngl upward by a few — it is to get all the way up or change the model.

The practical decision tree is short:

  • It already says all layers offloaded. You are on the fast side of the cliff. Leave -ngl high (99 is the standard "everything" value) and move on.
  • It is offloading some but not all. Check VRAM headroom with nvidia-smi. If you have room and it still won't take every layer, your context or KV cache is eating the budget — shrink the context, quantize the KV cache (-ctk q8_0 -ctv q8_0), or drop to a smaller model. Do not settle for a partial split as a "compromise"; it is the worst of both worlds.
  • It physically cannot fit. Move to a heavier quant or a smaller model that runs entirely on the GPU. See below.

The practical takeaway

Size your model to fit entirely in VRAM, or accept the cliff. There is no useful partial-offload zone — the curve doesn't gradually recover, it collapses and stays down. The right move is to pick a model that runs at -ngl 99 on your card:

  • 16 GB card (RTX 5060 Ti, RTX 4080): Q4_K_M 14B fits cleanly at ~9.2 GB VRAM. Use -ngl 99 and get 43+ tok/s.
  • 10–12 GB card (RTX 3080, RTX 2080 Ti): Q4_K_M 7B (~4.4 GB) fits with headroom. Avoid partial-offloading a 14B — you'll be stuck at 12.5 tok/s or below.
  • 8 GB card: Q4_K_M 7B fits. A 14B at any partial -ngl setting will live near the bottom of the cliff.

If you need 14B output quality on a card with less VRAM, consider a Q2_K quant (~5.5 GB). A fully-GPU Q2_K will significantly outperform a partial-offload Q4_K_M — the cliff penalty overwhelms the quant penalty every time. And if you have two smaller cards instead of one big one, splitting the model across both GPUs keeps every layer on fast memory and sidesteps the cliff entirely. For a sense of what "good" looks like on your hardware once you're over the edge, see what tokens-per-second to expect from a local LLM.

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp CUDA build with the public Qwen2.5-Coder-14B GGUF. Place Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf (or set SPECBENCH_MODEL_BASE) in your models directory (SPECBENCH_MODELS_DIR), then:

python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py && python scripts/inject.py

This is where you come in. This is experiment #5 in our open-source local-LLM benchmark series, and the whole point is to map where the cliff sits across real consumer hardware. A faster PCIe 4.0 or 5.0 link, a different memory configuration, an Apple Silicon Mac with unified memory, an older card — every rig moves the partial-offload numbers, and we want to see how far. The make_submission.py script auto-captures your CPU, RAM, GPU, PCIe link, board, and storage into a single submission file; open a pull request adding it and we'll fold it into a growing community comparison.

Full instructions, the sweep script, and our raw results are in the repo: github.com/InventiveHQ/local-llm-benchmarks/tree/main/experiments/vram-offload-cliff. Submitted data is licensed CC BY 4.0.


New to running models on your own hardware? Start with running local AI: the complete guide, then size your card with how much VRAM you need to run an LLM.

Frequently Asked Questions

What is the -ngl flag in llama.cpp?

The -ngl flag (number of GPU layers) tells llama.cpp how many of a model's transformer layers to load into GPU VRAM. Every layer past that count stays in system RAM and runs on the CPU. -ngl 0 keeps the entire model on the CPU; a high value like -ngl 99 forces every layer onto the GPU. Because a 14B model here has only 48 layers, anything at or above 48 puts the whole model on the GPU — hence the common "-ngl 99" shorthand for "all of it."

Why is partial GPU offload so much slower than full offload?

On every single decoding step the GPU has to process the layers it holds, then stall and pull the CPU-resident layers across the PCIe bus — practical one-way throughput of roughly 16 GB/s, versus about 480 GB/s of on-card GDDR7 bandwidth. The GPU cannot overlap that round trip with useful work, so even a handful of CPU layers forces a stall on every token. In our sweep, keeping just 8 of 48 layers on the CPU (-ngl 40) left throughput at 12.5 tok/s — under 30% of the 43.18 tok/s you get with all 48 layers on the GPU.

How much faster is full-GPU versus full-CPU inference?

On the same Qwen2.5-Coder-14B Q4_K_M model and the same machine, full GPU (-ngl 99) ran at 43.18 tok/s and full CPU (-ngl 0) ran at 2.89 tok/s — a 14.9× gap, effectively 15×. That is the entire spread between a usable, conversational generation speed and a painfully slow crawl, decided purely by whether the model fits in VRAM.

Is there a sweet spot for -ngl when a model doesn't fully fit?

No. The data shows no partial-offload setting that recovers most of the GPU's speed. Throughput stays near the bottom of the curve across the entire partial range and only jumps once the final CPU layers are removed. You are either fully on the GPU or you are not — tuning -ngl to split the model is not a meaningful optimization. If the model won't fit, the better move is a smaller model or a heavier quant that does.

How much VRAM does a 14B model need at full offload?

In this test, Qwen2.5-Coder-14B at Q4_K_M used 9,179 MB (about 9.2 GB) of VRAM at -ngl 99 with a 4K context. That fits a 16 GB card with over 6 GB to spare and clears a 10 GB card with roughly 800 MB of headroom before the KV cache fills. An 8 GB card cannot hold every layer of this model and quant, so it is stuck at the bottom of the cliff.

Should I use a heavier quant or partial offload if VRAM is tight?

Use the heavier quant. A model that fits entirely in VRAM at a lower precision (for example a Q2_K 14B at around 5.5 GB) will significantly outrun a higher-precision Q4_K_M of the same model that has to offload layers to the CPU. The cliff penalty from spilling to system RAM overwhelms the quality penalty from a more aggressive quant every time.

Why do the last few layers matter more than the rest?

It is less about the layers themselves and more about a threshold effect: as long as any layer lives on the CPU, every decode step still pays a PCIe round trip and the GPU keeps stalling. Only when the final CPU layers come off does the decode pipeline become fully bandwidth-bound on local GDDR7 and stop crossing the bus. In our sweep, the last step (ngl 40 → 99) added 30.7 tok/s — more than three times the 9.6 tok/s gained across the entire partial-offload range before it.

Can I reproduce these results on my own hardware?

Yes. The sweep runs on a stock llama.cpp CUDA build with the public Qwen2.5-Coder-14B GGUF. The benchmark harness, the -ngl sweep script, and our raw results are all in the open-source repo, and we are collecting community submissions from other GPUs and CPUs. If you have different hardware, a single run is a useful data point.

Local AILLM InferenceVRAMGPU Offloadllama.cppBenchmarks

Need help from an IT & cybersecurity partner?

InventiveHQ helps businesses secure, modernize, and run their technology. Let's talk about your goals.

Get in touch