Question 1

Why does fine-tuning need so much more memory than inference?

Accepted Answer

Inference only needs the model weights and KV cache. Training adds three big consumers: **gradients** (same size as the weights), **optimizer states** (AdamW keeps two momentum values per parameter in fp32 — 8 bytes per parameter, 4x the bf16 weights), and **activations** (intermediate values saved during the forward pass for backpropagation, which scale with batch size and sequence length). Full fine-tuning of an 8B model needs ~100+ GB; running it needs ~6 GB.

Question 2

What is the difference between LoRA and QLoRA?

Accepted Answer

**LoRA** freezes the base model in bf16 and trains small adapter matrices instead — you only need gradients and optimizer states for the adapters (typically under 1% of parameters). **QLoRA** goes further: the frozen base model is quantized to 4-bit (NF4), cutting its memory by ~73%. QLoRA is what makes fine-tuning a 70B model possible on a single 48-64 GB GPU. Quality difference between the two is usually negligible.

Question 3

How much VRAM do I need to fine-tune Llama 3.1 8B?

Accepted Answer

With **QLoRA**: about 10-14 GB (fits an RTX 3060 12GB or any 16 GB card). With **LoRA**: about 20-24 GB (RTX 3090/4090). With **full fine-tuning**: 100+ GB (multiple A100s or H100s). These numbers assume batch size 1, 2K sequence length, and gradient checkpointing enabled — the calculator lets you adjust all of these.

Question 4

What is gradient checkpointing and should I use it?

Accepted Answer

Gradient checkpointing trades compute for memory: instead of storing all activations from the forward pass, it stores a fraction and recomputes the rest during backpropagation. It cuts activation memory by ~85% at the cost of ~30% slower training. For consumer GPUs the answer is almost always yes — it is the difference between fitting and not fitting.

Question 5

What LoRA rank should I use?

Accepted Answer

Rank controls adapter capacity: **r=8-16** works for style/format adaptation and most chat fine-tunes. **r=32-64** for teaching substantial new knowledge or behaviors. **r=128+** approaches full fine-tuning quality but with diminishing returns. Higher ranks need more memory (linearly), but adapter memory is small compared to the base model, so rank rarely determines whether a job fits.

Question 6

Does batch size matter for quality?

Accepted Answer

Larger batches give more stable gradients but need proportionally more activation memory. The standard workaround is **gradient accumulation**: run multiple batch-size-1 steps and accumulate gradients before updating — same effective batch size, fraction of the memory. If memory is tight, use batch size 1 with accumulation steps of 8-32.

Question 7

Can I fine-tune on Apple Silicon?

Accepted Answer

Yes, with MLX (Apple's ML framework) — MLX-LM supports LoRA and QLoRA fine-tuning with similar memory characteristics. A 64 GB Mac can QLoRA-tune models up to ~30B. Training is slower than on NVIDIA GPUs (less compute), but for small datasets and adapters it is entirely practical.

Fine-Tuning VRAM Calculator

Estimate the VRAM You Need to Fine-Tune an LLM

Where Training Memory Actually Goes

LoRA and QLoRA Change the Math

When to Use This

Where Training Memory Actually Goes

Practical Fine-Tuning Hardware Guide

Frequently Asked Questions