Fine-Tuning VRAM Calculator
Calculate VRAM requirements for fine-tuning LLMs with full fine-tuning, LoRA, or QLoRA — accounting for gradients, optimizer states, activations, and gradient checkpointing.
Training setup
You build the idea. I'll ship the product.
Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.
Where Training Memory Actually Goes
Fine-tuning memory has four components, and which one dominates depends on the method:
Weights: the base model itself. Full fine-tuning and LoRA keep it in bf16 (2 bytes/param); QLoRA quantizes it to 4-bit NF4 (~0.55 bytes/param).
Gradients: needed for every trainable parameter. Full fine-tuning trains everything (gradients = weight size). LoRA/QLoRA only train tiny adapters, so gradients are negligible.
Optimizer states: AdamW stores two fp32 moments per trainable parameter — 8 bytes each. This is what makes full fine-tuning so expensive: an 8B model needs 64 GB of optimizer states alone. 8-bit optimizers cut this 4x.
Activations: intermediate values from the forward pass, scaling with batch size x sequence length x model depth. Gradient checkpointing cuts these ~85%.
The punchline: for full fine-tuning, optimizer states dominate. For LoRA/QLoRA, the frozen base model dominates — which is why quantizing it (QLoRA) is such a big win.
Practical Fine-Tuning Hardware Guide
What you can realistically train on common hardware (batch 1, 2K sequence, gradient checkpointing):
12-16 GB (RTX 3060/4060 Ti): QLoRA up to 8-9B models. This is the entry point — Llama 3.1 8B, Qwen3 8B, Gemma 4 E4B all work.
24 GB (RTX 3090/4090): QLoRA up to ~14B comfortably, LoRA up to 8B, or QLoRA on 27-32B models with short sequences.
48 GB (2x 3090, RTX 6000 Ada, 64 GB Mac): QLoRA on 70B models — the sweet spot for serious open-model fine-tuning.
80 GB+ (A100/H100): LoRA on 70B, full fine-tuning of 7-8B models, QLoRA on the largest MoE models.
Cloud alternative: renting an A100 80GB at ~1.40/hr means a typical QLoRA run (3-12 hours) costs 5-20 dollars — often cheaper than upgrading hardware for occasional training.
Frequently Asked Questions
Common questions about the Fine-Tuning VRAM Calculator
Inference only needs the model weights and KV cache. Training adds three big consumers: gradients (same size as the weights), optimizer states (AdamW keeps two momentum values per parameter in fp32 — 8 bytes per parameter, 4x the bf16 weights), and activations (intermediate values saved during the forward pass for backpropagation, which scale with batch size and sequence length). Full fine-tuning of an 8B model needs ~100+ GB; running it needs ~6 GB.
LoRA freezes the base model in bf16 and trains small adapter matrices instead — you only need gradients and optimizer states for the adapters (typically under 1% of parameters). QLoRA goes further: the frozen base model is quantized to 4-bit (NF4), cutting its memory by ~73%. QLoRA is what makes fine-tuning a 70B model possible on a single 48-64 GB GPU. Quality difference between the two is usually negligible.
With QLoRA: about 10-14 GB (fits an RTX 3060 12GB or any 16 GB card). With LoRA: about 20-24 GB (RTX 3090/4090). With full fine-tuning: 100+ GB (multiple A100s or H100s). These numbers assume batch size 1, 2K sequence length, and gradient checkpointing enabled — the calculator lets you adjust all of these.
Gradient checkpointing trades compute for memory: instead of storing all activations from the forward pass, it stores a fraction and recomputes the rest during backpropagation. It cuts activation memory by ~85% at the cost of ~30% slower training. For consumer GPUs the answer is almost always yes — it is the difference between fitting and not fitting.
Rank controls adapter capacity: r=8-16 works for style/format adaptation and most chat fine-tunes. r=32-64 for teaching substantial new knowledge or behaviors. r=128+ approaches full fine-tuning quality but with diminishing returns. Higher ranks need more memory (linearly), but adapter memory is small compared to the base model, so rank rarely determines whether a job fits.
Larger batches give more stable gradients but need proportionally more activation memory. The standard workaround is gradient accumulation: run multiple batch-size-1 steps and accumulate gradients before updating — same effective batch size, fraction of the memory. If memory is tight, use batch size 1 with accumulation steps of 8-32.
Yes, with MLX (Apple's ML framework) — MLX-LM supports LoRA and QLoRA fine-tuning with similar memory characteristics. A 64 GB Mac can QLoRA-tune models up to ~30B. Training is slower than on NVIDIA GPUs (less compute), but for small datasets and adapters it is entirely practical.
Explore More Tools
Continue with these related tools