Why model formats matter
When you download a model from Hugging Face you are not downloading "a model" — you are downloading a specific file format that encodes the weights in a specific way. That format decides three things before you ever run a prompt: which runtime can open the file, how much memory the weights consume, and how fast inference goes on your hardware. Grab a 140 GB FP16 safetensors repo when you meant to pull a 4.9 GB GGUF and you have just wasted an afternoon and your disk. Picking the wrong format is the single most common beginner mistake in local AI.
The good news: there are only a handful of formats that matter, and once you know which runtime you intend to use, the choice is nearly automatic. This guide covers GGUF, safetensors, MLX, GPTQ, AWQ, and EXL2 — what each one is, who loads it, and how to pick. If you want to know how much memory a given file needs before you download it, run the model through the LLM VRAM calculator; if you want a one-click read on what your current GPU can handle, try what LLM can I run.
The one table that answers most questions
Here is the whole landscape on one screen. Everything below is elaboration.
| Format | What it is | Primary runtime | Quantized? | Best for |
|---|---|---|---|---|
| GGUF | Single-file container (weights + tokenizer + metadata), memory-mapped | llama.cpp, Ollama, LM Studio, Jan, GPT4All | Yes (Q2–Q8 K-quants, IQ-quants) | CPU + consumer-GPU inference; the default for desktops and Macs |
| safetensors | Hugging Face full-precision tensor container (no pickle, safe to load) | transformers, vLLM, Diffusers, MLX (as source) | No (usually FP16/BF16) | The high-precision source; GPU serving at full precision; fine-tuning |
| MLX | Apple's safetensors-based format with MLX quantization | mlx-lm, Ollama 0.19+, LM Studio (on Macs) | Yes (MLX 4/8-bit) and FP16 | Apple Silicon — fastest path on M-series, unified memory |
| GPTQ | Post-training 4-bit quant (layer-wise error minimization), in safetensors | vLLM, transformers, TGI, ExLlama | Yes (typically 4-bit) | NVIDIA throughput serving where a GPTQ build exists |
| AWQ | Activation-aware 4-bit quant (protects salient channels), in safetensors | vLLM, transformers, TGI | Yes (typically 4-bit) | NVIDIA throughput serving; often better quality than GPTQ at 4-bit |
| EXL2 | ExLlamaV2's variable-bitrate quant (mix bpw per layer) | ExLlamaV2 / TabbyAPI | Yes (2.0–8.0 bpw, fractional) | Squeezing max quality into a fixed NVIDIA VRAM budget |
Two patterns fall out of this table immediately. First, GGUF and MLX are the "consumer" formats — you point a desktop app at one file and it runs. Second, safetensors, GPTQ, AWQ, and EXL2 are the "server" formats — they live in multi-file Hugging Face repos and are loaded by GPU engines that care about concurrent throughput. The rest of this article walks each one.
GGUF — the consumer local-AI standard
GGUF ("GPT-Generated Unified Format") was created by the llama.cpp project as the successor to the older GGML format, and it has become the de-facto container for quantized local models. Its design is what makes it so convenient:
- One file. Weights, the tokenizer, and all metadata (architecture, context length, chat template) live in a single
.gguf. There is no separate config, no tokenizer download, nothing to assemble. - Memory-mapped. The runtime
mmaps the file, so the OS pages weights in on demand and loading feels instant — the model isn't fully read into RAM up front. - Quantized by default. GGUF files almost always ship pre-quantized, which is why an 8B model that is 16 GB in FP16 shows up as a ~4.9 GB
Q4_K_M.gguf.
That last point is where the cryptic filenames come from. A file like Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf encodes the quantization scheme in the suffix:
Q<N>is the nominal bits per weight —Q4is 4-bit,Q8is 8-bit._Kmarks a "K-quant": llama.cpp's super-block scheme (256-weight blocks with hierarchical scales) that mixes precision across tensor types for far better quality-per-byte than the legacyQ4_0/Q4_1quants._S/_M/_Lis the tier — Small, Medium, Large — meaning how many sensitive tensors get bumped to higher precision._Mis the sweet spot.IQ<N>quants (IQ2/IQ3/IQ4) use an importance matrix and non-linear codebooks for better quality at very low bit counts, at the cost of slower decode.
Q4_K_M is the default you want for most models on most hardware: roughly 0.55 effective bytes per weight and only ~1–3% perplexity rise versus FP16 (Llama-3.1-8B measures ppl ~7.56 at Q4_K_M vs ~7.32 at FP16, +3.3%). Quality stays close to lossless down to ~4-bit and then falls off a cliff below it. We go deep on every suffix, the quality cliff, and the IQ-quants in LLM quantization explained — read that if you want to understand the trade-offs rather than just copy the default.
Who loads GGUF: Ollama, LM Studio, Jan, and GPT4All all sit on top of llama.cpp (or ship it as a backend), so anything GGUF "just works" in the popular desktop tools. If you are choosing between those runtimes, see Ollama vs LM Studio vs llama.cpp.
safetensors — the Hugging Face standard
safetensors is the format almost every model is first published in. It is a simple, secure container for raw tensors: a small JSON header describing each tensor's name, dtype, and shape, followed by the contiguous tensor bytes. "Safe" is literal — it replaced PyTorch's .bin/pickle files, which could execute arbitrary code on load. safetensors cannot; it only carries data.
Two things to know. First, it is usually full precision — a safetensors repo typically holds FP16/BF16 weights at 2.0 bytes per weight, so a 70B model is ~140 GB across several sharded files plus config.json and tokenizer files. It is not quantized unless the repo name says GPTQ/AWQ. Second, it is the source of truth: almost every GGUF, MLX, GPTQ, AWQ, and EXL2 file you download was converted from an original safetensors release, which lands first when a new model drops.
safetensors is loaded natively by Hugging Face transformers and by vLLM (FP16/BF16), and it is the input to fine-tuning. If you are serving at full precision or training, safetensors is your format. For everything memory-constrained, you convert down to a quantized format.
MLX — Apple Silicon native
MLX is Apple's open-source array framework for Apple Silicon, built around unified memory so the CPU and GPU share one zero-copy address space. The companion library mlx-lm loads, generates, fine-tunes, and quantizes models. The MLX format is safetensors-based but carries MLX-specific quantization (4-bit and 8-bit), distributed through the mlx-community org on Hugging Face.
Why it matters on a Mac: MLX is now the high-performance backend underneath the popular Mac apps. Ollama 0.19 (preview, ~March 2026) added an optional MLX backend that replaces the llama.cpp Metal path on Apple Silicon Macs with 32 GB+ of unified memory, reporting roughly 2× the inference speed of the prior Metal path. LM Studio uses MLX as its default engine when an MLX build of a model exists, reporting ~30–50% faster than llama.cpp/Metal. On the newest M5-series chips, MLX taps the GPU Neural Accelerators for faster time-to-first-token.
The practical rule: on a capable Mac, prefer an MLX build when one is available; fall back to GGUF when it isn't (GGUF still runs everywhere). We cover unified-memory sizing, which quant to pick per Mac, and real throughput in running LLMs on Apple Silicon with MLX.
GPTQ, AWQ, EXL2 — GPU-side quantization
These three are the "server" quantization formats. They all produce safetensors-style files, all target NVIDIA GPUs, and all exist because GGUF's experimental support in GPU serving engines is under-optimized (vLLM's own docs recommend llama.cpp instead if you only have GGUF). When you are running a high-throughput, multi-user endpoint, you reach for one of these.
- GPTQ quantizes the model one layer at a time, using second-order information to choose 4-bit values that minimize reconstruction error. It is the long-standing 4-bit standard for GPU inference and is loaded natively by vLLM, transformers, and TGI.
- AWQ (Activation-aware Weight Quantization) identifies the small fraction of weight channels that matter most to activations and protects them, quantizing the rest to 4-bit. It frequently edges out GPTQ on quality at the same bitrate and is likewise first-class in vLLM.
- EXL2 is the ExLlamaV2 format, and its trick is variable bitrate: you can target an average bits-per-weight (say 4.65 bpw) and the converter mixes per-layer precision to hit it. That lets you fill a fixed VRAM budget exactly — handy when you have, say, a 24 GB card and want the best possible quality that fits. EXL2 runs under ExLlamaV2/TabbyAPI rather than vLLM.
For why the choice is between these and not GGUF on a server: server engines like vLLM lean on continuous batching and PagedAttention to serve many concurrent requests, and the GPTQ/AWQ kernels are written to feed those paths. GGUF's strengths — single file, CPU offload, memory-mapping — are single-user conveniences that don't help a batched GPU server.
A decision table: which format do I download?
Match your runtime and hardware to a format and the choice is made for you:
| Your setup | Download this | Notes |
|---|---|---|
| Ollama / LM Studio / Jan on Windows or Linux (any GPU or CPU) | GGUF Q4_K_M | The universal default; bump to Q5_K_M/Q6_K if it fits |
| Mac (Apple Silicon, 32 GB+ unified) | MLX 4-bit if available, else GGUF | MLX is faster on M-series; GGUF always works |
| vLLM / TGI production endpoint on NVIDIA | AWQ or GPTQ 4-bit | Native, batched, high-throughput; AWQ often higher quality |
| ExLlamaV2 / TabbyAPI on a fixed-VRAM NVIDIA card | EXL2 at the bpw that fits | Variable bitrate fills your VRAM budget precisely |
| Full-precision serving or fine-tuning | safetensors (FP16/BF16) | The unquantized source; also what you convert from |
| Tight VRAM, need sub-4-bit | GGUF IQ2/IQ3 or low-bpw EXL2 | Better quality than legacy quants at the same bits |
If you are unsure whether a given file will fit, the LLM VRAM calculator computes weights + KV cache + overhead for any model and quant, and the LLM inference speed calculator estimates tokens/sec from your GPU's memory bandwidth so you know what performance to expect before you download. The chart below shows why quant choice dominates the fit question:
The FP16 safetensors source is 16 GB; the same model in Q4_K_M GGUF is under 5 GB with near-identical output quality. That ~3.3× shrink — at almost no quality cost — is the entire reason quantized formats exist.
Image and video models use formats too
GGUF is no longer text-only. The diffusion community, led by ComfyUI, now ships GGUF builds of image and video models — Wan (video) and Qwen-Image among them — primarily to cut VRAM so a large model fits on a smaller card. If you have seen wan2.x-Q4_K_M.gguf floating around, that is the same container concept applied to a diffusion transformer.
But do not assume text-GGUF intuition transfers. These are not decoder-only transformers; they are U-Net or DiT diffusion models wrapped in VAEs and text encoders, and that changes the rules:
- The quant trade-offs differ. Diffusion quality degrades differently from language-model perplexity, and which tensors tolerate low bits is model-specific.
- Runtime support is narrower. GGUF diffusion runs through ComfyUI's GGUF nodes, not llama.cpp — a separate code path with its own quirks.
- They don't shard like transformers do. Text transformers tensor-parallelize cleanly across GPUs because they are a uniform stack of identical layers; diffusion models have heterogeneous graphs, skip connections, and a serial denoising loop, so the multi-GPU story is about component offloading, not clean splitting.
That last point is a deep topic on its own — see splitting LLM models across GPUs for why transformers shard for free and diffusion models don't. The takeaway here: treat image/video GGUF as a related-but-separate ecosystem, and read the specific model card before assuming a quant level is safe.
Conclusion
Formats are a runtime decision, not a quality decision. Pick your runtime and hardware first, and the format follows: GGUF Q4_K_M for desktop apps and CPU/consumer-GPU inference, MLX on a capable Apple Silicon Mac, AWQ/GPTQ for vLLM serving on NVIDIA, EXL2 to fill a fixed VRAM budget exactly, and safetensors when you need full precision or are fine-tuning. When in doubt, GGUF Q4_K_M is the answer that runs almost everywhere with almost no quality loss.
Before you commit a download, sanity-check the fit with the LLM VRAM calculator and confirm your hardware with what LLM can I run — five seconds there saves you a 140 GB mistake.