Skip to main content
Home/Blog/What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)
Artificial Intelligence

What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)

GGUF, safetensors, MLX, GPTQ, AWQ, EXL2 — what each model format is, which runtime uses which, and how to pick the right file to download for your hardware.

By InventiveHQ Team

Why model formats matter

When you download a model from Hugging Face you are not downloading "a model" — you are downloading a specific file format that encodes the weights in a specific way. That format decides three things before you ever run a prompt: which runtime can open the file, how much memory the weights consume, and how fast inference goes on your hardware. Grab a 140 GB FP16 safetensors repo when you meant to pull a 4.9 GB GGUF and you have just wasted an afternoon and your disk. Picking the wrong format is the single most common beginner mistake in local AI.

The good news: there are only a handful of formats that matter, and once you know which runtime you intend to use, the choice is nearly automatic. This guide covers GGUF, safetensors, MLX, GPTQ, AWQ, and EXL2 — what each one is, who loads it, and how to pick. If you want to know how much memory a given file needs before you download it, run the model through the LLM VRAM calculator; if you want a one-click read on what your current GPU can handle, try what LLM can I run.

The one table that answers most questions

Here is the whole landscape on one screen. Everything below is elaboration.

FormatWhat it isPrimary runtimeQuantized?Best for
GGUFSingle-file container (weights + tokenizer + metadata), memory-mappedllama.cpp, Ollama, LM Studio, Jan, GPT4AllYes (Q2–Q8 K-quants, IQ-quants)CPU + consumer-GPU inference; the default for desktops and Macs
safetensorsHugging Face full-precision tensor container (no pickle, safe to load)transformers, vLLM, Diffusers, MLX (as source)No (usually FP16/BF16)The high-precision source; GPU serving at full precision; fine-tuning
MLXApple's safetensors-based format with MLX quantizationmlx-lm, Ollama 0.19+, LM Studio (on Macs)Yes (MLX 4/8-bit) and FP16Apple Silicon — fastest path on M-series, unified memory
GPTQPost-training 4-bit quant (layer-wise error minimization), in safetensorsvLLM, transformers, TGI, ExLlamaYes (typically 4-bit)NVIDIA throughput serving where a GPTQ build exists
AWQActivation-aware 4-bit quant (protects salient channels), in safetensorsvLLM, transformers, TGIYes (typically 4-bit)NVIDIA throughput serving; often better quality than GPTQ at 4-bit
EXL2ExLlamaV2's variable-bitrate quant (mix bpw per layer)ExLlamaV2 / TabbyAPIYes (2.0–8.0 bpw, fractional)Squeezing max quality into a fixed NVIDIA VRAM budget

Two patterns fall out of this table immediately. First, GGUF and MLX are the "consumer" formats — you point a desktop app at one file and it runs. Second, safetensors, GPTQ, AWQ, and EXL2 are the "server" formats — they live in multi-file Hugging Face repos and are loaded by GPU engines that care about concurrent throughput. The rest of this article walks each one.

GGUF — the consumer local-AI standard

GGUF ("GPT-Generated Unified Format") was created by the llama.cpp project as the successor to the older GGML format, and it has become the de-facto container for quantized local models. Its design is what makes it so convenient:

  • One file. Weights, the tokenizer, and all metadata (architecture, context length, chat template) live in a single .gguf. There is no separate config, no tokenizer download, nothing to assemble.
  • Memory-mapped. The runtime mmaps the file, so the OS pages weights in on demand and loading feels instant — the model isn't fully read into RAM up front.
  • Quantized by default. GGUF files almost always ship pre-quantized, which is why an 8B model that is 16 GB in FP16 shows up as a ~4.9 GB Q4_K_M.gguf.

That last point is where the cryptic filenames come from. A file like Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf encodes the quantization scheme in the suffix:

  • Q<N> is the nominal bits per weight — Q4 is 4-bit, Q8 is 8-bit.
  • _K marks a "K-quant": llama.cpp's super-block scheme (256-weight blocks with hierarchical scales) that mixes precision across tensor types for far better quality-per-byte than the legacy Q4_0/Q4_1 quants.
  • _S / _M / _L is the tier — Small, Medium, Large — meaning how many sensitive tensors get bumped to higher precision. _M is the sweet spot.
  • IQ<N> quants (IQ2/IQ3/IQ4) use an importance matrix and non-linear codebooks for better quality at very low bit counts, at the cost of slower decode.

Q4_K_M is the default you want for most models on most hardware: roughly 0.55 effective bytes per weight and only ~1–3% perplexity rise versus FP16 (Llama-3.1-8B measures ppl ~7.56 at Q4_K_M vs ~7.32 at FP16, +3.3%). Quality stays close to lossless down to ~4-bit and then falls off a cliff below it. We go deep on every suffix, the quality cliff, and the IQ-quants in LLM quantization explained — read that if you want to understand the trade-offs rather than just copy the default.

Who loads GGUF: Ollama, LM Studio, Jan, and GPT4All all sit on top of llama.cpp (or ship it as a backend), so anything GGUF "just works" in the popular desktop tools. If you are choosing between those runtimes, see Ollama vs LM Studio vs llama.cpp.

safetensors — the Hugging Face standard

safetensors is the format almost every model is first published in. It is a simple, secure container for raw tensors: a small JSON header describing each tensor's name, dtype, and shape, followed by the contiguous tensor bytes. "Safe" is literal — it replaced PyTorch's .bin/pickle files, which could execute arbitrary code on load. safetensors cannot; it only carries data.

Two things to know. First, it is usually full precision — a safetensors repo typically holds FP16/BF16 weights at 2.0 bytes per weight, so a 70B model is ~140 GB across several sharded files plus config.json and tokenizer files. It is not quantized unless the repo name says GPTQ/AWQ. Second, it is the source of truth: almost every GGUF, MLX, GPTQ, AWQ, and EXL2 file you download was converted from an original safetensors release, which lands first when a new model drops.

safetensors is loaded natively by Hugging Face transformers and by vLLM (FP16/BF16), and it is the input to fine-tuning. If you are serving at full precision or training, safetensors is your format. For everything memory-constrained, you convert down to a quantized format.

MLX — Apple Silicon native

MLX is Apple's open-source array framework for Apple Silicon, built around unified memory so the CPU and GPU share one zero-copy address space. The companion library mlx-lm loads, generates, fine-tunes, and quantizes models. The MLX format is safetensors-based but carries MLX-specific quantization (4-bit and 8-bit), distributed through the mlx-community org on Hugging Face.

Why it matters on a Mac: MLX is now the high-performance backend underneath the popular Mac apps. Ollama 0.19 (preview, ~March 2026) added an optional MLX backend that replaces the llama.cpp Metal path on Apple Silicon Macs with 32 GB+ of unified memory, reporting roughly 2× the inference speed of the prior Metal path. LM Studio uses MLX as its default engine when an MLX build of a model exists, reporting ~30–50% faster than llama.cpp/Metal. On the newest M5-series chips, MLX taps the GPU Neural Accelerators for faster time-to-first-token.

The practical rule: on a capable Mac, prefer an MLX build when one is available; fall back to GGUF when it isn't (GGUF still runs everywhere). We cover unified-memory sizing, which quant to pick per Mac, and real throughput in running LLMs on Apple Silicon with MLX.

GPTQ, AWQ, EXL2 — GPU-side quantization

These three are the "server" quantization formats. They all produce safetensors-style files, all target NVIDIA GPUs, and all exist because GGUF's experimental support in GPU serving engines is under-optimized (vLLM's own docs recommend llama.cpp instead if you only have GGUF). When you are running a high-throughput, multi-user endpoint, you reach for one of these.

  • GPTQ quantizes the model one layer at a time, using second-order information to choose 4-bit values that minimize reconstruction error. It is the long-standing 4-bit standard for GPU inference and is loaded natively by vLLM, transformers, and TGI.
  • AWQ (Activation-aware Weight Quantization) identifies the small fraction of weight channels that matter most to activations and protects them, quantizing the rest to 4-bit. It frequently edges out GPTQ on quality at the same bitrate and is likewise first-class in vLLM.
  • EXL2 is the ExLlamaV2 format, and its trick is variable bitrate: you can target an average bits-per-weight (say 4.65 bpw) and the converter mixes per-layer precision to hit it. That lets you fill a fixed VRAM budget exactly — handy when you have, say, a 24 GB card and want the best possible quality that fits. EXL2 runs under ExLlamaV2/TabbyAPI rather than vLLM.

For why the choice is between these and not GGUF on a server: server engines like vLLM lean on continuous batching and PagedAttention to serve many concurrent requests, and the GPTQ/AWQ kernels are written to feed those paths. GGUF's strengths — single file, CPU offload, memory-mapping — are single-user conveniences that don't help a batched GPU server.

A decision table: which format do I download?

Match your runtime and hardware to a format and the choice is made for you:

Your setupDownload thisNotes
Ollama / LM Studio / Jan on Windows or Linux (any GPU or CPU)GGUF Q4_K_MThe universal default; bump to Q5_K_M/Q6_K if it fits
Mac (Apple Silicon, 32 GB+ unified)MLX 4-bit if available, else GGUFMLX is faster on M-series; GGUF always works
vLLM / TGI production endpoint on NVIDIAAWQ or GPTQ 4-bitNative, batched, high-throughput; AWQ often higher quality
ExLlamaV2 / TabbyAPI on a fixed-VRAM NVIDIA cardEXL2 at the bpw that fitsVariable bitrate fills your VRAM budget precisely
Full-precision serving or fine-tuningsafetensors (FP16/BF16)The unquantized source; also what you convert from
Tight VRAM, need sub-4-bitGGUF IQ2/IQ3 or low-bpw EXL2Better quality than legacy quants at the same bits

If you are unsure whether a given file will fit, the LLM VRAM calculator computes weights + KV cache + overhead for any model and quant, and the LLM inference speed calculator estimates tokens/sec from your GPU's memory bandwidth so you know what performance to expect before you download. The chart below shows why quant choice dominates the fit question:

Weight-only size of an 8B model by format/quant (GB) 0 4 8 12 16 Weights (GB) 3.3 Q2_K 4.9 Q4_K_M 6.6 Q6_K 8.5 Q8_0 16.0 FP16

The FP16 safetensors source is 16 GB; the same model in Q4_K_M GGUF is under 5 GB with near-identical output quality. That ~3.3× shrink — at almost no quality cost — is the entire reason quantized formats exist.

Image and video models use formats too

GGUF is no longer text-only. The diffusion community, led by ComfyUI, now ships GGUF builds of image and video models — Wan (video) and Qwen-Image among them — primarily to cut VRAM so a large model fits on a smaller card. If you have seen wan2.x-Q4_K_M.gguf floating around, that is the same container concept applied to a diffusion transformer.

But do not assume text-GGUF intuition transfers. These are not decoder-only transformers; they are U-Net or DiT diffusion models wrapped in VAEs and text encoders, and that changes the rules:

  • The quant trade-offs differ. Diffusion quality degrades differently from language-model perplexity, and which tensors tolerate low bits is model-specific.
  • Runtime support is narrower. GGUF diffusion runs through ComfyUI's GGUF nodes, not llama.cpp — a separate code path with its own quirks.
  • They don't shard like transformers do. Text transformers tensor-parallelize cleanly across GPUs because they are a uniform stack of identical layers; diffusion models have heterogeneous graphs, skip connections, and a serial denoising loop, so the multi-GPU story is about component offloading, not clean splitting.

That last point is a deep topic on its own — see splitting LLM models across GPUs for why transformers shard for free and diffusion models don't. The takeaway here: treat image/video GGUF as a related-but-separate ecosystem, and read the specific model card before assuming a quant level is safe.

Conclusion

Formats are a runtime decision, not a quality decision. Pick your runtime and hardware first, and the format follows: GGUF Q4_K_M for desktop apps and CPU/consumer-GPU inference, MLX on a capable Apple Silicon Mac, AWQ/GPTQ for vLLM serving on NVIDIA, EXL2 to fill a fixed VRAM budget exactly, and safetensors when you need full precision or are fine-tuning. When in doubt, GGUF Q4_K_M is the answer that runs almost everywhere with almost no quality loss.

Before you commit a download, sanity-check the fit with the LLM VRAM calculator and confirm your hardware with what LLM can I run — five seconds there saves you a 140 GB mistake.

Frequently Asked Questions

Find answers to common questions

GGUF is the single-file model format created by the llama.cpp project (it replaced the older GGML format). One file holds the quantized weights, the tokenizer, and all the metadata a runtime needs, and it is memory-mapped so loading is near-instant. It is the de-facto standard for consumer local-AI tools — Ollama, LM Studio, Jan, and GPT4All all load GGUF.

safetensors is Hugging Face's container for full-precision (FP16/BF16) weights, loaded by GPU-serving engines like vLLM and the transformers library. GGUF packs quantized weights plus the tokenizer into one memory-mapped file for CPU and consumer-GPU inference via llama.cpp. They do different jobs: safetensors is the high-precision source most GGUFs are converted from; GGUF is the compressed, ready-to-run artifact.

MLX is Apple's array framework and model format, built around unified memory and tuned for Apple Silicon GPUs; on M-series it is often 30-50% faster than the llama.cpp Metal path. GGUF runs everywhere via llama.cpp and is more portable. On a Mac with 32 GB+ of unified memory, prefer an MLX build when one exists; otherwise GGUF is fine. See our Apple Silicon guide for the full picture.

Both are GPU-side, post-training quantization methods that produce safetensors files for NVIDIA serving engines. GPTQ quantizes layer-by-layer to minimize error; AWQ (Activation-aware Weight Quantization) protects the most salient weight channels. vLLM loads both natively at 4-bit. They target high-throughput concurrent serving rather than the single-user convenience GGUF is built for.

Yes. Projects like ComfyUI ship GGUF builds of diffusion and video models such as Wan and Qwen-Image, using GGUF mainly to shrink VRAM footprint so big models fit on smaller cards. But diffusion architectures (U-Net/DiT) are not transformers, so quant trade-offs and runtime support differ from text models — treat image GGUF as a separate ecosystem from llama.cpp text GGUF.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.