A well-specced Mac is one of the best local-LLM inference machines you can buy, and the reason is one architectural decision: unified memory. On a PC, the GPU has its own VRAM, separate from system RAM, and that VRAM caps the model size you can load. On Apple Silicon, the CPU and GPU share a single pool, so every gigabyte of RAM is usable as model memory. A 128GB Mac Studio holds models that would need a rack of discrete GPUs to fit — quietly, on a desk, drawing less power than a single RTX 4090.
This post covers what unified memory actually buys you, how to translate "how much RAM does my Mac have" into "what model can I run," how MLX and GGUF differ on M-series chips, and what token rates to realistically expect.
Why Apple Silicon is good at this
Two properties make M-series chips strong inference boxes.
Unified memory. On a discrete-GPU PC, weights live in VRAM, and if a model is bigger than the card's VRAM you either offload layers to system RAM over the (much slower) PCIe bus or you simply can't run it. Apple Silicon has no such split: the GPU addresses the same physical memory the CPU does, with zero-copy access. A 64GB Mac exposes roughly 56GB of that to a model — no PCIe transfers, no layer offloading, no second GPU.
High memory bandwidth. LLM decode (generating one token at a time) is memory-bandwidth-bound, not compute-bound: for a dense model the engine reads essentially every weight from memory to produce each token. So tokens per second scales with bandwidth, not FLOPs. Apple's Max-tier chips pair their unified memory with wide buses — roughly 300–400 GB/s on the M3 Max and 410–546 GB/s on the M4 Max — far above an ordinary laptop and enough to make 8B-class models feel instant.
The honest framing: Macs win on capacity-per-dollar and capacity-per-watt, not on raw speed. A datacenter GPU still moves data 4–6x faster (see the chart below). What a Mac gives you is the ability to fit a large model at all, on one quiet machine.
The takeaway from the chart: an M4 Max moves about half the data per second of an RTX 4090 and roughly a sixth of an H100. But the 4090 only has 24GB and the M4 Max can be configured to 128GB — so the Mac runs the model the 4090 can't even load. Capacity beats bandwidth when the alternative is "doesn't fit."
Unified memory: rethinking the VRAM math
Because the OS and your apps share the same pool, you never get the full RAM figure for a model. Budget 6–10GB for macOS plus a browser and your editor, more if you keep heavy apps open. Apple also caps how much of the pool the GPU may wire down (raise it with sudo sysctl iogpu.wired_limit_mb if you need to push close to the ceiling, but leave the OS breathing room).
Weight memory follows a simple formula: weights ≈ params × bytes_per_weight. At the recommended Q4_K_M quantization that's roughly 0.55 bytes/weight (so an 8B model is ~4.9GB, a 70B is ~40GB); Q8_0 is ~1.06 bytes/weight; full FP16 is 2.0. Add KV cache and a gigabyte or two of runtime overhead on top. For the full picture on any specific model use the LLM VRAM calculator, or let the what-LLM-can-I-run tool detect your hardware and list what fits.
Here's how Mac RAM tiers map to comfortable model sizes:
| Mac RAM | Usable for model | Largest comfortable model (Q4_K_M) | Notes |
|---|---|---|---|
| 8 GB | ~3–4 GB | 3B class | Tight; small models only, short context |
| 16 GB | ~10 GB | 7–8B Q4 (~5 GB) | Entry tier; below Ollama's MLX threshold |
| 24 GB | ~16 GB | 13–14B Q4 | Comfortable mid-size + headroom |
| 32 GB | ~24 GB | 30B-class Q4, or 8B at Q8 | Ollama MLX backend unlocks here (32GB+) |
| 48 GB | ~40 GB | 34B Q8, or 70B at Q3 | Squeezes a small-quant 70B |
| 64 GB | ~56 GB | 70B Q4 (~40 GB) + context | The sweet spot for 70B work |
| 128 GB | ~120 GB | 70B Q8 (~74 GB), or large MoE | Runs nearly anything on one box |
Remember KV cache grows with context length and eats into that budget — long-context sessions on a 70B can add tens of gigabytes. See how much VRAM you need to run an LLM for the KV-cache math.
MLX vs GGUF on a Mac
GGUF is the de-facto local-model container, originated by the llama.cpp project. It runs everywhere — Mac (via Metal), NVIDIA, CPU — and has the widest model selection by far. MLX is Apple's open-source array framework, built around unified memory with zero-copy CPU/GPU access; mlx-lm is the LLM library on top, and models ship in an MLX-quantized, safetensors-based format via the mlx-community org on Hugging Face. MLX is now the high-performance backend underneath both LM Studio and recent Ollama on Macs.
| Dimension | MLX (mlx-lm / mlx-community) | GGUF (llama.cpp / Metal) |
|---|---|---|
| Maintainer | Apple | ggml-org community |
| Platform | Apple Silicon only | Cross-platform (CUDA, Metal, CPU, ROCm) |
| Speed on M-series | Fastest path; ~30–50% over Metal in LM Studio, ~2x in Ollama | Baseline Metal path |
| Format | MLX-quantized safetensors | Single-file GGUF container |
| Quant options | MLX 4/8-bit, NVFP4 (newer) | Q2–Q8 K-quants, I-quants |
| Model availability | mlx-community org (growing) | Vast — the de-facto standard |
| Portability | Locked to the Mac | Runs anywhere unchanged |
The trade-off is simple: MLX for speed on a Mac, GGUF for portability and breadth. If you live on one Mac, prefer MLX builds when they exist. If you move models between machines or want the newest release the day it drops, GGUF usually gets there first.
The fastest way to run LLMs on M-series
Three practical options, roughly in order of ease:
- LM Studio — a free desktop GUI for discovering, downloading, and running models. It uses MLX as the default engine whenever an MLX build exists, falling back to GGUF/llama.cpp otherwise, with its MLX path reported around 30–50% faster than Metal. It exposes an OpenAI-compatible server on port 1234 (
/v1/chat/completions, etc.), and v0.4.1 added an Anthropic-compatible/v1/messagesendpoint that works with Claude Code. Best for: most people who want a polished, click-to-run experience. - Ollama — the CLI/REST-API workhorse on port 11434, with OpenAI-compatible endpoints under
/v1. Version 0.19 (preview, around March 31 2026) introduced an optional MLX backend that replaces the llama.cpp Metal path on Apple Silicon, reported at roughly 2x the old speed, with a later optimization pass (kernel fusion via MLX JIT, reworked GPU sampling) adding up to ~20% more. Important caveat: the MLX backend targets Macs with 32GB or more of unified memory — on a 16GB machine you stay on the GGUF/Metal path. Also remember Ollama's default context window is just 2048 tokens; setOLLAMA_CONTEXT_LENGTH=8192(or higher) for real coding and long-document work. The Ollama command builder generates the rightrun/pullinvocations and env vars. - mlx-lm (CLI) — Apple's own command-line tool for
load,generate,quantize, and LoRA fine-tuning. No GUI, maximum control. Best for scripting, batch jobs, and converting/quantizing your own MLX builds.
On the newest M5-series chips, the MLX engine also taps the GPU's Neural Accelerators for faster time-to-first-token and higher throughput, and recently gained NVFP4 support.
Performance: what tokens/sec to expect
Because decode is bandwidth-bound, you can estimate the ceiling as tok/s ≈ bandwidth ÷ bytes_read_per_token, where bytes-read is roughly the model's weight size at your quant. Real-world output lands at about 50–80% of that ceiling. The LLM inference-speed calculator does this for any chip and model; the figures below are order-of-magnitude single-stream (batch=1) Q4 estimates synthesized from the bandwidth model and community benchmarks.
| Chip | Unified bandwidth | 8B Q4 (~5 GB) | 70B Q4 (~40 GB) |
|---|---|---|---|
| Apple M3 Max | ~300–400 GB/s | ~30–60 tok/s | ~6–10 tok/s |
| Apple M4 Max | ~410–546 GB/s | ~40–80 tok/s | ~8–13 tok/s |
A few practical reads on these numbers:
- An 8B model feels instant on any Max-tier chip — 40+ tok/s is faster than you read. This is the size most people should run for interactive chat and coding assistance.
- A 70B model is usable but deliberate — 8–13 tok/s on an M4 Max is fine for careful, considered output but slow for rapid-fire iteration. The model fits, which is the whole point, but you trade speed for capacity.
- Longer context is slower. As the KV cache grows, the engine spends more time on attention, so a long session decodes slower than these short-prompt figures suggest.
- Base and Pro chips are slower, with narrower memory buses than the Max tiers above — scale these numbers down proportionally to your chip's bandwidth.
Limitations
Be clear-eyed about what a Mac gives up:
- No CUDA ecosystem. The vast majority of training and serving tooling — bitsandbytes, FlashAttention kernels, vLLM's PagedAttention serving, DeepSpeed — is built for NVIDIA. On a Mac you live in the MLX and llama.cpp worlds, which are excellent for inference but a smaller universe.
- Format and tooling lag. A brand-new model usually ships as a GGUF (or raw safetensors) first; the MLX conversion may land days later, so the fastest path isn't always available on day one.
- Training is limited.
mlx-lmsupports LoRA fine-tuning of smaller models, which is genuinely useful for adapters and experiments, but full fine-tunes and large-scale training still belong on NVIDIA hardware. - Bandwidth caps throughput. As the chart showed, even an M4 Max is well under a single datacenter GPU's bandwidth. For high-concurrency serving (many simultaneous users), a Mac is the wrong tool — that's a vLLM-on-GPU job.
Conclusion
For single-user local inference, a well-specced Mac is hard to beat: unified memory lets it hold models that would otherwise demand multiple discrete GPUs, on one quiet, power-sipping machine. Choose your RAM tier by the model you want to run — 32GB to get serious (and to unlock Ollama's MLX backend), 64GB for comfortable 70B work, 128GB to run almost anything. Reach for MLX when you want maximum speed on a Mac you control, and GGUF when you need portability or the newest model the moment it drops. LM Studio is the easiest on-ramp; Ollama and mlx-lm give you scripting and API control. Run the numbers for your exact hardware with the what-LLM-can-I-run and inference-speed tools before you buy.