Two very different goals: capacity vs throughput
Before you wire three Macs together with a switch, get clear on which problem you actually have. There are two, and clustering only solves one of them.
Capacity is "the model does not fit." DeepSeek V3 at 671B parameters needs hundreds of gigabytes even quantized; no single consumer machine has that much memory. To run it at all you must pool memory across machines — that is clustering, and it splits one model across boxes.
Throughput is "I have plenty of capacity but too many requests." Each machine can already hold the model; you just want to serve more users per second, and you want one node failing not to take you down. That is request routing, and it sends whole requests to whole nodes.
| Clustering (this article) | Request routing | |
|---|---|---|
| Problem it solves | Model won't fit on one machine | Too many requests / need failover |
| Unit of work split | One model, across machines | One request, to one machine |
| Per-machine requirement | None can hold the model alone | Each must hold the whole model |
| Network role | On the critical path — every token | Only at request hand-off |
| Effect on latency | Adds latency per token | No added per-token latency |
| Effect on throughput | Usually lower than single fast GPU | Scales ~linearly with nodes |
| Tools | exo, llama.cpp RPC, Petals | Load balancer, AI gateway |
Conflating the two is the most common mistake people make here. Clustering will not make your model faster, and routing will not let you run a model that doesn't fit. The rest of this article is about clustering. If your real problem is throughput or reliability, jump to the request-routing section or our guide to standing up an OpenAI-compatible endpoint for a local LLM.
How distributed inference works
A text transformer is a stack of identical decoder layers. That uniformity is exactly why it shards cleanly: you can cut the stack into contiguous ranges and put each range on a different machine. This is pipeline parallelism — machine A holds layers 1–40, machine B holds layers 41–80, and a token's activations flow A → B → (output) and, in a ring topology, back around.
Here is the part that determines everything about performance: that hand-off happens on every token. To generate one token, the input runs through layer 1 on machine A, produces an activation tensor, which must be serialized and sent across the network to machine B, which runs layers 41–80 and produces the next token. Then the cycle repeats. The network sits squarely on the critical path.
Two consequences fall out of this. First, bandwidth and latency dominate, not GPU FLOPs. Single-stream decode is already memory-bandwidth-bound inside one machine (tok/s ≈ bandwidth ÷ active_bytes_per_token); add a LAN hop and you've inserted a far slower link into the loop. Second, the cluster runs at the speed of its weakest part — the slowest device and the slowest link set the pace, because every token waits on the full ring. Pooling an M4 Max with a Raspberry Pi does not average their speed; it drags toward the Pi.
What you gain is the one thing you came for: combined memory. Eight machines with 64 GB each give you ~512 GB of effective model space. That is enough to hold a model no single box could — and that, not speed, is the whole point.
The tools: exo, llama.cpp RPC, and friends
exo is the most ergonomic option for mixed hardware. It auto-discovers devices on the LAN, measures each one's memory and network, and partitions the model using a ring memory-weighted strategy: every device runs a number of layers proportional to its RAM, so a 128 GB Mac Studio carries more of the model than a 16 GB laptop. There is no master-worker hierarchy — devices are peers, and any connected device can serve requests. Its canonical demo is exactly the capacity story: DeepSeek V3 671B at 8-bit across 8× M4 Mac minis, roughly 512 GB pooled. One caveat worth checking before you build on it: the exo repository has been flagged as archived, so confirm the project's maintenance status first.
llama.cpp RPC is the lower-level, more durable path. The engine ships an RPC backend: you run rpc-server on each remote host to expose its ggml devices, then point a head llama-server at them. It pipeline-parallelizes a single GGUF across N machines, including heterogeneous and modest hardware (Jetson boards, consumer desktops). It's less automatic than exo — you assign hosts yourself — but it's part of the mainline llama.cpp project that underpins Ollama, LM Studio, and Jan, so it tracks the ecosystem.
# On each worker machine, expose its GPU/CPU over RPC:
rpc-server --host 0.0.0.0 --port 50052
# On the head node, run the model split across the workers:
llama-server -m deepseek-v3-Q8_0.gguf \
--rpc 10.0.0.11:50052,10.0.0.12:50052 \
--host 0.0.0.0 --port 8080
Note that this is different from llama.cpp's in-box multi-GPU --split-mode, where layer or tensor splitting happens across cards on a fast PCIe/NVLink bus inside one machine — covered in splitting LLM models across GPUs. RPC takes the same idea onto the network, with the network's much higher latency.
Petals-style approaches go one step further, pooling layers across the public internet with strangers' machines. That maximizes the latency problem and adds trust and privacy concerns, so for self-hosting it's a curiosity more than a recommendation — a private LAN cluster with exo or llama.cpp RPC is the practical choice.
When clustering helps — and when it doesn't
Clustering helps when:
- The model genuinely will not fit any single machine you own — 400B-plus models or 671B-class Mixture-of-Experts models like DeepSeek-V3 — and you have spare boxes to pool. Check the real requirement with the LLM VRAM calculator before assuming you need a cluster.
- The workload is batch or throughput-tolerant: overnight document processing, offline analysis, a low-traffic internal assistant where a few seconds to first token is fine.
- You already own the hardware. Pooling idle Macs costs nothing extra; the alternative (an 80 GB H100 or renting cloud capacity) has a real bill — model that with the self-hosted LLM cost calculator.
Clustering hurts when:
- You expect it to be faster. It won't be. A model that fits one GPU runs faster on that GPU than split over a network. Use what LLM can I run to confirm whether your single box already covers your target model — if it does, stop here.
- The use case is latency-sensitive: interactive chat, autocomplete, anything a human waits on token-by-token. The per-token network round-trip is felt directly.
- Your links are slow or your devices are wildly mismatched. A gigabit switch and similar machines cluster acceptably; Wi-Fi plus a phone and a NUC will crawl, because the ring waits on the worst link and the worst device every token.
The honest summary: clustering trades latency for capacity. You accept slower generation in exchange for the ability to run a model at all. If you don't need that trade, don't make it.
The simpler alternative for most people
Most people who reach for clustering don't actually have a capacity problem — they have a throughput-or-reliability problem, and clustering is the wrong tool for it. If each of your machines can already hold the model, don't split the model — route the requests.
Request routing keeps a whole copy of the model on each node and sends whole requests to whichever node is free. The network only touches the request boundary, never the per-token critical path, so there's no per-token latency tax. Add nodes and throughput scales roughly linearly; lose a node and the others keep serving. That's the opposite latency/throughput profile from clustering, and it's what you want for serving real traffic.
This is where an AI gateway earns its keep. WideAreaAI is an edge-first LLM gateway: your apps get one OpenAI-compatible endpoint, and each request is routed — edge cache first, then your own llama.cpp GPU node (reached over a Cloudflare Tunnel, no inbound ports), then cloud burst as failover when your hardware is saturated or down. The model is own your baseline, burst to the cloud — edge-first, cloud when you choose, with a markup-free baseline and no per-token fees on the hardware you already own. Crucially, that's request-level routing and failover across whole nodes — it is not model-splitting or tensor parallelism, and it doesn't try to be. It solves the throughput-and-reliability problem that clustering doesn't, and leaves clustering to the capacity problem it does.
Conclusion
Cluster for capacity; route for throughput. If the model won't fit on any one machine and you can tolerate slower, network-bound generation, exo or llama.cpp RPC will pool your hardware's memory and let you run something no single box could hold. If instead each machine already fits the model and you just need more requests per second or graceful failover, splitting the model over a LAN is a step backward — route whole requests across whole nodes (and burst to the cloud) instead. Decide by your actual constraint, not by how many machines happen to be sitting on the shelf.