Skip to main content
Home/Blog/Clustering Machines for Local AI: Running Big Models Across Your Network
Artificial Intelligence

Clustering Machines for Local AI: Running Big Models Across Your Network

When no single machine can hold the model — or you just have spare hardware lying around — you can cluster. Here's how distributed inference works with tools like exo and llama.cpp RPC, and where it helps versus where it doesn't.

By InventiveHQ Team

Two very different goals: capacity vs throughput

Before you wire three Macs together with a switch, get clear on which problem you actually have. There are two, and clustering only solves one of them.

Capacity is "the model does not fit." DeepSeek V3 at 671B parameters needs hundreds of gigabytes even quantized; no single consumer machine has that much memory. To run it at all you must pool memory across machines — that is clustering, and it splits one model across boxes.

Throughput is "I have plenty of capacity but too many requests." Each machine can already hold the model; you just want to serve more users per second, and you want one node failing not to take you down. That is request routing, and it sends whole requests to whole nodes.

Clustering (this article)Request routing
Problem it solvesModel won't fit on one machineToo many requests / need failover
Unit of work splitOne model, across machinesOne request, to one machine
Per-machine requirementNone can hold the model aloneEach must hold the whole model
Network roleOn the critical path — every tokenOnly at request hand-off
Effect on latencyAdds latency per tokenNo added per-token latency
Effect on throughputUsually lower than single fast GPUScales ~linearly with nodes
Toolsexo, llama.cpp RPC, PetalsLoad balancer, AI gateway

Conflating the two is the most common mistake people make here. Clustering will not make your model faster, and routing will not let you run a model that doesn't fit. The rest of this article is about clustering. If your real problem is throughput or reliability, jump to the request-routing section or our guide to standing up an OpenAI-compatible endpoint for a local LLM.

How distributed inference works

A text transformer is a stack of identical decoder layers. That uniformity is exactly why it shards cleanly: you can cut the stack into contiguous ranges and put each range on a different machine. This is pipeline parallelism — machine A holds layers 1–40, machine B holds layers 41–80, and a token's activations flow A → B → (output) and, in a ring topology, back around.

Here is the part that determines everything about performance: that hand-off happens on every token. To generate one token, the input runs through layer 1 on machine A, produces an activation tensor, which must be serialized and sent across the network to machine B, which runs layers 41–80 and produces the next token. Then the cycle repeats. The network sits squarely on the critical path.

Pipeline-parallel inference: activations cross the network between layer ranges on every token Machine A layers 1–40 Machine B layers 41–80 activations → ← ring back (next token) This round-trip repeats for every single generated token.

Two consequences fall out of this. First, bandwidth and latency dominate, not GPU FLOPs. Single-stream decode is already memory-bandwidth-bound inside one machine (tok/s ≈ bandwidth ÷ active_bytes_per_token); add a LAN hop and you've inserted a far slower link into the loop. Second, the cluster runs at the speed of its weakest part — the slowest device and the slowest link set the pace, because every token waits on the full ring. Pooling an M4 Max with a Raspberry Pi does not average their speed; it drags toward the Pi.

What you gain is the one thing you came for: combined memory. Eight machines with 64 GB each give you ~512 GB of effective model space. That is enough to hold a model no single box could — and that, not speed, is the whole point.

The tools: exo, llama.cpp RPC, and friends

exo is the most ergonomic option for mixed hardware. It auto-discovers devices on the LAN, measures each one's memory and network, and partitions the model using a ring memory-weighted strategy: every device runs a number of layers proportional to its RAM, so a 128 GB Mac Studio carries more of the model than a 16 GB laptop. There is no master-worker hierarchy — devices are peers, and any connected device can serve requests. Its canonical demo is exactly the capacity story: DeepSeek V3 671B at 8-bit across 8× M4 Mac minis, roughly 512 GB pooled. One caveat worth checking before you build on it: the exo repository has been flagged as archived, so confirm the project's maintenance status first.

llama.cpp RPC is the lower-level, more durable path. The engine ships an RPC backend: you run rpc-server on each remote host to expose its ggml devices, then point a head llama-server at them. It pipeline-parallelizes a single GGUF across N machines, including heterogeneous and modest hardware (Jetson boards, consumer desktops). It's less automatic than exo — you assign hosts yourself — but it's part of the mainline llama.cpp project that underpins Ollama, LM Studio, and Jan, so it tracks the ecosystem.

# On each worker machine, expose its GPU/CPU over RPC:
rpc-server --host 0.0.0.0 --port 50052

# On the head node, run the model split across the workers:
llama-server -m deepseek-v3-Q8_0.gguf \
  --rpc 10.0.0.11:50052,10.0.0.12:50052 \
  --host 0.0.0.0 --port 8080

Note that this is different from llama.cpp's in-box multi-GPU --split-mode, where layer or tensor splitting happens across cards on a fast PCIe/NVLink bus inside one machine — covered in splitting LLM models across GPUs. RPC takes the same idea onto the network, with the network's much higher latency.

Petals-style approaches go one step further, pooling layers across the public internet with strangers' machines. That maximizes the latency problem and adds trust and privacy concerns, so for self-hosting it's a curiosity more than a recommendation — a private LAN cluster with exo or llama.cpp RPC is the practical choice.

When clustering helps — and when it doesn't

Clustering helps when:

  • The model genuinely will not fit any single machine you own — 400B-plus models or 671B-class Mixture-of-Experts models like DeepSeek-V3 — and you have spare boxes to pool. Check the real requirement with the LLM VRAM calculator before assuming you need a cluster.
  • The workload is batch or throughput-tolerant: overnight document processing, offline analysis, a low-traffic internal assistant where a few seconds to first token is fine.
  • You already own the hardware. Pooling idle Macs costs nothing extra; the alternative (an 80 GB H100 or renting cloud capacity) has a real bill — model that with the self-hosted LLM cost calculator.

Clustering hurts when:

  • You expect it to be faster. It won't be. A model that fits one GPU runs faster on that GPU than split over a network. Use what LLM can I run to confirm whether your single box already covers your target model — if it does, stop here.
  • The use case is latency-sensitive: interactive chat, autocomplete, anything a human waits on token-by-token. The per-token network round-trip is felt directly.
  • Your links are slow or your devices are wildly mismatched. A gigabit switch and similar machines cluster acceptably; Wi-Fi plus a phone and a NUC will crawl, because the ring waits on the worst link and the worst device every token.

The honest summary: clustering trades latency for capacity. You accept slower generation in exchange for the ability to run a model at all. If you don't need that trade, don't make it.

The simpler alternative for most people

Most people who reach for clustering don't actually have a capacity problem — they have a throughput-or-reliability problem, and clustering is the wrong tool for it. If each of your machines can already hold the model, don't split the model — route the requests.

Request routing keeps a whole copy of the model on each node and sends whole requests to whichever node is free. The network only touches the request boundary, never the per-token critical path, so there's no per-token latency tax. Add nodes and throughput scales roughly linearly; lose a node and the others keep serving. That's the opposite latency/throughput profile from clustering, and it's what you want for serving real traffic.

This is where an AI gateway earns its keep. WideAreaAI is an edge-first LLM gateway: your apps get one OpenAI-compatible endpoint, and each request is routed — edge cache first, then your own llama.cpp GPU node (reached over a Cloudflare Tunnel, no inbound ports), then cloud burst as failover when your hardware is saturated or down. The model is own your baseline, burst to the cloud — edge-first, cloud when you choose, with a markup-free baseline and no per-token fees on the hardware you already own. Crucially, that's request-level routing and failover across whole nodes — it is not model-splitting or tensor parallelism, and it doesn't try to be. It solves the throughput-and-reliability problem that clustering doesn't, and leaves clustering to the capacity problem it does.

Conclusion

Cluster for capacity; route for throughput. If the model won't fit on any one machine and you can tolerate slower, network-bound generation, exo or llama.cpp RPC will pool your hardware's memory and let you run something no single box could hold. If instead each machine already fits the model and you just need more requests per second or graceful failover, splitting the model over a LAN is a step backward — route whole requests across whole nodes (and burst to the cloud) instead. Decide by your actual constraint, not by how many machines happen to be sitting on the shelf.

Frequently Asked Questions

Find answers to common questions

Yes. Tools like exo and llama.cpp's RPC mode shard one model across networked machines so their combined memory holds a model none of them could fit alone — for example, DeepSeek V3 671B at 8-bit spread across 8 Mac minis pooling roughly 512 GB. The catch is that activations now hand off across your LAN between layers, so network bandwidth and latency become the bottleneck, not GPU compute.

exo (from Exo Labs) is an open-source framework that turns a heterogeneous mix of devices — Apple Silicon Macs, Linux PCs, even phones and Raspberry Pis — into a single inference cluster on your local network. It auto-discovers devices, measures their memory and link speed, and partitions the model with a ring memory-weighted strategy so each device runs a share of layers proportional to its RAM. There is no master node; devices connect peer-to-peer. Note: the project's repo has been flagged as archived, so confirm its maintenance status before depending on it.

Almost never. Clustering is about capacity, not speed. With pipeline parallelism the layers run sequentially across machines, and the activation hand-off crosses your network on every token, so throughput is bounded by the slowest device and the slowest link. A model that fits one GPU will always run faster on that one GPU than split across two over Ethernet. You cluster to run a model you otherwise couldn't, not to run a model faster.

Clustering splits ONE model across machines to add capacity — every machine cooperates on every token. Request routing sends WHOLE requests to whichever whole node is free, adding throughput and failover but requiring that each node can already hold the model on its own. They solve opposite problems: clustering for 'the model won't fit,' routing for 'I have too many requests.'

Yes. In pipeline-parallel inference the partial result (the activations) must travel from the machine holding layers 1–N to the machine holding layers N+1–M, and back around the ring, for every single token generated. On a gigabit LAN that per-hop cost is small but non-zero and it compounds across hops, which is why clustered throughput rarely beats a low single-digit to low-double-digit tokens/sec for large models.

You don't necessarily need a cluster at all. A 70B model at Q4_K_M is roughly 38–43 GB of weights plus KV cache and overhead, so a single 80 GB card (A100/H100) or two 24 GB cards (2× RTX 4090) holds it without any network involved — and far faster than clustering. Reach for clustering only when the model exceeds what you can put in one box, like a 400B-plus model or a 671B Mixture-of-Experts model such as DeepSeek-V3.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.