The question: my model won't fit on one GPU
You sized the model, checked the math, and the number came back ugly. A 70B model at Q4_K_M needs roughly 38–43 GB of weights, and once you add the KV cache and runtime overhead you're closer to ~48 GB total at modest context. A single RTX 4090 has 24 GB. A 5090 has 32 GB. Neither holds it. (If you haven't done this sizing yet, our LLM VRAM Calculator and what-llm-can-i-run will tell you exactly where the wall is.)
So you reach for the obvious fix: two GPUs. And for a text model, that works beautifully — two 24 GB cards (48 GB combined) are just enough to hold a 70B Q4 model that no single consumer card can. But the moment someone asks "great, can we do the same thing for our Stable Diffusion or video pipeline?" the answer changes completely. The image model fights you.
This post explains both halves: how transformers split across GPUs so cleanly, and the specific architectural reasons diffusion models don't. The difference isn't tooling laziness — it's the shape of the computation.
How text models split: tensor and pipeline parallelism
There are two fundamentally different axes you can cut a model along, and they have very different interconnect requirements.
Tensor parallelism (intra-layer). Take a single layer's weight matrix and split it across GPUs — by columns on one matmul, by rows on the next. Each GPU computes its slice of the output, then a single all-reduce per layer stitches the partial results back together. Because a transformer layer is just big matrix multiplies (the QKV/output projections and the MLP), and those matmuls have clean split dimensions (attention heads, hidden dimension), the work divides evenly. The catch: that all-reduce happens every layer, so the GPUs talk constantly. Tensor parallelism wants NVLink.
Pipeline parallelism (inter-layer). Instead of splitting within a layer, assign whole ranges of layers to each GPU: layers 0–39 on GPU 0, layers 40–79 on GPU 1. Activations flow GPU 0 → GPU 1 once per boundary. The communication is tiny — one activation tensor per handoff — so it runs fine over PCIe. The downside is the "bubble": at batch size 1, while GPU 1 works, GPU 0 sits idle. You hide bubbles by streaming many requests through the pipeline.
Here's the data flow for both, on a 4-GPU example:
In practice you combine them: tensor parallelism within a node (where NVLink exists) and pipeline parallelism across nodes (where you only have Ethernet/PCIe), plus data parallelism to scale request throughput. That TP × PP × DP stacking is exactly how vLLM, TensorRT-LLM, and DeepSpeed scale to dozens of GPUs.
Why it works so well for transformers
The reason this all clicks is the architecture. A decoder-only transformer is a stack of N identical layers. Every layer has the same shape, the same operations, the same tensors — attention projections and a feed-forward MLP, repeated 32 times for an 8B model or 80 times for a 70B. That uniformity gives you three things for free:
- Clean split dimensions. Weight matrices divide along heads or hidden dimension with no leftover. There's an obvious place to cut.
- Trivial load balancing. Pipeline parallelism just needs equal-sized layer ranges, and since every layer is identical, "equal" means "same count." No solver required.
- Regular, amortized communication. The all-reduce in tensor parallelism is predictable and happens over large matmuls, so the comms cost is spread across real compute.
The tooling reflects this. llama.cpp's --split-mode gives you layer (pipeline-style, the default, least inter-GPU traffic), row/tensor (true tensor split, can use NCCL), and none. vLLM exposes tensor parallelism as a single --tensor-parallel-size flag. The hard part — figuring out how to shard — is essentially solved because the model's structure hands you the answer.
| Split method | What it does | Interconnect need |
|---|---|---|
| Pipeline / layer split | Contiguous layer ranges per GPU; one activation handoff per boundary | PCIe Gen4/5 acceptable |
| Tensor / row split | Each weight matrix split across GPUs; all-reduce every layer | NVLink strongly preferred |
| Data parallel | Full model replicated per GPU; independent requests each | No GPU-to-GPU link needed |
| Component offload | Push idle sub-models (or layers) to CPU/second GPU/RAM | PCIe; saves memory, not latency |
| Cross-machine (RPC/cluster) | Pipeline a model across separate hosts over the network | LAN; bounded by slowest link |
Why image and video models are different
Now point the same toolkit at a diffusion model and it stops working. There are three independent reasons, and you only need one of them to break tensor parallelism — diffusion has all three.
1. The graph is heterogeneous. A U-Net is an encoder that downsamples, a bottleneck, and a decoder that upsamples — operating at different spatial resolutions — with skip connections that jump from the encoder straight across to the matching decoder stage. There is no single split dimension that's clean across the whole network, and those skip connections force activation transfers between GPUs at awkward points that wreck tensor-parallel efficiency. DiT (the transformer-based successor used in newer image/video models) is more uniform internally, but it's still wrapped in a VAE and text encoders, and laced with cross-attention.
2. Cross-attention couples everything to the conditioning. Every block cross-attends to the text/image conditioning (CLIP/T5 embeddings). That's extra coupling that doesn't exist in a plain decoder stack, adding communication that a clean tensor split would otherwise avoid.
3. The denoising loop is inherently serial. This is the big one. Generating an image isn't one forward pass — it's 20–50+ iterative denoising steps, and each step's full-image latent depends on the previous step's output. Step t needs step t−1. That's a serial outer loop you cannot parallelize across steps. Pipeline parallelism across layers within one step buys almost nothing, and step-level pipelining just stalls. On top of that the latent for a single image is small relative to the per-step compute, so splitting a tensor across GPUs adds more synchronization overhead than the work it offloads.
So what can you do with multiple GPUs and diffusion?
- Component/model offloading — keep the text encoder, VAE, and U-Net/DiT on different devices, or stream them in and out of VRAM ("sequential CPU offload," "model offload"). This fits a big pipeline on small cards. It saves memory, not latency.
- Data parallelism — run independent images or batches on each GPU. This is embarrassingly parallel and the best throughput scaling you'll get. Two GPUs ≈ two images at once.
- VAE/CLIP offload to CPU or a second GPU during the denoise loop, freeing the main card.
- Specialized research methods — DistriFusion, xDiT, patch-parallel and sequence-parallel DiT split the latent spatially or pipeline across steps with displaced/async communication. These exist, but they're bespoke techniques, not the free tensor parallelism transformers enjoy. For video, the practical axis is splitting across frames or temporal chunks — data-parallel in spirit — not within a single denoise step.
The summary: with a transformer you split the model; with diffusion you mostly split the workload (different images on different cards) or the components, because you can't cleanly split a single denoise step.
The practical reality
Multi-GPU for LLMs earns its keep in exactly one situation: the model doesn't fit one card. That's it. A 70B Q4 needs ~48 GB, so 2× 24 GB makes it possible at all — that's a real win. But be honest about what you're buying.
Decode speed is memory-bandwidth-bound, not GPU-count-bound. The rough ceiling is tok/s ≈ bandwidth ÷ active_bytes_per_token, and adding a second GPU doesn't raise per-stream bandwidth — it raises capacity. Two cards let the model run; they don't make each token twice as fast. (Estimate this for your hardware, or measure real bandwidth in-browser with the GPU benchmark.)
Interconnect can quietly eat your gains. Tensor parallelism does an all-reduce every layer; over PCIe instead of NVLink that traffic becomes the bottleneck and can cut your tokens/sec substantially. If you only have PCIe, prefer a layer/pipeline split. And before committing to two small cards, price out the alternative: a single 48 GB or 80 GB card holds a 70B model with no cross-GPU comms at all, and the self-hosted cost calculator will tell you whether owning that hardware beats bursting to a cloud API for your volume.
The other way to scale: don't split, distribute
Splitting one model across GPUs in one box is a capacity trick. There's a second, operationally different way to scale that often matters more once you're past the "does it fit" question: distribute requests, not weights.
- Cluster across machines. llama.cpp's RPC backend pipeline-parallelizes a GGUF model across N hosts; exo clusters heterogeneous devices (Macs, Linux PCs, even a Raspberry Pi) peer-to-peer and ring-partitions the model by each device's memory. This pools memory to fit models too big for any single machine — at the cost of being bounded by the slowest network link. It's clustering, and it's a genuinely different thing from tensor parallelism inside one server. Don't conflate them.
- Route requests across nodes. Instead of making one giant model, run a model per node and put a router in front. Each request goes to a healthy node; overflow fails over elsewhere; repeated prompts get served from cache.
That routing layer is where most production local-AI setups actually live, because it solves the problems splitting doesn't: failover, caching, and a single stable endpoint. If you reach that wall — you need one OpenAI-compatible URL that runs against your own GPU baseline and bursts to the cloud only when you choose — an edge-first AI gateway like WideAreaAI does request-level routing, failover, and edge caching across whole nodes. (To be clear about scope: that's routing and caching across nodes, not model-splitting across machines — own your baseline, burst to the cloud, no per-token fees on hardware you already run.)
Conclusion
Text transformers split across GPUs cleanly because they're stacks of identical layers with obvious cut points and regular communication — tensor and pipeline parallelism fall right out of the architecture. Diffusion image and video models resist it because they're heterogeneous, cross-attention-heavy, and gated by a serial denoising loop you can't parallelize across steps; the most you split there is components or whole images, not a single step. So pick the scaling path that matches the model: split transformers when they won't fit, parallelize images (not steps) for diffusion, and reach for clustering or request routing when the real problem is capacity, failover, or a stable endpoint rather than a single oversized model.