Skip to main content
Home/Blog/Splitting Models Across Multiple GPUs — and Why Image and Video Models Can't Do It the Same Way
Artificial Intelligence

Splitting Models Across Multiple GPUs — and Why Image and Video Models Can't Do It the Same Way

Text models shard across GPUs cleanly; diffusion image and video models fight you every step. Here's how tensor and pipeline parallelism work, why transformers split so well, and why a U-Net/DiT doesn't.

By InventiveHQ Team

The question: my model won't fit on one GPU

You sized the model, checked the math, and the number came back ugly. A 70B model at Q4_K_M needs roughly 38–43 GB of weights, and once you add the KV cache and runtime overhead you're closer to ~48 GB total at modest context. A single RTX 4090 has 24 GB. A 5090 has 32 GB. Neither holds it. (If you haven't done this sizing yet, our LLM VRAM Calculator and what-llm-can-i-run will tell you exactly where the wall is.)

So you reach for the obvious fix: two GPUs. And for a text model, that works beautifully — two 24 GB cards (48 GB combined) are just enough to hold a 70B Q4 model that no single consumer card can. But the moment someone asks "great, can we do the same thing for our Stable Diffusion or video pipeline?" the answer changes completely. The image model fights you.

This post explains both halves: how transformers split across GPUs so cleanly, and the specific architectural reasons diffusion models don't. The difference isn't tooling laziness — it's the shape of the computation.

How text models split: tensor and pipeline parallelism

There are two fundamentally different axes you can cut a model along, and they have very different interconnect requirements.

Tensor parallelism (intra-layer). Take a single layer's weight matrix and split it across GPUs — by columns on one matmul, by rows on the next. Each GPU computes its slice of the output, then a single all-reduce per layer stitches the partial results back together. Because a transformer layer is just big matrix multiplies (the QKV/output projections and the MLP), and those matmuls have clean split dimensions (attention heads, hidden dimension), the work divides evenly. The catch: that all-reduce happens every layer, so the GPUs talk constantly. Tensor parallelism wants NVLink.

Pipeline parallelism (inter-layer). Instead of splitting within a layer, assign whole ranges of layers to each GPU: layers 0–39 on GPU 0, layers 40–79 on GPU 1. Activations flow GPU 0 → GPU 1 once per boundary. The communication is tiny — one activation tensor per handoff — so it runs fine over PCIe. The downside is the "bubble": at batch size 1, while GPU 1 works, GPU 0 sits idle. You hide bubbles by streaming many requests through the pipeline.

Here's the data flow for both, on a 4-GPU example:

Tensor parallelism splits each layer across GPUs; pipeline parallelism splits the layer stack Tensor Parallelism one layer, split 4 ways — all-reduce every layer Layer N (weight matrix) GPU0 GPU1 GPU2 GPU3 all-reduce ↔ needs NVLink Pipeline Parallelism layer ranges on each GPU — one handoff per boundary GPU0 · layers 0–19 GPU1 · layers 20–39 GPU2 · layers 40–59 GPU3 · layers 60–79 activations ↓ PCIe is fine

In practice you combine them: tensor parallelism within a node (where NVLink exists) and pipeline parallelism across nodes (where you only have Ethernet/PCIe), plus data parallelism to scale request throughput. That TP × PP × DP stacking is exactly how vLLM, TensorRT-LLM, and DeepSpeed scale to dozens of GPUs.

Why it works so well for transformers

The reason this all clicks is the architecture. A decoder-only transformer is a stack of N identical layers. Every layer has the same shape, the same operations, the same tensors — attention projections and a feed-forward MLP, repeated 32 times for an 8B model or 80 times for a 70B. That uniformity gives you three things for free:

  • Clean split dimensions. Weight matrices divide along heads or hidden dimension with no leftover. There's an obvious place to cut.
  • Trivial load balancing. Pipeline parallelism just needs equal-sized layer ranges, and since every layer is identical, "equal" means "same count." No solver required.
  • Regular, amortized communication. The all-reduce in tensor parallelism is predictable and happens over large matmuls, so the comms cost is spread across real compute.

The tooling reflects this. llama.cpp's --split-mode gives you layer (pipeline-style, the default, least inter-GPU traffic), row/tensor (true tensor split, can use NCCL), and none. vLLM exposes tensor parallelism as a single --tensor-parallel-size flag. The hard part — figuring out how to shard — is essentially solved because the model's structure hands you the answer.

Split methodWhat it doesInterconnect need
Pipeline / layer splitContiguous layer ranges per GPU; one activation handoff per boundaryPCIe Gen4/5 acceptable
Tensor / row splitEach weight matrix split across GPUs; all-reduce every layerNVLink strongly preferred
Data parallelFull model replicated per GPU; independent requests eachNo GPU-to-GPU link needed
Component offloadPush idle sub-models (or layers) to CPU/second GPU/RAMPCIe; saves memory, not latency
Cross-machine (RPC/cluster)Pipeline a model across separate hosts over the networkLAN; bounded by slowest link

Why image and video models are different

Now point the same toolkit at a diffusion model and it stops working. There are three independent reasons, and you only need one of them to break tensor parallelism — diffusion has all three.

1. The graph is heterogeneous. A U-Net is an encoder that downsamples, a bottleneck, and a decoder that upsamples — operating at different spatial resolutions — with skip connections that jump from the encoder straight across to the matching decoder stage. There is no single split dimension that's clean across the whole network, and those skip connections force activation transfers between GPUs at awkward points that wreck tensor-parallel efficiency. DiT (the transformer-based successor used in newer image/video models) is more uniform internally, but it's still wrapped in a VAE and text encoders, and laced with cross-attention.

2. Cross-attention couples everything to the conditioning. Every block cross-attends to the text/image conditioning (CLIP/T5 embeddings). That's extra coupling that doesn't exist in a plain decoder stack, adding communication that a clean tensor split would otherwise avoid.

3. The denoising loop is inherently serial. This is the big one. Generating an image isn't one forward pass — it's 20–50+ iterative denoising steps, and each step's full-image latent depends on the previous step's output. Step t needs step t−1. That's a serial outer loop you cannot parallelize across steps. Pipeline parallelism across layers within one step buys almost nothing, and step-level pipelining just stalls. On top of that the latent for a single image is small relative to the per-step compute, so splitting a tensor across GPUs adds more synchronization overhead than the work it offloads.

So what can you do with multiple GPUs and diffusion?

  • Component/model offloading — keep the text encoder, VAE, and U-Net/DiT on different devices, or stream them in and out of VRAM ("sequential CPU offload," "model offload"). This fits a big pipeline on small cards. It saves memory, not latency.
  • Data parallelism — run independent images or batches on each GPU. This is embarrassingly parallel and the best throughput scaling you'll get. Two GPUs ≈ two images at once.
  • VAE/CLIP offload to CPU or a second GPU during the denoise loop, freeing the main card.
  • Specialized research methods — DistriFusion, xDiT, patch-parallel and sequence-parallel DiT split the latent spatially or pipeline across steps with displaced/async communication. These exist, but they're bespoke techniques, not the free tensor parallelism transformers enjoy. For video, the practical axis is splitting across frames or temporal chunks — data-parallel in spirit — not within a single denoise step.

The summary: with a transformer you split the model; with diffusion you mostly split the workload (different images on different cards) or the components, because you can't cleanly split a single denoise step.

The practical reality

Multi-GPU for LLMs earns its keep in exactly one situation: the model doesn't fit one card. That's it. A 70B Q4 needs ~48 GB, so 2× 24 GB makes it possible at all — that's a real win. But be honest about what you're buying.

Decode speed is memory-bandwidth-bound, not GPU-count-bound. The rough ceiling is tok/s ≈ bandwidth ÷ active_bytes_per_token, and adding a second GPU doesn't raise per-stream bandwidth — it raises capacity. Two cards let the model run; they don't make each token twice as fast. (Estimate this for your hardware, or measure real bandwidth in-browser with the GPU benchmark.)

Interconnect can quietly eat your gains. Tensor parallelism does an all-reduce every layer; over PCIe instead of NVLink that traffic becomes the bottleneck and can cut your tokens/sec substantially. If you only have PCIe, prefer a layer/pipeline split. And before committing to two small cards, price out the alternative: a single 48 GB or 80 GB card holds a 70B model with no cross-GPU comms at all, and the self-hosted cost calculator will tell you whether owning that hardware beats bursting to a cloud API for your volume.

The other way to scale: don't split, distribute

Splitting one model across GPUs in one box is a capacity trick. There's a second, operationally different way to scale that often matters more once you're past the "does it fit" question: distribute requests, not weights.

  • Cluster across machines. llama.cpp's RPC backend pipeline-parallelizes a GGUF model across N hosts; exo clusters heterogeneous devices (Macs, Linux PCs, even a Raspberry Pi) peer-to-peer and ring-partitions the model by each device's memory. This pools memory to fit models too big for any single machine — at the cost of being bounded by the slowest network link. It's clustering, and it's a genuinely different thing from tensor parallelism inside one server. Don't conflate them.
  • Route requests across nodes. Instead of making one giant model, run a model per node and put a router in front. Each request goes to a healthy node; overflow fails over elsewhere; repeated prompts get served from cache.

That routing layer is where most production local-AI setups actually live, because it solves the problems splitting doesn't: failover, caching, and a single stable endpoint. If you reach that wall — you need one OpenAI-compatible URL that runs against your own GPU baseline and bursts to the cloud only when you choose — an edge-first AI gateway like WideAreaAI does request-level routing, failover, and edge caching across whole nodes. (To be clear about scope: that's routing and caching across nodes, not model-splitting across machines — own your baseline, burst to the cloud, no per-token fees on hardware you already run.)

Conclusion

Text transformers split across GPUs cleanly because they're stacks of identical layers with obvious cut points and regular communication — tensor and pipeline parallelism fall right out of the architecture. Diffusion image and video models resist it because they're heterogeneous, cross-attention-heavy, and gated by a serial denoising loop you can't parallelize across steps; the most you split there is components or whole images, not a single step. So pick the scaling path that matches the model: split transformers when they won't fit, parallelize images (not steps) for diffusion, and reach for clustering or request routing when the real problem is capacity, failover, or a stable endpoint rather than a single oversized model.

Frequently Asked Questions

Find answers to common questions

Yes. Tensor parallelism splits each layer's weight matrices across GPUs; pipeline parallelism assigns contiguous ranges of layers to different GPUs. Both work well because a transformer is a stack of identical decoder layers with clean split dimensions. Engines like vLLM, TensorRT-LLM, and llama.cpp (--split-mode) do this natively — a 70B model that needs ~48 GB at modest context runs comfortably across 2× 24 GB cards.

Diffusion models (U-Net or DiT) are heterogeneous — different blocks operate at different spatial resolutions with skip connections crossing the network — so there's no single clean split dimension. Worse, generation is a sequential denoising loop of 20–50+ steps where step t depends on step t−1, so you can't pipeline across steps. You can offload the text encoder (CLIP/T5), VAE, or run separate images per GPU (data parallel), but splitting one denoise step across cards adds more sync overhead than it saves.

Tensor parallelism splits each layer's math across GPUs and recombines with an all-reduce every layer — it needs a fast interconnect (NVLink) because the cross-GPU traffic is constant. Pipeline parallelism puts different layers on different GPUs and passes activations forward once per boundary — it tolerates PCIe but introduces pipeline 'bubbles' (idle time) at batch size 1. Large production setups combine both (TP within a node, PP across nodes).

It's worth it when the model genuinely doesn't fit one card — that's the main reason to do it. But interconnect matters: tensor parallelism over PCIe instead of NVLink can erase most of the gain, and decode speed is bandwidth-bound, not GPU-count-bound. Often a single 48–80 GB card, or routing/bursting to the cloud for overflow, beats the complexity of two small cards glued together.

Not necessarily. Single-stream decode is limited by memory bandwidth, and splitting adds communication overhead. Tensor parallelism over NVLink can improve latency for a model that already fits, but the dominant reason to split is capacity — fitting a model that's too big for one GPU — not raw speed. For throughput, running independent requests on each GPU (data parallel) usually scales better.

Pipeline parallelism runs fine over PCIe because it only transfers activations once per layer-range boundary. Tensor parallelism wants NVLink (or NVSwitch) because it does an all-reduce every single layer; on PCIe Gen4/5 that communication becomes the bottleneck and can cut your tokens/sec substantially. Rule of thumb: PCIe is acceptable for pipeline/layer splits, NVLink is strongly preferred for tensor/row splits.

Yes, but it's a different tool. llama.cpp's RPC backend pipeline-parallelizes a GGUF model across N hosts, and projects like exo cluster heterogeneous devices (Macs, PCs, Pi) peer-to-peer. This is clustering — it pools memory to fit a model too big for any one machine — and throughput is bounded by the slowest network link. It is not the same as tensor parallelism inside a single server. See our clustering guide for the trade-offs.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.