Parallelism Strategies
Pipeline Parallelism
Model layers are divided into sequential “stages,” each assigned to a different device. Data flows through the stages like an assembly line, with micro‑batches interleaving forward and backward passes to keep GPUs busy. Diagram 2 illustrates this: multiple GPUs form pipeline stages, with staggered execution across micro‑batches to overlap computation and communication
Data parallelism
Every device holds a complete model replica and processes a distinct portion of the input batch in parallel. After the backward pass, gradients are synchronized across devices—usually via an all‑reduce—to ensure model consistency
Tensor Parallelism
Large tensors (e.g., weight matrices) within a layer are split and distributed across devices. Each device computes part of the operation—like matrix multiplication—and then uses collectives such as all‑gather or reduce‑scatter to assemble outputs. This is well illustrated in diagrams 1 and 3
Sequence Parallelism
An extension of tensor parallelism, sequence parallelism shards along the sequence dimension (e.g., time steps) across devices. It’s especially useful for processing long input sequences, splitting activations across devices for operations that aren’t tensor‑friendly (like LayerNorm and non‑linearities)
Expert Parallelism
In Mixture-of-Experts (MoE) models, only a subset of expert sub-networks is activated per token. Expert parallelism distributes these experts across devices—each GPU holds a subset of experts and processes tokens routed to them, using all‑to‑all communication to handle routing
Last updated
Was this helpful?