TL;DR
- Introduced in Megatron-LM v3 (Korthikanti et al., 2022) as a refinement of tensor parallelism.
- Splits the activations of the parts of a transformer block that TP does not naturally shard (LayerNorm, dropout, residual add) along the sequence dimension.
- Cuts activation memory roughly in half at the same TP group size, with no extra communication cost beyond a swap of AllReduce for AllGather + ReduceScatter.
Overview#
Plain tensor parallelism shards the MLP and attention weights — but not LayerNorm, residual additions, or dropout, because those are element-wise operations on full-size activations. Inside a TP=8 transformer block, the LayerNorm input is the full activation tensor [batch, seq, hidden] replicated on every GPU. For long-context training, that replicated activation can be the largest single memory item.
Sequence parallelism resolves this by sharding those activations along the sequence dimension while keeping the TP sharding on the weight matrices. The communication pattern changes from AllReduce-only to AllGather (before column-parallel matmuls) and ReduceScatter (after row-parallel matmuls) — but the total bytes moved are the same as plain TP, only redistributed in time.
Mechanism#
Inside a TP group of size N, each GPU holds 1/N of the sequence in the SP regions and 1/N of the hidden dimension in the TP regions. Transitions between regions are handled by AllGather (gathering the sequence back into a full activation for the MLP/attention input) and ReduceScatter (scattering the output back along the sequence). The arithmetic identity AllReduce = AllGather + ReduceScatter means the total network traffic is unchanged.
Activation memory savings from sequence parallelism are roughly proportional to the LayerNorm + dropout + residual share of the block. For long-context training (32k+ tokens) it can free 30-50 % of activation memory — sometimes the difference between OOM and 'fits comfortably'.
Performance Characteristics#
- Memory: activation memory inside SP regions falls by 1/N where N is the TP group size.
- Communication: same total bytes as plain TP, but AllGather + ReduceScatter instead of AllReduce.
- Compute: identical to plain TP — no extra FLOPs.
- Sweet spot: any long-context (>8k tokens) training run already using TP.
When to Use#
Turn on sequence parallelism whenever you are using tensor parallelism on long-context workloads. It is essentially free — Megatron-LM and NeMo enable it with a single flag — and the activation-memory headroom either lets you raise the micro-batch size or train longer contexts at the same memory budget.
Pitfalls#
- Sequence length must be divisible by the TP/SP group size.
- Some custom fused kernels assume contiguous sequence dimension — SP can break those if the kernel was not written with sharding in mind.
- Activation-checkpointing patterns interact with SP — check what is being recomputed.
Software#
- Megatron-LM `--sequence-parallel` flag (since v3).
- NeMo Framework exposes the same flag.
- DeepSpeed Ulysses is a sequence-parallelism variant targeted at very long context (see context-parallelism).
References
- Reducing Activation Recomputation in Large Transformer Models · arXiv (Korthikanti et al., 2022)
- Megatron-LM sequence-parallel implementation · GitHub (NVIDIA)