TL;DR
- A family of techniques for training and inference at sequence lengths (100k+, 1M+) that exceed what fits on a single GPU even with sharded weights.
- DeepSpeed Ulysses (2023) shards the sequence and uses AllToAll on the head dimension; Ring Attention (Liu et al., 2023) streams KV blocks around a ring of GPUs to compute attention in a memory-distributed fashion.
- NVIDIA NeMo and Megatron-LM ship 'context parallelism' (CP) as a first-class parallelism dimension alongside DP/TP/PP/SP.
Overview#
Standard tensor and sequence parallelism shard weights and per-layer activations, but the attention computation itself still requires materialising at least part of an O(L²) attention matrix in some form, where L is the sequence length. For L = 1,000,000 tokens, no realistic GPU has the memory to do this naively — even with Flash Attention's O(L) memory footprint, the activation aggregate across heads becomes a problem.
Context parallelism distributes the sequence itself across GPUs. Two approaches dominate: Ulysses uses AllToAll communication to transpose between 'sequence-sharded, head-replicated' and 'sequence-replicated, head-sharded' layouts; Ring Attention computes attention block-by-block by passing KV blocks around a ring of GPUs in a software-pipelined fashion.
Mechanism — Ulysses#
DeepSpeed Ulysses keeps activations sharded along the sequence dimension everywhere except inside attention. Before attention, an AllToAll redistributes the data so each GPU holds the full sequence for a subset of heads. After attention, a reverse AllToAll restores the sequence-sharded layout. The communication cost is sequence-length-independent but linear in the number of attention heads.
Mechanism — Ring Attention#
Ring Attention shards the sequence and never reassembles it. Each GPU holds a block of Q, K, V. Attention is computed incrementally: each GPU computes its local Q against its local K, V, then K, V blocks rotate one step around the ring, and the partial attention outputs are accumulated. After N rounds (N = ring size), every Q has been compared against every K, V and the full attention output is reconstructed.
Ring Attention scales to effectively unlimited sequence length given enough GPUs and bandwidth. Its weakness is that communication grows with sequence length, whereas Ulysses' communication grows with head count — so the right choice depends on (L, heads, ring_size).
Performance Characteristics#
- Memory: O(L/N) activation memory per GPU, where N is the CP group size.
- Communication: Ulysses scales with head count; Ring Attention scales with sequence length.
- Sweet spot for Ulysses: long context, many heads, modest CP group (≤16).
- Sweet spot for Ring Attention: very long context (>256k), small head count.
When to Use#
Use context parallelism when sequence length is the dominant memory constraint — training Llama-style models on 128k+ contexts, video-token transformers, or long-document fine-tuning. For 32k or 64k contexts, sequence parallelism (the Megatron variant) is usually sufficient and simpler.
Pitfalls#
- Ulysses requires head count divisible by CP group size — some 'thin' models (small head_dim, many heads) work well, others do not.
- Ring Attention is bandwidth-sensitive across the ring; cross-node ring rotation needs InfiniBand or NVLink Switch.
- Causal masking interacts with both — Ring Attention with causal mask is load-imbalanced unless explicitly scheduled.
- Most fine-tuning frameworks do not yet expose CP — long-context fine-tuning is still a Megatron/NeMo or DeepSpeed exercise.
Software#
- DeepSpeed Ulysses — sequence parallelism on top of DeepSpeed.
- NVIDIA NeMo / Megatron-LM `--context-parallel-size` — production Ring Attention variant.
- Ring Flash Attention reference implementation by Hao Liu (Berkeley).
- JAX TransformerEngine implementations exist for TPU-style mesh sharding.
References
- Ring Attention with Blockwise Transformers for Near-Infinite Context · arXiv (Liu et al., 2023)
- DeepSpeed Ulysses · GitHub (Microsoft)
- NeMo Context Parallelism documentation · NVIDIA