TL;DR
- ZeRO (Zero Redundancy Optimizer), introduced by Microsoft DeepSpeed in 2019 (Rajbhandari et al., arXiv:1910.02054), removes the memory redundancy of vanilla data parallelism by sharding optimiser state (Stage 1), gradients (Stage 2), and parameters (Stage 3). MIT-licensed, hosted at github.com/microsoft/DeepSpeed.
- Stage 3 reduces per-GPU memory roughly N-fold for the same model, at the cost of two extra AllGather + one ReduceScatter per layer per step to reconstruct full parameters on demand. ZeRO-Offload extends to CPU RAM; ZeRO-Infinity to NVMe.
- Drives the single-node 70B fine-tuning case (Stage 3 + CPU offload on 4-8x H100) and remains the production memory-saving strategy in Megatron-DeepSpeed and the HuggingFace Trainer / Accelerate stack. Architecturally interchangeable with PyTorch FSDP, with which it has converged functionally.
- Drives every memory line on the DeepSpeed config JSON: `zero_optimization.stage`, `offload_optimizer`, `offload_param`, `contiguous_gradients`, `overlap_comm`, `reduce_bucket_size`, `stage3_prefetch_bucket_size`, `stage3_param_persistence_threshold`. Tuning these is the bulk of DeepSpeed ops work.
Overview#
Vanilla data parallelism wastes memory. Every worker holds the full model parameters (P), the gradients (P), and the Adam optimiser state — master weights in FP32 (4P), first moment FP32 (4P), second moment FP32 (4P) — for a total of roughly 16 bytes per parameter at mixed precision (2 BF16 weights + 2 BF16 grads + 12 FP32 optimiser bytes). For a 70B model that is 1.12 TB per rank; on a 64-GPU cluster, 71 TB of redundant state.
ZeRO observes that this redundancy is unnecessary: only one worker needs to own each piece of state at any given moment, provided we reconstruct what we need when we need it and discard it after the layer's compute is done. The three stages progressively eliminate redundancy at the cost of more frequent collective communication: Stage 1 shards optimiser state (where the most bytes live), Stage 2 also shards gradients, Stage 3 also shards parameters (true model sharding).
DeepSpeed wraps PyTorch with a configuration-driven engine that exposes ZeRO plus mixed-precision, gradient accumulation, gradient checkpointing, fused optimisers, and pipeline parallelism behind a single `deepspeed.initialize()` call. The configuration is a JSON file — the canonical surface that production deployments tune. Yobitel NeoCloud customers training 70B+ models commonly use DeepSpeed ZeRO-3 with CPU offload on single 8x H100 nodes and Megatron-DeepSpeed hybrid configurations on multi-node training pods.
This entry documents the production surface: the JSON config schema for ZeRO, the three stages and their communication patterns, ZeRO-Offload and ZeRO-Infinity for NVMe spill, the integration with HuggingFace Trainer and Accelerate, sizing tables, and the migration path to and from FSDP. This entry helps you choose and operate DeepSpeed ZeRO for training pods on Yobitel NeoCloud or your own multi-GPU cluster.
Quick start#
The example below fine-tunes Llama-3 8B on a custom instruction dataset using ZeRO-3 on 4x A100 80GB. The first block installs DeepSpeed and writes the ZeRO-3 config. The second block launches the training job via `deepspeed` (which wraps `torchrun`). The third block shows the equivalent HuggingFace Trainer integration that picks up the same config.
# 1. Install DeepSpeed and write a ZeRO-3 config
pip install "deepspeed>=0.14.0" transformers accelerate bitsandbytes datasets
cat > ds_zero3.json <<'JSON'
{
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"offload_param": { "device": "cpu", "pin_memory": true },
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": 4,
"gradient_clipping": 1.0,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
JSON
# 2. Launch fine-tuning on 4x A100 80GB
deepspeed --num_gpus=4 train.py \
--deepspeed ds_zero3.json \
--model_name_or_path meta-llama/Meta-Llama-3-8B \
--dataset_name yahma/alpaca-cleaned \
--output_dir ./llama3-8b-sft \
--bf16 true \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--learning_rate 2e-5 \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_strategy steps --save_steps 500
# 3. The HuggingFace Trainer wires it up via TrainingArguments(deepspeed="ds_zero3.json")
python - <<'PY'
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
output_dir="./llama3-8b-sft",
bf16=True,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-5,
num_train_epochs=3,
deepspeed="ds_zero3.json",
)
# Trainer(model, args, ...).train()
PYThe `auto` values for bucket and prefetch sizes are computed by DeepSpeed from the model and world size — leave them as `auto` for the first run, then tune only if profiling shows AllGather under-overlapped.
How it works#
ZeRO partitions training state into three classes (parameters, gradients, optimiser state) and applies progressively more aggressive sharding across the DP group. The compute graph is unchanged from DDP; what changes is which rank owns which bytes at which moment, and the collective operations that move bytes in and out of GPU memory just in time.
Stage 1 (optimiser state sharding): each rank owns 1/N of the FP32 master weights, 1/N of the Adam first moment, 1/N of the Adam second moment. The forward and backward passes are unchanged. The gradient AllReduce remains, but the optimiser step is now local on each rank's slice; a final AllGather of updated BF16 parameters distributes the new weights. Memory drops from ~16 bytes/param/rank to ~4 + 12/N bytes/param/rank. Communication volume rises modestly (the AllGather is added).
Stage 2 (Stage 1 + gradient sharding): gradients are also sharded across the DP group. The single AllReduce becomes a ReduceScatter (each rank ends up with 1/N of the gradients). Memory drops further; communication volume is identical to DDP (ReduceScatter + AllGather = AllReduce in bytes moved).
Stage 3 (Stage 2 + parameter sharding): parameters themselves live in 1/N slices on each rank. Before a layer's forward pass, an AllGather reconstructs the full parameters of that layer on every rank from the per-rank slices. The layer executes; then the gathered copies are freed. The backward pass does the same and adds a ReduceScatter for the gradients. Per-step communication volume is roughly 1.5x DDP, but per-rank memory falls linearly in N — the breakthrough that enables training models much larger than a single GPU's memory.
ZeRO-Offload (Ren et al., 2021, arXiv:2101.06840) moves optimiser state and the optimiser-step computation to the CPU. The gradients ReduceScatter into CPU memory; Adam's update runs on x86 cores using fused AVX kernels; updated BF16 parameters AllGather back to the GPU. CPU-side Adam is cheap enough not to bottleneck for typical Llama-shaped models; the cost is host-device bandwidth (PCIe Gen4 ~32 GB/s, Gen5 ~64 GB/s).
ZeRO-Infinity (Rajbhandari et al., 2021, arXiv:2104.07857) extends Offload to NVMe. Parameter and optimiser state stream from a RAID-0 NVMe array via DMA, prefetched layer-by-layer. With 8-NVMe RAID-0 sustaining 50+ GB/s of read bandwidth, a 175B-class model can fit and fine-tune on a single 8x A100 node — at roughly 30-50 percent of the throughput of an in-memory configuration but at a tiny fraction of the cluster cost.
- ZeRO-1: optimiser state sharded; ~4x memory reduction vs DDP; communication ~same as DDP.
- ZeRO-2: + gradients sharded; ~8x memory reduction; communication ~same as DDP (ReduceScatter + AllGather).
- ZeRO-3: + parameters sharded; ~N-fold memory reduction; communication ~1.5x DDP.
- ZeRO-Offload: CPU spill for optimiser state and gradients; Adam runs on CPU.
- ZeRO-Infinity: NVMe spill for parameters and optimiser state; bandwidth bounded by RAID-0.
- ZeRO++ (2023): hierarchical partitioning + quantised weights/gradients; cuts cross-node traffic 4x.
- Compose with: gradient checkpointing (activations), tensor parallelism (per Megatron-DeepSpeed), pipeline parallelism (per DeepSpeed pipeline module).
Memory savings are quoted relative to vanilla DDP with Adam at mixed precision (16 bytes/param/rank). Real workloads also carry activation memory, which ZeRO does not touch — pair Stage 3 with `gradient_checkpointing=True` for long-context training above 8K.
Reference and specifications#
DeepSpeed is configured via a JSON document passed to `deepspeed.initialize(config=...)` or to the launcher's `--deepspeed` argument. The table below documents the ZeRO-relevant fields as of DeepSpeed 0.14 (June 2026). Fields under `zero_optimization` apply only when ZeRO is active; offload sub-objects require ZeRO stage >= 2 (optimiser) or stage 3 (parameters).
| Config key | Type | Default | Description |
|---|---|---|---|
| zero_optimization.stage | int | 0 | 0 = off, 1 = optimiser sharding, 2 = + grads, 3 = + params. |
| zero_optimization.offload_optimizer.device | string | (unset) | cpu | nvme. Offloads optimiser state and step compute. |
| zero_optimization.offload_optimizer.nvme_path | path | /local_nvme | Path on NVMe filesystem for the optimiser-state spill. |
| zero_optimization.offload_optimizer.pin_memory | bool | false | Pin CPU memory for higher H<->D bandwidth. |
| zero_optimization.offload_param.device | string | (unset) | cpu | nvme. Stage 3 only; offloads parameter slices when not in use. |
| zero_optimization.offload_param.nvme_path | path | /local_nvme | Path on NVMe for parameter spill. |
| zero_optimization.overlap_comm | bool | false | Overlap collective comms with backward compute (~10-20 percent uplift). |
| zero_optimization.contiguous_gradients | bool | false | Copy grads into a contiguous buffer before ReduceScatter (recommended). |
| zero_optimization.reduce_bucket_size | int | 5e8 | Bytes per ReduceScatter bucket; smaller = lower latency, larger = higher bandwidth. |
| zero_optimization.allgather_bucket_size | int | 5e8 | Bytes per AllGather bucket (Stage 1/2). |
| zero_optimization.stage3_prefetch_bucket_size | int | auto | Bytes prefetched ahead for the next layer's AllGather. |
| zero_optimization.stage3_param_persistence_threshold | int | auto | Params smaller than this stay replicated (avoids per-step AllGather). |
| zero_optimization.stage3_max_live_parameters | int | 1e9 | Cap on bytes of gathered params in GPU memory at once. |
| zero_optimization.stage3_max_reuse_distance | int | 1e9 | Bytes before a gathered param is evicted; raise for tighter prefetch. |
| zero_optimization.stage3_gather_16bit_weights_on_model_save | bool | false | Materialise full BF16 weights at save; required for HF export. |
| zero_optimization.sub_group_size | int | 1e9 | Stage 3 grouping for partitioned optimiser step; tune for large models. |
| zero_optimization.cpu_offload | bool | (deprecated) | Legacy alias for offload_optimizer; use the explicit form. |
| zero_optimization.zero_quantized_weights | bool | false | ZeRO++ quantised weight communication (INT8). |
| zero_optimization.zero_hpz_partition_size | int | (off) | ZeRO++ hierarchical partitioning; size of per-node group. |
| bf16.enabled | bool | false | Enable BF16 mixed-precision training. |
| fp16.enabled | bool | false | Enable FP16 + loss scaling. BF16 preferred on Ampere+. |
| fp16.loss_scale | float | 0 = dynamic | 0 enables dynamic loss scaling. |
| gradient_accumulation_steps | int | 1 | Effective batch = micro_batch x dp_size x grad_accum. |
| gradient_clipping | float | 0 = off | Global gradient-norm clipping threshold. |
| train_micro_batch_size_per_gpu | int | (required) | Per-rank micro-batch. |
| optimizer.type | string | (required) | Adam | AdamW | OneBitAdam | Lamb | DeepSpeedCPUAdam (with offload). |
| scheduler.type | string | (unset) | WarmupLR | WarmupDecayLR | OneCycle | Cosine. |
| activation_checkpointing.partition_activations | bool | false | Partition activations across MP group (Megatron-DeepSpeed). |
| wall_clock_breakdown | bool | false | Per-phase timing breakdown; useful for tuning. |
| zero_allow_untested_optimizer | bool | false | Required to use optimisers other than the official list. |
`stage3_gather_16bit_weights_on_model_save: false` (the default) writes Stage 3 sharded checkpoints that HuggingFace `from_pretrained` cannot load directly. Either set this to true (uses 2x model size GPU memory at save) or use DeepSpeed's `zero_to_fp32.py` script to convert offline.
Workload patterns#
Three workload shapes dominate DeepSpeed ZeRO production usage: single-node fine-tuning of large models via Stage 3 + CPU offload, moderate-scale pretraining at ZeRO-2 from random weights, and Megatron-DeepSpeed hybrid 3D + ZeRO-1 for frontier pretraining. Each maps to a different config emphasis, and each maps to a different Yobitel NeoCloud training-pod shape — Pattern A on a single 8x H100 SXM5 node, Pattern B on a 32-GPU (4-node) pod, Pattern C on a 64-GPU (8-node) pod with InfiniBand NDR fabric.
Pattern A — ZeRO-3 + CPU offload to fine-tune Llama-3 70B on a single 8x H100 node (the standard Yobitel NeoCloud single-node SFT shape). Pattern B — ZeRO-2 pretraining of a 13B from random weights on 32 GPUs (a 4-node NeoCloud training pod), prioritising throughput over memory. Pattern C — Megatron-DeepSpeed with ZeRO-1 for a 175B-class pretraining run that combines TP+PP+DP+ZeRO across an 8-node NeoCloud training pod.
// A — ZeRO-3 + CPU offload for single-node 70B fine-tune
{
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"offload_param": { "device": "cpu", "pin_memory": true },
"overlap_comm": true,
"contiguous_gradients": true,
"stage3_max_live_parameters": 2e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": 8,
"train_micro_batch_size_per_gpu": 1,
"gradient_clipping": 1.0,
"optimizer": { "type": "AdamW", "params": { "lr": 2e-5 } }
}
// B — ZeRO-2 pretrain of 13B on 32 GPUs, throughput-optimised
{
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 2e8,
"allgather_bucket_size": 2e8
},
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu": 4,
"gradient_clipping": 1.0,
"optimizer": { "type": "AdamW", "params": { "lr": 3e-4, "betas": [0.9, 0.95] } },
"activation_checkpointing": { "partition_activations": false }
}
// C — Megatron-DeepSpeed 3D + ZeRO-1 for 175B pretrain at scale
{
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 1,
"overlap_comm": true,
"contiguous_gradients": true
},
"tensor_parallel": { "tp_size": 8 },
"pipeline_parallel": { "pp_size": 8 },
"gradient_accumulation_steps": 16,
"train_micro_batch_size_per_gpu": 1,
"gradient_clipping": 1.0,
"optimizer": { "type": "AdamW", "params": { "lr": 1.5e-4 } },
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true
}
}Sizing and capacity planning#
The two questions that drive ZeRO sizing are: what is the largest model I can fit on this cluster, and what throughput will I get? The table below gives reference configs and observed throughput for BF16 Adam-trained dense transformers, with `overlap_comm: true`, `contiguous_gradients: true`, and gradient checkpointing on. Throughput is per-GPU sustained tokens-per-second from DeepSpeed published benchmarks and Yobitel-operated fleets at 4K sequence length.
- ZeRO-3 + CPU offload trades 2-3x throughput for 30-50 percent GPU-memory headroom; primarily a fine-tuning lever.
- ZeRO-Infinity (NVMe spill) is 5-10x slower than in-memory; use only when no other capacity is available.
- Above ~70B, prefer Megatron-DeepSpeed 3D + ZeRO-1 over pure ZeRO-3 — the per-layer AllGather amortises poorly past a certain size.
- ZeRO++ (zero_quantized_weights + hierarchical partitioning) cuts cross-node ZeRO-3 traffic 4x on bandwidth-constrained clusters.
- Effective batch size = micro_batch x DP x grad_accumulation; for stable Adam at scale aim for 1-4M tokens per step.
| Model | Cluster | ZeRO stage + offload | Per-rank GPU mem | Per-GPU tok/s |
|---|---|---|---|---|
| 7B | 8x A100 80GB | Stage 2 | ~32 GB | 8,500 |
| 13B | 8x A100 80GB | Stage 2 | ~58 GB | 5,200 |
| 13B | 32x A100 80GB | Stage 2 | ~36 GB | 5,000 |
| 30B | 8x A100 80GB | Stage 3 | ~64 GB | 2,800 |
| 70B | 8x H100 80GB | Stage 3 | ~74 GB | 2,200 |
| 70B | 8x H100 80GB | Stage 3 + CPU offload | ~58 GB | 950 |
| 70B | 32x H100 | Stage 3 | ~48 GB | 2,400 |
| 175B | 8x H100 80GB | Stage 3 + NVMe (Infinity) | ~75 GB | 180 |
| 175B | 64x H100 | Stage 3 | ~70 GB | 1,400 |
| 8x22B MoE | 32x H100 | Stage 2 + EP | ~60 GB | 1,800 |
Limits and quotas#
DeepSpeed itself imposes few hard limits; the constraints come from GPU memory, host RAM (for offload), NVMe bandwidth (for Infinity), and NCCL configuration. The table below summarises the constraints worth knowing before designing a ZeRO config.
| Constraint | Default / ceiling | How to manage |
|---|---|---|
| DP group size | = world_size | ZeRO shards within DP; no hard upper bound. |
| sub_group_size (Stage 3) | 1e9 | Reduce for very large models (saves opt-step memory). |
| stage3_max_live_parameters | 1e9 | Cap on concurrently gathered params; raise to overlap more. |
| CPU RAM (offload) | Host-bounded | Need ~12 bytes/param of CPU RAM for AdamW offload. |
| NVMe bandwidth (Infinity) | Drive-bounded | RAID-0 8-drive NVMe sustains 50+ GB/s; required for usable Infinity. |
| Activation memory | Not addressed by ZeRO | Use activation_checkpointing for long context. |
| AllGather bucket size | 5e8 bytes | Larger = better bandwidth, lower latency. |
| NCCL communicator count | World-size dependent | Set NCCL_BUFFSIZE, NCCL_NTHREADS; use NCCL >= 2.20. |
| BF16 hardware | Ampere+ | Pre-Ampere needs FP16 + dynamic loss scaling. |
| Optimiser support | Adam/AdamW/Lamb stock | Custom optimisers need zero_allow_untested_optimizer. |
Observability#
DeepSpeed exposes per-step timing via `wall_clock_breakdown: true` in the config — partition_forward, partition_backward, optimizer_step, AllGather, ReduceScatter durations all log to TensorBoard or stdout. Pair with DCGM exporter for GPU-side metrics and NCCL_DEBUG=INFO for collective-level diagnostics. The signals that detect 90 percent of DeepSpeed production issues are throughput drift, optimiser-step latency, and offload bandwidth utilisation.
- samples_per_second: holds within +-5 percent in steady state; drops correlate with offload bandwidth saturation.
- optimizer_step (CPU offload): should be ~0.5-2s at 70B; >5s means CPU-bound Adam.
- AllGather/ReduceScatter latency: visible in wall_clock_breakdown; pair with NCCL collective profiling.
- host_to_device / device_to_host bandwidth: PCIe Gen4 ~32 GB/s, Gen5 ~64 GB/s; offload saturates the lower number.
- GPU memory peak per phase: log with `deepspeed.runtime.utils.memory_status()` to validate stage3_max_live_parameters.
- grad_norm: spikes indicate divergence; correlate with LR.
- DCGM_FI_PROF_NVLINK_TX_BYTES and DCGM_FI_PROF_PCIE_TX_BYTES — distinguish ZeRO collectives from CPU offload traffic.
# Prometheus rules for a DeepSpeed ZeRO training job
groups:
- name: deepspeed-training
interval: 60s
rules:
- alert: DeepSpeedStepTimeRegression
expr: |
avg_over_time(deepspeed:step_time_seconds[5m]) >
1.10 * avg_over_time(deepspeed:step_time_seconds[1h] offset 30m)
for: 10m
labels: { severity: warning, team: training }
annotations:
summary: "DeepSpeed step time +10% vs 1h baseline on {{ $labels.job_name }}"
- alert: DeepSpeedOptimizerStepSlow
expr: avg_over_time(deepspeed:optimizer_step_seconds[5m]) > 5.0
for: 15m
annotations:
summary: "CPU-offload Adam step > 5s — CPU saturated or PCIe contention"
- alert: DeepSpeedAllGatherP99
expr: histogram_quantile(0.99,
rate(deepspeed:allgather_seconds_bucket[5m])) > 1.0
for: 10m
annotations:
summary: "ZeRO-3 AllGather p99 > 1s — likely interconnect issue"
- alert: DeepSpeedGradNormSpike
expr: deepspeed:grad_norm > 10 * avg_over_time(deepspeed:grad_norm[1h] offset 30m)
for: 5m
labels: { severity: warning }
annotations:
summary: "Grad-norm spike — investigate LR or data"Cost and FinOps#
DeepSpeed cost economics are driven by where state lives (GPU / CPU / NVMe) and the throughput tax for each tier. ZeRO-3 in GPU memory matches DDP throughput within ~5 percent; CPU offload typically costs 2-3x in wall-clock; NVMe offload costs 5-10x. The table below uses Yobitel UK pricing (June 2026) for the canonical fine-tuning shape (Llama-3 70B SFT on 8x H100 SXM5).
- Single-node ZeRO-3 + CPU offload is the cheapest 70B-class fine-tuning option when the team has one 8x H100 box.
- Multi-node Megatron-DS hybrid beats ZeRO-3 alone past ~70B because of the cross-node AllGather cost.
- NVMe-tier (Infinity) is a budget-of-last-resort option for prototyping a model larger than CPU RAM permits.
- BF16 vs FP16: identical cost on Ampere/Hopper, but BF16 saves the engineering time spent debugging loss-scale collapses.
| Config | Node | Hours / 1B tokens | Hourly rate | USD / 1B tokens |
|---|---|---|---|---|
| Stage 2, no offload, fits | 8x H100 80GB | Not feasible (OOM) | $24.80 | — |
| Stage 3, no offload | 8x H100 80GB | 55 | $24.80 | $1,364 |
| Stage 3 + CPU offload | 8x H100 80GB + 1TB DDR5 | 120 | $28.40 | $3,408 |
| Stage 3 + NVMe (Infinity) | 8x H100 + RAID0 NVMe | 320 | $26.20 | $8,384 |
| Stage 1 (Megatron-DS hybrid) | 32x H100 (4 nodes) | 18 | $99.20 | $1,786 |
| FSDP HSDP equivalent | 8x H100 80GB | 55 | $24.80 | $1,364 |
Security and compliance#
DeepSpeed is a training library with no built-in network surface; security and compliance apply at the cluster level (Slurm/K8s auth, encrypted weights filesystem, isolated training network) the same way they do for Megatron-LM. The DeepSpeed binary itself is pip-installable, MIT-licensed, and has no telemetry call-home. For UK and EU sovereign deployments, DeepSpeed runs inside the same NCSC- and G-Cloud-compliant tenancies as Megatron, NeMo, or FSDP — the framework choice does not change the sovereignty posture.
Two operational notes for compliance audits: (1) DeepSpeed sharded checkpoints (`global_stepX/zero_pp_rank_Y_*.pt`) are not directly portable; use `zero_to_fp32.py` to materialise a single fp32 state dict for audit retention. (2) NVMe offload spills sensitive weights to disk; ensure the spill path is on encrypted ephemeral storage (LUKS or dm-crypt) and is wiped on container teardown.
Migration and alternatives#
The closest alternative to DeepSpeed ZeRO is PyTorch FSDP — architecturally equivalent at Stage 3 (FULL_SHARD) and Stage 2 (SHARD_GRAD_OP). The two have converged functionally: ZeRO-3 and FSDP FULL_SHARD give the same memory and throughput on the same workload; the meaningful differences are ergonomic and ecosystem. DeepSpeed has the richer offload story (CPU and NVMe), the Megatron-DeepSpeed hybrid, and the established HuggingFace Trainer integration. FSDP has tighter PyTorch core integration, native DTensor composition with TP, and a cleaner extension story via `torch.distributed._composable.fsdp.fully_shard` (FSDP2).
| From / To | Effort | When to choose | What you keep / lose |
|---|---|---|---|
| DDP -> DeepSpeed ZeRO-1 | Trivial — drop in config | Always; ZeRO-1 is free. | Keep DDP semantics; gain optimiser sharding. |
| DDP -> DeepSpeed ZeRO-3 | Low — config change | Model > 1 GPU memory. | Lose direct param access; gain N-fold memory. |
| DeepSpeed -> FSDP | Medium — code rewrite | PyTorch-native stack, no offload needed. | Lose CPU/NVMe offload; gain DTensor / TP composition. |
| FSDP -> DeepSpeed | Medium — config + launcher | Need CPU/NVMe offload or Megatron-DS hybrid. | Gain offload; lose FSDP2 ergonomics. |
| DeepSpeed -> Megatron-LM | High — different paradigm | True 3D parallelism at frontier scale. | Gain TP+PP; lose configuration ergonomics. |
| HuggingFace Trainer + DS -> Accelerate + DS | Low | More flexible training loop. | Same engine; different orchestration. |
If you are starting fresh in 2026 on PyTorch and don't need NVMe offload, prefer FSDP2 over DeepSpeed ZeRO. The two are equivalent at the math layer, but FSDP2 composes more cleanly with the rest of the torch.distributed stack. Keep DeepSpeed when you need ZeRO-Infinity, the Megatron-DeepSpeed hybrid, or HuggingFace Trainer with its existing config surface.
Troubleshooting#
The error patterns below cover the common DeepSpeed ZeRO failure modes observed on Yobitel-operated training fleets and the public DeepSpeed issue tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.
| Symptom | Cause | Fix |
|---|---|---|
| OOM during model init with Stage 3 | Full model materialised before sharding. | Use `with deepspeed.zero.Init(): model = AutoModel.from_pretrained(...)` context manager. |
| from_pretrained fails on Stage 3 checkpoint | Sharded weights cannot load directly into HF. | Set stage3_gather_16bit_weights_on_model_save=true OR run zero_to_fp32.py. |
| Throughput much slower than expected at Stage 3 | overlap_comm=false or contiguous_gradients=false. | Set both to true; verify prefetch bucket sizes are non-trivial. |
| CPU-offload step time grows linearly with model size | DeepSpeedCPUAdam not selected; falls back to Python Adam. | Set optimizer.type='DeepSpeedCPUAdam' explicitly when offloading. |
| NCCL hang on Stage 3 launch | AllGather buckets overflow comm buffer. | Lower allgather_bucket_size to 2e8; raise NCCL_BUFFSIZE. |
| NVMe Infinity throughput < 1 GB/s | Single drive or HDD instead of RAID-0 NVMe. | Provision 4+ NVMe RAID-0; pin offload_param.nvme_path to it. |
| Loss NaN on FP16 | Dynamic loss scale collapsed. | Switch to bf16; or set fp16.initial_scale_power lower. |
| Optimiser step 5x slower than expected | AdamW called on full param replica (not the sharded slice). | Verify zero_optimization.stage > 0; check optimiser is DeepSpeed-wrapped. |
| Gradient overflow warnings every step | FP16 + Adam epsilon too small. | Switch to bf16; or raise eps to 1e-6 from 1e-8. |
| Stage 3 with TP fails on shape mismatch | ZeRO-3 not aware of TP partitioning by default. | Use Megatron-DeepSpeed and set zero_optimization.stage=1 with TP instead. |
| Checkpoint save takes > 10 minutes | Sharded checkpoint write to slow filesystem. | Use parallel writes; place save dir on Lustre or NVMe; consider torch_dist format. |
| zero_to_fp32.py OOMs | Tries to load all shards into one process. | Run on a CPU box with 2x model-size RAM; or use the streaming variant. |
Where this fits in the Yobitel stack#
DeepSpeed ZeRO is one of the supported training engines on Yobitel sovereign GPU tenancies. For customers running single-node 70B fine-tuning on 8x H100, the canonical path is Stage 3 + CPU offload with the HuggingFace Trainer integration and a pre-validated Slurm/Pyxis launch script. For multi-node pretraining above 30B, Yobitel recommends Megatron-LM (via NeMo) for true 3D parallelism and uses DeepSpeed primarily when an existing customer codebase is already DeepSpeed-shaped.
Yobitel's sovereign training tenancies (London-1, Frankfurt-1) ship DeepSpeed alongside Megatron, NeMo, FSDP, and torchtune in pre-built NGC-derived containers. NCCL and InfiniBand tuning, NVMe RAID-0 configuration for ZeRO-Infinity, and DCGM observability dashboards are pre-applied. Trained checkpoints flow into Yobibyte's inference path (vLLM, TensorRT-LLM) after the standard HuggingFace conversion.
References
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models · arXiv (Rajbhandari et al., 2019)
- ZeRO-Offload: Democratizing Billion-Scale Model Training · arXiv (Ren et al., 2021)
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning · arXiv (Rajbhandari et al., 2021)
- ZeRO++: Extremely Efficient Collective Communication for Giant Model Training · arXiv (Wang et al., 2023)
- DeepSpeed on GitHub · GitHub (Microsoft)
- DeepSpeed Documentation · Microsoft DeepSpeed
- HuggingFace Transformers DeepSpeed integration · HuggingFace