DeepSpeed ZeRO

TL;DR

ZeRO (Zero Redundancy Optimizer), introduced by Microsoft DeepSpeed in 2019 (Rajbhandari et al., arXiv:1910.02054), removes the memory redundancy of vanilla data parallelism by sharding optimiser state (Stage 1), gradients (Stage 2), and parameters (Stage 3). MIT-licensed, hosted at github.com/microsoft/DeepSpeed.
Stage 3 reduces per-GPU memory roughly N-fold for the same model, at the cost of two extra AllGather + one ReduceScatter per layer per step to reconstruct full parameters on demand. ZeRO-Offload extends to CPU RAM; ZeRO-Infinity to NVMe.
Drives the single-node 70B fine-tuning case (Stage 3 + CPU offload on 4-8x H100) and remains the production memory-saving strategy in Megatron-DeepSpeed and the HuggingFace Trainer / Accelerate stack. Architecturally interchangeable with PyTorch FSDP, with which it has converged functionally.
Drives every memory line on the DeepSpeed config JSON: `zero_optimization.stage`, `offload_optimizer`, `offload_param`, `contiguous_gradients`, `overlap_comm`, `reduce_bucket_size`, `stage3_prefetch_bucket_size`, `stage3_param_persistence_threshold`. Tuning these is the bulk of DeepSpeed ops work.

Overview

Vanilla data parallelism wastes memory. Every worker holds the full model parameters (P), the gradients (P), and the Adam optimiser state — master weights in FP32 (4P), first moment FP32 (4P), second moment FP32 (4P) — for a total of roughly 16 bytes per parameter at mixed precision (2 BF16 weights + 2 BF16 grads + 12 FP32 optimiser bytes). For a 70B model that is 1.12 TB per rank; on a 64-GPU cluster, 71 TB of redundant state.

ZeRO observes that this redundancy is unnecessary: only one worker needs to own each piece of state at any given moment, provided we reconstruct what we need when we need it and discard it after the layer's compute is done. The three stages progressively eliminate redundancy at the cost of more frequent collective communication: Stage 1 shards optimiser state (where the most bytes live), Stage 2 also shards gradients, Stage 3 also shards parameters (true model sharding).

DeepSpeed wraps PyTorch with a configuration-driven engine that exposes ZeRO plus mixed-precision, gradient accumulation, gradient checkpointing, fused optimisers, and pipeline parallelism behind a single deepspeed.initialize() call. The configuration is a JSON file — the canonical surface that production deployments tune. Yobitel NeoCloud customers training 70B+ models commonly use DeepSpeed ZeRO-3 with CPU offload on single 8x H100 nodes and Megatron-DeepSpeed hybrid configurations on multi-node training pods.

This entry documents the production surface: the JSON config schema for ZeRO, the three stages and their communication patterns, ZeRO-Offload and ZeRO-Infinity for NVMe spill, the integration with HuggingFace Trainer and Accelerate, sizing tables, and the migration path to and from FSDP. This entry helps you choose and operate DeepSpeed ZeRO for training pods on Yobitel NeoCloud or your own multi-GPU cluster.

Quick start

The example below fine-tunes Llama-3 8B on a custom instruction dataset using ZeRO-3 on 4x A100 80GB. The first block installs DeepSpeed and writes the ZeRO-3 config. The second block launches the training job via deepspeed (which wraps torchrun). The third block shows the equivalent HuggingFace Trainer integration that picks up the same config.

# 1. Install DeepSpeed and write a ZeRO-3 config
pip install "deepspeed>=0.14.0" transformers accelerate bitsandbytes datasets

cat > ds_zero3.json <<'JSON'
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "offload_param":     { "device": "cpu", "pin_memory": true },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": 4,
  "gradient_clipping": 1.0,
  "train_micro_batch_size_per_gpu": 1,
  "wall_clock_breakdown": false
}
JSON

# 2. Launch fine-tuning on 4x A100 80GB
deepspeed --num_gpus=4 train.py \
    --deepspeed ds_zero3.json \
    --model_name_or_path meta-llama/Meta-Llama-3-8B \
    --dataset_name yahma/alpaca-cleaned \
    --output_dir ./llama3-8b-sft \
    --bf16 true \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_strategy steps --save_steps 500

# 3. The HuggingFace Trainer wires it up via TrainingArguments(deepspeed="ds_zero3.json")
python - <<'PY'
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
    output_dir="./llama3-8b-sft",
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    num_train_epochs=3,
    deepspeed="ds_zero3.json",
)
# Trainer(model, args, ...).train()
PY

Tip: The auto values for bucket and prefetch sizes are computed by DeepSpeed from the model and world size — leave them as auto for the first run, then tune only if profiling shows AllGather under-overlapped.

How it works

ZeRO partitions training state into three classes (parameters, gradients, optimiser state) and applies progressively more aggressive sharding across the DP group. The compute graph is unchanged from DDP; what changes is which rank owns which bytes at which moment, and the collective operations that move bytes in and out of GPU memory just in time.

Stage 1 (optimiser state sharding): each rank owns 1/N of the FP32 master weights, 1/N of the Adam first moment, 1/N of the Adam second moment. The forward and backward passes are unchanged. The gradient AllReduce remains, but the optimiser step is now local on each rank's slice; a final AllGather of updated BF16 parameters distributes the new weights. Memory drops from ~16 bytes/param/rank to ~4 + 12/N bytes/param/rank. Communication volume rises modestly (the AllGather is added).

Stage 2 (Stage 1 + gradient sharding): gradients are also sharded across the DP group. The single AllReduce becomes a ReduceScatter (each rank ends up with 1/N of the gradients). Memory drops further; communication volume is identical to DDP (ReduceScatter + AllGather = AllReduce in bytes moved).

Stage 3 (Stage 2 + parameter sharding): parameters themselves live in 1/N slices on each rank. Before a layer's forward pass, an AllGather reconstructs the full parameters of that layer on every rank from the per-rank slices. The layer executes; then the gathered copies are freed. The backward pass does the same and adds a ReduceScatter for the gradients. Per-step communication volume is roughly 1.5x DDP, but per-rank memory falls linearly in N — the breakthrough that enables training models much larger than a single GPU's memory.

ZeRO-Offload (Ren et al., 2021, arXiv:2101.06840) moves optimiser state and the optimiser-step computation to the CPU. The gradients ReduceScatter into CPU memory; Adam's update runs on x86 cores using fused AVX kernels; updated BF16 parameters AllGather back to the GPU. CPU-side Adam is cheap enough not to bottleneck for typical Llama-shaped models; the cost is host-device bandwidth (PCIe Gen4 ~32 GB/s, Gen5 ~64 GB/s).

ZeRO-Infinity (Rajbhandari et al., 2021, arXiv:2104.07857) extends Offload to NVMe. Parameter and optimiser state stream from a RAID-0 NVMe array via DMA, prefetched layer-by-layer. With 8-NVMe RAID-0 sustaining 50+ GB/s of read bandwidth, a 175B-class model can fit and fine-tune on a single 8x A100 node — at roughly 30-50 percent of the throughput of an in-memory configuration but at a tiny fraction of the cluster cost.

ZeRO-1: optimiser state sharded; ~4x memory reduction vs DDP; communication ~same as DDP.
ZeRO-2: + gradients sharded; ~8x memory reduction; communication ~same as DDP (ReduceScatter + AllGather).
ZeRO-3: + parameters sharded; ~N-fold memory reduction; communication ~1.5x DDP.
ZeRO-Offload: CPU spill for optimiser state and gradients; Adam runs on CPU.
ZeRO-Infinity: NVMe spill for parameters and optimiser state; bandwidth bounded by RAID-0.
ZeRO++ (2023): hierarchical partitioning + quantised weights/gradients; cuts cross-node traffic 4x.
Compose with: gradient checkpointing (activations), tensor parallelism (per Megatron-DeepSpeed), pipeline parallelism (per DeepSpeed pipeline module).

Note: Memory savings are quoted relative to vanilla DDP with Adam at mixed precision (16 bytes/param/rank). Real workloads also carry activation memory, which ZeRO does not touch — pair Stage 3 with gradient_checkpointing=True for long-context training above 8K.

Reference and specifications

DeepSpeed is configured via a JSON document passed to deepspeed.initialize(config=...) or to the launcher's --deepspeed argument. The table below documents the ZeRO-relevant fields as of DeepSpeed 0.14 (June 2026). Fields under zero_optimization apply only when ZeRO is active; offload sub-objects require ZeRO stage >= 2 (optimiser) or stage 3 (parameters).

Config key	Type	Default	Description
zero_optimization.stage	int	0	0 = off, 1 = optimiser sharding, 2 = + grads, 3 = + params.
zero_optimization.offload_optimizer.device	string	(unset)	cpu
zero_optimization.offload_optimizer.nvme_path	path	/local_nvme	Path on NVMe filesystem for the optimiser-state spill.
zero_optimization.offload_optimizer.pin_memory	bool	false	Pin CPU memory for higher H<->D bandwidth.
zero_optimization.offload_param.device	string	(unset)	cpu
zero_optimization.offload_param.nvme_path	path	/local_nvme	Path on NVMe for parameter spill.
zero_optimization.overlap_comm	bool	false	Overlap collective comms with backward compute (~10-20 percent uplift).
zero_optimization.contiguous_gradients	bool	false	Copy grads into a contiguous buffer before ReduceScatter (recommended).
zero_optimization.reduce_bucket_size	int	5e8	Bytes per ReduceScatter bucket; smaller = lower latency, larger = higher bandwidth.
zero_optimization.allgather_bucket_size	int	5e8	Bytes per AllGather bucket (Stage 1/2).
zero_optimization.stage3_prefetch_bucket_size	int	auto	Bytes prefetched ahead for the next layer's AllGather.
zero_optimization.stage3_param_persistence_threshold	int	auto	Params smaller than this stay replicated (avoids per-step AllGather).
zero_optimization.stage3_max_live_parameters	int	1e9	Cap on bytes of gathered params in GPU memory at once.
zero_optimization.stage3_max_reuse_distance	int	1e9	Bytes before a gathered param is evicted; raise for tighter prefetch.
zero_optimization.stage3_gather_16bit_weights_on_model_save	bool	false	Materialise full BF16 weights at save; required for HF export.
zero_optimization.sub_group_size	int	1e9	Stage 3 grouping for partitioned optimiser step; tune for large models.
zero_optimization.cpu_offload	bool	(deprecated)	Legacy alias for offload_optimizer; use the explicit form.
zero_optimization.zero_quantized_weights	bool	false	ZeRO++ quantised weight communication (INT8).
zero_optimization.zero_hpz_partition_size	int	(off)	ZeRO++ hierarchical partitioning; size of per-node group.
bf16.enabled	bool	false	Enable BF16 mixed-precision training.
fp16.enabled	bool	false	Enable FP16 + loss scaling. BF16 preferred on Ampere+.
fp16.loss_scale	float	0 = dynamic	0 enables dynamic loss scaling.
gradient_accumulation_steps	int	1	Effective batch = micro_batch x dp_size x grad_accum.
gradient_clipping	float	0 = off	Global gradient-norm clipping threshold.
train_micro_batch_size_per_gpu	int	(required)	Per-rank micro-batch.
optimizer.type	string	(required)	Adam
scheduler.type	string	(unset)	WarmupLR
activation_checkpointing.partition_activations	bool	false	Partition activations across MP group (Megatron-DeepSpeed).
wall_clock_breakdown	bool	false	Per-phase timing breakdown; useful for tuning.
zero_allow_untested_optimizer	bool	false	Required to use optimisers other than the official list.

Warning: stage3_gather_16bit_weights_on_model_save: false (the default) writes Stage 3 sharded checkpoints that HuggingFace from_pretrained cannot load directly. Either set this to true (uses 2x model size GPU memory at save) or use DeepSpeed's zero_to_fp32.py script to convert offline.

Workload patterns

Three workload shapes dominate DeepSpeed ZeRO production usage: single-node fine-tuning of large models via Stage 3 + CPU offload, moderate-scale pretraining at ZeRO-2 from random weights, and Megatron-DeepSpeed hybrid 3D + ZeRO-1 for frontier pretraining. Each maps to a different config emphasis, and each maps to a different Yobitel NeoCloud training-pod shape — Pattern A on a single 8x H100 SXM5 node, Pattern B on a 32-GPU (4-node) pod, Pattern C on a 64-GPU (8-node) pod with InfiniBand NDR fabric.

Pattern A — ZeRO-3 + CPU offload to fine-tune Llama-3 70B on a single 8x H100 node (the standard Yobitel NeoCloud single-node SFT shape). Pattern B — ZeRO-2 pretraining of a 13B from random weights on 32 GPUs (a 4-node NeoCloud training pod), prioritising throughput over memory. Pattern C — Megatron-DeepSpeed with ZeRO-1 for a 175B-class pretraining run that combines TP+PP+DP+ZeRO across an 8-node NeoCloud training pod.

// A — ZeRO-3 + CPU offload for single-node 70B fine-tune
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "offload_param":     { "device": "cpu", "pin_memory": true },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "stage3_max_live_parameters": 2e9,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": 8,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_clipping": 1.0,
  "optimizer": { "type": "AdamW", "params": { "lr": 2e-5 } }
}

// B — ZeRO-2 pretrain of 13B on 32 GPUs, throughput-optimised
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 2e8,
    "allgather_bucket_size": 2e8
  },
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_clipping": 1.0,
  "optimizer": { "type": "AdamW", "params": { "lr": 3e-4, "betas": [0.9, 0.95] } },
  "activation_checkpointing": { "partition_activations": false }
}

// C — Megatron-DeepSpeed 3D + ZeRO-1 for 175B pretrain at scale
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 1,
    "overlap_comm": true,
    "contiguous_gradients": true
  },
  "tensor_parallel": { "tp_size": 8 },
  "pipeline_parallel": { "pp_size": 8 },
  "gradient_accumulation_steps": 16,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_clipping": 1.0,
  "optimizer": { "type": "AdamW", "params": { "lr": 1.5e-4 } },
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true
  }
}

Sizing and capacity planning

The two questions that drive ZeRO sizing are: what is the largest model I can fit on this cluster, and what throughput will I get? The table below gives reference configs and observed throughput for BF16 Adam-trained dense transformers, with overlap_comm: true, contiguous_gradients: true, and gradient checkpointing on. Throughput is per-GPU sustained tokens-per-second from DeepSpeed published benchmarks and Yobitel-operated fleets at 4K sequence length.

ZeRO-3 + CPU offload trades 2-3x throughput for 30-50 percent GPU-memory headroom; primarily a fine-tuning lever.
ZeRO-Infinity (NVMe spill) is 5-10x slower than in-memory; use only when no other capacity is available.
Above ~70B, prefer Megatron-DeepSpeed 3D + ZeRO-1 over pure ZeRO-3 — the per-layer AllGather amortises poorly past a certain size.
ZeRO++ (zero_quantized_weights + hierarchical partitioning) cuts cross-node ZeRO-3 traffic 4x on bandwidth-constrained clusters.
Effective batch size = micro_batch x DP x grad_accumulation; for stable Adam at scale aim for 1-4M tokens per step.

Model	Cluster	ZeRO stage + offload	Per-rank GPU mem	Per-GPU tok/s
7B	8x A100 80GB	Stage 2	~32 GB	8,500
13B	8x A100 80GB	Stage 2	~58 GB	5,200
13B	32x A100 80GB	Stage 2	~36 GB	5,000
30B	8x A100 80GB	Stage 3	~64 GB	2,800
70B	8x H100 80GB	Stage 3	~74 GB	2,200
70B	8x H100 80GB	Stage 3 + CPU offload	~58 GB	950
70B	32x H100	Stage 3	~48 GB	2,400
175B	8x H100 80GB	Stage 3 + NVMe (Infinity)	~75 GB	180
175B	64x H100	Stage 3	~70 GB	1,400
8x22B MoE	32x H100	Stage 2 + EP	~60 GB	1,800

Limits and quotas

DeepSpeed itself imposes few hard limits; the constraints come from GPU memory, host RAM (for offload), NVMe bandwidth (for Infinity), and NCCL configuration. The table below summarises the constraints worth knowing before designing a ZeRO config.

Constraint	Default / ceiling	How to manage
DP group size	= world_size	ZeRO shards within DP; no hard upper bound.
sub_group_size (Stage 3)	1e9	Reduce for very large models (saves opt-step memory).
stage3_max_live_parameters	1e9	Cap on concurrently gathered params; raise to overlap more.
CPU RAM (offload)	Host-bounded	Need ~12 bytes/param of CPU RAM for AdamW offload.
NVMe bandwidth (Infinity)	Drive-bounded	RAID-0 8-drive NVMe sustains 50+ GB/s; required for usable Infinity.
Activation memory	Not addressed by ZeRO	Use activation_checkpointing for long context.
AllGather bucket size	5e8 bytes	Larger = better bandwidth, lower latency.
NCCL communicator count	World-size dependent	Set NCCL_BUFFSIZE, NCCL_NTHREADS; use NCCL >= 2.20.
BF16 hardware	Ampere+	Pre-Ampere needs FP16 + dynamic loss scaling.
Optimiser support	Adam/AdamW/Lamb stock	Custom optimisers need zero_allow_untested_optimizer.

Observability

DeepSpeed exposes per-step timing via wall_clock_breakdown: true in the config — partition_forward, partition_backward, optimizer_step, AllGather, ReduceScatter durations all log to TensorBoard or stdout. Pair with DCGM exporter for GPU-side metrics and NCCL_DEBUG=INFO for collective-level diagnostics. The signals that detect 90 percent of DeepSpeed production issues are throughput drift, optimiser-step latency, and offload bandwidth utilisation.

samples_per_second: holds within +-5 percent in steady state; drops correlate with offload bandwidth saturation.
optimizer_step (CPU offload): should be ~0.5-2s at 70B; >5s means CPU-bound Adam.
AllGather/ReduceScatter latency: visible in wall_clock_breakdown; pair with NCCL collective profiling.
host_to_device / device_to_host bandwidth: PCIe Gen4 ~32 GB/s, Gen5 ~64 GB/s; offload saturates the lower number.
GPU memory peak per phase: log with deepspeed.runtime.utils.memory_status() to validate stage3_max_live_parameters.
grad_norm: spikes indicate divergence; correlate with LR.
DCGM_FI_PROF_NVLINK_TX_BYTES and DCGM_FI_PROF_PCIE_TX_BYTES — distinguish ZeRO collectives from CPU offload traffic.

# Prometheus rules for a DeepSpeed ZeRO training job
groups:
  - name: deepspeed-training
    interval: 60s
    rules:
      - alert: DeepSpeedStepTimeRegression
        expr: |
          avg_over_time(deepspeed:step_time_seconds[5m]) >
          1.10 * avg_over_time(deepspeed:step_time_seconds[1h] offset 30m)
        for: 10m
        labels: { severity: warning, team: training }
        annotations:
          summary: "DeepSpeed step time +10% vs 1h baseline on {{ $labels.job_name }}"

      - alert: DeepSpeedOptimizerStepSlow
        expr: avg_over_time(deepspeed:optimizer_step_seconds[5m]) > 5.0
        for: 15m
        annotations:
          summary: "CPU-offload Adam step > 5s — CPU saturated or PCIe contention"

      - alert: DeepSpeedAllGatherP99
        expr: histogram_quantile(0.99,
                rate(deepspeed:allgather_seconds_bucket[5m])) > 1.0
        for: 10m
        annotations:
          summary: "ZeRO-3 AllGather p99 > 1s — likely interconnect issue"

      - alert: DeepSpeedGradNormSpike
        expr: deepspeed:grad_norm > 10 * avg_over_time(deepspeed:grad_norm[1h] offset 30m)
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Grad-norm spike — investigate LR or data"

Cost and FinOps

DeepSpeed cost economics are driven by where state lives (GPU / CPU / NVMe) and the throughput tax for each tier. ZeRO-3 in GPU memory matches DDP throughput within ~5 percent; CPU offload typically costs 2-3x in wall-clock; NVMe offload costs 5-10x. The table below uses Yobitel UK pricing (June 2026) for the canonical fine-tuning shape (Llama-3 70B SFT on 8x H100 SXM5).

Single-node ZeRO-3 + CPU offload is the cheapest 70B-class fine-tuning option when the team has one 8x H100 box.
Multi-node Megatron-DS hybrid beats ZeRO-3 alone past ~70B because of the cross-node AllGather cost.
NVMe-tier (Infinity) is a budget-of-last-resort option for prototyping a model larger than CPU RAM permits.
BF16 vs FP16: identical cost on Ampere/Hopper, but BF16 saves the engineering time spent debugging loss-scale collapses.

Config	Node	Hours / 1B tokens	Hourly rate	USD / 1B tokens
Stage 2, no offload, fits	8x H100 80GB	Not feasible (OOM)	$24.80	—
Stage 3, no offload	8x H100 80GB	55	$24.80	$1,364
Stage 3 + CPU offload	8x H100 80GB + 1TB DDR5	120	$28.40	$3,408
Stage 3 + NVMe (Infinity)	8x H100 + RAID0 NVMe	320	$26.20	$8,384
Stage 1 (Megatron-DS hybrid)	32x H100 (4 nodes)	18	$99.20	$1,786
FSDP HSDP equivalent	8x H100 80GB	55	$24.80	$1,364

Security and compliance

DeepSpeed is a training library with no built-in network surface; security and compliance apply at the cluster level (Slurm/K8s auth, encrypted weights filesystem, isolated training network) the same way they do for Megatron-LM. The DeepSpeed binary itself is pip-installable, MIT-licensed, and has no telemetry call-home. For UK and EU sovereign deployments, DeepSpeed runs inside the same NCSC- and G-Cloud-compliant tenancies as Megatron, NeMo, or FSDP — the framework choice does not change the sovereignty posture.

Two operational notes for compliance audits: (1) DeepSpeed sharded checkpoints (global_stepX/zero_pp_rank_Y_*.pt) are not directly portable; use zero_to_fp32.py to materialise a single fp32 state dict for audit retention. (2) NVMe offload spills sensitive weights to disk; ensure the spill path is on encrypted ephemeral storage (LUKS or dm-crypt) and is wiped on container teardown.

Migration and alternatives

The closest alternative to DeepSpeed ZeRO is PyTorch FSDP — architecturally equivalent at Stage 3 (FULL_SHARD) and Stage 2 (SHARD_GRAD_OP). The two have converged functionally: ZeRO-3 and FSDP FULL_SHARD give the same memory and throughput on the same workload; the meaningful differences are ergonomic and ecosystem. DeepSpeed has the richer offload story (CPU and NVMe), the Megatron-DeepSpeed hybrid, and the established HuggingFace Trainer integration. FSDP has tighter PyTorch core integration, native DTensor composition with TP, and a cleaner extension story via torch.distributed._composable.fsdp.fully_shard (FSDP2).

From / To	Effort	When to choose	What you keep / lose
DDP -> DeepSpeed ZeRO-1	Trivial — drop in config	Always; ZeRO-1 is free.	Keep DDP semantics; gain optimiser sharding.
DDP -> DeepSpeed ZeRO-3	Low — config change	Model > 1 GPU memory.	Lose direct param access; gain N-fold memory.
DeepSpeed -> FSDP	Medium — code rewrite	PyTorch-native stack, no offload needed.	Lose CPU/NVMe offload; gain DTensor / TP composition.
FSDP -> DeepSpeed	Medium — config + launcher	Need CPU/NVMe offload or Megatron-DS hybrid.	Gain offload; lose FSDP2 ergonomics.
DeepSpeed -> Megatron-LM	High — different paradigm	True 3D parallelism at frontier scale.	Gain TP+PP; lose configuration ergonomics.
HuggingFace Trainer + DS -> Accelerate + DS	Low	More flexible training loop.	Same engine; different orchestration.

Tip: If you are starting fresh in 2026 on PyTorch and don't need NVMe offload, prefer FSDP2 over DeepSpeed ZeRO. The two are equivalent at the math layer, but FSDP2 composes more cleanly with the rest of the torch.distributed stack. Keep DeepSpeed when you need ZeRO-Infinity, the Megatron-DeepSpeed hybrid, or HuggingFace Trainer with its existing config surface.

Troubleshooting

The error patterns below cover the common DeepSpeed ZeRO failure modes observed on Yobitel-operated training fleets and the public DeepSpeed issue tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom	Cause	Fix
OOM during model init with Stage 3	Full model materialised before sharding.	Use `with deepspeed.zero.Init(): model = AutoModel.from_pretrained(...)` context manager.
from_pretrained fails on Stage 3 checkpoint	Sharded weights cannot load directly into HF.	Set stage3_gather_16bit_weights_on_model_save=true OR run zero_to_fp32.py.
Throughput much slower than expected at Stage 3	overlap_comm=false or contiguous_gradients=false.	Set both to true; verify prefetch bucket sizes are non-trivial.
CPU-offload step time grows linearly with model size	DeepSpeedCPUAdam not selected; falls back to Python Adam.	Set optimizer.type='DeepSpeedCPUAdam' explicitly when offloading.
NCCL hang on Stage 3 launch	AllGather buckets overflow comm buffer.	Lower allgather_bucket_size to 2e8; raise NCCL_BUFFSIZE.
NVMe Infinity throughput < 1 GB/s	Single drive or HDD instead of RAID-0 NVMe.	Provision 4+ NVMe RAID-0; pin offload_param.nvme_path to it.
Loss NaN on FP16	Dynamic loss scale collapsed.	Switch to bf16; or set fp16.initial_scale_power lower.
Optimiser step 5x slower than expected	AdamW called on full param replica (not the sharded slice).	Verify zero_optimization.stage > 0; check optimiser is DeepSpeed-wrapped.
Gradient overflow warnings every step	FP16 + Adam epsilon too small.	Switch to bf16; or raise eps to 1e-6 from 1e-8.
Stage 3 with TP fails on shape mismatch	ZeRO-3 not aware of TP partitioning by default.	Use Megatron-DeepSpeed and set zero_optimization.stage=1 with TP instead.
Checkpoint save takes > 10 minutes	Sharded checkpoint write to slow filesystem.	Use parallel writes; place save dir on Lustre or NVMe; consider torch_dist format.
zero_to_fp32.py OOMs	Tries to load all shards into one process.	Run on a CPU box with 2x model-size RAM; or use the streaming variant.

Where this fits in the Yobitel stack

DeepSpeed ZeRO is one of the supported training engines on Yobitel sovereign GPU tenancies. For customers running single-node 70B fine-tuning on 8x H100, the canonical path is Stage 3 + CPU offload with the HuggingFace Trainer integration and a pre-validated Slurm/Pyxis launch script. For multi-node pretraining above 30B, Yobitel recommends Megatron-LM (via NeMo) for true 3D parallelism and uses DeepSpeed primarily when an existing customer codebase is already DeepSpeed-shaped.

Yobitel's sovereign training tenancies (London-1, Frankfurt-1) ship DeepSpeed alongside Megatron, NeMo, FSDP, and torchtune in pre-built NGC-derived containers. NCCL and InfiniBand tuning, NVMe RAID-0 configuration for ZeRO-Infinity, and DCGM observability dashboards are pre-applied. Trained checkpoints flow into Yobibyte's inference path (vLLM, TensorRT-LLM) after the standard HuggingFace conversion.

References

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models · arXiv (Rajbhandari et al., 2019)
ZeRO-Offload: Democratizing Billion-Scale Model Training · arXiv (Ren et al., 2021)
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning · arXiv (Rajbhandari et al., 2021)
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training · arXiv (Wang et al., 2023)
DeepSpeed on GitHub · GitHub (Microsoft)
DeepSpeed Documentation · Microsoft DeepSpeed
HuggingFace Transformers DeepSpeed integration · HuggingFace

TL;DR

ZeRO (Zero Redundancy Optimizer), introduced by Microsoft DeepSpeed in 2019 (Rajbhandari et al., arXiv:1910.02054), removes the memory redundancy of vanilla data parallelism by sharding optimiser state (Stage 1), gradients (Stage 2), and parameters (Stage 3). MIT-licensed, hosted at github.com/microsoft/DeepSpeed.
Stage 3 reduces per-GPU memory roughly N-fold for the same model, at the cost of two extra AllGather + one ReduceScatter per layer per step to reconstruct full parameters on demand. ZeRO-Offload extends to CPU RAM; ZeRO-Infinity to NVMe.
Drives the single-node 70B fine-tuning case (Stage 3 + CPU offload on 4-8x H100) and remains the production memory-saving strategy in Megatron-DeepSpeed and the HuggingFace Trainer / Accelerate stack. Architecturally interchangeable with PyTorch FSDP, with which it has converged functionally.
Drives every memory line on the DeepSpeed config JSON: `zero_optimization.stage`, `offload_optimizer`, `offload_param`, `contiguous_gradients`, `overlap_comm`, `reduce_bucket_size`, `stage3_prefetch_bucket_size`, `stage3_param_persistence_threshold`. Tuning these is the bulk of DeepSpeed ops work.

Overview

Quick start

# 1. Install DeepSpeed and write a ZeRO-3 config
pip install "deepspeed>=0.14.0" transformers accelerate bitsandbytes datasets

cat > ds_zero3.json <<'JSON'
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "offload_param":     { "device": "cpu", "pin_memory": true },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": 4,
  "gradient_clipping": 1.0,
  "train_micro_batch_size_per_gpu": 1,
  "wall_clock_breakdown": false
}
JSON

# 2. Launch fine-tuning on 4x A100 80GB
deepspeed --num_gpus=4 train.py \
    --deepspeed ds_zero3.json \
    --model_name_or_path meta-llama/Meta-Llama-3-8B \
    --dataset_name yahma/alpaca-cleaned \
    --output_dir ./llama3-8b-sft \
    --bf16 true \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_strategy steps --save_steps 500

# 3. The HuggingFace Trainer wires it up via TrainingArguments(deepspeed="ds_zero3.json")
python - <<'PY'
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
    output_dir="./llama3-8b-sft",
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    num_train_epochs=3,
    deepspeed="ds_zero3.json",
)
# Trainer(model, args, ...).train()
PY

Tip: The auto values for bucket and prefetch sizes are computed by DeepSpeed from the model and world size — leave them as auto for the first run, then tune only if profiling shows AllGather under-overlapped.

How it works

ZeRO-1: optimiser state sharded; ~4x memory reduction vs DDP; communication ~same as DDP.
ZeRO-2: + gradients sharded; ~8x memory reduction; communication ~same as DDP (ReduceScatter + AllGather).
ZeRO-3: + parameters sharded; ~N-fold memory reduction; communication ~1.5x DDP.
ZeRO-Offload: CPU spill for optimiser state and gradients; Adam runs on CPU.
ZeRO-Infinity: NVMe spill for parameters and optimiser state; bandwidth bounded by RAID-0.
ZeRO++ (2023): hierarchical partitioning + quantised weights/gradients; cuts cross-node traffic 4x.
Compose with: gradient checkpointing (activations), tensor parallelism (per Megatron-DeepSpeed), pipeline parallelism (per DeepSpeed pipeline module).

Note: Memory savings are quoted relative to vanilla DDP with Adam at mixed precision (16 bytes/param/rank). Real workloads also carry activation memory, which ZeRO does not touch — pair Stage 3 with gradient_checkpointing=True for long-context training above 8K.

Reference and specifications

Config key	Type	Default	Description
zero_optimization.stage	int	0	0 = off, 1 = optimiser sharding, 2 = + grads, 3 = + params.
zero_optimization.offload_optimizer.device	string	(unset)	cpu
zero_optimization.offload_optimizer.nvme_path	path	/local_nvme	Path on NVMe filesystem for the optimiser-state spill.
zero_optimization.offload_optimizer.pin_memory	bool	false	Pin CPU memory for higher H<->D bandwidth.
zero_optimization.offload_param.device	string	(unset)	cpu
zero_optimization.offload_param.nvme_path	path	/local_nvme	Path on NVMe for parameter spill.
zero_optimization.overlap_comm	bool	false	Overlap collective comms with backward compute (~10-20 percent uplift).
zero_optimization.contiguous_gradients	bool	false	Copy grads into a contiguous buffer before ReduceScatter (recommended).
zero_optimization.reduce_bucket_size	int	5e8	Bytes per ReduceScatter bucket; smaller = lower latency, larger = higher bandwidth.
zero_optimization.allgather_bucket_size	int	5e8	Bytes per AllGather bucket (Stage 1/2).
zero_optimization.stage3_prefetch_bucket_size	int	auto	Bytes prefetched ahead for the next layer's AllGather.
zero_optimization.stage3_param_persistence_threshold	int	auto	Params smaller than this stay replicated (avoids per-step AllGather).
zero_optimization.stage3_max_live_parameters	int	1e9	Cap on bytes of gathered params in GPU memory at once.
zero_optimization.stage3_max_reuse_distance	int	1e9	Bytes before a gathered param is evicted; raise for tighter prefetch.
zero_optimization.stage3_gather_16bit_weights_on_model_save	bool	false	Materialise full BF16 weights at save; required for HF export.
zero_optimization.sub_group_size	int	1e9	Stage 3 grouping for partitioned optimiser step; tune for large models.
zero_optimization.cpu_offload	bool	(deprecated)	Legacy alias for offload_optimizer; use the explicit form.
zero_optimization.zero_quantized_weights	bool	false	ZeRO++ quantised weight communication (INT8).
zero_optimization.zero_hpz_partition_size	int	(off)	ZeRO++ hierarchical partitioning; size of per-node group.
bf16.enabled	bool	false	Enable BF16 mixed-precision training.
fp16.enabled	bool	false	Enable FP16 + loss scaling. BF16 preferred on Ampere+.
fp16.loss_scale	float	0 = dynamic	0 enables dynamic loss scaling.
gradient_accumulation_steps	int	1	Effective batch = micro_batch x dp_size x grad_accum.
gradient_clipping	float	0 = off	Global gradient-norm clipping threshold.
train_micro_batch_size_per_gpu	int	(required)	Per-rank micro-batch.
optimizer.type	string	(required)	Adam
scheduler.type	string	(unset)	WarmupLR
activation_checkpointing.partition_activations	bool	false	Partition activations across MP group (Megatron-DeepSpeed).
wall_clock_breakdown	bool	false	Per-phase timing breakdown; useful for tuning.
zero_allow_untested_optimizer	bool	false	Required to use optimisers other than the official list.

Warning: stage3_gather_16bit_weights_on_model_save: false (the default) writes Stage 3 sharded checkpoints that HuggingFace from_pretrained cannot load directly. Either set this to true (uses 2x model size GPU memory at save) or use DeepSpeed's zero_to_fp32.py script to convert offline.

Workload patterns

// A — ZeRO-3 + CPU offload for single-node 70B fine-tune
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "offload_param":     { "device": "cpu", "pin_memory": true },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "stage3_max_live_parameters": 2e9,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": 8,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_clipping": 1.0,
  "optimizer": { "type": "AdamW", "params": { "lr": 2e-5 } }
}

// B — ZeRO-2 pretrain of 13B on 32 GPUs, throughput-optimised
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 2e8,
    "allgather_bucket_size": 2e8
  },
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_clipping": 1.0,
  "optimizer": { "type": "AdamW", "params": { "lr": 3e-4, "betas": [0.9, 0.95] } },
  "activation_checkpointing": { "partition_activations": false }
}

// C — Megatron-DeepSpeed 3D + ZeRO-1 for 175B pretrain at scale
{
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 1,
    "overlap_comm": true,
    "contiguous_gradients": true
  },
  "tensor_parallel": { "tp_size": 8 },
  "pipeline_parallel": { "pp_size": 8 },
  "gradient_accumulation_steps": 16,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_clipping": 1.0,
  "optimizer": { "type": "AdamW", "params": { "lr": 1.5e-4 } },
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true
  }
}

Sizing and capacity planning

ZeRO-3 + CPU offload trades 2-3x throughput for 30-50 percent GPU-memory headroom; primarily a fine-tuning lever.
ZeRO-Infinity (NVMe spill) is 5-10x slower than in-memory; use only when no other capacity is available.
Above ~70B, prefer Megatron-DeepSpeed 3D + ZeRO-1 over pure ZeRO-3 — the per-layer AllGather amortises poorly past a certain size.
ZeRO++ (zero_quantized_weights + hierarchical partitioning) cuts cross-node ZeRO-3 traffic 4x on bandwidth-constrained clusters.
Effective batch size = micro_batch x DP x grad_accumulation; for stable Adam at scale aim for 1-4M tokens per step.

Model	Cluster	ZeRO stage + offload	Per-rank GPU mem	Per-GPU tok/s
7B	8x A100 80GB	Stage 2	~32 GB	8,500
13B	8x A100 80GB	Stage 2	~58 GB	5,200
13B	32x A100 80GB	Stage 2	~36 GB	5,000
30B	8x A100 80GB	Stage 3	~64 GB	2,800
70B	8x H100 80GB	Stage 3	~74 GB	2,200
70B	8x H100 80GB	Stage 3 + CPU offload	~58 GB	950
70B	32x H100	Stage 3	~48 GB	2,400
175B	8x H100 80GB	Stage 3 + NVMe (Infinity)	~75 GB	180
175B	64x H100	Stage 3	~70 GB	1,400
8x22B MoE	32x H100	Stage 2 + EP	~60 GB	1,800

Limits and quotas

Constraint	Default / ceiling	How to manage
DP group size	= world_size	ZeRO shards within DP; no hard upper bound.
sub_group_size (Stage 3)	1e9	Reduce for very large models (saves opt-step memory).
stage3_max_live_parameters	1e9	Cap on concurrently gathered params; raise to overlap more.
CPU RAM (offload)	Host-bounded	Need ~12 bytes/param of CPU RAM for AdamW offload.
NVMe bandwidth (Infinity)	Drive-bounded	RAID-0 8-drive NVMe sustains 50+ GB/s; required for usable Infinity.
Activation memory	Not addressed by ZeRO	Use activation_checkpointing for long context.
AllGather bucket size	5e8 bytes	Larger = better bandwidth, lower latency.
NCCL communicator count	World-size dependent	Set NCCL_BUFFSIZE, NCCL_NTHREADS; use NCCL >= 2.20.
BF16 hardware	Ampere+	Pre-Ampere needs FP16 + dynamic loss scaling.
Optimiser support	Adam/AdamW/Lamb stock	Custom optimisers need zero_allow_untested_optimizer.

Observability

samples_per_second: holds within +-5 percent in steady state; drops correlate with offload bandwidth saturation.
optimizer_step (CPU offload): should be ~0.5-2s at 70B; >5s means CPU-bound Adam.
AllGather/ReduceScatter latency: visible in wall_clock_breakdown; pair with NCCL collective profiling.
host_to_device / device_to_host bandwidth: PCIe Gen4 ~32 GB/s, Gen5 ~64 GB/s; offload saturates the lower number.
GPU memory peak per phase: log with deepspeed.runtime.utils.memory_status() to validate stage3_max_live_parameters.
grad_norm: spikes indicate divergence; correlate with LR.
DCGM_FI_PROF_NVLINK_TX_BYTES and DCGM_FI_PROF_PCIE_TX_BYTES — distinguish ZeRO collectives from CPU offload traffic.

# Prometheus rules for a DeepSpeed ZeRO training job
groups:
  - name: deepspeed-training
    interval: 60s
    rules:
      - alert: DeepSpeedStepTimeRegression
        expr: |
          avg_over_time(deepspeed:step_time_seconds[5m]) >
          1.10 * avg_over_time(deepspeed:step_time_seconds[1h] offset 30m)
        for: 10m
        labels: { severity: warning, team: training }
        annotations:
          summary: "DeepSpeed step time +10% vs 1h baseline on {{ $labels.job_name }}"

      - alert: DeepSpeedOptimizerStepSlow
        expr: avg_over_time(deepspeed:optimizer_step_seconds[5m]) > 5.0
        for: 15m
        annotations:
          summary: "CPU-offload Adam step > 5s — CPU saturated or PCIe contention"

      - alert: DeepSpeedAllGatherP99
        expr: histogram_quantile(0.99,
                rate(deepspeed:allgather_seconds_bucket[5m])) > 1.0
        for: 10m
        annotations:
          summary: "ZeRO-3 AllGather p99 > 1s — likely interconnect issue"

      - alert: DeepSpeedGradNormSpike
        expr: deepspeed:grad_norm > 10 * avg_over_time(deepspeed:grad_norm[1h] offset 30m)
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Grad-norm spike — investigate LR or data"

Cost and FinOps

Single-node ZeRO-3 + CPU offload is the cheapest 70B-class fine-tuning option when the team has one 8x H100 box.
Multi-node Megatron-DS hybrid beats ZeRO-3 alone past ~70B because of the cross-node AllGather cost.
NVMe-tier (Infinity) is a budget-of-last-resort option for prototyping a model larger than CPU RAM permits.
BF16 vs FP16: identical cost on Ampere/Hopper, but BF16 saves the engineering time spent debugging loss-scale collapses.

Config	Node	Hours / 1B tokens	Hourly rate	USD / 1B tokens
Stage 2, no offload, fits	8x H100 80GB	Not feasible (OOM)	$24.80	—
Stage 3, no offload	8x H100 80GB	55	$24.80	$1,364
Stage 3 + CPU offload	8x H100 80GB + 1TB DDR5	120	$28.40	$3,408
Stage 3 + NVMe (Infinity)	8x H100 + RAID0 NVMe	320	$26.20	$8,384
Stage 1 (Megatron-DS hybrid)	32x H100 (4 nodes)	18	$99.20	$1,786
FSDP HSDP equivalent	8x H100 80GB	55	$24.80	$1,364

Security and compliance

Migration and alternatives

From / To	Effort	When to choose	What you keep / lose
DDP -> DeepSpeed ZeRO-1	Trivial — drop in config	Always; ZeRO-1 is free.	Keep DDP semantics; gain optimiser sharding.
DDP -> DeepSpeed ZeRO-3	Low — config change	Model > 1 GPU memory.	Lose direct param access; gain N-fold memory.
DeepSpeed -> FSDP	Medium — code rewrite	PyTorch-native stack, no offload needed.	Lose CPU/NVMe offload; gain DTensor / TP composition.
FSDP -> DeepSpeed	Medium — config + launcher	Need CPU/NVMe offload or Megatron-DS hybrid.	Gain offload; lose FSDP2 ergonomics.
DeepSpeed -> Megatron-LM	High — different paradigm	True 3D parallelism at frontier scale.	Gain TP+PP; lose configuration ergonomics.
HuggingFace Trainer + DS -> Accelerate + DS	Low	More flexible training loop.	Same engine; different orchestration.

Tip: If you are starting fresh in 2026 on PyTorch and don't need NVMe offload, prefer FSDP2 over DeepSpeed ZeRO. The two are equivalent at the math layer, but FSDP2 composes more cleanly with the rest of the torch.distributed stack. Keep DeepSpeed when you need ZeRO-Infinity, the Megatron-DeepSpeed hybrid, or HuggingFace Trainer with its existing config surface.

Troubleshooting

Symptom	Cause	Fix
OOM during model init with Stage 3	Full model materialised before sharding.	Use `with deepspeed.zero.Init(): model = AutoModel.from_pretrained(...)` context manager.
from_pretrained fails on Stage 3 checkpoint	Sharded weights cannot load directly into HF.	Set stage3_gather_16bit_weights_on_model_save=true OR run zero_to_fp32.py.
Throughput much slower than expected at Stage 3	overlap_comm=false or contiguous_gradients=false.	Set both to true; verify prefetch bucket sizes are non-trivial.
CPU-offload step time grows linearly with model size	DeepSpeedCPUAdam not selected; falls back to Python Adam.	Set optimizer.type='DeepSpeedCPUAdam' explicitly when offloading.
NCCL hang on Stage 3 launch	AllGather buckets overflow comm buffer.	Lower allgather_bucket_size to 2e8; raise NCCL_BUFFSIZE.
NVMe Infinity throughput < 1 GB/s	Single drive or HDD instead of RAID-0 NVMe.	Provision 4+ NVMe RAID-0; pin offload_param.nvme_path to it.
Loss NaN on FP16	Dynamic loss scale collapsed.	Switch to bf16; or set fp16.initial_scale_power lower.
Optimiser step 5x slower than expected	AdamW called on full param replica (not the sharded slice).	Verify zero_optimization.stage > 0; check optimiser is DeepSpeed-wrapped.
Gradient overflow warnings every step	FP16 + Adam epsilon too small.	Switch to bf16; or raise eps to 1e-6 from 1e-8.
Stage 3 with TP fails on shape mismatch	ZeRO-3 not aware of TP partitioning by default.	Use Megatron-DeepSpeed and set zero_optimization.stage=1 with TP instead.
Checkpoint save takes > 10 minutes	Sharded checkpoint write to slow filesystem.	Use parallel writes; place save dir on Lustre or NVMe; consider torch_dist format.
zero_to_fp32.py OOMs	Tries to load all shards into one process.	Run on a CPU box with 2x model-size RAM; or use the streaming variant.

Where this fits in the Yobitel stack

References

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models · arXiv (Rajbhandari et al., 2019)
ZeRO-Offload: Democratizing Billion-Scale Model Training · arXiv (Ren et al., 2021)
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning · arXiv (Rajbhandari et al., 2021)
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training · arXiv (Wang et al., 2023)
DeepSpeed on GitHub · GitHub (Microsoft)
DeepSpeed Documentation · Microsoft DeepSpeed
HuggingFace Transformers DeepSpeed integration · HuggingFace

DeepSpeed ZeRO

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

DeepSpeed ZeRO

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte