TL;DR
- Adam (Kingma & Ba, 2014, arXiv:1412.6980) tracks per-parameter exponential moving averages of the gradient (first moment) and squared gradient (second moment) to set adaptive per-parameter step sizes.
- AdamW (Loshchilov & Hutter, 2017, arXiv:1711.05101) decouples weight decay from the gradient update, fixing a subtle but important bug in how Adam handled L2 regularisation.
- AdamW is the default optimiser for every frontier Transformer training pipeline — GPT, Llama, Claude, Qwen, DeepSeek.
- Memory cost is 2× the model parameter count (one EMA for m, one for v) in FP32 — which is why state-sharded optimisers like ZeRO-1 are universal at scale.
The Algorithm#
At each step, given gradient g_t for parameter θ_t, Adam maintains two exponential moving averages: m_t = β_1 · m_{t-1} + (1 − β_1) · g_t (first moment), v_t = β_2 · v_{t-1} + (1 − β_2) · g_t² (second moment, element-wise).
Both moments are bias-corrected because they start at zero: m̂_t = m_t / (1 − β_1^t), v̂_t = v_t / (1 − β_2^t). The parameter update is θ_{t+1} = θ_t − η · m̂_t / (√v̂_t + ε). Each parameter gets its own effective learning rate scaled by 1/√(second moment) — a per-parameter adaptive step.
# AdamW update — the standard form used for Transformer pretraining.
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad * grad
m_hat = m / (1 - beta1 ** step)
v_hat = v / (1 - beta2 ** step)
# Decoupled weight decay (the W in AdamW).
param.mul_(1 - lr * weight_decay)
param.addcdiv_(m_hat, v_hat.sqrt().add(eps), value=-lr)Why Adam Works#
Adam combines momentum (the first-moment EMA smooths noisy gradients) with RMSProp-style adaptive scaling (the second-moment EMA normalises by historical gradient magnitude). The effect is that parameters with consistently small gradients still take meaningful steps, while parameters with bursty gradients are damped — both desirable behaviours for sparse, high-dimensional problems like deep learning.
Adam also handles the heterogeneous gradient scales typical of neural networks — embedding layers, attention projections and FFN matrices have wildly different gradient magnitudes — without needing per-layer hand-tuned learning rates.
The Weight Decay Bug AdamW Fixed#
In the original Adam paper, L2 regularisation was implemented by adding λ · θ to the gradient: g_t ← g_t + λ · θ_t. With adaptive scaling, this means the effective weight decay is scaled by 1/√v̂, so parameters with large second moment get less decay than parameters with small second moment. That is not what L2 is supposed to do.
Loshchilov and Hutter's 2017 fix is trivial in retrospect: apply the decay directly to the parameter, outside the adaptive update. θ ← (1 − η · λ) · θ before the gradient step. This decouples regularisation strength from gradient magnitude. The fix improves generalisation on essentially every task ever tested with Adam, and AdamW is now universal.
Several deep learning frameworks shipped 'Adam with weight decay' for years that was actually the broken L2 form. If you are reproducing a paper from before 2018, check carefully which is being used — they are not equivalent.
Hyperparameters in Practice#
The canonical defaults β_1 = 0.9, β_2 = 0.999, ε = 1e-8 work well for most workloads. For Transformer pretraining the community converged on β_2 = 0.95 (faster adaptation to changing gradient scales) and ε = 1e-8 to 1e-5. Learning rates are scheduled (cosine, linear warm-up + decay) and range from 1e-4 to 6e-4 for AdamW on language modelling. Weight decay is typically 0.1 for embedding and norm parameters excluded.
Memory Cost and Sharding#
AdamW needs to store m, v and the parameters themselves in optimiser state. In mixed-precision training (BF16 parameters, FP32 master copies and FP32 moments), the optimiser state alone is 12 bytes per parameter — 12× the parameter memory in BF16. For a 70B model that is 840 GB, far more than a single GPU can hold.
ZeRO-1 (DeepSpeed, Microsoft, 2019) shards the optimiser state across data-parallel ranks. ZeRO-2 also shards the gradients. ZeRO-3 shards the parameters themselves. PyTorch's FSDP implements ZeRO-3 directly. Every frontier training run uses some form of optimiser-state sharding.
Successors and Challengers#
AdamW's dominance has been challenged but never displaced for general Transformer pretraining. Lion (Chen et al., 2023) cuts memory by tracking only the sign of the gradient. Sophia (Liu et al., 2023) uses second-order curvature estimates for faster convergence. Adafactor (Shazeer & Stern, 2018) factorises v into row and column statistics, dramatically cutting memory — used for T5.
As of 2026, AdamW remains the default for every public Llama, Qwen, Mistral and DeepSeek pretrain. Lion has seen adoption for fine-tuning. The dethroning never quite happened.