TL;DR
- Order the training distribution to start with 'easier' examples and progress to 'harder' ones; formalised by Bengio et al., 2009.
- In LLM practice it shows up as data-mixing schedules (more web text early, more code/math/long-context late) and length-based curricula (short sequences early, long contexts late).
- Most consistent wins: pretraining throughput (short-context warmup), long-context extension (graduated context length), and multimodal training (vision-language data ordering).
Overview#
Curriculum learning is the idea — borrowed from child cognitive development — that models learn faster and reach better solutions when the training data is presented in an ordered, easy-to-hard sequence rather than uniformly at random. The formal proposal is Bengio et al., 2009 (ICML).
In modern transformer training the strict interpretation rarely applies, but two pragmatic variants do. First, data mixing schedules — the proportion of web text, code, math, scientific text, and conversational data is varied across pretraining. Second, sequence-length curricula — train on 2k tokens for the first 80 % of pretraining, then graduate to 32k and 128k contexts in the final phase, which is faster than training on 128k throughout.
Mechanism#
Data-mixing curricula are implemented at the data-loader level: each global batch samples micro-batches from named datasets in proportions that vary across training steps. Common schedules are linear ramps, step changes at fixed milestones, or learned re-weighting (DoReMi, RegMix).
Length curricula manipulate the per-step sequence length. Short sequences pack more tokens per FLOP (attention is O(L²)), so the early phase trains faster wall-clock; the late phase adapts the model to long context. Used in Llama 3, Nemotron, DeepSeek-V3 pretraining recipes.
Performance Characteristics#
- Length curriculum: 1.5-3× wall-clock speedup vs uniform long-context training, no quality loss when the late-phase context-length budget is sufficient.
- Data-mixing schedules: 0.5-2 % benchmark improvement is typical when tuned; gains are noisy and architecture-dependent.
- Multimodal curricula (image-text pretraining first, video later) show clearer wins in published recipes.
When to Use#
Apply a length curriculum to almost every modern LLM pretraining run — it is essentially free wall-clock savings. Apply data-mixing schedules when you have a long-enough run to amortise tuning the schedule. For fine-tuning, curricula are rarely worth the complexity unless the dataset is highly heterogeneous (e.g. mixed-domain SFT).
Llama 3's recipe (gradually increase context length from 8k → 128k in the final 10 % of pretraining) is now a near-standard pattern for long-context models. The intuition is that the model has already learned representations on short sequences and only needs to learn position-extrapolation in the late phase.
Pitfalls#
- Strict 'easy to hard' is hard to define for LLM data — most published curricula are dataset-composition schedules, not difficulty rankings.
- Length curricula need positional encodings that extrapolate (RoPE with appropriate base/theta scaling, ALiBi); a fixed-position encoding will not survive the jump.
- Curriculum schedules add complexity to reproducibility — every variant is a separate run.
- Some published gains do not survive larger model scales; budget for an ablation.
Software#
- Megatron-LM and NeMo both support mixed-dataset blending with weights — schedule by editing the weights mid-training.
- DoReMi (Xie et al., 2023) — learned domain reweighting for pretraining data mixes.
- RegMix (Liu et al., 2024) — regression-based mix prediction.
- Custom data loaders in any framework can implement length curricula with a few lines of code.
References
- Curriculum Learning (Bengio et al., 2009) · ICML 2009
- DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining · arXiv (Xie et al., 2023)
- The Llama 3 Herd of Models · arXiv (Meta, 2024)