TL;DR
- Rejection sampling fine-tuning (RSFT) is an iterative post-training method: generate K candidate responses per prompt, score them, keep the best, train on those examples, repeat.
- Used as a stage of Llama 2's post-training pipeline (Touvron et al., 2023) and a building block of many alignment recipes — sometimes called 'best-of-N SFT' or 'expert iteration'.
- Conceptually a stripped-down RLHF: the reward model picks winners, but instead of running PPO you simply do SFT on the winners. Cheaper than PPO, often most of the quality.
- Strong baseline for reasoning training (RFT in DeepSeekMath, STaR for chain-of-thought bootstrapping) and a building block of GRPO-style online learning.
The Recipe#
Rejection sampling fine-tuning has four steps per iteration. Start from a base model (typically already SFT'd). For each prompt in a training set, sample K responses with non-zero temperature. Score every response using a reward model, judge LLM, or programmatic rule (e.g. unit tests pass). Keep the highest-scoring response per prompt. Train the policy with standard SFT on these (prompt, best response) pairs. Repeat for several iterations.
The intuition is the same as the cross-entropy method in RL: the policy already produces good outputs sometimes; identify them, treat them as targets, and the policy concentrates probability mass on producing more of them. No reward-model gradient, no on-policy KL constraint, no PPO inner loop — just SFT on filtered samples.
Where It Has Been Used#
| Model | Year | Role of rejection sampling |
|---|---|---|
| Llama 2 Chat | 2023 | Iteration 1-3 of alignment pipeline before PPO |
| WizardLM-M | 2023 | Self-improving fine-tune loop |
| DeepSeekMath (RFT) | 2024 | Reasoning bootstrap from reward-verified math solutions |
| STaR | 2022 | Self-taught reasoner — bootstrap CoT from solved problems |
| Self-Rewarding LMs | 2024 | Model judges its own outputs to bootstrap preference data |
| DeepSeek-R1 | 2025 | Cold-start SFT data filtered by automated checks |
Hyperparameters#
| Hyperparameter | Typical value |
|---|---|
| K (candidates per prompt) | 4 - 32 |
| Sampling temperature | 0.7 - 1.0 |
| Top-p | 0.9 - 0.95 |
| Acceptance criterion | Top-1 by reward, or top-k above threshold |
| SFT epochs per iteration | 1 - 2 |
| Iterations | 2 - 5 |
| Learning rate | Lower than initial SFT (1e-6 - 1e-5 for full FT) |
Diversity at the sampling stage matters. If all K candidates look alike, rejection sampling reduces to ordinary SFT on a slightly cleaner dataset. Increase temperature, vary system prompts, or sample from earlier iterations to keep diversity high.
Compared with DPO and PPO#
- vs DPO: rejection sampling uses only the winners, throwing away the losers. DPO uses both — winner and loser of each pair — so it extracts more signal per unit of preference data. RSFT is simpler to implement and harder to break.
- vs PPO: rejection sampling does on-policy generation but off-policy updates (it does SFT on the winners, not RL). Compute cost is lower, infrastructure is simpler, and there is no KL-control delicacy. Quality ceiling is generally a notch below well-tuned PPO.
- vs simple SFT: rejection sampling adds the bootstrapping loop and the reward signal. It is materially better than fixed-dataset SFT whenever a usable reward exists.
Failure Modes#
- Reward hacking — if the reward model has a loophole, rejection sampling concentrates the policy onto that loophole as fast as any RL method.
- Mode collapse — without diversity in sampling, iterations reinforce a single response style and the model becomes monotonous.
- Reward signal saturation — once most candidates score similarly, the filter does nothing and progress stalls.
- Train/test contamination — if the prompts used for rejection sampling include or paraphrase eval prompts, gains evaporate on a clean test set.
When to Reach for Rejection Sampling#
Use rejection sampling when (a) you can generate many candidates cheaply, (b) you have a reliable reward signal (programmatic check, strong judge, or trained reward model), and (c) you want most of the gains of preference optimisation without the complexity of PPO. It is the default first step for reasoning fine-tuning where unit tests or automated graders give a hard correctness signal — DeepSeek and OpenAI's o-series report relying on it heavily.
References
- Llama 2: Open Foundation and Fine-Tuned Chat Models · arXiv (Touvron et al., 2023)
- STaR: Bootstrapping Reasoning With Reasoning · arXiv (Zelikman et al., 2022)
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning · arXiv (Shao et al., 2024)