Rejection Sampling Fine-Tuning

TL;DR

Rejection sampling fine-tuning (RSFT) is an iterative post-training method: generate K candidate responses per prompt, score them, keep the best, train on those examples, repeat.
Used as a stage of Llama 2's post-training pipeline (Touvron et al., 2023) and a building block of many alignment recipes — sometimes called 'best-of-N SFT' or 'expert iteration'.
Conceptually a stripped-down RLHF: the reward model picks winners, but instead of running PPO you simply do SFT on the winners. Cheaper than PPO, often most of the quality.
Strong baseline for reasoning training (RFT in DeepSeekMath, STaR for chain-of-thought bootstrapping) and a building block of GRPO-style online learning.

The Recipe#

Rejection sampling fine-tuning has four steps per iteration. Start from a base model (typically already SFT'd). For each prompt in a training set, sample K responses with non-zero temperature. Score every response using a reward model, judge LLM, or programmatic rule (e.g. unit tests pass). Keep the highest-scoring response per prompt. Train the policy with standard SFT on these (prompt, best response) pairs. Repeat for several iterations.

The intuition is the same as the cross-entropy method in RL: the policy already produces good outputs sometimes; identify them, treat them as targets, and the policy concentrates probability mass on producing more of them. No reward-model gradient, no on-policy KL constraint, no PPO inner loop — just SFT on filtered samples.

Where It Has Been Used#

Model	Year	Role of rejection sampling
Llama 2 Chat	2023	Iteration 1-3 of alignment pipeline before PPO
WizardLM-M	2023	Self-improving fine-tune loop
DeepSeekMath (RFT)	2024	Reasoning bootstrap from reward-verified math solutions
STaR	2022	Self-taught reasoner — bootstrap CoT from solved problems
Self-Rewarding LMs	2024	Model judges its own outputs to bootstrap preference data
DeepSeek-R1	2025	Cold-start SFT data filtered by automated checks

Hyperparameters#

Hyperparameter	Typical value
K (candidates per prompt)	4 - 32
Sampling temperature	0.7 - 1.0
Top-p	0.9 - 0.95
Acceptance criterion	Top-1 by reward, or top-k above threshold
SFT epochs per iteration	1 - 2
Iterations	2 - 5
Learning rate	Lower than initial SFT (1e-6 - 1e-5 for full FT)

Diversity at the sampling stage matters. If all K candidates look alike, rejection sampling reduces to ordinary SFT on a slightly cleaner dataset. Increase temperature, vary system prompts, or sample from earlier iterations to keep diversity high.

Compared with DPO and PPO#

vs DPO: rejection sampling uses only the winners, throwing away the losers. DPO uses both — winner and loser of each pair — so it extracts more signal per unit of preference data. RSFT is simpler to implement and harder to break.
vs PPO: rejection sampling does on-policy generation but off-policy updates (it does SFT on the winners, not RL). Compute cost is lower, infrastructure is simpler, and there is no KL-control delicacy. Quality ceiling is generally a notch below well-tuned PPO.
vs simple SFT: rejection sampling adds the bootstrapping loop and the reward signal. It is materially better than fixed-dataset SFT whenever a usable reward exists.

Failure Modes#

Reward hacking — if the reward model has a loophole, rejection sampling concentrates the policy onto that loophole as fast as any RL method.
Mode collapse — without diversity in sampling, iterations reinforce a single response style and the model becomes monotonous.
Reward signal saturation — once most candidates score similarly, the filter does nothing and progress stalls.
Train/test contamination — if the prompts used for rejection sampling include or paraphrase eval prompts, gains evaporate on a clean test set.

When to Reach for Rejection Sampling#

Use rejection sampling when (a) you can generate many candidates cheaply, (b) you have a reliable reward signal (programmatic check, strong judge, or trained reward model), and (c) you want most of the gains of preference optimisation without the complexity of PPO. It is the default first step for reasoning fine-tuning where unit tests or automated graders give a hard correctness signal — DeepSeek and OpenAI's o-series report relying on it heavily.

References

Llama 2: Open Foundation and Fine-Tuned Chat Models · arXiv (Touvron et al., 2023)
STaR: Bootstrapping Reasoning With Reasoning · arXiv (Zelikman et al., 2022)
DeepSeekMath: Pushing the Limits of Mathematical Reasoning · arXiv (Shao et al., 2024)

The Recipe#

Where It Has Been Used#

Model	Year	Role of rejection sampling
Llama 2 Chat	2023	Iteration 1-3 of alignment pipeline before PPO
WizardLM-M	2023	Self-improving fine-tune loop
DeepSeekMath (RFT)	2024	Reasoning bootstrap from reward-verified math solutions
STaR	2022	Self-taught reasoner — bootstrap CoT from solved problems
Self-Rewarding LMs	2024	Model judges its own outputs to bootstrap preference data
DeepSeek-R1	2025	Cold-start SFT data filtered by automated checks

Hyperparameters#

Hyperparameter	Typical value
K (candidates per prompt)	4 - 32
Sampling temperature	0.7 - 1.0
Top-p	0.9 - 0.95
Acceptance criterion	Top-1 by reward, or top-k above threshold
SFT epochs per iteration	1 - 2
Iterations	2 - 5
Learning rate	Lower than initial SFT (1e-6 - 1e-5 for full FT)

Compared with DPO and PPO#

vs DPO: rejection sampling uses only the winners, throwing away the losers. DPO uses both — winner and loser of each pair — so it extracts more signal per unit of preference data. RSFT is simpler to implement and harder to break.

vs PPO: rejection sampling does on-policy generation but off-policy updates (it does SFT on the winners, not RL). Compute cost is lower, infrastructure is simpler, and there is no KL-control delicacy. Quality ceiling is generally a notch below well-tuned PPO.

vs simple SFT: rejection sampling adds the bootstrapping loop and the reward signal. It is materially better than fixed-dataset SFT whenever a usable reward exists.

Failure Modes#

Reward hacking — if the reward model has a loophole, rejection sampling concentrates the policy onto that loophole as fast as any RL method.

Mode collapse — without diversity in sampling, iterations reinforce a single response style and the model becomes monotonous.

Reward signal saturation — once most candidates score similarly, the filter does nothing and progress stalls.

Train/test contamination — if the prompts used for rejection sampling include or paraphrase eval prompts, gains evaporate on a clean test set.

When to Reach for Rejection Sampling#

Rejection Sampling Fine-Tuning

The Recipe#

Where It Has Been Used#

Hyperparameters#

Compared with DPO and PPO#

Failure Modes#

When to Reach for Rejection Sampling#

References

Browse all entries

Deploy on Yobitel

Rejection Sampling Fine-Tuning

The Recipe#

Where It Has Been Used#

Hyperparameters#

Compared with DPO and PPO#

Failure Modes#

When to Reach for Rejection Sampling#

References

Browse all entries

Deploy on Yobitel