Alignment Tuning

TL;DR

Alignment tuning is the umbrella term for the post-training stage that takes an instruction-tuned model and makes it produce responses humans rate higher — for helpfulness, harmlessness, and honesty.
Encompasses RLHF with PPO (the original InstructGPT/ChatGPT recipe), direct preference methods (DPO, IPO, ORPO, KTO, SimPO), and online policy methods (GRPO, REINFORCE++).
Operates on preference data — pairs of (prompt, chosen response, rejected response) — rather than instruction-response pairs.
By 2026 the open-model field has largely converged on DPO or one of its variants for cost reasons; closed labs still use PPO and increasingly online preference learning.

Why Alignment Is a Separate Stage#

SFT teaches a model what an acceptable response looks like — the format, the style, the rough behaviour. It does not teach the model to distinguish good responses from mediocre ones when both look superficially correct. Two well-formatted responses to the same prompt can differ enormously in helpfulness, factuality, or safety; SFT's cross-entropy loss treats them equally as long as both are 'plausible'.

Alignment tuning fixes this by training on relative preferences. Given two responses to the same prompt, one preferred and one dispreferred, the model is updated to make the preferred response more likely and the dispreferred less likely. The signal is comparative, not absolute, and that distinction is what lets the model learn nuance that SFT cannot.

The Family of Methods#

Alignment methods divide into two broad camps: those that train an explicit reward model and use reinforcement learning to optimise against it (the RLHF camp), and those that derive a closed-form loss directly from preference pairs (the direct preference camp). A third, increasingly important, camp is online preference learning, where the model generates its own samples mid-training and gets them ranked.

Method	Camp	Year	Headline trait
RLHF + PPO	RL	2022	InstructGPT/ChatGPT original
DPO	Direct	2023	Closed-form, no reward model
IPO	Direct	2023	Fixes DPO overfitting
KTO	Direct	2024	Single-response (not pairwise) data
ORPO	Direct	2024	Combines SFT + preference in one loss
SimPO	Direct	2024	No reference model needed
GRPO	Online RL	2024	Group-relative, used by DeepSeek-R1
REINFORCE++	Online RL	2024	Simpler than PPO, competitive

Preference Data#

All preference methods consume some variant of (prompt, response_A, response_B, preference) where the preference labels which response is better. Sources include:

Human annotators (the original RLHF approach; expensive but highest fidelity).
Strong judge models (GPT-4-class models scoring response pairs; the dominant open-model approach since UltraFeedback in 2023).
Reward models trained on prior preference data (cheaper than humans, less drifty than judge models).
Programmatic rules (length, format compliance, code passing tests — for narrow tasks).
Self-rewarding loops (the model judges its own outputs against a rubric; the Self-Rewarding Language Models recipe).

Diversity in preference data matters as much as in SFT data. A preference set drawn entirely from one judge model encodes that judge's biases. Mix judge models when possible.

RLHF vs DPO in Practice#

RLHF with PPO requires training a separate reward model, running on-policy rollouts during alignment, and maintaining a KL constraint to the reference model. The infrastructure cost is significant — frontier labs use it because they need the headroom it provides at very large scales.

DPO sidesteps all of that. Its closed-form loss takes preference pairs directly and updates the policy with a single backward pass, no reward model, no rollouts, no PPO inner loop. The cost is roughly that of SFT. The trade-off is that DPO is more sensitive to data quality and can overfit subtle artefacts in the preference set — IPO, ORPO, and SimPO each address specific DPO failure modes.

Trade-offs#

RLHF/PPO: highest ceiling, highest cost, hardest to get right; what frontier labs use.
DPO and variants: simple, cheap, close to PPO in quality; the open-model default.
Online RL (GRPO, REINFORCE++): essential for reasoning training (DeepSeek-R1 style) and reward-hackable tasks; more expensive than DPO, less than full PPO.
Constitutional AI / RLAIF: replaces human preference labels with AI labels guided by a constitution; cheap but downstream quality depends on the labelling model.

When to Run Alignment Tuning#

Run alignment whenever the model will face open-ended user input. A model that has only been SFT'd is usable but visibly weaker — it will hedge less skilfully, refuse less appropriately, and produce more obviously suboptimal responses among plausible alternatives. For narrow task-specific fine-tunes, alignment is often unnecessary; for any chat-style or agent-style product, it is essential.

References

Direct Preference Optimization (DPO) · arXiv (Rafailov et al., 2023)
Constitutional AI (Anthropic) · arXiv (Bai et al., 2022)
DeepSeek-R1 — GRPO training recipe · arXiv (DeepSeek, 2025)

Why Alignment Is a Separate Stage#

The Family of Methods#

Method	Camp	Year	Headline trait
RLHF + PPO	RL	2022	InstructGPT/ChatGPT original
DPO	Direct	2023	Closed-form, no reward model
IPO	Direct	2023	Fixes DPO overfitting
KTO	Direct	2024	Single-response (not pairwise) data
ORPO	Direct	2024	Combines SFT + preference in one loss
SimPO	Direct	2024	No reference model needed
GRPO	Online RL	2024	Group-relative, used by DeepSeek-R1
REINFORCE++	Online RL	2024	Simpler than PPO, competitive

Preference Data#

All preference methods consume some variant of (prompt, response_A, response_B, preference) where the preference labels which response is better. Sources include:

Human annotators (the original RLHF approach; expensive but highest fidelity).

Strong judge models (GPT-4-class models scoring response pairs; the dominant open-model approach since UltraFeedback in 2023).

Reward models trained on prior preference data (cheaper than humans, less drifty than judge models).

Programmatic rules (length, format compliance, code passing tests — for narrow tasks).

Self-rewarding loops (the model judges its own outputs against a rubric; the Self-Rewarding Language Models recipe).

Diversity in preference data matters as much as in SFT data. A preference set drawn entirely from one judge model encodes that judge's biases. Mix judge models when possible.

RLHF vs DPO in Practice#

Trade-offs#

RLHF/PPO: highest ceiling, highest cost, hardest to get right; what frontier labs use.

DPO and variants: simple, cheap, close to PPO in quality; the open-model default.

Online RL (GRPO, REINFORCE++): essential for reasoning training (DeepSeek-R1 style) and reward-hackable tasks; more expensive than DPO, less than full PPO.

Constitutional AI / RLAIF: replaces human preference labels with AI labels guided by a constitution; cheap but downstream quality depends on the labelling model.

When to Run Alignment Tuning#

Alignment Tuning

Why Alignment Is a Separate Stage#

The Family of Methods#

Preference Data#

RLHF vs DPO in Practice#

Trade-offs#

When to Run Alignment Tuning#

References

Browse all entries

Deploy on Yobitel

Alignment Tuning

Why Alignment Is a Separate Stage#

The Family of Methods#

Preference Data#

RLHF vs DPO in Practice#

Trade-offs#

When to Run Alignment Tuning#

References

Browse all entries

Deploy on Yobitel