TL;DR
- PPO (Schulman et al., 2017, arXiv:1707.06347) is a policy-gradient method that constrains each update to stay close to the previous policy via a clipped surrogate objective.
- It is simpler and more robust than its predecessor TRPO, with similar guarantees on monotonic improvement under approximation.
- PPO became the de-facto algorithm for RLHF on language models — InstructGPT, Llama 2 RLHF, Claude's early RLHF and most public RLHF code uses it.
- Newer LLM-RL variants — GRPO, RLOO, REINFORCE++ — drop the value network or use simpler baselines but still inherit PPO's clipped-surrogate scaffolding.
Policy Gradients in One Line#
Policy gradient methods optimise a parameterised policy π_θ by ascending the gradient of expected reward: ∇_θ J(θ) = E[∇_θ log π_θ(a|s) · A(s, a)], where A is the advantage. The vanilla form has high variance and is unstable; nearly every modern method is a variance-reduced, regularised form of this gradient.
From TRPO to PPO#
Trust Region Policy Optimisation (Schulman et al., 2015) constrained each update to stay within a KL-divergence ball around the previous policy. The theoretical guarantees were strong but the constraint required second-order optimisation (conjugate gradient) — expensive and complex to implement.
PPO replaced the explicit KL constraint with a clipped probability-ratio in the loss. The result: same intuition, vastly simpler implementation, fewer hyperparameters, comparable empirical performance. PPO swept reinforcement learning benchmarks in 2017-2018 and became the default robust baseline.
The Clipped Surrogate Objective#
When the ratio r_t is within [1 − ε, 1 + ε] of 1, the surrogate is the standard policy-gradient term. Outside that band, the gradient is clipped — pushing the policy too far is penalised. The pessimistic min between clipped and unclipped surrogates ensures that overconfident updates are bounded.
Typical PPO settings: clip ε = 0.2, multiple epochs over each rollout batch, GAE advantage estimation with λ = 0.95, value-loss coefficient 0.5, entropy bonus 0.01.
# r_t is the probability ratio π_new(a|s) / π_old(a|s).
# A_t is the advantage estimate.
def ppo_loss(r_t, A_t, clip_eps=0.2):
surrogate_1 = r_t * A_t
surrogate_2 = torch.clamp(r_t, 1 - clip_eps, 1 + clip_eps) * A_t
return -torch.min(surrogate_1, surrogate_2).mean()PPO for RLHF#
In RLHF, the policy is a language model, the action is a generated response (or token), and the reward comes from a learned reward model. The value network is trained jointly to predict expected reward. A KL penalty against the reference (SFT) model is added explicitly to prevent drift.
The RLHF PPO loss combines: the clipped surrogate (policy update), the value-function loss (critic update), and the KL penalty (regularisation). Four models are in memory simultaneously: policy, reference, reward model, value model. Memory is the practical bottleneck for frontier-scale RLHF.
RLHF PPO is famously fragile. Reward hacking, value-network drift, KL collapse and gradient instability are all routine failure modes. Heavy logging, frequent checkpoints and conservative hyperparameters are essential.
GAE: Generalised Advantage Estimation#
Advantages can be estimated in many ways; the choice trades bias against variance. GAE (Schulman et al., 2016) interpolates: A_t^GAE(λ) = Σ_{k≥0} (γλ)^k δ_{t+k}, where δ_t is the one-step TD residual. λ = 0 gives low-variance but high-bias TD(0); λ = 1 gives Monte Carlo. λ = 0.95 is the standard sweet spot used in essentially all production PPO.
PPO's Decline in LLM-RL#
By 2024-2025, PPO's dominance in LLM-RL waned for two reasons. First, DPO showed that for preference data, the RL framing was unnecessary. Second, for verifiable-reward training (math, code), GRPO and RLOO showed that the value network was a liability rather than an asset.
PPO remains the canonical baseline and is still used by closed labs for some pipelines. For new work in 2026, DPO is the default for preference data and GRPO is the default for verifiable rewards. PPO holds the middle ground for mixed-reward settings or research that needs a well-understood algorithm.
References
- Proximal Policy Optimization Algorithms (Schulman et al., 2017) · arXiv
- Trust Region Policy Optimization (Schulman et al., 2015) · arXiv
- High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., 2016) · arXiv
- Training Language Models to Follow Instructions (InstructGPT, Ouyang et al., 2022) · arXiv