In the era of powerful language models, we’ve entered a new frontier where raw capability is no longer enough. We want AI systems that are not only smart, but aligned with human values, preferences, and safety requirements. That’s where Reinforcement Learning from Human Feedback (RLHF) comes in.
RLHF is the technique that helped train models like ChatGPT, enabling them to follow human instructions better and avoid toxic or unsafe responses. But what exactly happens under the hood? Let’s unpack the models, losses, and intuitions behind RLHF.
Reinforcement Learning from Human Feedback is a 3-phase training pipeline that aligns a language model using preferences collected from humans. The goal is to reward behaviors humans approve of and penalize those they don’t.
This is a standard cross-entropy loss. You train the model to predict human-written responses word by word:
Loss = -log P(human_response | prompt)
Intuition: Imitate the best human examples first, before learning to generalize via reward signals.
Given two responses A and B, and a human preference for A over B, the reward model learns:
Loss = -log σ(r(A) - r(B))
Where σ
is the sigmoid function and r(x)
is the reward score for response x
.
Intuition: The model should learn to assign higher scores to human-preferred answers.
Once the reward model is trained, the language model is fine-tuned via PPO to maximize expected reward, while staying close to the original model:
Loss = - Advantage * πθ / πθ_old + KL Penalty
Advantage: How much better the new output is, compared to the old one.
KL Penalty: Prevents the model from drifting too far from its base behavior.
PPO is a popular reinforcement learning algorithm because it offers a stable trade-off between exploration and safety. It ensures the updated policy doesn't change too drastically, which is crucial when working with powerful language models prone to generating unpredictable outputs.
Big ideas begin with small steps.
Whether you're exploring options or ready to build, we're here to help.
Let’s connect and create something great together.