Home Blog Reinforcement Learning from Human Feedback (RLHF) — Models, Losses, and Their Intuitions

🤝 Reinforcement Learning from Human Feedback (RLHF) — Models, Losses, and Their Intuitions

In the era of powerful language models, we’ve entered a new frontier where raw capability is no longer enough. We want AI systems that are not only smart, but aligned with human values, preferences, and safety requirements. That’s where Reinforcement Learning from Human Feedback (RLHF) comes in.

RLHF is the technique that helped train models like ChatGPT, enabling them to follow human instructions better and avoid toxic or unsafe responses. But what exactly happens under the hood? Let’s unpack the models, losses, and intuitions behind RLHF.

📦 What is RLHF?

Reinforcement Learning from Human Feedback is a 3-phase training pipeline that aligns a language model using preferences collected from humans. The goal is to reward behaviors humans approve of and penalize those they don’t.

🛠️ The Three Core Stages of RLHF:

Supervised Fine-Tuning (SFT): Start with a pretrained language model and fine-tune it on high-quality human-written responses.
Reward Model (RM) Training: Train a reward model on ranked pairs of outputs where humans prefer one over another.
Reinforcement Learning: Fine-tune the language model using Proximal Policy Optimization (PPO) with the reward model guiding the learning process.

🤖 The Models Involved

Base Language Model (LM): A pretrained transformer (like GPT) that generates text based on input prompts.
Reward Model (RM): A model trained to assign scalar reward values to model outputs, based on human preferences.
Policy Model: The final version of the LM fine-tuned via reinforcement learning to generate aligned responses.

🧠 The Intuition Behind the Losses

1. Supervised Fine-Tuning Loss

This is a standard cross-entropy loss. You train the model to predict human-written responses word by word:

Loss = -log P(human_response | prompt)

Intuition: Imitate the best human examples first, before learning to generalize via reward signals.

2. Reward Model Loss

Given two responses A and B, and a human preference for A over B, the reward model learns:

Loss = -log σ(r(A) - r(B))

Where σ is the sigmoid function and r(x) is the reward score for response x.

Intuition: The model should learn to assign higher scores to human-preferred answers.

3. PPO Loss (Policy Optimization)

Once the reward model is trained, the language model is fine-tuned via PPO to maximize expected reward, while staying close to the original model:

Loss = - Advantage * πθ / πθ_old + KL Penalty

Advantage: How much better the new output is, compared to the old one.
KL Penalty: Prevents the model from drifting too far from its base behavior.

⚖️ Why PPO?

PPO is a popular reinforcement learning algorithm because it offers a stable trade-off between exploration and safety. It ensures the updated policy doesn't change too drastically, which is crucial when working with powerful language models prone to generating unpredictable outputs.