RL Notes (5): LLM Alignment (Part 1)

2025-12-19 · Qi Lu · Views:

This is the fifth article in the reinforcement learning series, entering the domain where LLM and RL combine. This article introduces the RL modeling of LLM alignment, the classic three-stage RLHF approach, and the more concise DPO method.

Introduction: From Pre-training to Alignment

Core Problem

Large Language Models (LLMs) acquire powerful language understanding and generation capabilities through massive text pre-training. However, there is a gap between the pre-training objective (predicting the next token) and human-expected behavior:

Pre-trained LLMs only learn to “speak like humans,” but not to “act according to human expectations.”

How can we make LLMs not only fluent, but also helpful, honest, and harmless?

This is the LLM Alignment problem. And reinforcement learning is the core technology to solve this problem.

Why Do We Need RL?

Supervised Learning (SFT) can make the model imitate high-quality responses, but has limitations:

Limited distribution: Can only learn response patterns that appear in the training set
Cannot express preferences: Difficult to distinguish between “good” and “better”
Cannot explore: Won’t try new answering strategies

Reinforcement learning provides a different perspective:

Model the LLM generation process as an MDP
Define the reward function using human preferences
Optimize the policy by maximizing rewards

LLM Training Pipeline

RL Modeling of LLM Alignment

State/Action/Reward Definition

Modeling the LLM alignment problem as an RL problem:

RL Modeling of LLM

State $s_t$: prompt $x$ + generated token sequence $y_{<t} = (y_1, \ldots, y_{t-1})$

Action $a_t$: next token $y_t$ (vocabulary size $|\mathcal{V}| \sim$ 100k)

Policy $\pi_\theta(a|s)$: the LLM itself, $\pi_\theta(y_t | x, y_{<t})$

Trajectory $\tau$: complete generation sequence $y = (y_1, y_2, \ldots, y_T)$

Reward $r$: usually only given at the end of the sequence

LLM as MDP

Characteristics of LLM RL:

Huge action space: Vocabulary typically has 100k+ tokens
Deterministic state transitions: Next state = current state + new token
Episode = one complete generation: From prompt to EOS
Sparse rewards: Reward signal only at the end of the sequence

Sparse Reward Problem

Typical reward structure for LLM alignment:

\[r_t = \begin{cases} 0 & t < T \\ r_\phi(x, y) & t = T \text{ (end of sequence)} \end{cases}\]

Challenges brought by sparse rewards:

Credit assignment difficulty: How to attribute the final reward to each token?
Weak gradient signal: No learning signal at most time steps
Especially difficult for long sequences: Signal needs to propagate very far (thousands of tokens)

Two approaches to solving sparse rewards:

Sequence-level methods: Treat the entire sequence as a bandit, update directly with sequence reward (e.g., REINFORCE)
Process rewards: Train PRM to provide reward signals for intermediate steps

Three-Stage RLHF

RLHF (Reinforcement Learning from Human Feedback) is the classic approach to LLM alignment, systematized by OpenAI in InstructGPT.

RLHF Overall Architecture

RLHF Architecture

Stage 1: Supervised Fine-Tuning (SFT)

Fine-tune the pre-trained model with high-quality dialogue data:

\[L_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \left[ \log \pi_\theta(y|x) \right] = -\mathbb{E} \left[ \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t}) \right]\]

Role of SFT:

Make the model learn the basic format of “instruction following”
Provide the starting point for RL (reference model $\pi_{\text{ref}}$)
Filter out low-quality patterns from pre-training

Stage 2: Reward Model Training

Learn the Reward Model from human preference data.

Preference data: For prompt $x$, human annotators compare two responses and give a preference: $y_w \succ y_l$ ($y_w$ is preferred over $y_l$).

Bradley-Terry Model

Bradley-Terry Model

Assumes human preferences follow the Bradley-Terry model—preference probability is determined by “ability difference”:
\[P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l)) = \frac{1}{1 + e^{-(r(x, y_w) - r(x, y_l))}}\]
where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function, and $r(x, y)$ is the “score” of the response.

Intuition of the Bradley-Terry model:

When reward difference = 0, preference probability = 0.5 (cannot distinguish)
The larger the reward difference, the closer the preference probability to 1 (more certain)
The model assumes preferences are probabilistic comparisons based on “intrinsic quality scores”

Reward Model Training

The training objective of the Reward Model is to maximize the likelihood of preference data:

\[L_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]\]

This is a binary classification problem: given $(y_w, y_l)$, predict which is better.

Reward Model architecture choices:

Usually initialized from the SFT model
Remove the language model head, add a scalar output head
Input $(x, y)$, output scalar $r_\phi(x, y) \in \mathbb{R}$

Stage 3: PPO Fine-tuning

Use the Reward Model to provide reward signals, optimize the policy with PPO.

RLHF Optimization Objective
\[\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})\]
where $\beta > 0$ is the KL regularization coefficient.

Role of KL Regularization

The KL regularization term $\text{KL}(\pi_\theta | \pi_{\text{ref}})$ is crucial:

Prevent Reward Hacking:
- The Reward Model is an imperfect proxy
- Unconstrained optimization will find ways to “fool” the RM
- For example: generating specific patterns to get high scores, but actual quality is poor
Maintain generation quality:
- The SFT model already has good language capabilities
- KL constraint prevents drifting too far and causing fluency degradation
Stabilize training:
- Constrain the optimization space, avoid policy collapse
- Provide regularization effect

KL vs Reward Tradeoff

PPO Update Process

Specific steps of PPO in RLHF:

PPO-RLHF Algorithm

Important Note: Models needed for RLHF:

$\pi_\theta$: policy being trained (Active Model)

$\pi_{\text{ref}}$: reference model (frozen)

$r_\phi$: Reward Model (frozen)

$V_\psi$: Critic network

Total 4 large models, huge memory overhead! This is the problem that methods like DPO and GRPO try to solve.

Direct Preference Optimization (DPO)

DPO is a simplified method that bypasses the Reward Model and PPO, proposed by Rafailov et al. 2023.

DPO Motivation

Problems with RLHF + PPO:

Large model overhead: Need to maintain 4 models
High sampling cost: Online generation with large models is expensive
Complex implementation: PPO is hyperparameter-sensitive, needs fine-tuning
Training instability: RL training is prone to collapse

DPO’s core question: Can we optimize directly on preference data $(x, y_w, y_l)$, as simple as supervised learning?

The answer is yes! Key insight: The RL problem with KL regularization has a closed-form solution.

DPO Loss Formula

DPO Loss
\[L_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \left[ \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right) \right]\]

Complete DPO Derivation

DPO Equivalence Theorem: DPO Loss is equivalent to the RLHF objective at the optimal solution.

Proof: The derivation has 5 key steps.

Step 1: RLHF Objective Expansion

RLHF optimization objective:

\[\max_\pi \mathbb{E}_{y \sim \pi} \left[ r(x, y) \right] - \beta \cdot \text{KL}(\pi \| \pi_{\text{ref}})\]

Expand the KL divergence:

\[= \mathbb{E}_{y \sim \pi} \left[ r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right]\]

Step 2: Introduce Partition Function $Z(x)$

To make the optimal policy a valid probability distribution, define the partition function:

\[Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x,y)}{\beta} \right)\]

$Z(x)$ is a normalization constant that only depends on $x$ (not on the policy being optimized).

Step 3: Closed-form Solution for Optimal Policy

The KL-regularized RL problem has a closed-form solution:

Optimal Policy Lemma for KL-regularized RL

The optimal solution to the objective $\max_\pi \mathbb{E}_{y \sim \pi}[r(y)] - \beta \cdot \text{KL}(\pi | \pi_{\text{ref}})$ is:
\[\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x,y)}{\beta} \right)\]

This is a constrained optimization problem ($\pi$ needs to be a probability distribution). Intuition: The optimal policy is the reference policy reweighted by $\exp(r/\beta)$. Higher reward, higher probability boost.

Step 4: Solve for Reward from Optimal Policy

Key step: Solve for reward from the optimal policy.

Take logarithm:

\[\log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) - \log Z(x) + \frac{r(x,y)}{\beta}\]

Rearrange to get:

\[r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\]

Core Insight: Reward can be expressed using the log-ratio of policies! Although there’s a $\log Z(x)$ term, it only depends on $x$ and will cancel out in pairwise comparisons.

Step 5: Substitute into Bradley-Terry Model, $Z(x)$ Cancels

Substitute the reward expression into the Bradley-Terry model:

\[\begin{align} P(y_w \succ y_l) &= \sigma(r(x, y_w) - r(x, y_l)) \\ &= \sigma\left( \beta \left[ \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right) \end{align}\]

The $\beta \log Z(x)$ terms cancel out!

Maximize the log-likelihood of preference data, replace $\pi^*$ with $\pi_\theta$, and we get the DPO Loss.

DPO Derivation

DPO’s Core Insights:

The KL-regularized RL problem has a closed-form solution, the optimal policy is exponential reweighting of the reference policy

We can solve for the implicit reward from the optimal policy

The partition function $Z(x)$ cancels out in pairwise comparisons—this is the key to why DPO works

The final form only needs to compute log-probability, as simple as supervised learning

Intuitive Understanding of DPO

Define implicit reward:

\[\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\]

DPO Loss can be written as:

\[L_{\text{DPO}} = -\mathbb{E} \left[ \log \sigma(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)) \right]\]

Intuition:

$\hat{r}_\theta(x, y_w) > \hat{r}_\theta(x, y_l)$: $y_w$ has higher implicit reward, loss decreases
Training process increases $y_w$’s probability relative to $\pi_{\text{ref}}$, decreases $y_l$’s probability
$\beta$ controls the scale of “how much to deviate from the reference policy”

DPO vs RLHF Comparison

Feature	RLHF + PPO	DPO
Need Reward Model	Yes	No
Need Critic network	Yes	No
Training method	Online sampling	Offline training
Number of models	4	2
Implementation complexity	High	Low
Hyperparameter sensitivity	High	Low
Exploration ability	Yes	No
Applicable scenarios	Complex tasks	Simple alignment

DPO limitations:

No exploration: Completely offline, can only optimize within the distribution of existing preference data
Coarse pairwise signal: Only knows which is better, not how much better
Limited improvement on difficult tasks: Not as effective as RL on tasks like math and code that require exploration

Chapter Summary

RL Modeling of LLM Alignment: State = prompt + generated tokens, Action = next token, sparse reward only given at end of sequence
Three-Stage RLHF:
- Stage 1 (SFT): Supervised fine-tuning, learn instruction following
- Stage 2 (RM): Train Reward Model from preference data (Bradley-Terry model)
- Stage 3 (PPO): Use RM to provide rewards, PPO optimization, KL regularization prevents reward hacking
DPO:
- Leverages KL-RL closed-form solution, bypasses RM and PPO
- Optimize directly on preference data, as simple as supervised learning
- Only needs 2 models ($\pi_\theta$ and $\pi_{\text{ref}}$)
- Limitations: No exploration ability, limited improvement on difficult tasks

RLHF vs DPO

The next article will introduce more advanced methods such as GRPO, KL estimators, PRM, and Long CoT RL, which attempt to restore online exploration capabilities while maintaining DPO’s simplicity.

Tags: RL RLHF Alignment

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析