LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

RL Notes (5): LLM Alignment (Part 1)

2025-12-19 · Qi Lu · Views:

This is the fifth article in the reinforcement learning series, entering the domain where LLM and RL combine. This article introduces the RL modeling of LLM alignment, the classic three-stage RLHF approach, and the more concise DPO method.

Introduction: From Pre-training to Alignment

Core Problem

Large Language Models (LLMs) acquire powerful language understanding and generation capabilities through massive text pre-training. However, there is a gap between the pre-training objective (predicting the next token) and human-expected behavior:

Pre-trained LLMs only learn to “speak like humans,” but not to “act according to human expectations.”

How can we make LLMs not only fluent, but also helpful, honest, and harmless?

This is the LLM Alignment problem. And reinforcement learning is the core technology to solve this problem.

Why Do We Need RL?

Supervised Learning (SFT) can make the model imitate high-quality responses, but has limitations:

  1. Limited distribution: Can only learn response patterns that appear in the training set
  2. Cannot express preferences: Difficult to distinguish between “good” and “better”
  3. Cannot explore: Won’t try new answering strategies

Reinforcement learning provides a different perspective:

LLM Training Pipeline

RL Modeling of LLM Alignment

State/Action/Reward Definition

Modeling the LLM alignment problem as an RL problem:

RL Modeling of LLM

  • State $s_t$: prompt $x$ + generated token sequence $y_{<t} = (y_1, \ldots, y_{t-1})$
  • Action $a_t$: next token $y_t$ (vocabulary size $|\mathcal{V}| \sim$ 100k)
  • Policy $\pi_\theta(a|s)$: the LLM itself, $\pi_\theta(y_t | x, y_{<t})$
  • Trajectory $\tau$: complete generation sequence $y = (y_1, y_2, \ldots, y_T)$
  • Reward $r$: usually only given at the end of the sequence

LLM as MDP

Characteristics of LLM RL:

Sparse Reward Problem

Typical reward structure for LLM alignment:

\[r_t = \begin{cases} 0 & t < T \\ r_\phi(x, y) & t = T \text{ (end of sequence)} \end{cases}\]

Challenges brought by sparse rewards:

Two approaches to solving sparse rewards:

  1. Sequence-level methods: Treat the entire sequence as a bandit, update directly with sequence reward (e.g., REINFORCE)
  2. Process rewards: Train PRM to provide reward signals for intermediate steps

Three-Stage RLHF

RLHF (Reinforcement Learning from Human Feedback) is the classic approach to LLM alignment, systematized by OpenAI in InstructGPT.

RLHF Overall Architecture

RLHF Architecture

Stage 1: Supervised Fine-Tuning (SFT)

Fine-tune the pre-trained model with high-quality dialogue data:

\[L_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \left[ \log \pi_\theta(y|x) \right] = -\mathbb{E} \left[ \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t}) \right]\]

Role of SFT:

Stage 2: Reward Model Training

Learn the Reward Model from human preference data.

Preference data: For prompt $x$, human annotators compare two responses and give a preference: $y_w \succ y_l$ ($y_w$ is preferred over $y_l$).

Bradley-Terry Model

Bradley-Terry Model

Assumes human preferences follow the Bradley-Terry model—preference probability is determined by “ability difference”:

\[P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l)) = \frac{1}{1 + e^{-(r(x, y_w) - r(x, y_l))}}\]

where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function, and $r(x, y)$ is the “score” of the response.

Intuition of the Bradley-Terry model:

Reward Model Training

The training objective of the Reward Model is to maximize the likelihood of preference data:

\[L_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]\]

This is a binary classification problem: given $(y_w, y_l)$, predict which is better.

Reward Model architecture choices:

Stage 3: PPO Fine-tuning

Use the Reward Model to provide reward signals, optimize the policy with PPO.

RLHF Optimization Objective

\[\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})\]

where $\beta > 0$ is the KL regularization coefficient.

Role of KL Regularization

The KL regularization term $\text{KL}(\pi_\theta | \pi_{\text{ref}})$ is crucial:

  1. Prevent Reward Hacking:
    • The Reward Model is an imperfect proxy
    • Unconstrained optimization will find ways to “fool” the RM
    • For example: generating specific patterns to get high scores, but actual quality is poor
  2. Maintain generation quality:
    • The SFT model already has good language capabilities
    • KL constraint prevents drifting too far and causing fluency degradation
  3. Stabilize training:
    • Constrain the optimization space, avoid policy collapse
    • Provide regularization effect

KL vs Reward Tradeoff

PPO Update Process

Specific steps of PPO in RLHF:

PPO-RLHF Algorithm

Important Note: Models needed for RLHF:

  1. $\pi_\theta$: policy being trained (Active Model)
  2. $\pi_{\text{ref}}$: reference model (frozen)
  3. $r_\phi$: Reward Model (frozen)
  4. $V_\psi$: Critic network

Total 4 large models, huge memory overhead! This is the problem that methods like DPO and GRPO try to solve.

Direct Preference Optimization (DPO)

DPO is a simplified method that bypasses the Reward Model and PPO, proposed by Rafailov et al. 2023.

DPO Motivation

Problems with RLHF + PPO:

DPO’s core question: Can we optimize directly on preference data $(x, y_w, y_l)$, as simple as supervised learning?

The answer is yes! Key insight: The RL problem with KL regularization has a closed-form solution.

DPO Loss Formula

DPO Loss

\[L_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \left[ \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right) \right]\]

Complete DPO Derivation

DPO Equivalence Theorem: DPO Loss is equivalent to the RLHF objective at the optimal solution.

Proof: The derivation has 5 key steps.

Step 1: RLHF Objective Expansion

RLHF optimization objective:

\[\max_\pi \mathbb{E}_{y \sim \pi} \left[ r(x, y) \right] - \beta \cdot \text{KL}(\pi \| \pi_{\text{ref}})\]

Expand the KL divergence:

\[= \mathbb{E}_{y \sim \pi} \left[ r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right]\]

Step 2: Introduce Partition Function $Z(x)$

To make the optimal policy a valid probability distribution, define the partition function:

\[Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x,y)}{\beta} \right)\]

$Z(x)$ is a normalization constant that only depends on $x$ (not on the policy being optimized).

Step 3: Closed-form Solution for Optimal Policy

The KL-regularized RL problem has a closed-form solution:

Optimal Policy Lemma for KL-regularized RL

The optimal solution to the objective $\max_\pi \mathbb{E}_{y \sim \pi}[r(y)] - \beta \cdot \text{KL}(\pi | \pi_{\text{ref}})$ is:

\[\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x,y)}{\beta} \right)\]

This is a constrained optimization problem ($\pi$ needs to be a probability distribution). Intuition: The optimal policy is the reference policy reweighted by $\exp(r/\beta)$. Higher reward, higher probability boost.

Step 4: Solve for Reward from Optimal Policy

Key step: Solve for reward from the optimal policy.

Take logarithm:

\[\log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) - \log Z(x) + \frac{r(x,y)}{\beta}\]

Rearrange to get:

\[r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\]

Core Insight: Reward can be expressed using the log-ratio of policies! Although there’s a $\log Z(x)$ term, it only depends on $x$ and will cancel out in pairwise comparisons.

Step 5: Substitute into Bradley-Terry Model, $Z(x)$ Cancels

Substitute the reward expression into the Bradley-Terry model:

\[\begin{align} P(y_w \succ y_l) &= \sigma(r(x, y_w) - r(x, y_l)) \\ &= \sigma\left( \beta \left[ \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right) \end{align}\]

The $\beta \log Z(x)$ terms cancel out!

Maximize the log-likelihood of preference data, replace $\pi^*$ with $\pi_\theta$, and we get the DPO Loss.

DPO Derivation

DPO’s Core Insights:

  1. The KL-regularized RL problem has a closed-form solution, the optimal policy is exponential reweighting of the reference policy
  2. We can solve for the implicit reward from the optimal policy
  3. The partition function $Z(x)$ cancels out in pairwise comparisons—this is the key to why DPO works
  4. The final form only needs to compute log-probability, as simple as supervised learning

Intuitive Understanding of DPO

Define implicit reward:

\[\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\]

DPO Loss can be written as:

\[L_{\text{DPO}} = -\mathbb{E} \left[ \log \sigma(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)) \right]\]

Intuition:

DPO vs RLHF Comparison

Feature RLHF + PPO DPO
Need Reward Model Yes No
Need Critic network Yes No
Training method Online sampling Offline training
Number of models 4 2
Implementation complexity High Low
Hyperparameter sensitivity High Low
Exploration ability Yes No
Applicable scenarios Complex tasks Simple alignment

DPO limitations:

Chapter Summary

  1. RL Modeling of LLM Alignment: State = prompt + generated tokens, Action = next token, sparse reward only given at end of sequence

  2. Three-Stage RLHF:
    • Stage 1 (SFT): Supervised fine-tuning, learn instruction following
    • Stage 2 (RM): Train Reward Model from preference data (Bradley-Terry model)
    • Stage 3 (PPO): Use RM to provide rewards, PPO optimization, KL regularization prevents reward hacking
  3. DPO:
    • Leverages KL-RL closed-form solution, bypasses RM and PPO
    • Optimize directly on preference data, as simple as supervised learning
    • Only needs 2 models ($\pi_\theta$ and $\pi_{\text{ref}}$)
    • Limitations: No exploration ability, limited improvement on difficult tasks

RLHF vs DPO

The next article will introduce more advanced methods such as GRPO, KL estimators, PRM, and Long CoT RL, which attempt to restore online exploration capabilities while maintaining DPO’s simplicity.

← Back to Home

Comments