LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

RL Notes (3): Policy-Based RL

2025-12-19 · Qi Lu · Views:

In the previous article, we introduced Value-Based methods: first learn $Q^*$, then derive the policy through $\arg\max$. This approach works well in discrete action spaces, but encounters difficulties when facing the following problems:

If the action space is continuous (such as robot joint angles), how do we compute $\arg\max_a Q(s,a)$?

If the optimal policy is stochastic (such as rock-paper-scissors), how can we represent it with a deterministic policy?

Policy-Based methods provide a more direct approach: directly parameterize the policy $\pi_\theta(a|s)$ and maximize expected return through gradient ascent.

1. Why Do We Need Policy-Based Methods?

1.1 Limitations of Value-Based Methods

  1. Continuous action spaces are difficult: $\max_a Q(s,a)$ requires enumerating or optimizing over all actions
  2. Function approximation instability (Deadly Triad): Function approximation + Bootstrapping + Off-policy can diverge
  3. Indirect optimization objective: Minimizing TD error rather than directly optimizing expected return $J(\pi)$
  4. Can only learn deterministic policies: $\arg\max$ outputs deterministic actions, but stochastic policies are better in some environments

1.2 Parameterized Policies

Policy-Based methods directly parameterize the policy $\pi_\theta(a|s)$:

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ G_0 \right] = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t r_t \right]\]

Common forms of policy parameterization:

Advantages of Policy-Based Methods:

  1. Handle continuous actions: Directly output action distributions, no need for $\arg\max$
  2. Learn stochastic policies: Can output probability distributions over actions
  3. Direct optimization objective: Gradient ascent directly maximizes $J(\theta)$
  4. Better convergence properties: Small changes in policy parameters lead to small changes in policy (smoothness)

2. Policy Gradient Theorem

The Policy Gradient theorem is the theoretical foundation of Policy-Based RL, providing the gradient expression of the objective function $J(\theta)$ with respect to parameters $\theta$.

2.1 Log-Derivative Trick

The key trick for computing $\nabla_\theta J(\theta)$ is the Log-Derivative Trick:

\[\nabla_\theta p(x|\theta) = p(x|\theta) \nabla_\theta \log p(x|\theta)\]

Proof: By the derivative rule for logarithms, $\nabla_\theta \log p(x|\theta) = \frac{\nabla_\theta p(x|\theta)}{p(x|\theta)}$, multiplying both sides by $p(x|\theta)$ gives the result.

The elegance of the Log-Derivative Trick: It converts the derivative of $p(x|\theta)$ into the derivative of $\log p(x|\theta)$, which is often easier to compute, especially when $p$ is in product form.

2.2 Policy Gradient Theorem

Theorem (Policy Gradient Theorem):

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]\]

where $G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$ is the reward-to-go from time $t$.

Proof Sketch:

Step 1: Apply the Log-Derivative Trick

\[\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \int p(\tau|\theta) R(\tau) d\tau \\ &= \int p(\tau|\theta) \nabla_\theta \log p(\tau|\theta) R(\tau) d\tau \\ &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log p(\tau|\theta) \cdot R(\tau) \right] \end{aligned}\]

Step 2: Expand $\nabla_\theta \log p(\tau|\theta)$

Recall the trajectory probability decomposition: $p(\tau|\theta) = p(s_0) \prod_{t} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t)$

Taking the logarithm and computing the gradient:

\[\nabla_\theta \log p(\tau|\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\]

Key Observation: $p(s_0)$ is the environment’s initial state distribution, and $P(s_{t+1}|s_t,a_t)$ is the environment’s dynamics model, both of which are independent of the policy parameters $\theta$, so their gradients are zero!

This means: Even without knowing the environment dynamics $P$, we can still compute the Policy Gradient—this is the fundamental reason why Policy Gradient methods can be Model-Free.

Step 3: Introduce Reward-to-go (Causality)

Action $a_t$ only affects future rewards, not past rewards. Therefore, we can replace the full return $R(\tau)$ with reward-to-go $G_t$.

Intuitive Understanding of the Policy Gradient Theorem:

  • $\nabla_\theta \log \pi_\theta(a_t|s_t)$ is the direction that “increases the probability of action $a_t$”
  • $G_t$ is the cumulative reward obtained after that action
  • If $G_t > 0$: Update in the gradient direction, increase the probability of $a_t$
  • If $G_t < 0$: Update in the opposite direction, decrease the probability of $a_t$

In short: Good actions become more likely to be selected, bad actions become less likely to be selected.

3. REINFORCE Algorithm

REINFORCE is the simplest Policy Gradient algorithm, using Monte Carlo sampling directly to estimate the gradient.

REINFORCE Algorithm

REINFORCE is an unbiased estimator: $\mathbb{E}[\hat{g}] = \nabla_\theta J(\theta)$

But has high variance:

4. Baseline and Variance Reduction

4.1 Baseline Trick

A clever trick is: subtracting a baseline $b(s_t)$ from $G_t$ can reduce variance without introducing bias.

Theorem (Baseline doesn’t change expectation): For any function $b(s)$ that depends only on state $s$ (not on action $a$):

\[\mathbb{E}_{a \sim \pi_\theta(\cdot|s)} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot b(s) \right] = 0\]

Proof: Since $b(s)$ doesn’t depend on $a$, it can be pulled outside the expectation:

\[b(s) \cdot \mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0\]

Therefore, the Policy Gradient can be written as:

\[\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) \right]\]

4.2 Why Does Baseline Reduce Variance?

4.3 Optimal Baseline

Theorem: Under the constraint of not changing expectation, the baseline that minimizes variance is the state value function: $b^*(s) = V^\pi(s)$

When $b(s) = V^\pi(s)$, the expectation of $G_t - V^\pi(s_t)$ is exactly the advantage function!

5. Advantage Function and Actor-Critic

5.1 Definition and Intuition of Advantage

\[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]

When using $V^\pi(s)$ as the baseline:

\[\mathbb{E}_\pi \left[ G_t - V^\pi(s_t) \mid s_t, a_t \right] = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)\]

Therefore, Policy Gradient with Advantage:

\[\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t) \right]\]

Intuition of Advantage $A(s,a)$:

  • $V(s)$: “Average” performance in state $s$
  • $Q(s,a)$: Performance when choosing action $a$ in state $s$
  • $A(s,a) = Q(s,a) - V(s)$: How much better is action $a$ than average

$A > 0$: This action is better than average, should increase its probability $A < 0$: This action is worse than average, should decrease its probability

5.2 Methods for Estimating Advantage

  1. Monte Carlo estimate: $\hat{A}_t^{\text{MC}} = G_t - \hat{V}(s_t)$ (unbiased but high variance)

  2. TD estimate (1-step): $\hat{A}_t^{\text{TD}} = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t) = \delta_t$ (low variance but biased)

  3. n-step estimate: Between the two

  4. GAE: Flexibly trades off bias and variance through the $\lambda$ parameter

5.3 Actor-Critic Architecture

To estimate $\hat{V}(s)$, we introduce a Critic network. Actor-Critic methods simultaneously learn:

Actor-Critic Architecture

A2C (Advantage Actor-Critic) core update rules:

Actor update (Policy Gradient with Advantage): \(\theta \leftarrow \theta + \alpha_\theta \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t\)

Critic update (Value Function Regression): \(\phi \leftarrow \phi - \alpha_\phi \nabla_\phi \sum_t \left( \hat{V}_\phi(s_t) - \text{target} \right)^2\)

Why Do We Need a Critic?

  • Provides $\hat{V}(s)$ to compute advantage $\hat{A}_t$
  • Lower variance compared to pure MC (using $G_t$)
  • Can update at every step, no need to wait for episode termination

6. Generalized Advantage Estimation (GAE)

GAE provides a method for flexibly trading off between bias and variance in advantage estimation, and is a core component of modern Policy Gradient algorithms (such as PPO).

6.1 From n-step Advantage to GAE

Recall n-step advantage estimation:

\[\hat{A}_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \hat{V}(s_{t+n}) - \hat{V}(s_t)\]

A natural question: Can we combine estimates with different $n$ to achieve a better tradeoff?

The answer is GAE—by taking an exponentially weighted average of all n-step advantages, a single parameter $\lambda$ flexibly controls the bias-variance balance point.

6.2 Definition of GAE

Definition (Generalized Advantage Estimation):

\[\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\]

where $\delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)$ is the TD residual, and $\lambda \in [0,1]$ is the decay parameter.

Theorem: GAE is equivalent to a weighted sum of n-step Advantages:

\[\hat{A}_t^{\text{GAE}} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} \hat{A}_t^{(n)}\]

6.3 Bias-Variance Tradeoff of $\lambda$ Parameter

$\lambda$ value Equivalent form Bias Variance
$\lambda = 0$ $\delta_t$ (TD) High (depends on $\hat{V}$) Low
$\lambda = 1$ $G_t - \hat{V}(s_t)$ (MC) Low High
$\lambda \in (0,1)$ Weighted average Medium Medium

In practice, $\lambda = 0.95$ or $\lambda = 0.97$ are commonly used choices.

Intuitive Understanding of GAE:

  • $\delta_t$ is the advantage of “estimating remaining value with Critic after one step”
  • GAE is a weighted sum of multi-step $\delta$, with $(\gamma\lambda)^l$ making distant $\delta$ decay exponentially
  • Smaller $\lambda$ relies more on Critic estimation (higher bias but lower variance)
  • Larger $\lambda$ relies more on actual returns (lower bias but higher variance)

6.4 Practical Computation of GAE

GAE can be efficiently computed through recursion:

\[\hat{A}_t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}_{t+1}^{\text{GAE}}\]

Boundary condition: $\hat{A}_T^{\text{GAE}} = 0$. Compute from back to front with complexity $O(T)$.

7. Importance Sampling and Off-Policy Policy Gradient

7.1 The Problem with On-Policy

Policy Gradient is on-policy: after each update of $\theta$, the distribution of old data differs from the new policy. This leads to:

Importance Sampling (IS) allows us to reuse old data.

7.2 Importance Sampling Principle

Using samples from distribution $q(x)$ to estimate expectation under $p(x)$:

\[\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\left[ \frac{p(x)}{q(x)} f(x) \right]\]

where $\rho(x) = \frac{p(x)}{q(x)}$ is called the importance weight.

Applying to Policy Gradient, the single-step importance weight:

\[\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}\]

Off-policy Policy Gradient:

\[\nabla_\theta J(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t \right]\]

7.3 Strict Off-Policy Gradient and State Distribution Correction

The formula above omits an important detail. Strict off-policy policy gradient requires correcting not only action probabilities but also the state distribution:

\[\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d_{\pi_{\text{old}}}} \left[ \frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)} \cdot \mathbb{E}_{a \sim \pi_{\text{old}}(\cdot\|s)} \left[ \rho_t(\theta) \nabla_\theta \log \pi_\theta(a\|s) \hat{A}(s,a) \right] \right]\]

where $d_\pi(s)$ is the state distribution induced by policy $\pi$ (also known as the discounted state visitation distribution).

Why is state distribution correction needed?

Intuitive understanding: when sampling with old policy $\pi_{\text{old}}$, not only does the action distribution change, but even the distribution of visited states changes. For example:

Therefore, strict off-policy gradient needs to correct both biases.

Implicit Approximation in PPO/TRPO

However, computing the state distribution ratio $\frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)}$ is very difficult—it depends on the cumulative effect of the entire trajectory and cannot be directly computed like action probabilities.

The surrogate objective of PPO/TRPO actually makes a key approximation:

\[\frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)} \approx 1\]

That is, assuming the state distributions induced by new and old policies are the same. When is this approximation reasonable?

The Role of Trust Region: When $\pi_\theta$ is sufficiently close to $\pi_{\text{old}}$ (small KL divergence), the difference in state distributions will also be small. TRPO’s KL constraint and PPO’s clip mechanism are precisely to ensure this.

Summary: The surrogate objective used by PPO/TRPO

\[L(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \hat{A}_t \right]\]

actually implicitly does two things:

  1. Uses $\rho_t = \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}$ to correct action probability bias
  2. Assumes $\frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)} \approx 1$, ignoring state distribution bias

This explains why PPO/TRPO can directly use token-level $\rho_t$ without additional state distribution correction—provided the trust region constraint holds.

7.4 Variance Problem

When $\rho_t$ deviates too much from 1, variance increases dramatically. Need to limit the magnitude of policy updates to maintain $\rho_t \approx 1$.

8. Trust Region Methods: TRPO and PPO

8.1 Motivation: Limiting Policy Update Magnitude

Importance sampling allows reusing old data, but if $\pi_\theta$ differs too much from $\pi_{\text{old}}$, the estimate becomes unreliable. Trust Region methods solve this problem by limiting the magnitude of policy updates.

Core idea: Each update is performed only within a “trust region”, ensuring the new policy is sufficiently close to the old policy.

8.2 TRPO: KL-Constrained Optimization

TRPO limits policy updates through KL divergence constraints:

\[\begin{aligned} \max_\theta \quad & L(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \hat{A}_t \right] \\ \text{s.t.} \quad & \bar{D}_{\text{KL}}(\pi_{\text{old}} \| \pi_\theta) \leq \delta \end{aligned}\]

TRPO theoretically guarantees monotonic improvement, but requires computing the Hessian of KL divergence, making implementation complex.

8.3 PPO: Simplified Trust Region

PPO approximates TRPO’s effect through simpler means.

PPO-Clip objective:

\[L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min \left( \rho_t \hat{A}_t, \, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]\]

where $\text{clip}(x, a, b) = \max(a, \min(x, b))$, and $\epsilon$ is typically 0.1 or 0.2.

Intuition of PPO-Clip:

PPO-KL (Optional Variant):

An alternative variant of PPO uses a KL penalty term instead of clipping:

\[L^{\text{KL}}(\theta) = \mathbb{E} \left[ \rho_t \hat{A}_t \right] - \beta \cdot \text{KL}(\pi_{\text{old}} \| \pi_\theta)\]

where $\beta$ is an adaptively adjusted coefficient:

In practice, PPO-Clip is more commonly used because it’s simpler and performs comparably.

8.4 Entropy Bonus

To encourage exploration, PPO typically also adds an entropy bonus:

\[L^{\text{total}}(\theta) = L^{\text{CLIP}}(\theta) + c_1 \cdot H(\pi_\theta)\]

where $H(\pi_\theta) = -\mathbb{E}[\log \pi_\theta(a|s)]$ is the entropy of the policy, and $c_1$ is a weight coefficient (typically 0.01).

Role of Entropy Bonus:

  • Encourages the policy to maintain some randomness, avoiding premature convergence to a deterministic policy
  • Promotes exploration, preventing getting stuck in local optima
  • Higher entropy means a more “uniform” policy with flatter probability distribution over actions

8.5 Complete PPO Algorithm

PPO Algorithm

Reasons for PPO’s Success:

  1. Simple and efficient: Only needs first-order optimization, no Hessian computation required
  2. Sample efficiency: Can reuse the same batch of data multiple times ($K$ updates)
  3. Stability: Clip mechanism prevents drastic policy changes
  4. Robustness: Not sensitive to hyperparameters, applicable to various tasks

PPO is currently the most commonly used Policy Gradient algorithm and is the standard choice in RLHF.

Chapter Summary

Core Content:

  1. Policy Gradient Theorem
    • Provides the analytical form of the objective function gradient: $\nabla_\theta J = \mathbb{E}[\sum_t \nabla \log \pi \cdot G_t]$
    • Log-Derivative Trick is the key to the derivation
    • Environment dynamics are independent of $\theta$, enabling Model-Free learning
  2. Variance Reduction Techniques
    • Baseline trick: Subtracting $b(s)$ doesn’t change expectation but reduces variance
    • Optimal baseline is $V^\pi(s)$
    • Use Advantage $A = Q - V$ instead of $G_t$
  3. Actor-Critic Architecture
    • Actor (policy network) + Critic (value network)
    • Critic provides $\hat{V}(s)$ to estimate advantage
  4. GAE
    • $\hat{A}^{\text{GAE}} = \sum_l (\gamma\lambda)^l \delta_{t+l}$
    • $\lambda$ controls bias-variance tradeoff
  5. Trust Region Methods
    • Importance sampling allows reusing old data, but requires limiting policy changes
    • TRPO: KL-constrained optimization, complex implementation
    • PPO: Clip mechanism, simple and efficient, the preferred choice in practice

Policy Gradient Evolution

The next article will introduce Model-Based RL and multi-agent learning, including MCTS and AlphaGo/Zero.

← Back to Home

Comments