RL Notes (3): Policy-Based RL

2025-12-19 · Qi Lu · Views:

In the previous article, we introduced Value-Based methods: first learn $Q^*$, then derive the policy through $\arg\max$. This approach works well in discrete action spaces, but encounters difficulties when facing the following problems:

If the action space is continuous (such as robot joint angles), how do we compute $\arg\max_a Q(s,a)$?

If the optimal policy is stochastic (such as rock-paper-scissors), how can we represent it with a deterministic policy?

Policy-Based methods provide a more direct approach: directly parameterize the policy $\pi_\theta(a|s)$ and maximize expected return through gradient ascent.

1. Why Do We Need Policy-Based Methods?

1.1 Limitations of Value-Based Methods

Continuous action spaces are difficult: $\max_a Q(s,a)$ requires enumerating or optimizing over all actions
Function approximation instability (Deadly Triad): Function approximation + Bootstrapping + Off-policy can diverge
Indirect optimization objective: Minimizing TD error rather than directly optimizing expected return $J(\pi)$
Can only learn deterministic policies: $\arg\max$ outputs deterministic actions, but stochastic policies are better in some environments

1.2 Parameterized Policies

Policy-Based methods directly parameterize the policy $\pi_\theta(a|s)$:

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ G_0 \right] = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t r_t \right]\]

Common forms of policy parameterization:

Discrete action space: Softmax outputs Categorical distribution $\pi_\theta(a|s) = \frac{\exp(f_\theta(s,a))}{\sum_{a'} \exp(f_\theta(s,a'))}$
Continuous action space: Output parameters of Gaussian distribution $(\mu_\theta(s), \sigma_\theta(s))$ $\pi_\theta(a|s) = \mathcal{N}(a \mid \mu_\theta(s), \sigma_\theta^2(s))$

Advantages of Policy-Based Methods:

Handle continuous actions: Directly output action distributions, no need for $\arg\max$

Learn stochastic policies: Can output probability distributions over actions

Direct optimization objective: Gradient ascent directly maximizes $J(\theta)$

Better convergence properties: Small changes in policy parameters lead to small changes in policy (smoothness)

2. Policy Gradient Theorem

The Policy Gradient theorem is the theoretical foundation of Policy-Based RL, providing the gradient expression of the objective function $J(\theta)$ with respect to parameters $\theta$.

2.1 Log-Derivative Trick

The key trick for computing $\nabla_\theta J(\theta)$ is the Log-Derivative Trick:

\[\nabla_\theta p(x|\theta) = p(x|\theta) \nabla_\theta \log p(x|\theta)\]

Proof: By the derivative rule for logarithms, $\nabla_\theta \log p(x|\theta) = \frac{\nabla_\theta p(x|\theta)}{p(x|\theta)}$, multiplying both sides by $p(x|\theta)$ gives the result.

The elegance of the Log-Derivative Trick: It converts the derivative of $p(x|\theta)$ into the derivative of $\log p(x|\theta)$, which is often easier to compute, especially when $p$ is in product form.

2.2 Policy Gradient Theorem

Theorem (Policy Gradient Theorem):

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]\]

where $G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$ is the reward-to-go from time $t$.

Proof Sketch:

Step 1: Apply the Log-Derivative Trick

\[\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \int p(\tau|\theta) R(\tau) d\tau \\ &= \int p(\tau|\theta) \nabla_\theta \log p(\tau|\theta) R(\tau) d\tau \\ &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log p(\tau|\theta) \cdot R(\tau) \right] \end{aligned}\]

Step 2: Expand $\nabla_\theta \log p(\tau|\theta)$

Recall the trajectory probability decomposition: $p(\tau|\theta) = p(s_0) \prod_{t} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t)$

Taking the logarithm and computing the gradient:

\[\nabla_\theta \log p(\tau|\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\]

Key Observation: $p(s_0)$ is the environment’s initial state distribution, and $P(s_{t+1}|s_t,a_t)$ is the environment’s dynamics model, both of which are independent of the policy parameters $\theta$, so their gradients are zero!

This means: Even without knowing the environment dynamics $P$, we can still compute the Policy Gradient—this is the fundamental reason why Policy Gradient methods can be Model-Free.

Step 3: Introduce Reward-to-go (Causality)

Action $a_t$ only affects future rewards, not past rewards. Therefore, we can replace the full return $R(\tau)$ with reward-to-go $G_t$.

Intuitive Understanding of the Policy Gradient Theorem:

$\nabla_\theta \log \pi_\theta(a_t|s_t)$ is the direction that “increases the probability of action $a_t$”

$G_t$ is the cumulative reward obtained after that action

If $G_t > 0$: Update in the gradient direction, increase the probability of $a_t$

If $G_t < 0$: Update in the opposite direction, decrease the probability of $a_t$

In short: Good actions become more likely to be selected, bad actions become less likely to be selected.

3. REINFORCE Algorithm

REINFORCE is the simplest Policy Gradient algorithm, using Monte Carlo sampling directly to estimate the gradient.

REINFORCE Algorithm

REINFORCE is an unbiased estimator: $\mathbb{E}[\hat{g}] = \nabla_\theta J(\theta)$

But has high variance:

$G_t$ accumulates all randomness from $t$ to termination
Longer trajectories lead to larger variance
When rewards are sparse, most trajectories have $G_t \approx 0$

4. Baseline and Variance Reduction

4.1 Baseline Trick

A clever trick is: subtracting a baseline $b(s_t)$ from $G_t$ can reduce variance without introducing bias.

Theorem (Baseline doesn’t change expectation): For any function $b(s)$ that depends only on state $s$ (not on action $a$):

\[\mathbb{E}_{a \sim \pi_\theta(\cdot|s)} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot b(s) \right] = 0\]

Proof: Since $b(s)$ doesn’t depend on $a$, it can be pulled outside the expectation:

\[b(s) \cdot \mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0\]

Therefore, the Policy Gradient can be written as:

\[\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) \right]\]

4.2 Why Does Baseline Reduce Variance?

$G_t$ might always be positive (e.g., when all rewards are positive), causing all action probabilities to increase
Subtracting $b(s)$ (such as the average return) makes $G_t - b(s_t)$ both positive and negative
Actions better than average are reinforced, actions worse than average are weakened

4.3 Optimal Baseline

Theorem: Under the constraint of not changing expectation, the baseline that minimizes variance is the state value function: $b^*(s) = V^\pi(s)$

When $b(s) = V^\pi(s)$, the expectation of $G_t - V^\pi(s_t)$ is exactly the advantage function!

5. Advantage Function and Actor-Critic

5.1 Definition and Intuition of Advantage

\[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]

When using $V^\pi(s)$ as the baseline:

\[\mathbb{E}_\pi \left[ G_t - V^\pi(s_t) \mid s_t, a_t \right] = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)\]

Therefore, Policy Gradient with Advantage:

\[\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t) \right]\]

Intuition of Advantage $A(s,a)$:

$V(s)$: “Average” performance in state $s$

$Q(s,a)$: Performance when choosing action $a$ in state $s$

$A(s,a) = Q(s,a) - V(s)$: How much better is action $a$ than average

$A > 0$: This action is better than average, should increase its probability $A < 0$: This action is worse than average, should decrease its probability

5.2 Methods for Estimating Advantage

Monte Carlo estimate: $\hat{A}_t^{\text{MC}} = G_t - \hat{V}(s_t)$ (unbiased but high variance)
TD estimate (1-step): $\hat{A}_t^{\text{TD}} = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t) = \delta_t$ (low variance but biased)
n-step estimate: Between the two
GAE: Flexibly trades off bias and variance through the $\lambda$ parameter

5.3 Actor-Critic Architecture

To estimate $\hat{V}(s)$, we introduce a Critic network. Actor-Critic methods simultaneously learn:

Actor: Policy network $\pi_\theta(a|s)$, outputs action distribution
Critic: Value network $\hat{V}_\phi(s)$, estimates state value

Actor-Critic Architecture

A2C (Advantage Actor-Critic) core update rules:

Actor update (Policy Gradient with Advantage): $\theta \leftarrow \theta + \alpha_\theta \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t$

Critic update (Value Function Regression): $\phi \leftarrow \phi - \alpha_\phi \nabla_\phi \sum_t \left( \hat{V}_\phi(s_t) - \text{target} \right)^2$

Why Do We Need a Critic?

Provides $\hat{V}(s)$ to compute advantage $\hat{A}_t$

Lower variance compared to pure MC (using $G_t$)

Can update at every step, no need to wait for episode termination

6. Generalized Advantage Estimation (GAE)

GAE provides a method for flexibly trading off between bias and variance in advantage estimation, and is a core component of modern Policy Gradient algorithms (such as PPO).

6.1 Definition of GAE

Definition (Generalized Advantage Estimation):

\[\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\]

where $\delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)$ is the TD residual, and $\lambda \in [0,1]$ is the decay parameter.

Theorem: GAE is equivalent to a weighted sum of n-step Advantages:

\[\hat{A}_t^{\text{GAE}} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} \hat{A}_t^{(n)}\]

6.2 Bias-Variance Tradeoff of $\lambda$ Parameter

$\lambda$ value	Equivalent form	Bias	Variance
$\lambda = 0$	$\delta_t$ (TD)	High (depends on $\hat{V}$)	Low
$\lambda = 1$	$G_t - \hat{V}(s_t)$ (MC)	Low	High
$\lambda \in (0,1)$	Weighted average	Medium	Medium

In practice, $\lambda = 0.95$ or $\lambda = 0.97$ are commonly used choices.

Intuitive Understanding of GAE:

$\delta_t$ is the advantage of “estimating remaining value with Critic after one step”

GAE is a weighted sum of multi-step $\delta$, with $(\gamma\lambda)^l$ making distant $\delta$ decay exponentially

Smaller $\lambda$ relies more on Critic estimation (higher bias but lower variance)

Larger $\lambda$ relies more on actual returns (lower bias but higher variance)

6.3 Practical Computation of GAE

GAE can be efficiently computed through recursion:

\[\hat{A}_t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}_{t+1}^{\text{GAE}}\]

Boundary condition: $\hat{A}_T^{\text{GAE}} = 0$. Compute from back to front with complexity $O(T)$.

7. Importance Sampling and Off-Policy Policy Gradient

7.1 The Problem with On-Policy

Policy Gradient is on-policy: after each update of $\theta$, the distribution of old data differs from the new policy. This leads to:

Data can only be used once, low sample efficiency
Need to resample after each update

Importance Sampling (IS) allows us to reuse old data.

7.2 Importance Sampling Principle

Using samples from distribution $q(x)$ to estimate expectation under $p(x)$:

\[\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\left[ \frac{p(x)}{q(x)} f(x) \right]\]

where $\rho(x) = \frac{p(x)}{q(x)}$ is called the importance weight.

Applying to Policy Gradient, the single-step importance weight:

\[\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}\]

Off-policy Policy Gradient:

\[\nabla_\theta J(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t \right]\]

7.3 Variance Problem

When $\rho_t$ deviates too much from 1, variance increases dramatically. Need to limit the magnitude of policy updates to maintain $\rho_t \approx 1$.

8. Trust Region Methods: TRPO and PPO

8.1 TRPO: KL-Constrained Optimization

TRPO limits policy updates through KL divergence constraints:

\[\begin{aligned} \max_\theta \quad & L(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \hat{A}_t \right] \\ \text{s.t.} \quad & \bar{D}_{\text{KL}}(\pi_{\text{old}} \| \pi_\theta) \leq \delta \end{aligned}\]

TRPO theoretically guarantees monotonic improvement, but requires computing the Hessian of KL divergence, making implementation complex.

8.2 PPO: Simplified Trust Region

PPO approximates TRPO’s effect through simpler means.

PPO-Clip objective:

\[L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min \left( \rho_t \hat{A}_t, \, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]\]

where $\text{clip}(x, a, b) = \max(a, \min(x, b))$, and $\epsilon$ is typically 0.1 or 0.2.

Intuition of PPO-Clip:

When $\hat{A}_t > 0$ (good action): The objective is to increase probability, but clips when $\rho_t > 1+\epsilon$ to prevent excessive increase
When $\hat{A}_t < 0$ (bad action): The objective is to decrease probability, but clips when $\rho_t < 1-\epsilon$ to prevent excessive decrease

8.3 Entropy Bonus

To encourage exploration, PPO typically also adds an entropy bonus:

\[L^{\text{total}}(\theta) = L^{\text{CLIP}}(\theta) + c_1 \cdot H(\pi_\theta)\]

where $H(\pi_\theta) = -\mathbb{E}[\log \pi_\theta(a|s)]$ is the entropy of the policy.

8.4 Complete PPO Algorithm

PPO Algorithm

Reasons for PPO’s Success:

Simple and efficient: Only needs first-order optimization, no Hessian computation required

Sample efficiency: Can reuse the same batch of data multiple times ($K$ updates)

Stability: Clip mechanism prevents drastic policy changes

Robustness: Not sensitive to hyperparameters, applicable to various tasks

PPO is currently the most commonly used Policy Gradient algorithm and is the standard choice in RLHF.

Chapter Summary

Core Content:

Policy Gradient Theorem
- Provides the analytical form of the objective function gradient: $\nabla_\theta J = \mathbb{E}[\sum_t \nabla \log \pi \cdot G_t]$
- Log-Derivative Trick is the key to the derivation
- Environment dynamics are independent of $\theta$, enabling Model-Free learning
Variance Reduction Techniques
- Baseline trick: Subtracting $b(s)$ doesn’t change expectation but reduces variance
- Optimal baseline is $V^\pi(s)$
- Use Advantage $A = Q - V$ instead of $G_t$
Actor-Critic Architecture
- Actor (policy network) + Critic (value network)
- Critic provides $\hat{V}(s)$ to estimate advantage
GAE
- $\hat{A}^{\text{GAE}} = \sum_l (\gamma\lambda)^l \delta_{t+l}$
- $\lambda$ controls bias-variance tradeoff
Trust Region Methods
- Importance sampling allows reusing old data, but requires limiting policy changes
- TRPO: KL-constrained optimization, complex implementation
- PPO: Clip mechanism, simple and efficient, the preferred choice in practice

Policy Gradient Evolution

The next article will introduce Model-Based RL and multi-agent learning, including MCTS and AlphaGo/Zero.

Tags: RL PPO

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析