RL Notes (3): Policy-Based RL
2025-12-19 · Qi Lu · Views:
In the previous article, we introduced Value-Based methods: first learn $Q^*$, then derive the policy through $\arg\max$. This approach works well in discrete action spaces, but encounters difficulties when facing the following problems:
If the action space is continuous (such as robot joint angles), how do we compute $\arg\max_a Q(s,a)$?
If the optimal policy is stochastic (such as rock-paper-scissors), how can we represent it with a deterministic policy?
Policy-Based methods provide a more direct approach: directly parameterize the policy $\pi_\theta(a|s)$ and maximize expected return through gradient ascent.
1. Why Do We Need Policy-Based Methods?
1.1 Limitations of Value-Based Methods
- Continuous action spaces are difficult: $\max_a Q(s,a)$ requires enumerating or optimizing over all actions
- Function approximation instability (Deadly Triad): Function approximation + Bootstrapping + Off-policy can diverge
- Indirect optimization objective: Minimizing TD error rather than directly optimizing expected return $J(\pi)$
- Can only learn deterministic policies: $\arg\max$ outputs deterministic actions, but stochastic policies are better in some environments
1.2 Parameterized Policies
Policy-Based methods directly parameterize the policy $\pi_\theta(a|s)$:
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ G_0 \right] = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t r_t \right]\]Common forms of policy parameterization:
-
Discrete action space: Softmax outputs Categorical distribution \(\pi_\theta(a|s) = \frac{\exp(f_\theta(s,a))}{\sum_{a'} \exp(f_\theta(s,a'))}\)
-
Continuous action space: Output parameters of Gaussian distribution $(\mu_\theta(s), \sigma_\theta(s))$ \(\pi_\theta(a|s) = \mathcal{N}(a \mid \mu_\theta(s), \sigma_\theta^2(s))\)
Advantages of Policy-Based Methods:
- Handle continuous actions: Directly output action distributions, no need for $\arg\max$
- Learn stochastic policies: Can output probability distributions over actions
- Direct optimization objective: Gradient ascent directly maximizes $J(\theta)$
- Better convergence properties: Small changes in policy parameters lead to small changes in policy (smoothness)
2. Policy Gradient Theorem
The Policy Gradient theorem is the theoretical foundation of Policy-Based RL, providing the gradient expression of the objective function $J(\theta)$ with respect to parameters $\theta$.
2.1 Log-Derivative Trick
The key trick for computing $\nabla_\theta J(\theta)$ is the Log-Derivative Trick:
\[\nabla_\theta p(x|\theta) = p(x|\theta) \nabla_\theta \log p(x|\theta)\]Proof: By the derivative rule for logarithms, $\nabla_\theta \log p(x|\theta) = \frac{\nabla_\theta p(x|\theta)}{p(x|\theta)}$, multiplying both sides by $p(x|\theta)$ gives the result.
The elegance of the Log-Derivative Trick: It converts the derivative of $p(x|\theta)$ into the derivative of $\log p(x|\theta)$, which is often easier to compute, especially when $p$ is in product form.
2.2 Policy Gradient Theorem
Theorem (Policy Gradient Theorem):
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]\]where $G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$ is the reward-to-go from time $t$.
Proof Sketch:
Step 1: Apply the Log-Derivative Trick
\[\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \int p(\tau|\theta) R(\tau) d\tau \\ &= \int p(\tau|\theta) \nabla_\theta \log p(\tau|\theta) R(\tau) d\tau \\ &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log p(\tau|\theta) \cdot R(\tau) \right] \end{aligned}\]Step 2: Expand $\nabla_\theta \log p(\tau|\theta)$
Recall the trajectory probability decomposition: $p(\tau|\theta) = p(s_0) \prod_{t} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t)$
Taking the logarithm and computing the gradient:
\[\nabla_\theta \log p(\tau|\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\]Key Observation: $p(s_0)$ is the environment’s initial state distribution, and $P(s_{t+1}|s_t,a_t)$ is the environment’s dynamics model, both of which are independent of the policy parameters $\theta$, so their gradients are zero!
This means: Even without knowing the environment dynamics $P$, we can still compute the Policy Gradient—this is the fundamental reason why Policy Gradient methods can be Model-Free.
Step 3: Introduce Reward-to-go (Causality)
Action $a_t$ only affects future rewards, not past rewards. Therefore, we can replace the full return $R(\tau)$ with reward-to-go $G_t$.
Intuitive Understanding of the Policy Gradient Theorem:
- $\nabla_\theta \log \pi_\theta(a_t|s_t)$ is the direction that “increases the probability of action $a_t$”
- $G_t$ is the cumulative reward obtained after that action
- If $G_t > 0$: Update in the gradient direction, increase the probability of $a_t$
- If $G_t < 0$: Update in the opposite direction, decrease the probability of $a_t$
In short: Good actions become more likely to be selected, bad actions become less likely to be selected.
3. REINFORCE Algorithm
REINFORCE is the simplest Policy Gradient algorithm, using Monte Carlo sampling directly to estimate the gradient.
REINFORCE is an unbiased estimator: $\mathbb{E}[\hat{g}] = \nabla_\theta J(\theta)$
But has high variance:
- $G_t$ accumulates all randomness from $t$ to termination
- Longer trajectories lead to larger variance
- When rewards are sparse, most trajectories have $G_t \approx 0$
4. Baseline and Variance Reduction
4.1 Baseline Trick
A clever trick is: subtracting a baseline $b(s_t)$ from $G_t$ can reduce variance without introducing bias.
Theorem (Baseline doesn’t change expectation): For any function $b(s)$ that depends only on state $s$ (not on action $a$):
\[\mathbb{E}_{a \sim \pi_\theta(\cdot|s)} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot b(s) \right] = 0\]Proof: Since $b(s)$ doesn’t depend on $a$, it can be pulled outside the expectation:
\[b(s) \cdot \mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0\]Therefore, the Policy Gradient can be written as:
\[\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) \right]\]4.2 Why Does Baseline Reduce Variance?
- $G_t$ might always be positive (e.g., when all rewards are positive), causing all action probabilities to increase
- Subtracting $b(s)$ (such as the average return) makes $G_t - b(s_t)$ both positive and negative
- Actions better than average are reinforced, actions worse than average are weakened
4.3 Optimal Baseline
Theorem: Under the constraint of not changing expectation, the baseline that minimizes variance is the state value function: $b^*(s) = V^\pi(s)$
When $b(s) = V^\pi(s)$, the expectation of $G_t - V^\pi(s_t)$ is exactly the advantage function!
5. Advantage Function and Actor-Critic
5.1 Definition and Intuition of Advantage
\[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]When using $V^\pi(s)$ as the baseline:
\[\mathbb{E}_\pi \left[ G_t - V^\pi(s_t) \mid s_t, a_t \right] = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)\]Therefore, Policy Gradient with Advantage:
\[\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t) \right]\]Intuition of Advantage $A(s,a)$:
- $V(s)$: “Average” performance in state $s$
- $Q(s,a)$: Performance when choosing action $a$ in state $s$
- $A(s,a) = Q(s,a) - V(s)$: How much better is action $a$ than average
$A > 0$: This action is better than average, should increase its probability $A < 0$: This action is worse than average, should decrease its probability
5.2 Methods for Estimating Advantage
-
Monte Carlo estimate: $\hat{A}_t^{\text{MC}} = G_t - \hat{V}(s_t)$ (unbiased but high variance)
-
TD estimate (1-step): $\hat{A}_t^{\text{TD}} = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t) = \delta_t$ (low variance but biased)
-
n-step estimate: Between the two
-
GAE: Flexibly trades off bias and variance through the $\lambda$ parameter
5.3 Actor-Critic Architecture
To estimate $\hat{V}(s)$, we introduce a Critic network. Actor-Critic methods simultaneously learn:
- Actor: Policy network $\pi_\theta(a|s)$, outputs action distribution
- Critic: Value network $\hat{V}_\phi(s)$, estimates state value
A2C (Advantage Actor-Critic) core update rules:
Actor update (Policy Gradient with Advantage): \(\theta \leftarrow \theta + \alpha_\theta \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t\)
Critic update (Value Function Regression): \(\phi \leftarrow \phi - \alpha_\phi \nabla_\phi \sum_t \left( \hat{V}_\phi(s_t) - \text{target} \right)^2\)
Why Do We Need a Critic?
- Provides $\hat{V}(s)$ to compute advantage $\hat{A}_t$
- Lower variance compared to pure MC (using $G_t$)
- Can update at every step, no need to wait for episode termination
6. Generalized Advantage Estimation (GAE)
GAE provides a method for flexibly trading off between bias and variance in advantage estimation, and is a core component of modern Policy Gradient algorithms (such as PPO).
6.1 From n-step Advantage to GAE
Recall n-step advantage estimation:
\[\hat{A}_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \hat{V}(s_{t+n}) - \hat{V}(s_t)\]- $n=1$: TD advantage, $\hat{A}_t^{(1)} = \delta_t$ (low variance, high bias)
- $n=\infty$: MC advantage, $\hat{A}_t^{(\infty)} = G_t - \hat{V}(s_t)$ (high variance, low bias)
A natural question: Can we combine estimates with different $n$ to achieve a better tradeoff?
The answer is GAE—by taking an exponentially weighted average of all n-step advantages, a single parameter $\lambda$ flexibly controls the bias-variance balance point.
6.2 Definition of GAE
Definition (Generalized Advantage Estimation):
\[\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\]where $\delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)$ is the TD residual, and $\lambda \in [0,1]$ is the decay parameter.
Theorem: GAE is equivalent to a weighted sum of n-step Advantages:
\[\hat{A}_t^{\text{GAE}} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} \hat{A}_t^{(n)}\]6.3 Bias-Variance Tradeoff of $\lambda$ Parameter
| $\lambda$ value | Equivalent form | Bias | Variance |
|---|---|---|---|
| $\lambda = 0$ | $\delta_t$ (TD) | High (depends on $\hat{V}$) | Low |
| $\lambda = 1$ | $G_t - \hat{V}(s_t)$ (MC) | Low | High |
| $\lambda \in (0,1)$ | Weighted average | Medium | Medium |
In practice, $\lambda = 0.95$ or $\lambda = 0.97$ are commonly used choices.
Intuitive Understanding of GAE:
- $\delta_t$ is the advantage of “estimating remaining value with Critic after one step”
- GAE is a weighted sum of multi-step $\delta$, with $(\gamma\lambda)^l$ making distant $\delta$ decay exponentially
- Smaller $\lambda$ relies more on Critic estimation (higher bias but lower variance)
- Larger $\lambda$ relies more on actual returns (lower bias but higher variance)
6.4 Practical Computation of GAE
GAE can be efficiently computed through recursion:
\[\hat{A}_t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}_{t+1}^{\text{GAE}}\]Boundary condition: $\hat{A}_T^{\text{GAE}} = 0$. Compute from back to front with complexity $O(T)$.
7. Importance Sampling and Off-Policy Policy Gradient
7.1 The Problem with On-Policy
Policy Gradient is on-policy: after each update of $\theta$, the distribution of old data differs from the new policy. This leads to:
- Data can only be used once, low sample efficiency
- Need to resample after each update
Importance Sampling (IS) allows us to reuse old data.
7.2 Importance Sampling Principle
Using samples from distribution $q(x)$ to estimate expectation under $p(x)$:
\[\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\left[ \frac{p(x)}{q(x)} f(x) \right]\]where $\rho(x) = \frac{p(x)}{q(x)}$ is called the importance weight.
Applying to Policy Gradient, the single-step importance weight:
\[\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}\]Off-policy Policy Gradient:
\[\nabla_\theta J(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t \right]\]7.3 Strict Off-Policy Gradient and State Distribution Correction
The formula above omits an important detail. Strict off-policy policy gradient requires correcting not only action probabilities but also the state distribution:
\[\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d_{\pi_{\text{old}}}} \left[ \frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)} \cdot \mathbb{E}_{a \sim \pi_{\text{old}}(\cdot\|s)} \left[ \rho_t(\theta) \nabla_\theta \log \pi_\theta(a\|s) \hat{A}(s,a) \right] \right]\]where $d_\pi(s)$ is the state distribution induced by policy $\pi$ (also known as the discounted state visitation distribution).
Why is state distribution correction needed?
Intuitive understanding: when sampling with old policy $\pi_{\text{old}}$, not only does the action distribution change, but even the distribution of visited states changes. For example:
- The new policy may prefer to enter certain states more often
- These states may have fewer samples in the old data
Therefore, strict off-policy gradient needs to correct both biases.
Implicit Approximation in PPO/TRPO
However, computing the state distribution ratio $\frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)}$ is very difficult—it depends on the cumulative effect of the entire trajectory and cannot be directly computed like action probabilities.
The surrogate objective of PPO/TRPO actually makes a key approximation:
\[\frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)} \approx 1\]That is, assuming the state distributions induced by new and old policies are the same. When is this approximation reasonable?
The Role of Trust Region: When $\pi_\theta$ is sufficiently close to $\pi_{\text{old}}$ (small KL divergence), the difference in state distributions will also be small. TRPO’s KL constraint and PPO’s clip mechanism are precisely to ensure this.
Summary: The surrogate objective used by PPO/TRPO
\[L(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \hat{A}_t \right]\]actually implicitly does two things:
- Uses $\rho_t = \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}$ to correct action probability bias
- Assumes $\frac{d_{\pi_\theta}(s)}{d_{\pi_{\text{old}}}(s)} \approx 1$, ignoring state distribution bias
This explains why PPO/TRPO can directly use token-level $\rho_t$ without additional state distribution correction—provided the trust region constraint holds.
7.4 Variance Problem
When $\rho_t$ deviates too much from 1, variance increases dramatically. Need to limit the magnitude of policy updates to maintain $\rho_t \approx 1$.
8. Trust Region Methods: TRPO and PPO
8.1 Motivation: Limiting Policy Update Magnitude
Importance sampling allows reusing old data, but if $\pi_\theta$ differs too much from $\pi_{\text{old}}$, the estimate becomes unreliable. Trust Region methods solve this problem by limiting the magnitude of policy updates.
Core idea: Each update is performed only within a “trust region”, ensuring the new policy is sufficiently close to the old policy.
8.2 TRPO: KL-Constrained Optimization
TRPO limits policy updates through KL divergence constraints:
\[\begin{aligned} \max_\theta \quad & L(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\text{old}}} \left[ \rho_t(\theta) \hat{A}_t \right] \\ \text{s.t.} \quad & \bar{D}_{\text{KL}}(\pi_{\text{old}} \| \pi_\theta) \leq \delta \end{aligned}\]TRPO theoretically guarantees monotonic improvement, but requires computing the Hessian of KL divergence, making implementation complex.
8.3 PPO: Simplified Trust Region
PPO approximates TRPO’s effect through simpler means.
PPO-Clip objective:
\[L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min \left( \rho_t \hat{A}_t, \, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]\]where $\text{clip}(x, a, b) = \max(a, \min(x, b))$, and $\epsilon$ is typically 0.1 or 0.2.
Intuition of PPO-Clip:
- When $\hat{A}_t > 0$ (good action): The objective is to increase probability, but clips when $\rho_t > 1+\epsilon$ to prevent excessive increase
- When $\hat{A}_t < 0$ (bad action): The objective is to decrease probability, but clips when $\rho_t < 1-\epsilon$ to prevent excessive decrease
PPO-KL (Optional Variant):
An alternative variant of PPO uses a KL penalty term instead of clipping:
\[L^{\text{KL}}(\theta) = \mathbb{E} \left[ \rho_t \hat{A}_t \right] - \beta \cdot \text{KL}(\pi_{\text{old}} \| \pi_\theta)\]where $\beta$ is an adaptively adjusted coefficient:
- If KL is too large, increase $\beta$
- If KL is too small, decrease $\beta$
In practice, PPO-Clip is more commonly used because it’s simpler and performs comparably.
8.4 Entropy Bonus
To encourage exploration, PPO typically also adds an entropy bonus:
\[L^{\text{total}}(\theta) = L^{\text{CLIP}}(\theta) + c_1 \cdot H(\pi_\theta)\]where $H(\pi_\theta) = -\mathbb{E}[\log \pi_\theta(a|s)]$ is the entropy of the policy, and $c_1$ is a weight coefficient (typically 0.01).
Role of Entropy Bonus:
- Encourages the policy to maintain some randomness, avoiding premature convergence to a deterministic policy
- Promotes exploration, preventing getting stuck in local optima
- Higher entropy means a more “uniform” policy with flatter probability distribution over actions
8.5 Complete PPO Algorithm
Reasons for PPO’s Success:
- Simple and efficient: Only needs first-order optimization, no Hessian computation required
- Sample efficiency: Can reuse the same batch of data multiple times ($K$ updates)
- Stability: Clip mechanism prevents drastic policy changes
- Robustness: Not sensitive to hyperparameters, applicable to various tasks
PPO is currently the most commonly used Policy Gradient algorithm and is the standard choice in RLHF.
Chapter Summary
Core Content:
- Policy Gradient Theorem
- Provides the analytical form of the objective function gradient: $\nabla_\theta J = \mathbb{E}[\sum_t \nabla \log \pi \cdot G_t]$
- Log-Derivative Trick is the key to the derivation
- Environment dynamics are independent of $\theta$, enabling Model-Free learning
- Variance Reduction Techniques
- Baseline trick: Subtracting $b(s)$ doesn’t change expectation but reduces variance
- Optimal baseline is $V^\pi(s)$
- Use Advantage $A = Q - V$ instead of $G_t$
- Actor-Critic Architecture
- Actor (policy network) + Critic (value network)
- Critic provides $\hat{V}(s)$ to estimate advantage
- GAE
- $\hat{A}^{\text{GAE}} = \sum_l (\gamma\lambda)^l \delta_{t+l}$
- $\lambda$ controls bias-variance tradeoff
- Trust Region Methods
- Importance sampling allows reusing old data, but requires limiting policy changes
- TRPO: KL-constrained optimization, complex implementation
- PPO: Clip mechanism, simple and efficient, the preferred choice in practice
The next article will introduce Model-Based RL and multi-agent learning, including MCTS and AlphaGo/Zero.
Comments