RL Notes (2): Value-Based RL

2025-12-19 · Qi Lu · Views:

In the previous article, we defined value functions $V^\pi(s)$ and $Q^\pi(s,a)$, which measure how “good” a state or state-action pair is under policy $\pi$. The natural questions now are:

How do we compute value functions? How do we derive the optimal policy from value functions?

The answers to these questions form the core of Value-Based RL. This article introduces three categories of methods:

Dynamic Programming (DP): Exact solutions when the environment model is known
Model-free Estimation (MC and TD): Learning from samples when the model is unknown
Deep Q-Networks (DQN): Handling large or continuous state spaces

1. Why Learn Value Functions?

A core observation is: if we know the optimal action value function $Q^*(s,a)$, we can directly derive the optimal policy:

\[\pi^*(s) = \arg\max_a Q^*(s,a)\]

This is the theoretical foundation of Value-Based methods—instead of learning the policy directly, we obtain it indirectly by learning the value function.

Value-Based Methods Flow

2. Bellman Equations

The Bellman equation is the core equation of RL, revealing the recursive structure of value functions. Understanding Bellman equations is the foundation for mastering all Value-Based methods.

2.1 Intuition: One-Step Decomposition

The core form of the Bellman equation is:

\[V^\pi(s) = \mathbb{E} \left[ r + \gamma V^\pi(s') \right]\]

In words: value of current state = immediate reward + discounted value of next state.

This recursive relationship allows us to compute value functions through iteration, without enumerating all possible trajectories.

2.2 Bellman Expectation Equation

Theorem (Bellman Expectation Equation for $V^\pi$): For any policy $\pi$, the state value function satisfies:

\[V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s,a) V^\pi(s') \right]\]

Proof: Starting from the definition of state value function:

\[\begin{aligned} V^\pi(s) &= \mathbb{E}_\pi \left[ G_t \mid S_t = s \right] \\ &= \mathbb{E}_\pi \left[ r_t + \gamma G_{t+1} \mid S_t = s \right] \\ &= \mathbb{E}_\pi \left[ r_t + \gamma V^\pi(S_{t+1}) \mid S_t = s \right] \end{aligned}\]

Expanding the expectation: given $S_t = s$, action $A_t = a$ has probability $\pi(a|s)$; given $(s, a)$, next state $S_{t+1} = s’$ has probability $P(s’|s,a)$.

Similarly, the Bellman Expectation Equation for action value function:

\[Q^\pi(s,a) = R(s,a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s,a) \sum_{a' \in \mathcal{A}} \pi(a'|s') Q^\pi(s',a')\]

The mutual representation of $V^\pi$ and $Q^\pi$ is key to deriving Bellman equations: $V^\pi(s) = \sum_{a} \pi(a|s) Q^\pi(s,a)$ $Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')$

2.3 Bellman Optimality Equation

Optimal value functions satisfy the Bellman Optimality Equation. Unlike the Expectation version, here we take the maximum over actions instead of expectation.

Theorem (Bellman Optimality Equation for $V^{*}$):

\[V^*(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s,a) V^*(s') \right]\]

Theorem (Bellman Optimality Equation for $Q^{*}$):

\[Q^*(s,a) = R(s,a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s,a) \max_{a' \in \mathcal{A}} Q^*(s',a')\]

Key difference between Bellman Expectation and Bellman Optimality:

Bellman Expectation: Weighted average over actions (by $\pi(a|s)$), describes value of a given policy

Bellman Optimality: Maximum over actions ($\max_a$), describes value of the optimal policy

Q-Learning directly approximates $Q^*$, hence uses Bellman Optimality Equation as the update target.

2.4 Bellman Operator and Contraction Property

Definition (Bellman Operator):

\[(\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]\] \[(\mathcal{T}^* V)(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]\]

Theorem (Contraction Property of Bellman Operator): The Bellman operator is a $\gamma$-contraction mapping:

\[\| \mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2 \|_\infty \leq \gamma \| V_1 - V_2 \|_\infty\]

Important corollaries of contraction property:

Unique fixed point: Bellman operator has a unique fixed point $V^\pi$ (or $V^*$)

Iterative convergence: Starting from any $V_0$, $V_{k+1} = \mathcal{T}V_k$ converges to the fixed point

Convergence rate: Error decays exponentially at rate $\gamma^k$

3. Dynamic Programming Methods

When environment model $P(s’|s,a)$ and $R(s,a)$ are fully known, Dynamic Programming (DP) methods can exactly solve for the optimal policy.

3.1 Policy Evaluation

Given policy $\pi$, compute $V^\pi$ by iterating the Bellman Expectation equation:

\[V_{k+1}(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s') \right]\]

Starting from any $V_0$, iterate until convergence $V_k \to V^\pi$.

3.2 Policy Improvement

Theorem (Policy Improvement Theorem): Let $\pi$ and $\pi’$ be two policies. If for all $s \in \mathcal{S}$:

\[Q^\pi(s, \pi'(s)) \geq V^\pi(s)\]

then $\pi’$ is at least as good as $\pi$: for all $s$, $V^{\pi’}(s) \geq V^\pi(s)$.

Greedy Policy Improvement:

\[\pi'(s) = \arg\max_a Q^\pi(s,a) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right]\]

3.3 Policy Iteration

Alternate between Policy Evaluation and Policy Improvement:

Policy Iteration

For finite MDPs, Policy Iteration converges to optimal policy $\pi^*$ in finite steps.

3.4 Value Iteration

Combine Policy Evaluation and Policy Improvement into one step, directly iterating the Bellman Optimality equation:

\[V_{k+1}(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s') \right]\]

After convergence, $\pi^*(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s’} P(s’|s,a) V(s’) \right]$.

	Policy Iteration	Value Iteration
Policy Evaluation	Full solve for $V^\pi$	Only one Bellman update
Theoretical basis	Bellman Expectation	Bellman Optimality
Per-iteration complexity	High (multiple inner loops)	Low (single sweep)
Iterations to converge	Few	Many

3.5 Limitations of DP Methods

Requires complete model: Must know $P(s’|s,a)$ and $R(s,a)$
State space enumeration: Each iteration requires sweeping all states, complexity $O(|\mathcal{S}|^2 |\mathcal{A}|)$
Cannot handle continuous spaces: Tabular methods don’t apply directly

These limitations motivated the development of model-free methods (MC, TD) and function approximation (DQN).

4. Model-free Methods: Monte Carlo vs Temporal Difference

When the environment model is unknown, we need to estimate value functions from samples of interaction.

4.1 Monte Carlo Estimation

The core idea of MC: use actual trajectory returns $G_t$ to estimate the expectation.

Monte Carlo Update:

\[V(S_t) \leftarrow V(S_t) + \alpha \left( G_t - V(S_t) \right)\]

where $G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k}$ is the actual return from time $t$ to episode end.

MC method characteristics:

Must wait for episode to end to compute $G_t$, only for episodic tasks
Unbiased estimate: $\mathbb{E}[G_t | S_t = s] = V^\pi(s)$
High variance: $G_t$ accumulates randomness from the entire trajectory
No bootstrapping: Target comes entirely from real samples

4.2 Temporal Difference Estimation

The core idea of TD: use “one-step reward + estimated value of next state” to replace full return.

TD(0) Update:

\[V(S_t) \leftarrow V(S_t) + \alpha \left( \underbrace{r_t + \gamma V(S_{t+1})}_{\text{TD target}} - V(S_t) \right)\]

TD error: $\delta_t = r_t + \gamma V(S_{t+1}) - V(S_t)$

TD method characteristics:

Can update every step, suitable for continuing tasks and online learning
Biased estimate: Uses estimated $V(S_{t+1})$
Low variance: Only uses single-step reward
Uses bootstrapping: Uses current estimate $V(S_{t+1})$ to update $V(S_t)$

4.3 Bias-Variance Trade-off

	Monte Carlo	TD(0)
Target	$G_t$ (actual return)	$r_t + \gamma V(S_{t+1})$
Bias	Unbiased	Biased (depends on $V$ estimate)
Variance	High (accumulates trajectory randomness)	Low (only single-step randomness)
Bootstrap	No	Yes
Task type	Episodic only	Episodic and Continuing

Why does TD have lower variance?

MC target $G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$ contains all rewards from $t$ to termination, variances add up. TD target only has $r_t$ and $S_{t+1}$ as random variables—although $V(S_{t+1})$ may be inaccurate (biased), it doesn’t accumulate randomness.

4.4 n-step TD and TD(λ)

n-step TD is intermediate between MC and TD(0):

\[G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(S_{t+n})\]

$n=1$: TD(0)
$n=\infty$: Monte Carlo

TD(λ) method computes weighted average of all n-step returns:

\[G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}\]

$\lambda = 0$: Equivalent to TD(0)
$\lambda = 1$: Equivalent to Monte Carlo

5. Q-Learning and SARSA

To find the optimal policy, we need to estimate action value function $Q$, so we can derive policy via $\arg\max$.

5.1 On-policy vs Off-policy

On-policy: Evaluate and improve the same policy (behavior policy = target policy)
Off-policy: Use behavior policy to collect data, evaluate/improve a different target policy

Off-policy advantages:

Can learn from historical data (Experience Replay)
Can learn from human demonstrations or other policies

5.2 SARSA (On-policy)

SARSA is an on-policy TD control method, named after the quintuple $(S_t, A_t, R_t, S_{t+1}, A_{t+1})$ needed for updates.

SARSA Update:

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left( r_t + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right)\]

where $A_{t+1}$ is the action actually sampled according to current policy (e.g., $\epsilon$-greedy).

5.3 Q-Learning (Off-policy)

Q-Learning directly approximates optimal $Q^*$, not the current policy’s $Q^\pi$.

Q-Learning Update:

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left( r_t + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right)\]

Key difference: SARSA uses $Q(S_{t+1}, A_{t+1})$ (actually sampled action), Q-Learning uses $\max_{a’} Q(S_{t+1}, a’)$ (optimal action).

Theorem (Q-Learning Convergence): Q-Learning converges to $Q^{*}$ under the following conditions:

All state-action pairs are visited infinitely often
Learning rate satisfies Robbins-Monro conditions: $\sum_t \alpha_t = \infty$, $\sum_t \alpha_t^2 < \infty$

5.4 Cliff Walking Example

Cliff Walking is a classic Grid World environment that clearly demonstrates the behavioral difference between Q-Learning and SARSA:

Q-Learning: Learns optimal path (along cliff edge), but often falls during training due to exploration
SARSA: Learns conservative path accounting for exploration, staying away from cliff edge

Root cause: SARSA’s target includes actually executed exploration actions; Q-Learning’s target always selects $\max$.

	Q-Learning	SARSA
Type	Off-policy	On-policy
TD target	$r + \gamma \max_{a’} Q(s’,a’)$	$r + \gamma Q(s’, a’)$
Learning target	$Q^*$ (optimal policy)	$Q^\pi$ (current policy)
Behavior	More aggressive/optimistic	More conservative

6. Deep Q-Network (DQN)

Tabular Q-Learning cannot handle large state spaces (like image inputs) or continuous state spaces. DQN uses neural networks to approximate $Q^*$, pioneering deep reinforcement learning.

6.1 Motivation for Function Approximation

Tabular method limitations:

State space explosion: Atari game pixel space is about $256^{210 \times 160 \times 3}$
No generalization: Cannot handle unseen states
Continuous states: Cannot be represented by tables

Solution: Use parameterized function $Q(s,a;\theta)$ (e.g., neural network) to approximate $Q^*$.

6.2 DQN Loss Function

Transform Q-Learning update into a regression problem:

\[\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( \underbrace{r + \gamma \max_{a'} Q(s',a';\theta^-)}_{\text{TD target } y} - Q(s,a;\theta) \right)^2 \right]\]

where $\mathcal{D}$ is the Replay Buffer, $\theta^-$ is the Target Network parameters.

6.3 Experience Replay

Store transitions $(s_t, a_t, r_t, s_{t+1})$ in Replay Buffer $\mathcal{D}$, randomly sample mini-batches from $\mathcal{D}$ for training.

Replay Buffer

Benefits of Experience Replay:

Break sample correlation: Random sampling provides more independent samples
Improve data efficiency: Each sample can be reused multiple times
Stabilize data distribution: Buffer distribution changes slowly

6.4 Target Network

Use old parameters $\theta^-$ to compute TD target, periodically update $\theta^- \leftarrow \theta$ (e.g., every $C$ steps).

Target Network purpose:

Avoid oscillation or divergence from target “chasing” current estimate
TD target stays fixed for a period, similar to fixed labels in supervised learning

A variant is Soft Update: $\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-$, where $\tau \ll 1$.

6.5 DQN Algorithm

DQN Algorithm

DQN’s two key techniques solve deep RL stability problems:

Experience Replay: Addresses sample correlation, improves data efficiency

Target Network: Addresses target instability, prevents oscillation and divergence

6.6 DQN Variants

Double DQN: Addresses overestimation by decoupling action selection and value evaluation:

\[y = r + \gamma Q\left(s', \arg\max_{a'} Q(s',a';\theta); \theta^-\right)\]

Dueling DQN: Decomposes Q value into state value and action advantage:

\[Q(s,a;\theta) = V(s;\theta_v) + \left( A(s,a;\theta_a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s,a';\theta_a) \right)\]

Other improvements:

Prioritized Experience Replay: Prioritize sampling transitions with large TD error
Multi-step Learning: Use n-step return as target
Distributional RL: Learn distribution of return rather than expectation
Noisy Networks: Replace $\epsilon$-greedy exploration with network parameter noise

Rainbow combines all these improvements, achieving SOTA performance on Atari games.

Summary

Core content:

Bellman Equations are the theoretical foundation of Value-Based RL
- Bellman Expectation: Describes value of a given policy
- Bellman Optimality: Describes value of optimal policy
- Bellman operator is a $\gamma$-contraction, guaranteeing iterative convergence
Dynamic Programming (when model is known)
- Policy Evaluation: Iteratively solve for $V^\pi$
- Policy Iteration: Evaluate → Improve → Evaluate → …
- Value Iteration: Directly iterate Bellman Optimality equation
MC vs TD (when model is unknown)
- MC: Unbiased, high variance, requires complete trajectory
- TD: Biased, low variance, updates every step
- n-step TD and TD(λ) trade off between the two
Q-Learning vs SARSA
- Q-Learning: Off-policy, learns $Q^*$, more aggressive
- SARSA: On-policy, learns $Q^\pi$, more conservative
DQN: Deep Value-Based RL
- Experience Replay: Breaks sample correlation
- Target Network: Stabilizes TD target
- Double DQN, Dueling DQN, and other improvements

The next article will introduce Policy-Based methods—directly optimizing parameterized policies—and the Actor-Critic architecture.

Tags: RL DQN

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析