RL Notes (4): Model-Based Methods & MARL

2025-12-19 · Qi Lu · Views:

This is the fourth article in the reinforcement learning series, introducing core concepts of Model-Based RL and Multi-Agent RL. Starting from the pursuit of sample efficiency, it explains World Models, Dyna architecture, and MCTS, ultimately demonstrating the powerful combination of these techniques through AlphaGo/AlphaZero as examples, and introduces game theory fundamentals and Self-Play methods.

Introduction: The Pursuit of Sample Efficiency

Core Problem

The Model-Free methods introduced in the previous three chapters (Q-Learning, Policy Gradient, PPO) are powerful but share a common flaw:

Extremely low sample efficiency—training an Atari game agent requires hundreds of millions of frames, equivalent to hundreds of hours of human play. Yet humans typically need only a few minutes to learn basic operations. Why is there such a huge gap?

The key difference is: humans have a world model in their minds. When we imagine “what would happen if I do this,” we are using this model for mental simulation, without needing to actually try it.

The core idea of Model-Based RL is precisely: learn or utilize an environment model to improve sample efficiency through planning.

Model-Free vs Model-Based

RL methods are divided into two major categories based on whether they use an environment model:

Model-Free vs Model-Based

Model-Free: Does not learn or use an environment model, directly learns value functions or policies from real experience

Model-Based: Learns or utilizes environment model $\hat{P}(s’|s,a)$, $\hat{R}(s,a)$, performs planning within the model

Model-Free vs Model-Based

Property	Model-Free	Model-Based
Environment Model	Not needed	Needed (known or learned)
Sample Efficiency	Low	High
Computational Cost	Low	High (planning)
Model Error	None	May accumulate
Use Cases	Model hard to obtain	Model known or easy to learn
Typical Algorithms	Q-Learning, PPO	Dyna, MCTS, MuZero

Model-Based RL Overview

Definition of World Model

World Model is an estimate of environment dynamics, including:

State transition model: $\hat{P}(s’|s,a) \approx P(s’|s,a)$

Reward model: $\hat{R}(s,a) \approx R(s,a)$

With a World Model, the agent can simulate action consequences “in its mind” without actually executing them.

There are two sources for World Models:

Known rules: Such as rules for board games, equations for physics engines
- Advantage: Model is precise, no error
- Disadvantage: Only applicable to domains with fully known rules
Learning from data: Use neural networks to learn from interaction experience
- Advantage: Applicable to complex environments
- Disadvantage: Model has errors

Learning World Models

Learning a World Model is essentially a supervised learning problem. Given experience data ${(s_t, a_t, r_t, s_{t+1})}$:

Deterministic model: Directly predict next state

\[\hat{s}_{t+1} = f_\theta(s_t, a_t), \quad L = \|s_{t+1} - \hat{s}_{t+1}\|^2\]

Probabilistic model: Predict state distribution

\[\hat{P}_\theta(s'|s, a), \quad L = -\log \hat{P}_\theta(s_{t+1}|s_t, a_t)\]

Latent space model: Predict in low-dimensional latent space

\[z_{t+1} = f_\theta(z_t, a_t), \quad z_t = \text{Encoder}(s_t)\]

Note: Modern World Model methods (such as Dreamer, MuZero) typically perform predictions in latent space, avoiding direct prediction of high-dimensional raw observations (like images), greatly reducing learning difficulty.

Model Bias Problem

Model Bias / Model Error: When the learned model $\hat{P}, \hat{R}$ differs from the real environment $P, R$, strategies obtained from planning in the model may perform poorly in the real environment.

The key issue with Model Bias is error compounding:

Model Bias Error Compounding

Error $\epsilon_t$ increases with steps: $\epsilon_1 < \epsilon_2 < \epsilon_3 < \epsilon_4$

Error Compounding Upper Bound Theorem: Let the single-step model error be $\epsilon = \max_{s,a} |\hat{P}(\cdot s,a) - P(\cdot s,a)|_1$, then the upper bound of total variation distance for $H$-step planning is:

\[\text{TV}(\hat{P}^H, P^H) \leq H \cdot \epsilon\]
That is, error accumulates linearly with planning steps.

Strategies to mitigate Model Bias:

Short-term planning: Use model only for short-term predictions (such as 1-step in Dyna)
Ensemble models: Train multiple models, use uncertainty to guide exploration
Continuous correction: Continuously update model with real data
Latent space planning: Plan in abstract space (such as MuZero)

Planning Methods

With an environment model, the next step is to utilize the model for planning. Based on planning timing, there are two categories:

Background Planning vs Decision-time Planning

Background Planning: Outside of interaction with the real environment, use the model to generate simulated experience to train the policy

Decision-time Planning: When a decision needs to be made, use the model for forward search to select the optimal action

Planning Types

Dyna Architecture

Dyna is a classic framework for Background Planning, proposed by Sutton in 1991. Its core idea is: after each real interaction, use the model to generate multiple simulated experiences to accelerate learning.

Dyna Architecture

Each real interaction generates $n$ simulated steps

Dyna-Q Algorithm

Dyna’s Core Advantages:

Improved sample efficiency: Each real interaction can produce n simulated learning steps
Flexible computation-sample tradeoff: Increasing n can trade more computation for less real interaction
Gradual convergence: When the model is accurate, theoretically converges to the same policy as direct learning

Decision-time Planning

Unlike Background Planning, Decision-time Planning performs planning at each decision point:

Starting from current state, use the model to simulate multiple possible trajectories
Evaluate the return of each trajectory
Select the optimal first-step action
Re-plan after execution (do not save intermediate results)

Characteristics of Decision-time Planning:

Computation focused: All computation serves the current decision
Dynamic precision: Can adjust search depth and breadth as needed
No training needed: Can be directly used at test time

The most famous Decision-time Planning method is Monte Carlo Tree Search (MCTS).

Monte Carlo Tree Search (MCTS)

MCTS is a tree search-based Decision-time Planning method widely used in board games and combinatorial optimization problems.

Core Idea of MCTS

The goal of MCTS is to estimate the value of each action in the current state within a limited computational budget. Its core idea is: selectively expand the search tree, concentrating computational resources on the most promising branches.

How to decide which branch is “most promising”? This requires balancing exploitation (selecting known good branches) and exploration (trying uncertain branches).

MCTS Four-Step Process

Each iteration of MCTS includes four steps:

MCTS Four Steps

Selection: Select by UCB along tree
Expansion: Expand new child node
Evaluation: Rollout or Value Network
Backup: Backpropagate statistics

Selection: Starting from the root node, recursively select child nodes using a tree policy (such as UCB) until reaching a leaf node (an incompletely expanded node).
Expansion: If the leaf node is not a terminal state, expand one or more new child nodes based on feasible actions.
Evaluation: Evaluate the value of the newly expanded node. Traditional methods use rollout (random simulation to game end); modern methods use a value network for direct estimation.
Backup: Backpropagate the evaluation value along the selection path, updating the visit count $N$ and value estimate $Q$ of all nodes on the path.

UCB Formula

The core of the Selection phase is the UCB (Upper Confidence Bound) formula, which elegantly balances exploitation and exploration:

UCB for Trees (UCT)

In the Selection phase, select the action that maximizes the following value:
\[\text{UCB}(s, a) = \underbrace{Q(s, a)}_{\text{exploitation}} + \underbrace{c \sqrt{\frac{\ln N(s)}{N(s, a)}}}_{\text{exploration}}\]
Where:

$Q(s, a)$: Average value estimate of action $a$ (statistics from historical simulations)

$N(s)$: Total visit count of state $s$

$N(s, a)$: Number of times action $a$ was executed in state $s$

$c$: Exploration coefficient, controls exploration-exploitation tradeoff

Intuitive understanding of UCB:

Exploitation term $Q(s,a)$: Select actions with good historical performance
Exploration term: Select actions with low visit counts (high uncertainty)
As visit count increases, exploration bonus gradually decreases, eventually dominated by exploitation
Larger $c$ means more exploration; smaller $c$ means more exploitation

MCTS Algorithm

Important: MCTS final action selection criterion:

During training/search: Use UCB (balance exploration-exploitation)

Final decision: Select action with most visits (more robust)

Choose visit count rather than average value because high visit count means high confidence.

AlphaGo and AlphaZero

AlphaGo and AlphaZero are milestone achievements of MCTS + deep learning + Self-Play.

The Challenge of Go

Go is considered the most difficult board game for AI to master:

Enormous search space: Average of $\sim 200$ legal moves per step, about $200$ steps per game, total states $\sim 10^{170}$
Difficult position evaluation: Unlike chess with clear piece values, the merit of a Go position is hard to quantify
Long-term planning: Need to consider strategic impact dozens of moves ahead

Traditional Go AI used exhaustive search + hand-crafted evaluation functions, reaching only amateur dan level.

AlphaGo Architecture (2016)

AlphaGo defeated world champion Lee Sedol 4:1 in 2016. Its architecture includes:

AlphaGo Architecture

Policy Network $p_\theta(a|s)$:
- Input: Board state (multi-channel features)
- Output: Probability of placing a stone at each position
- Training: First supervised learning on human game records, then reinforcement with Policy Gradient through self-play
Value Network $v_\phi(s)$:
- Input: Board state
- Output: Win rate estimate for current position $v_\phi(s) \approx \mathbb{E}[z|s]$, where $z \in \{-1, +1\}$
- Training: Supervised learning on $(s, z)$ data generated from self-play
Improved MCTS:
- Selection: Guided by Policy Network (PUCT formula)
\[\text{UCB}(s,a) = Q(s,a) + c \cdot p_\theta(a|s) \cdot \frac{\sqrt{N(s)}}{1 + N(s,a)}\]
- Evaluation: Mix Value Network and rollout
\[v = (1-\lambda) v_\phi(s) + \lambda z_{\text{rollout}}\]

AlphaZero’s Simplification and Transcendence (2017)

AlphaZero in 2017 significantly simplified AlphaGo’s design while achieving even stronger performance:

Property	AlphaGo	AlphaZero
Human game records	Needed (supervised pretraining)	Not needed
Network structure	Separate Policy + Value	Unified network
Rollout	Needed	Not needed
Feature engineering	Hand-crafted features	Raw board input
Training time	Months	Hours
Applicable games	Only Go	Go, Chess, Shogi

AlphaZero Network

Single network outputs both policy and value. Shared representation, fewer params, more efficient.

AlphaZero Training Loop

AlphaZero’s training is a self-reinforcing loop:

AlphaZero Training Loop

Positive loop: Better network → better search → better training data → better network

AlphaZero Algorithm

Where each loss term means:

$(z - v_\theta(s))^2$: Value loss, making value prediction approach game outcome
$-\pi_{\text{MCTS}}^\top \log p_\theta(s)$: Policy loss, making policy approach MCTS search result
$c|\theta|^2$: L2 regularization term

AlphaZero’s Core Insights:

MCTS as policy improvement: Search-generated $\pi_{\text{MCTS}}$ is better than raw network $p_\theta$
Network learns search: Network is trained to imitate MCTS output
Positive loop: Better network → better search → better training data → better network

This loop requires no human knowledge, learning completely from scratch (tabula rasa).

Multi-Agent RL Basics

When multiple agents exist in the environment, the problem becomes more complex.

From Single-Agent to Multi-Agent

Core problem: When other agents are also learning and changing their strategies, the environment is non-stationary for a single agent. This breaks the basic assumption of MDPs.

In multi-agent environments, state transitions and rewards depend not only on one’s own actions but also on other agents’ actions:

\[P(s'|s, a_1, a_2, \ldots, a_n), \quad R_i(s, a_1, a_2, \ldots, a_n)\]

When other agents’ strategies $\pi_{-i}$ change during the learning process, from agent $i$’s perspective, the environment is non-stationary.

Impact of non-stationarity:

Convergence guarantees of single-agent RL no longer apply
Optimal response strategy varies with opponent’s strategy
May exhibit policy oscillation, unable to converge

Game Theory Basics

Multi-agent problems can be described using game theory language.

Normal-form Game

An $n$-player game consists of the following elements:

Player set: $\mathcal{N} = {1, 2, \ldots, n}$

Strategy space: Each player $i$’s strategy set $\mathcal{A}_i$

Utility function: $u_i: \mathcal{A}_1 \times \cdots \times \mathcal{A}_n \to \mathbb{R}$, representing player $i$’s payoff under each strategy combination

Prisoner’s Dilemma Example:

Two suspects are interrogated separately, each choosing “cooperate” (stay silent) or “defect” (confess):

	Player 2: Cooperate	Player 2: Defect
Player 1: Cooperate	(-1, -1)	(-3, 0)
Player 1: Defect	(0, -3)	(-2, -2)

Analysis:

Regardless of opponent’s choice, “defect” is always more advantageous (dominant strategy)
Result: Both defect, each gets -2
But if both cooperate, each gets -1 (Pareto superior)

Cooperative vs Competitive Settings

Multi-agent scenarios mainly fall into two categories:

Cooperative: All agents share rewards, maximize team total return
- Examples: Multi-robot collaborative carrying, multi-agent coordinated navigation
- Challenge: Credit Assignment—how much did each agent contribute?
- Method: Centralized Training Decentralized Execution (CTDE)
Competitive/Zero-sum: One party’s gain equals another’s loss
- Examples: Board games, adversarial games
- Property: $u_1 + u_2 = 0$
- Goal: Find Nash equilibrium

Nash Equilibrium

Nash Equilibrium

A strategy combination $(\pi_1^{*}, \pi_2^{*}, \ldots, \pi_n^{*})$ is a Nash equilibrium if and only if no player has an incentive to unilaterally change their strategy:
\[\forall i, \forall \pi_i: \quad u_i(\pi_i^*, \pi_{-i}^*) \geq u_i(\pi_i, \pi_{-i}^*)\]
Where $\pi_{-i}^*$ represents the strategies of all players except player $i$.

Meaning of Nash equilibrium:

Each player is making a best response to other players’ strategies
It is a stable state: no one has an incentive to unilaterally deviate
Not necessarily globally optimal (as in Prisoner’s Dilemma where mutual defection is Nash equilibrium but not Pareto optimal)

Nash Equilibrium in Rock-Paper-Scissors:

	Rock	Scissors	Paper
Rock	(0, 0)	(1, -1)	(-1, 1)
Scissors	(-1, 1)	(0, 0)	(1, -1)
Paper	(1, -1)	(-1, 1)	(0, 0)

Nash equilibrium: Both players randomly choose with probability $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$.

Any deterministic strategy can be exploited by the opponent; only a mixed strategy (randomization) can achieve equilibrium.

Nash Equilibrium Existence Theorem: Every finite game (finite players, finite strategies) has at least one Nash equilibrium (possibly a mixed strategy equilibrium).

Self-Play Methods

Self-Play is a powerful method for training game AI, and is one of the keys to AlphaGo/AlphaZero’s success.

Definition of Self-Play

Self-Play: The agent plays against itself (or its historical versions), learning to improve its strategy from game experience.

Self-Play

Advantages of Self-Play

Unlimited data: Can generate arbitrarily many game data, not limited by human game count
Adaptive difficulty: Opponent grows stronger with oneself, always providing appropriate challenge
- Early stage: Weak opponent, easy to learn basic strategies
- Late stage: Strong opponent, drives learning of advanced strategies
Discover new strategies: Not limited by human prior knowledge, may discover innovative strategies unknown to humans
- AlphaGo’s “shoulder hit” and other new moves shocked professional players
Approach Nash equilibrium: In zero-sum games, Self-Play theoretically converges to Nash equilibrium

Challenges of Self-Play

Strategy forgetting:
- When strategy updates, may “forget” how to deal with old strategies
- Solution: Maintain historical opponent pool, randomly sample opponents
Local optima:
- May fall into local optimum of “only good at playing against itself”
- Example: Two versions counter each other, forming a cycle
- Solution: Add diversity rewards, or sample from opponent pool
Difficulty in evaluation:
- No fixed baseline to measure progress
- Solution: Use Elo rating system or play against fixed opponents

Relationship between Self-Play and Nash Equilibrium:

In two-player zero-sum games, if Self-Play converges, it converges to Nash equilibrium
Intuition: Nash equilibrium is a “fixed point of best responses,” Self-Play iteratively seeks best responses
But convergence is not guaranteed—may have strategy cycles

Chapter Summary

Model-Based RL uses environment models to improve sample efficiency
- World Model = state transition + reward model
- Model Bias: Model errors accumulate, requiring short-term planning or continuous correction
Dyna architecture combines direct learning with planning
- After each real interaction, use model to generate n simulated experiences
- Provides flexible computation-sample tradeoff
MCTS is the representative of Decision-time Planning
- Four-step process: Selection, Expansion, Evaluation, Backup
- UCB formula balances exploration and exploitation
AlphaGo/AlphaZero demonstrates the powerful combination of MCTS + deep learning + Self-Play
- AlphaZero starts from scratch, requires no human knowledge
- Core loop: MCTS improves policy → network learns search → positive reinforcement
Multi-Agent RL faces non-stationarity challenges
- Nash equilibrium: Stable strategy combination
- Self-Play: Effective method for training game AI

RL Methods Taxonomy

AlphaZero = MCTS + Policy Network + Value Network + Self-Play

The next blog post will enter the field of combining LLMs with RL, introducing methods like RLHF and DPO used for language model alignment.

Tags: RL MCTS AlphaZero

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

RL Notes (4): Model-Based Methods & MARL

Introduction: The Pursuit of Sample Efficiency

Core Problem

Model-Free vs Model-Based

Model-Based RL Overview

Definition of World Model

Learning World Models

Model Bias Problem

Planning Methods

Dyna Architecture

Decision-time Planning

Monte Carlo Tree Search (MCTS)

Core Idea of MCTS

MCTS Four-Step Process

UCB Formula

MCTS Algorithm

AlphaGo and AlphaZero

The Challenge of Go

AlphaGo Architecture (2016)

AlphaZero’s Simplification and Transcendence (2017)

AlphaZero Training Loop

Multi-Agent RL Basics

From Single-Agent to Multi-Agent

Game Theory Basics

Cooperative vs Competitive Settings

Nash Equilibrium

Self-Play Methods

Definition of Self-Play

Advantages of Self-Play

Challenges of Self-Play

Chapter Summary

Comments