LLM-RL Training Stability: Root Cause Analysis and Solutions

2025-12-19 · Qi Lu · Views:

Introduction

In large language model reinforcement learning (LLM-RL) training, it’s common to observe training curves that rise steadily for a period before suddenly collapsing. Whether it’s complex reasoning RL or multi-turn tool-calling Agentic RL, many practitioners have encountered this mysterious training collapse.

This blog post synthesizes multiple important works including ByteDance’s When Speed Kills Stability, Qwen team’s Stabilizing RL with LLMs, and vLLM’s Bitwise Consistent Training to systematically analyze the root causes of LLM-RL training instability and summarize practical solutions.

Notation Conventions:

Symbol	Meaning
$\pi_\theta$	Current policy being optimized (computed by training engine)
$\pi_{\text{old}}$	Policy at sampling time (computed by training engine, but with old parameters)
$\pi_\text{vllm}$	Policy computed by inference engine (vLLM/SGLang, numerically different from $\pi$)
$\pi_{\text{ref}}$	Reference policy (anchor for KL regularization, typically the SFT model)

Core distinction: $\pi$ vs $\pi_\text{vllm}$ represents numerical differences of the same parameters across different engines, while $\pi_\theta$ vs $\pi_{\text{old}}$ represents parameter differences at different time steps within the same engine.

Problem Manifestation: Sudden Collapse

Typical collapse patterns include:

Training Reward: Steady increase → sudden drop or severe oscillations
Gradient Norm: Normal range → sudden explosion
PPL (Perplexity): Stable → sharp spike
Entropy: Gradual decline → abnormal fluctuations

Most perplexingly, these collapses are often unpredictable—the same code and hyperparameters may behave completely differently across different GPUs.

Root Cause Analysis

LLM-RL training instability has two levels of root causes:

System Level: Training-Inference Mismatch (numerical differences between inference engine $\pi_\text{vllm}$ and training engine $\pi$)
Algorithm Level: Token-Sequence Mismatch (token-level optimization objective vs sequence-level reward)

These two issues are independent but compound each other. Let’s analyze them separately.

Root Cause One: Training-Inference Mismatch

The Trade-off Between Speed and Consistency

Modern LLM-RL systems typically use high-speed inference engines (such as vLLM, SGLang) for rollout sampling, while using training frameworks (such as FSDP, Megatron-LM) for parameter updates. These two types of systems have fundamentally different optimization objectives:

System	Optimization Goal	Typical Techniques
Inference Engine	Throughput maximization	Speculative Decoding, INT8/FP8, batch-variant CUDA kernels
Training Framework	Numerical stability	FP32 Master Weights, deterministic operators

This divergence in optimization objectives leads to inevitable numerical inconsistencies. Even with identical parameters, the inference engine’s computed $\pi_\text{vllm}(y\mid x)$ and the training engine’s computed $\pi(y\mid x)$ will differ.

Actual Gradient vs Theoretical Gradient

In theory, the on-policy policy gradient should be:

\[\mathbb{E}_{y \sim \pi_\theta} \left[ R(x,y) \nabla_\theta \log \pi_\theta(y|x) \right]\]

But in practice, since samples come from the inference engine $\pi_\text{vllm}$:

\[\mathbb{E}_{y \sim \pi_\text{vllm}} \left[ R(x,y) \nabla_\theta \log \pi_\theta(y|x) \right]\]

This means: You think you’re doing on-policy training, but you’re actually doing off-policy training.

Root Cause Two: Token-Sequence Mismatch

Mainstream RL algorithms (PPO, GRPO) use token-level optimization objectives, but rewards are sequence-level.

First-order Approximation Theoretical Foundation: The sequence-level IS weight can be expanded as:

\[\frac{\pi_\theta(y|x)}{\pi_\text{vllm}(y|x)} = \prod_{t=1}^{|y|}(1+\delta_t) \approx 1 + \sum_{t=1}^{|y|}\delta_t\]

where $\delta_t = \frac{\pi_\theta(y_t \mid s_t)}{\pi_\text{vllm}(y_t \mid s_t)} - 1$. This shows that the token-level objective is a first-order approximation of the sequence-level objective, ignoring higher-order terms of $O(\delta^2)$.

IS Weight Decomposition: The token-level IS weight can be decomposed into two factors:

\[\frac{\pi_\theta(y_t|s_t)}{\pi_\text{vllm}(y_t|s_t)} = \underbrace{\frac{\pi_{\text{old}}(y_t|s_t)}{\pi_\text{vllm}(y_t|s_t)}}_{\text{Training-Inference Discrepancy}} \times \underbrace{\frac{\pi_\theta(y_t|s_t)}{\pi_{\text{old}}(y_t|s_t)}}_{\text{Policy Staleness}}\]

Training-Inference Discrepancy: Numerical differences between inference engine $\pi_\text{vllm}$ and training engine $\pi$ (Root Cause One)
Policy Staleness: Policy drift during mini-batch processing

This decomposition shows: The two root causes compound multiplicatively—if either deviates significantly, the IS weight will deviate from 1, undermining the validity of the first-order approximation.

Specific Manifestations and Challenges

Challenge One: Low Probability Token Trap

Mismatch is most severe at low probability tokens. When vLLM samples a token with near-zero probability, the probability computed by FSDP may be orders of magnitude lower than vLLM’s, leading to:

PPL explosion (denominator approaching 0)
Gradient explosion
Training collapse

This explains why multi-turn tool-calling scenarios are particularly prone to collapse—tool-returned OOD (Out-of-Distribution) text causes the model to generate more low probability tokens.

Challenge Two: Hardware Differences Amplify the Problem

The degree of mismatch varies drastically across different GPU architectures:

\[\text{vllm-kl}: \quad \text{H20} < \text{L20} < \text{A100}\]

H20: $5 \times 10^{-4}$ ~ $10^{-3}$
L20: $10^{-3}$ ~ $10^{-2}$
A100: $10^{-2}$ ~ $1$

The same code may collapse on L20, but resuming from checkpoint on H20 can immediately stabilize training!

Challenge Three: High Variance and Entropy Collapse

High Variance Problem: In long CoT scenarios, the accumulation of token-level IS weights leads to variance explosion:

\[\prod_{t=1}^{T} \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} = \exp\left(\sum_t \log \rho_t\right)\]

Logprob differences accumulate linearly over long sequences, then become exponential differences after exponentiation. The consequence is that gradients are dominated by very few samples and very few tokens, which is especially catastrophic for MoE models—a small number of extreme updates may break expert routing.

Entropy Collapse: RL optimization tends to enhance the probability of high-reward tokens while compressing low-probability tokens, leading to continuous policy entropy decline. When entropy approaches 0:

Exploration capability is lost
Diversity collapses
Unable to discover new solutions

Research shows that policy performance comes at the cost of entropy consumption, with a theoretical upper bound.

Solutions

Solution One: Sequence-Level Importance Sampling

The correct unbiased estimator requires applying the importance ratio over the entire sequence:

\[g_{\text{seq}} = \mathbb{E}_{y \sim \pi_\text{vllm}} \left[ \frac{\pi_\theta(y|x)}{\pi_\text{vllm}(y|x)} \cdot R(x,y) \cdot \nabla \log \pi_\theta(y|x) \right]\]

In practice, there are two variants:

Truncated IS (TIS): Truncate the ratio $\rho(y) \gets \min(\rho(y), C)$
Masked IS (MIS): Directly mask sequences exceeding the threshold $\rho(y) \gets \rho(y) \cdot \mathbb{I}\{\rho(y) \le C\}$

Experiments show that MIS works better than TIS, not only stabilizing training but also exceeding the peak performance before collapse.

Solution Two: Off-Policy Sequence Masking (DeepSeek-V3.2)

DeepSeek-V3.2 adopts a more refined masking strategy:

\[M_{i,t} = \begin{cases} 0 & \text{if } \hat{A}_{i,t} < 0 \text{ and } \frac{1}{\lvert o_i \rvert}\sum \log \frac{\pi_{\text{old}}}{\pi_\theta} > \delta \\ 1 & \text{otherwise} \end{cases}\]

Core idea: Only mask sequences with negative advantage and off-policy degree exceeding threshold $\delta$.

Here $\frac{1}{\lvert o_i \rvert}\sum \log \frac{\pi_{\text{old}}}{\pi_\theta}$ measures the off-policy degree, essentially per-token average KL divergence (equivalent to the geometric mean of log ratios). This length normalization avoids systematic discarding of long sequences.

Why only mask negative advantage? Samples with positive advantage, even if off-policy, still provide useful gradient directions; whereas off-policy samples with negative advantage may introduce harmful gradient noise.

DeepSeek-V3.2 also introduces complementary stabilization techniques:

Keep Routing (MoE-specific): Expert routing in inference and training frameworks may be inconsistent. The solution is to save the routing path during inference and force the same path during training.

Keep Sampling Mask: Top-p/top-k sampling truncates low probability tokens, causing inconsistent action spaces between $\pi_{\text{old}}$ and $\pi_\theta$. The solution is to save the truncation mask during sampling and apply the same mask to $\pi_\theta$ during training.

Unbiased KL Estimation: The standard K3 estimator is biased in off-policy settings. Corrected formula: $D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) = \frac{\pi_\theta}{\pi_{\text{old}}} \left[ \frac{\pi_{\text{ref}}}{\pi_\theta} - \log \frac{\pi_{\text{ref}}}{\pi_\theta} - 1 \right]$

Solution Three: Bitwise Consistent Training

Another path is to make inference and training completely consistent.

Core approach:

Audit every kernel call in the forward pass
Import vLLM’s fused operations (SiLU MLPs, RMSNorm) into the training framework
Implement corresponding backward passes

Experiments show that enabling bitwise consistency results in:

Faster convergence
Higher final rewards
More stable training

But the cost is approximately 2.4x performance degradation.

Solution Four: IcePop (Token-Level Discrepancy Masking)

IcePop, proposed by Ring-1T, handles mismatch at token granularity, complementing the previous sequence-level methods.

Core idea: Define token-level ratio $k_{i,t} = \frac{\pi(y_t \mid s_t)}{\pi_\text{vllm}(y_t \mid s_t)}$, and mask tokens exceeding reasonable ranges:

\[M(k) = \begin{cases} k & \text{if } k \in [\alpha, \beta] \\ 0 & \text{otherwise} \end{cases}\]

Typical parameters: $\alpha = 0.5$, $\beta = 5.0$.

Bidirectional Truncation: Unlike sequence-level MIS which only focuses on $k > C$, IcePop handles both directions:

$k > \beta$: Training probability far exceeds inference probability (may cause gradient explosion)
$k < \alpha$: Training probability far below inference probability (may cause PPL explosion)

Why is token-level effective? In MoE models, differences in expert routing cause uneven mismatch distribution across token positions. Token-level masking can precisely remove problematic tokens rather than discarding entire sequences.

Comparison with sequence-level:

Method	Granularity	Advantages	Disadvantages
Seq-MIS	Sequence	Theoretically unbiased	May discard too much data
IcePop	Token	Fine-grained control	Doesn’t correct state occupancy

In practice, they can be combined: first use IcePop to handle extreme tokens, then use sequence-level methods to handle overall drift.

Solution Five: GSPO (Sequence-Level IS)

GSPO (Group Sequence Policy Optimization) elevates the IS operation to the sequence level:

\[s_i(\theta) = \left(\frac{\pi_\theta(y_i \mid x)}{\pi_{\text{old}}(y_i \mid x)}\right)^{1/\lvert y_i \rvert}\]

Core improvements:

First apply length normalization to the sequence-level ratio, then clip
All tokens in the same sequence share the same IS weight

Differences from GRPO:

Dimension	GRPO	GSPO
IS Granularity	Token-level	Sequence-level
Clip Target	Each token’s ratio	Normalized ratio of entire sequence
Long Sequence Stability	Poor (variance explosion)	Good (length normalization)

Advantages:

Avoids variance explosion from token-level weight multiplication
More stable for MoE models
Simplifies RL infrastructure design

Solution Six: Multiple Sampling Estimation (MoE-specific)

The KAT-Coder team proposed a different perspective: for MoE models, sampling noise itself is the dominant factor causing training instability, not training-inference inconsistency.

Noise Source Analysis:

Model Type	Train-Infer Gap	Inference Noise Variance	Training Noise Variance
Dense	~0.002	~$10^{-5}$	0 (Megatron deterministic)
MoE	~0.008	~$10^{-3}$	~$10^{-7}$ (scatter_add randomness)

MoE’s inference noise variance is two orders of magnitude higher than Dense—this is the primary cause of instability.

Core Method: When computing $\pi_{\text{old}}$, directly use the inference engine to recompute n times (n=8) and take the average:

\[\hat{\pi}_{\text{old}}(y \mid x) = \frac{1}{n} \sum_{i=1}^{n} \pi_{\text{inference}}^{(i)}(y \mid x)\]

Key Advantages:

Obtains unbiased estimate with variance reduced by factor of n
No training engine recompute needed, directly uses inference engine
In asynchronous frameworks, multiple sampling time can overlap with rollout
KV cache hit rate approaches 100%
End-to-end 10-20% training time reduction in practice

Comparison with Other Solutions:

Solution	Issue
Routing Replay	Hard to guarantee prefix cache hits in large-scale agentic scenarios
Truncated IS (TIS)	Sensitive to truncation boundary, doesn’t address root cause of estimation bias
Deterministic Inference	Requires deep inference engine modifications, 40-70% throughput drop
Multiple Sampling Estimation	No hyperparameters, engineering-friendly, best performance

Experiments on Qwen3-235B-A22B show that recompute and rollout_logprob methods crash after 60-80 steps, while this method maintains stable growth and outperforms TIS.

Solution Seven: Engineering Tuning

Some practical engineering approaches:

Method	Effect	Applicable Scenarios
Lower top-p	Reduce low probability tokens	Sacrifices exploration
Switch GPU	H20 most stable	When hardware is flexible
Disable Cascade Attention	Significantly reduces mismatch on A100	A100 users
FP32 LM Head	Slight improvement	Limited effectiveness

Monitoring Metrics

vllm-kl is an important early warning indicator:

\[\text{vllm-kl} = \mathbb{E}_{s, a \sim \pi_\text{vllm}} \left[ \log \frac{\pi_\text{vllm}(a|s)}{\pi(a|s)} \right]\]

Recommended to monitor simultaneously:

vllm-kl: Degree of mismatch
fsdp-ppl: Training engine perplexity
Gradient norm: Stability indicator
Entropy: Policy distribution health

When vllm-kl shows a spike, it often foreshadows an impending collapse.

Practical Recommendations

Accept that mismatch is inevitable: This is a fundamental trade-off between speed and consistency, not a temporary bug.
Use sequence-level corrections: Token-level IS is theoretically biased and will fail on complex tasks. MIS or Geo-RS recommended.
Monitor vllm-kl: This is the most direct health indicator.
Verify hardware impact: Test on target hardware; results may not be portable.
For MoE models: Consider Routing Replay to stabilize expert routing.
Trade-off in solution selection:
- Pursue highest performance → Bitwise Consistent Training (sacrifice speed)
- Pursue practical balance → Sequence-Level MIS + appropriate top-p

Summary

The stability problem in LLM-RL training is essentially a side effect of modern system architecture division of labor. The optimizations made by inference engines and training frameworks for efficiency are amplified into systemic instability through RL’s closed-loop feedback.

Key insights for understanding this problem:

Seemingly on-policy but actually off-policy
Token-level correction insufficient, sequence-level needed
Low probability tokens are the weakest link
High variance and entropy collapse require special handling

As Reasoning RL and Agentic RL continue to develop, this problem will only become more important. I hope this article helps you avoid some pitfalls in practice.

References

Tags: RL RLHF PPO

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析