Entropy Control in LLM-RL: A Systematic Survey from Entropy Collapse to Exploration
2025-12-23 · Qi Lu · Views:
Introduction
In 2025, a series of studies emerged in LLM reinforcement learning (especially RLVR: Reinforcement Learning with Verifiable Rewards) addressing Entropy Collapse. The core problem: during RL training, model output diversity gradually decreases, leading to loss of exploration capability and premature convergence to suboptimal solutions.
This article reviews relevant works in this field organized by publication timeline, extracting core insights and formulas, followed by unified analysis and critical reflection.
Paper Timeline
March 2025: DAPO — Industrial-Scale RL System
Paper: DAPO: An Open-Source LLM Reinforcement Learning System at Scale Institution: ByteDance Seed Date: 2025-03-18
Core Problem
The DAPO team observed in large-scale RL training:
“PPO and GRPO suffer from entropy collapse — entropy of policy decreases quickly with training, causing sampled responses to become identical.”
Core Method: Clip-Higher (Decoupled Clipping)
Standard GRPO uses a single parameter $\epsilon$ for clipping:
\[\text{clip}(\rho, 1-\epsilon, 1+\epsilon)\]DAPO decouples it into $\epsilon_{\text{low}}$ and $\epsilon_{\text{high}}$:
\[\text{clip}(\rho, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}})\]Key Configuration: $\epsilon_{\text{low}} = 0.2$, $\epsilon_{\text{high}} = 0.28$
“By increasing the value of $\epsilon_{\text{high}}$, we leave more room for the increase of low-probability tokens. This adjustment effectively enhances the policy’s entropy.”
Four Key Techniques
| Technique | Effect |
|---|---|
| Clip-Higher | More room for low-probability tokens, mitigates entropy collapse |
| Dynamic Sampling | Improves training efficiency and stability |
| Token-Level PG Loss | Adapts to long CoT scenarios |
| Overlong Reward Shaping | Reduces reward noise from truncated samples |
Results
Based on Qwen2.5-32B, achieves 50 points on AIME 2024, surpassing DeepSeek-R1-Zero-Qwen-32B (47 points) with only 50% training steps.
April 2025: VAPO — Value-Augmented Policy Optimization
Paper: VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks Institution: ByteDance Date: 2025-04-07
Core Contribution
VAPO introduces 7 innovative techniques on top of PPO, significantly improving value learning and balancing exploration-exploitation.
Key Techniques
- Clip-Higher: Adopts DAPO’s asymmetric clipping ($\epsilon_{\text{high}} = 0.28$, $\epsilon_{\text{low}} = 0.2$)
- Value Learning Improvements: More accurate value estimation reduces variance
- Exploration-Exploitation Balance: Maintains stable entropy, neither collapsing nor exploding
Key Results
“VAPO matches DAPO’s performance using only 60% of DAPO’s steps and achieves a new SOTA score of 60.4 within just 5,000 steps.”
“VAPO maintains stable entropy — neither collapsing nor becoming excessively high.”
May 2025: SEED-GRPO — Semantic Entropy-Guided Optimization
Paper: SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization Date: 2025-05-18
Core Problem
“Vanilla GRPO treats all prompts equally during policy updates, ignoring important information about the model’s knowledge boundaries.”
Key Insight: Semantic Entropy vs Token Entropy
- Token Entropy: Measures uncertainty at individual token positions
- Semantic Entropy: Measures semantic diversity across multiple responses, clustering by meaning rather than form
“Semantic entropy clusters responses based on meaning rather than form. This makes semantic entropy a more faithful and robust indicator of a model’s uncertainty.”
Core Formula
Given prompt $q$, sample $G$ responses ${o_1, …, o_G}$, modulate advantage using semantic entropy:
\[\hat{A}_i = f(\text{SE}(q)) \cdot (r_i - \bar{r})\]Where $\text{SE}(q)$ is semantic entropy, $f$ is the modulation function.
Strategy: High semantic entropy (model uncertain) → conservative update; Low semantic entropy (model confident) → aggressive update.
May 2025: Unearthing Gems from Stones — Mining Correct Steps from Negative Samples
Paper: Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning Institution: CASIA, StepFun Date: 2025-05-20
Core Problem
“Negative responses contain valuable components such as self-reflection and error-correction steps, yet existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL).”
Core Method: BCPG-NSA
Three-Stage Pipeline:
- Sample Segmentation: Use SAT model to segment long reasoning trajectories into independent steps
- Consensus Assessment: LLM Judge + PRM dual judgment for step correctness
- NSA Optimization: Give positive rewards to correct steps within negative samples
Core Idea
“Mining positive steps within negative samples” — not simply penalizing entire negative samples, but extracting correct reasoning tokens.
May 2025: The Entropy Mechanism — Mathematical Law of Entropy-Performance Trade-off
Paper: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models Institution: PRIME-RL (Shanghai AI Lab) Date: 2025-05-28
Core Discovery: R = -a·exp(H) + b
The paper proposes an empirical entropy-performance relationship:
\[R = -a \cdot e^H + b\]Where $R$ is downstream performance, $H$ is policy entropy, $a, b$ are fitting coefficients.
“This empirical law strongly indicates that policy performance is traded from policy entropy, thus bottlenecked by its exhaustion. The ceiling is fully predictable: $H=0 \Rightarrow R = -a + b$.”
Implication: Performance is gained at the cost of entropy; when entropy is exhausted, performance hits ceiling.
Core Theorem: Covariance Drives Entropy Change
For vanilla Policy Gradient, logit difference is:
\[\Delta z_{s,a} = \eta \cdot \pi_\theta(a|s) \cdot A(s,a)\]Entropy Change Formula (Theorem 1):
\[H(\pi^{k+1}_\theta|s) - H(\pi^k_\theta|s) \approx -\eta \cdot \text{Cov}_{a \sim \pi^k_\theta}[\log \pi^k_\theta(a|s), \pi^k_\theta(a|s) \cdot A(s,a)]\]Key Insight
“A high-probability action with high advantage would reduce policy entropy, while a rare action with high advantage would increase policy entropy.”
| Scenario | Entropy Change |
|---|---|
| High probability + High advantage | Decrease |
| Low probability + High advantage | Increase |
Empirically, the covariance term remains positive → entropy monotonically decreases.
Solutions: Clip-Cov and KL-Cov
- Clip-Cov: Randomly select a portion of positive-covariance tokens, detach their gradients
- KL-Cov: Apply KL penalty to highest-covariance tokens
Results
| Model | Method | AIME24 | AIME25 |
|---|---|---|---|
| Qwen2.5-32B | GRPO | baseline | baseline |
| Qwen2.5-32B | KL-Cov | +15.0% | +14.6% |
Merged into verl framework.
May 2025: OPO — Advantages of Exact On-Policy Training
Paper: On-Policy RL with Optimal Reward Baseline Institution: Microsoft Research Date: 2025-05-29
Core View
OPO emphasizes the importance of strict exact on-policy training, contrasting with PPO/GRPO’s multiple update strategy.
Two Innovations
- Exact On-Policy: Single gradient update per batch (
ppo_mini_batch_size = train_batch_size) - Optimal Baseline: Baseline depending on both policy and reward, minimizing gradient variance
Key Finding
“Exact on-policy training demonstrates superior pass@1 performance and significantly lower KL divergence and higher output entropy throughout training compared to off-policy variants.”
Configuration: entropy_coeff: 0, use_kl_loss: False — no explicit entropy regularization needed!
Results
| Benchmark | OPO | GRPO |
|---|---|---|
| MATH-500 | 95.26% | 95.10% |
| AIME 2025 (Pass@16) | 85.33% | 81.33% |
Merged into verl framework.
May 2025: Skywork-OR1 — MAGIC Pipeline with Adaptive Entropy Control
Paper: Skywork Open Reasoner 1 Technical Report Institution: Skywork AI Date: 2025-05-29 Open Source: GitHub
MAGIC Pipeline
MAGIC = Multi-stage Adaptive entropy scheduling for GRPO In Convergence
Core components:
- Rigorous data collection (offline + online filtering)
- Multi-stage training (progressively increasing context length)
- High-temperature sampling to enhance exploration
Adaptive Entropy Control
“By leveraging adaptive entropy control, we maintain the model’s entropy at a reasonable level throughout training and effectively prevent premature collapse.”
Uses adaptive coefficient $\alpha_k$ to dynamically adjust entropy term weight.
Key Finding
“Training approaches that accelerate entropy collapse lead to worse test performance.”
May 2025: ProRL — Stability for Prolonged RL Training
Paper: ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries Date: 2025-05-30
Core Problem
How to maintain stability in prolonged RL training?
Solutions
- KL Divergence Penalty: Stronger stability than Clip-Higher
- Periodic Reset of Reference Policy: Periodically reset reference policy
“While DAPO and temperature adjustment help slow entropy collapse, explicit regularization via KL divergence penalty provides a stronger and more stable solution.”
June 2025: Beyond the 80/20 Rule — High-Entropy Minority Tokens Drive Effective RL
Paper: Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Institution: Qwen/Alibaba, Tsinghua University Date: 2025-06-02 Venue: NeurIPS 2025
Core Discovery: 20% High-Entropy Tokens Determine Everything
“Only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways.”
Token Distribution:
- ~80% tokens: Low entropy (sentence completion, deterministic elements)
- ~20% tokens: High entropy (logical connectors like “however”, “because”, “thus”)
Core Experiment
Training with only top 20% high-entropy token gradients:
| Model | Full Gradient | Top 20% Only |
|---|---|---|
| Qwen3-32B (AIME’25) | baseline | +11.04 |
| Qwen3-32B (AIME’24) | baseline | +7.71 |
| Qwen3-14B (AIME’25) | baseline | +4.79 |
“Training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance.”
Why It Works
“RL tends to preserve or increase the entropy of forking tokens, maintaining flexible reasoning paths. In contrast, SFT reduces token entropy, leading to memorization and poor generalization.”
Practical Configuration
Set top_entropy_quantile = 0.2 in GRPO.
June 2025: The Surprising Effectiveness of Negative Reinforcement
Paper: The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning Date: 2025-06-02 Venue: NeurIPS 2025
Core Concepts: PSR vs NSR
Decomposing RLVR’s learning signal:
| Term | Meaning |
|---|---|
| PSR (Positive Sample Reinforcement) | Reinforce correct answers |
| NSR (Negative Sample Reinforcement) | Penalize incorrect answers |
Core Finding
“Training with only negative samples — without reinforcing correct responses — can be highly effective: it consistently improves performance over the base model across the entire Pass@k spectrum.”
How NSR Works
“NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model’s prior beliefs. This effectively refines its existing knowledge without aggressively teaching new behaviors.”
Core Formula: W-REINFORCE
\[\mathcal{L}_{\text{W-REINFORCE}}(\theta) = \lambda \cdot \mathcal{L}_{\text{PSR}}(\theta) + \mathcal{L}_{\text{NSR}}(\theta)\]Recommended $\lambda = 0.1$ (significantly down-weighting positive samples).
Experimental Conclusions
| Method | Pass@1 | Pass@256 |
|---|---|---|
| PSR only | Best | Poor |
| NSR only | Poor | Near best |
| W-REINFORCE | Good | Best |
NSR is crucial for maintaining entropy → diversity at large k.
June 2025: Rewarding the Unlikely — Fixing GRPO’s Rank Bias
Paper: Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening Institution: CMU Date: 2025-06-03 Venue: EMNLP 2025
Core Problem: Rank Bias
“A degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening.”
Consequence: Model only learns to solve already-solvable problems with fewer samples, but underperforms on Pass@N (large N) compared to simply sampling more from the original model.
Solution: Unlikeliness Reward
“Explicitly up-weighting rare but correct solutions.”
Give extra reward to low-probability but correct solutions.
June 2025: LUFFY — Off-Policy Guidance Maintains High Entropy
Paper: LUFFY: Learning to reason Under oFF-policY guidance Date: 2025-06-11
Core Problem
Limitation of on-policy RL: model can only learn from its own generations, unable to access superior reasoning patterns.
Solution
Introduce off-policy guidance from stronger policies (e.g., DeepSeek-R1).
Key Finding
“LUFFY consistently sustains higher entropy compared to On-Policy RL throughout training. The generation entropy of On-Policy RL rapidly converges to nearly zero after ~200 steps, while the elevated entropy in LUFFY allows continuous exploration.”
| Method | Entropy after 200 steps |
|---|---|
| On-Policy RL | ~0 |
| LUFFY | Remains high |
June 2025: Dr. GRPO — Fixing GRPO’s Bias
Paper: Dr. GRPO (Getting GRPO Done Right) Date: 2025-06
Core Problem
GRPO’s length normalization and std normalization may cause biased optimization, leading models to generate longer incorrect answers.
Solution
“Removing both the length and std normalization terms in GRPO.”
Simple but effective fix.
October 2025: Rethinking Entropy Interventions — From an Entropy Change Perspective
Paper: Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective Institution: Zhejiang University Date: 2025-10-11
Core Criticism
“Existing methods attempt to control entropy indirectly by only adjusting related factors such as the advantage signal and generation probability. Their effectiveness is inherently limited and prone to failure.”
Existing methods (like Clip-Higher, advantage reweighting) only control entropy indirectly, with limited effectiveness.
Core Method: STEER
STEER = Stabilizing Token-level Entropy-changE via Reweighting
Core idea: Analyze each token’s entropy change, directly control entropy dynamics at token level.
“The overall entropy dynamics during training arises from the accumulation of per-token entropy changes.”
November 2025: Revisiting Entropy in RLVR — Positive Samples Are the Main Cause
Paper: Revisiting Entropy in Reinforcement Learning for Large Reasoning Models Date: 2025-11-08
Core Discovery
“Entropy collapse in RLVR primarily arises from tokens with positive advantages, and regulating their relative loss weights provides an effective means to control entropy.”
Counter-intuitive: It’s not negative samples causing entropy collapse, but positive samples!
Theoretical Explanation
Combined with the Entropy Mechanism paper’s formula:
- High probability + Positive advantage → Large entropy decrease
- Low probability + Positive advantage → Entropy increase (but rare)
Since correct answers are often already high-probability, reinforcing them further sharpens the distribution.
Key Factors
Experiments identify three key factors affecting entropy:
- Number of off-policy updates: More updates → easier collapse
- Training data diversity: Lower diversity → easier collapse
- Clip threshold: Improper settings accelerate collapse
November 2025: EntroPIC — Stabilizing Entropy with Control Theory
Paper: EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control Institution: Tencent AI Lab Date: 2025-11-20
Core Idea
Use PID controller (industrial control theory) to stabilize entropy!
Core Formula
Let target entropy be $H_{\text{target}}$, current entropy be $H(n)$, error is:
\[e(n) = H_{\text{target}} - H(n)\]PI Control Signal:
\[\alpha(n) = K_p \cdot e(n) + K_i \cdot \sum_{t=0}^{n} e(t)\]Where:
- $K_p$: Proportional coefficient (responds to current error)
- $K_i$: Integral coefficient (responds to accumulated error)
Dynamically adjust positive/negative sample loss weights based on $\alpha(n)$.
Code Implementation
# Integral term: accumulated entropy error
control_alpha_i = accumulate_entropy_error * K_i
# Proportional term: current entropy error
control_alpha_p = (entropy_loss - target_entropy) * K_p
# Total control signal
control_alpha = control_alpha_i + control_alpha_p
Results
“Successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs. Validated through successful training on over 1M prompts.”
Works for both on-policy and off-policy training.
December 2025: SENT — Dual-Layer Semantic and Token Entropy Control
Paper: Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning Date: 2025-12-04
Core Framework
SENT = Semantic ENtropy with Token-level entropy optimization
Dual-Layer Design:
| Level | Method | Effect |
|---|---|---|
| Data Level | Semantic entropy-guided curriculum learning | Organize training data from easy to hard |
| Algorithm Level | Token-level entropy optimization | Apply KL regularization to low-entropy tokens |
Semantic Entropy Curriculum Learning
“Organizing training data from low to high semantic entropy guides progressive optimization from easier to more challenging tasks.”
Principle: Build reasoning capabilities on easier problems first, avoid encountering hard problems too early which leads to aggressive updates and entropy collapse.
Unified Analysis
1. Root Cause of Entropy Collapse
Mathematically, entropy change is driven by covariance:
\[\Delta H \propto -\text{Cov}[\log \pi(a|s), \pi(a|s) \cdot A(s,a)]\]| Scenario | Entropy Change | Frequency |
|---|---|---|
| High probability + High advantage | Large decrease | High (correct answers are usually high-probability) |
| Low probability + High advantage | Increase | Low |
| Any + Negative advantage | Opposite effect | - |
Conclusion: Positive samples are the main cause of entropy collapse, because correct answers are often already high-probability.
2. Entropy-Performance Trade-off
\[R = -a \cdot e^H + b\]This means:
- Entropy is a consumable resource
- Performance gains come at the cost of entropy
- When entropy is exhausted, performance hits ceiling ($H=0 \Rightarrow R_{\max} = -a + b$)
Practical Implication: This formula can predict the training performance ceiling.
3. Mitigation Methods Classification
| Category | Method | Representative Paper |
|---|---|---|
| Clipping Strategy | Clip-Higher, decoupled $\epsilon$ | DAPO, VAPO |
| Covariance Control | Clip-Cov, KL-Cov | Entropy Mechanism |
| Token Filtering | Only use high-entropy token gradients | Beyond 80/20 |
| Sample Reweighting | W-REINFORCE, Unlikeliness Reward | NSR, Rewarding Unlikely |
| Direct Entropy Control | PID controller, adaptive coefficient | EntroPIC, Skywork-OR1 |
| Entropy Change Aware | Token-level entropy change reweighting | STEER |
| Curriculum Learning | Semantic entropy-ordered data | SENT, SEED-GRPO |
| Negative Sample Mining | Extract correct steps from incorrect answers | Unearthing Gems |
| On-Policy Optimization | Exact on-policy, optimal baseline | OPO |
| Off-Policy Guidance | External strong policy guidance | LUFFY |
| KL Regularization | KL penalty + periodic reset | ProRL |
| Normalization Fix | Remove length/std normalization | Dr. GRPO |
4. Role of Positive vs Negative Samples
| Sample Type | Effect on Entropy | Effect on Performance |
|---|---|---|
| Positive | Decrease entropy (sharpen distribution) | Improve Pass@1 |
| Negative | Maintain/increase entropy (preserve diversity) | Improve Pass@k (large k) |
Best Practice: W-REINFORCE recommends $\lambda = 0.1$, i.e., significantly down-weight positive samples.
5. Special Status of High-Entropy Tokens
Only ~20% of tokens are high-entropy, but they are:
- Reasoning fork points (like “however”, “because”)
- Key determinants of reasoning path diversity
- What RL should focus on optimizing
“RL preserves entropy of forking tokens → flexible reasoning. SFT reduces all entropy → memorization.”
6. Impact of Data Domain
| Data Type | Pretraining Exposure | Initial Entropy | Entropy Decay Rate |
|---|---|---|---|
| Math/Code | High | Lower | Fast |
| Synthetic Logic-game | Low | Higher | Slow |
Recommendation: Use synthetic data not seen during pretraining (e.g., SynLogic) to mitigate entropy collapse.
Practical Recommendations
Starter Configuration
- Use DAPO’s Clip-Higher ($\epsilon_{\text{high}} = 0.28$)
- Set
top_entropy_quantile = 0.2to only use high-entropy token gradients - Use W-REINFORCE to down-weight positive samples ($\lambda = 0.1$)
Advanced Configuration
- Implement Clip-Cov or KL-Cov for covariance-based update control
- Use EntroPIC’s PI controller for dynamic adjustment
- Adopt SENT’s semantic entropy curriculum learning
Monitoring Metrics
- Policy Entropy: Core health indicator
- R vs H Curve: Verify if it follows $R = -a \cdot e^H + b$
- Token Entropy Distribution: Ratio and position of high-entropy tokens
Critical Reflection: Does Entropy Control Really Matter?
After surveying these papers, a question worth considering: Are these entropy control methods necessary in industrial practice?
What Industry Actually Does
DeepSeek V3.2
DeepSeek V3.2’s core techniques are:
1. Off-policy sequence masking (mask samples with advantage<0 and high off-policy degree)
2. Keep Routing (MoE-specific)
3. Keep Sampling Mask
4. Unbiased KL Estimation
No explicit entropy control.
Qwen MiniRL / GRPO
Main focus:
- Data filtering (accuracy within certain range)
- Within-group advantage normalization
- Clipping
Also no explicit entropy control.
Entropy May Be Effect, Not Cause
These papers treat entropy as the core problem, but the actual causal chain might be:
Poor data quality / Training instability / Reward hacking
↓
Entropy collapse (symptom)
↓
Performance stagnation
Industry may solve upstream problems directly, and entropy naturally stabilizes.
DeepSeek V3.2’s masking logic:
if advantage < 0 and off_policy_degree > threshold:
mask_this_sample()
This rule directly addresses two problems:
- Harmful gradients from off-policy samples
- Over-penalization of negative samples
Research Focus Selection Bias
Entropy as a research subject has practical advantages:
- Clear mathematical definition
- Easy to measure and track
- Convenient for theoretical analysis
This may lead research to focus on entropy itself rather than more fundamental issues.
Re-evaluation: Which Findings Are Actually Valuable?
| Finding | Value | Reason |
|---|---|---|
| R = -ae^H + b | ⭐⭐⭐ | Diagnostic tool, can predict training ceiling |
| Positive samples cause entropy collapse | ⭐⭐⭐ | Explains why down-weighting positive samples works |
| 20% high-entropy tokens are forks | ⭐⭐ | Can reduce computation, but industry may not care |
| Exact on-policy is more stable | ⭐⭐⭐ | Engineering guidance, but sacrifices sample efficiency |
| Various entropy control methods | ⭐ | May be over-engineering |
Experimental Setting Limitations
These papers’ experimental settings:
- Mostly based on Qwen2.5 + AIME/MATH
- Relatively small training scale (thousands to tens of thousands of steps)
- Lack comparison with industrial-scale systems
Yet DeepSeek V3.2 achieved good results with simple masking strategies, suggesting:
With sufficient data quality and proper training setup, explicit entropy control may not be the primary concern.
Summary
- Entropy is a useful monitoring metric, similar to loss curves, but may not need to be an optimization target
- Explicit entropy control may only be necessary in specific scenarios: limited data, smaller models, extended training
- Industry focuses more on upstream problems: data quality, training stability, reward design
- These papers’ primary value lies in theoretical understanding, helping explain the underlying mechanisms
When Should You Care About Entropy?
| Scenario | Need Explicit Entropy Control? |
|---|---|
| Abundant high-quality data + short training | ❌ Probably not |
| Limited data + need long training | ✓ Probably yes |
| Smaller model + prone to overfitting | ✓ Probably yes |
| Already using good masking/filtering | ❌ Entropy will naturally stabilize |
Open Questions
-
Entropy control vs data/training optimization: Should we prioritize solving upstream problems rather than directly controlling entropy?
-
Can we find an optimization objective that doesn’t cause entropy decrease? All current methods are mitigation, not cure.
-
High entropy vs exploration efficiency trade-off: High entropy helps exploration, but exploration efficiency may decrease (needs more steps to see effect).
-
Cross-domain generalization: Most conclusions are based on Qwen2.5 + Math/Code. Do they apply to other models and domains?
-
Why doesn’t industry use these methods? Is it because they don’t work, or because there are simpler alternatives?
References
Chronologically Ordered
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale (2025.03, ByteDance)
- VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks (2025.04, ByteDance)
- SEED-GRPO: Semantic Entropy Enhanced GRPO (2025.05)
- Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation (2025.05, CASIA/StepFun)
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (2025.05, Shanghai AI Lab)
- Skywork Open Reasoner 1 Technical Report (2025.05, Skywork AI)
- On-Policy RL with Optimal Reward Baseline (2025.05, Microsoft Research)
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries (2025.05)
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RL (2025.06, NeurIPS 2025, Qwen/Alibaba)
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning (2025.06, NeurIPS 2025)
- Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening (2025.06, EMNLP 2025, CMU)
- LUFFY: Learning to reason Under oFF-policY guidance (2025.06)
- Rethinking Entropy Interventions in RLVR (2025.10, Zhejiang University)
- Revisiting Entropy in Reinforcement Learning for Large Reasoning Models (2025.11)
- EntroPIC: Entropy Stabilization with Proportional-Integral Control (2025.11, Tencent AI Lab)
- SENT: Semantic and Token Entropy for LLM Reasoning (2025.12)
Comments