Train Long, Think Short: A Survey on LLM Reasoning Length Control

2025-12-31 · Qi Lu · Views:

When training RLVR, even for simple problems, the model’s chain of thought often runs thousands or even tens of thousands of tokens, yet mainstream commercial models like ChatGPT and Claude keep their reasoning remarkably concise. What’s the difference?

With this question in mind, I surveyed research on reasoning length control and found quite a bit of work in this area, roughly falling into two categories: training-time optimization and inference-time control.

1. Background

1.1 The Overthinking Phenomenon

In RLVR (Reinforcement Learning with Verifiable Rewards) settings, reasoning models commonly show these problems:

Redundant verification: Answer is correct, but model continues “Wait, let me verify…”
Repeated hesitation: Using “Hmm”, “Alternatively” to repeatedly switch approaches
Length inflation: Small models require thousands of tokens for medium-difficulty reasoning

1.2 Optimization Objective

Minimize reasoning tokens without sacrificing accuracy:

\[\min_\pi \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot\mid x)}[\text{len}(y)] \quad \text{s.t.} \quad \text{Acc}(\pi) \geq \text{Acc}(\pi_0)\]

Evaluation metrics include:

Accuracy-Length Pareto Front: Shorter at same accuracy, or more accurate at same length
Length distribution of correct samples: Focus on long tail, not just mean

2. Training-Time Methods

2.1 Hard Truncation: ThinkPrune

Paper: ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning Date: 2025-04 Institution: UCSB Code: GitHub

Approach: Set token limits during training; incomplete reasoning exceeding the limit gets truncated, resulting in zero reward. Iteratively tighten the limit to force the model to learn more concise reasoning.

Method:

Set initial length limit $L_0$
Samples exceeding limit cannot produce valid answers → reward = 0
Iteratively tighten: $L_{t+1} = \alpha \cdot L_t$, where $\alpha < 1$

Results:

DeepSeek-R1-Distill-Qwen-1.5B on AIME24: length halved, accuracy drops only 2%
DeepScaleR-1.5B-Preview: 5,914 → 3,370 tokens
QwQ-32B: 8,763 → 4,494 tokens

Pros: No complex reward engineering needed Risk: Over-tight limits may truncate correct solutions

2.2 Length Reward: GRPO-LEAD

Paper: GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning Date: 2025-04 Code: GitHub

LEAD = Length-dependent rewards + Explicit penalties + Advantage reweighting for Difficulty

This method includes three modifications:

Length-dependent accuracy reward: Rank correct samples by length, encouraging shorter correct solutions
Explicit error penalties: Additional negative constraints for incorrect samples
Difficulty-aware advantage reweighting: Weight by empirical solve rate, amplifying learning signal for harder problems

Notably, length ranking only applies within correct samples; errors are handled separately with penalties.

Results: 14B model achieves SOTA, significantly improving accuracy, conciseness, and efficiency.

2.3 Step Reward Shaping: LASER

Paper: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping Date: 2025-05 Code: GitHub

This work proposes a unified framework formalizing efficient reasoning methods as length-based reward shaping. Based on this framework, the authors introduce LASER (Length-bAsed StEp Reward shaping) using step functions:

\[r_{\text{shaped}}(y) = r_{\text{task}}(y) + f(\text{len}(y))\]

LASER-D (Dynamic and Difficulty-aware) extension:

Reward schedule should adapt as model behavior evolves during training
Length rewards should be difficulty-aware—penalize long CoT more on easy problems

Results: LASER-D improves AIME2024 by +6.1 points while reducing token usage by 63%.

2.4 Adaptive Constraint: LEASH

Paper: Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model Date: 2025-12

LEASH formulates length control as constrained optimization, using Lagrangian Primal-Dual to dynamically adjust penalty coefficients:

\[\max_\pi \mathbb{E}[r_{\text{task}}] \quad \text{s.t.} \quad \mathbb{E}[\text{len}(y)] \leq L_{\text{target}}\]

Dynamic adjustment:

Generation exceeds target length → penalty increases
Generation below target length → penalty relaxes

One-sided penalty: Only penalize “too long”, avoiding incentives to become infinitely short.

Results: On Deepseek-R1-Distill-Qwen-1.5B and Qwen3-4B-Thinking-2507, average reasoning length reduced by 60% across tasks (including in-distribution math and OOD code/instruction-following) while maintaining competitive performance.

2.5 Curriculum Learning: Train Long, Think Short

Paper: Train Long, Think Short: Curriculum Learning for Efficient Reasoning Date: 2025-08 Code: GitHub

Uses a curriculum approach—first let the model “learn to solve”, then gradually compress budget:

Phase 1: Generous token budget for exploring effective solution strategies
Phase 2: Gradually tighten budget, encouraging distillation into concise reasoning chains
Combined training signal: Correctness (verifier feedback) + length efficiency + format adherence

Results: On GSM8K, MATH500, SVAMP, College Math, GSM+, curriculum training consistently outperforms fixed-budget baselines at the same final budget.

2.6 Prompt-Controllable: L1 / LCPO

Paper: L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning Date: 2025-03 Homepage: CMU L3 Lab

LCPO (Length Controlled Policy Optimization) embeds target length in the prompt:

LCPO-Exact: “Think for exactly N tokens”
LCPO-Max: “Think for maximum N tokens”

RL objective includes length deviation term for controllable budget reasoning.

Results:

1.5B L1 model outperforms GPT-4o at same reasoning length
Outperforms s1 (Budget Forcing) baseline
Can export Short Reasoning Models (SRMs): CoT length comparable to non-reasoning models while retaining reasoning mode

2.7 Difficulty-Adaptive Length Penalty: Just Enough Thinking

Paper: Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning Date: 2025-06

LRMs often “overthink” simple problems—for instance, DeepSeek-R1 and Qwen-QwQ32B generate over 10,000 tokens for “2+3=?”.

This work proposes Adaptive Length Penalty (ALP), adjusting penalty based on each prompt’s online solve rate:

High solve rate (easy) prompts → higher extra token cost
Low solve rate (hard) prompts → penalty unchanged

Simply put, save tokens on easy problems, reallocate budget to hard problems.

Results:

DeepScaleR-1.5B with ALP post-training: 50% average token reduction, minimal performance drop
Higher accuracy on hardest problems compared to fixed-budget and uniform penalty baselines

2.8 Long2Short: Kimi k1.5

Paper: Kimi k1.5: Scaling Reinforcement Learning with LLMs Date: 2025-01 Institution: Moonshot AI Code: GitHub

Long CoT reasoning achieves high accuracy but incurs heavy compute costs. Kimi k1.5 introduces Long2Short techniques to compress long CoT strategies into efficient short CoT representations.

Three Long2Short methods:

Method	Description
Model Merging	Weight averaging of long-CoT and short-CoT models
Shortest Rejection Sampling	Select shortest correct response for SFT
Preference-based RL	Train model to prefer brevity while maintaining correctness

Results (Short CoT SOTA):

AIME 2024: 60.8
MATH500: 94.6
LiveCodeBench: 47.3
Outperforms GPT-4o and Claude Sonnet 3.5 by up to +550%

2.9 Length-Harmonizing Fine-Tuning: O1-Pruner

Paper: O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning Date: 2025-01 Code: GitHub

O1-like long-thinking models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancy. O1-Pruner proposes Length-Harmonizing Fine-Tuning to address this:

Pre-sampling: Estimate model’s baseline performance across problems
RL-style Fine-tuning: Encourage shorter reasoning under accuracy constraints

Results:

Inference overhead reduced by 50%
Accuracy improves rather than drops
Applicable to various mathematical reasoning benchmarks

2.10 Conciseness-Guided RL: ConciseRL

Paper: ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models Date: 2025-05

Reasoning traces often extend beyond reaching correct answers, causing wasted computation, reduced readability, and even hallucinations. ConciseRL introduces a hyperparameter-free conciseness score as RL reward signal:

Use LLM-as-judge to evaluate conciseness
Dynamic, context-aware feedback (not just token count)

Results:

TheoremQA: accuracy +2.2% while using 12.5x fewer tokens
Dynamically adjusts reasoning length based on problem difficulty
Stronger judge models yield greater gains

3. Inference-Time Methods

3.1 Answer Convergence

Paper: Answer Convergence as a Signal for Early Stopping in Reasoning Date: 2025-06

An interesting finding: on MATH and similar tasks, models typically converge to the final answer after 60% of reasoning steps; the remaining content is mostly redundant.

Based on this observation, the authors propose three inference-time strategies:

Answer Consistency early stopping: Stop when consecutive reasoning chunks produce same answer
Think Token Adjustment: Increase probability of generating end-of-thinking signal
Learn-to-Stop: Train classifier on internal activations for “when to stop”

Results:

Learn-to-Stop on NQ + QwQ-32B: 48% token reduction, sometimes improving accuracy
Answer Consistency on NaturalQuestions: 40%+ token reduction with accuracy improvement

3.2 Step Answer Monitoring: ES-CoT

Paper: Early Stopping Chain-of-thoughts in Large Language Models Date: 2025-09

A few key concepts:

Step Answer: Model’s current answer guess at each reasoning step
Run: Consecutive sequence of steps with same answer
Run-Jump Test: Terminate when run length of same step answer shows statistically significant jump

The idea is straightforward: “stop thinking when the answer stabilizes”—no extra model or retraining needed.

Results: Across 5 reasoning datasets and 3 LLMs, ES-CoT reduces generated tokens by 41% on average while maintaining accuracy comparable to original CoT.

3.3 Transition Point Monitoring: DEER

Paper: Dynamic Early Exit in Reasoning Models Date: 2025-04 Code: GitHub

DEER’s observation: long CoT contains “pearl reasoning”—critical positions that are sufficient but not redundant.

The approach:

Monitor Action Transition Points (ATP): Phrases like “Wait”, “Alternatively” indicating approach switches
Induce trial answers at ATP
Use confidence to decide early exit—incomplete reasoning yields low confidence; sufficient reasoning yields high confidence

Advantage: No extra training needed, seamlessly integrates with existing o1-like reasoning LLMs.

Results: Across 10 reasoning benchmarks (GSM8K, MATH-500, AMC, GPQA, AIME, LiveCodeBench) and 11 frontier reasoning LLMs:

CoT length reduced by 19.1% - 80.1% on average
Accuracy improved by 0.3% - 5.0%

3.4 Three-Stage Reasoning Theory: Stop Spinning Wheels

Paper: Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit Date: 2025-08

This work divides the reasoning process into three stages:

Insufficient Exploration Stage: Exploring problem space
Compensatory Reasoning Stage: Where correct answer typically emerges
Reasoning Convergence Stage: Often triggers overthinking

The key is finding the Reasoning Completion Point (RCP)—end of compensatory reasoning stage, typically at the first complete reasoning cycle.

Detection methods include:

Query LLM sentence by sentence
Monitor probability of </think> end-of-thinking tokens
Mine more sensitive and consistent RCP patterns + lightweight threshold strategy

Results: On AIME24, AIME25, GPQA-D, reduces token consumption while maintaining or improving reasoning accuracy.

3.5 Budget Forcing: s1

Paper: s1: Simple test-time scaling Date: 2025-01 Code: GitHub

s1’s approach is straightforward:

Curate small dataset s1K with 1,000 question + reasoning trace pairs
SFT on Qwen2.5-32B-Instruct (only 26 minutes on 16×H100)
Budget Forcing: Control reasoning length by forced termination or repeatedly appending “Wait”

Results:

s1-32B outperforms o1-preview by 27% on competition math (MATH and AIME24)
Budget forcing improves AIME24 from 50% to 57%

3.6 Suppressing Reflection Tokens: NoWait

Paper: Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency Date: EMNLP 2025 arXiv: 2506.08343

Budget forcing isn’t always effective across reasoning models. This work observes that explicit self-reflection (“Wait”, “Hmm”, “Alternatively”) may not be necessary.

The method is simple: logit suppression on specific reflection/hesitation tokens at inference:

Identify key reflection words (via 32 independent runs, selecting most frequent single-word tokens)
Suppress these tokens during inference

Results: Across 5 R1-style model families (QwQ, Phi4, Qwen3, Kimi-VL, QvQ):

CoT length reduced by 27%-51%
Maintains model utility across text, vision, and video reasoning tasks
Plug-and-play, no training required

3.7 Dynamic Budget: ABF

Paper: Reasoning at the Right Length: Adaptive Budget Forcing for Efficient and Accurate LLM Inference Date: 2025-09

Adaptive Budget Forcing (ABF) dynamically adjusts reasoning length by monitoring real-time certainty signals (token-level confidence, entropy, semantic consistency):

Sufficient confidence → stop generation
Insufficient confidence → continue reasoning

Difference from traditional Budget Forcing: Traditional methods use fixed length constraints or predetermined control tokens; ABF monitors the model’s “thinking trajectory” in real-time for adaptive stopping decisions.

4. Method Taxonomy

Training-Time Methods

Category	Core Idea	Representative Works
Reward Shaping	Add length penalty to RL reward, encouraging shorter correct reasoning	ThinkPrune, GRPO-LEAD, LASER, LEASH, Just Enough Thinking, ConciseRL
Curriculum/Distillation	First let model learn to solve, then gradually compress reasoning or distill from long to short CoT	Train Long Think Short, Kimi k1.5, O1-Pruner
Prompt-Controllable	Train model to control reasoning length based on budget instructions in prompt	L1/LCPO

Inference-Time Methods

Category	Core Idea	Representative Works
Early Stop Detection	Monitor answer convergence, confidence, or reasoning completion signals for early termination	Answer Convergence, ES-CoT, DEER, Stop Spinning Wheels
Token Intervention	Control generation length via forced budgets, reflection word suppression, or dynamic thresholds	s1, NoWait, ABF

5. Open Problems

Accuracy-Efficiency Trade-off: How to ensure compression doesn’t hurt correctness?
Difficulty Awareness: Compress more on easy problems, preserve long thinking on hard ones
Generalization: Can training-time methods generalize to OOD tasks?
Inference vs Training: Can both approaches be effectively combined?

References

Training-Time Methods

ThinkPrune: arXiv:2504.01296
GRPO-LEAD: arXiv:2504.09696
LASER: arXiv:2505.15612
LEASH: arXiv:2512.21540
Train Long, Think Short: arXiv:2508.08940
L1/LCPO: arXiv:2503.04697
Just Enough Thinking: arXiv:2506.05256
Kimi k1.5: arXiv:2501.12599
O1-Pruner: arXiv:2501.12570
ConciseRL: arXiv:2505.17250

Inference-Time Methods

Answer Convergence: arXiv:2506.02536
ES-CoT: arXiv:2509.14004
DEER: arXiv:2504.15895
Stop Spinning Wheels: arXiv:2508.17627
s1: arXiv:2501.19393
NoWait: arXiv:2506.08343
ABF: OpenReview

Tags: RL GRPO Reasoning Efficiency

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

Train Long, Think Short: A Survey on LLM Reasoning Length Control

1. Background

1.1 The Overthinking Phenomenon

1.2 Optimization Objective

2. Training-Time Methods

2.1 Hard Truncation: ThinkPrune

2.2 Length Reward: GRPO-LEAD

2.3 Step Reward Shaping: LASER

2.4 Adaptive Constraint: LEASH

2.5 Curriculum Learning: Train Long, Think Short

2.6 Prompt-Controllable: L1 / LCPO

2.7 Difficulty-Adaptive Length Penalty: Just Enough Thinking

2.8 Long2Short: Kimi k1.5

2.9 Length-Harmonizing Fine-Tuning: O1-Pruner

2.10 Conciseness-Guided RL: ConciseRL

3. Inference-Time Methods

3.1 Answer Convergence

3.2 Step Answer Monitoring: ES-CoT

3.3 Transition Point Monitoring: DEER

3.4 Three-Stage Reasoning Theory: Stop Spinning Wheels

3.5 Budget Forcing: s1

3.6 Suppressing Reflection Tokens: NoWait

3.7 Dynamic Budget: ABF

4. Method Taxonomy

Training-Time Methods

Inference-Time Methods

5. Open Problems

References

Training-Time Methods

Inference-Time Methods

Comments