LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

Speculative Decoding: A Complete Guide to Principles, Methods, and Speedup Analysis

2026-01-05 · Qi Lu · Views:

Speculative Decoding is one of the most important techniques in LLM inference acceleration. Through a “draft-then-verify” paradigm, it achieves 2-3x speedup without changing the output distribution.

This post provides a comprehensive introduction to Speculative Decoding, covering its principles, implementation methods, draft model acquisition approaches, and an in-depth analysis of why it accelerates inference.


1. Motivation: The Autoregressive Decoding Bottleneck

1.1 Why is LLM Inference Slow?

LLMs generate text autoregressively: each token depends on all previous tokens, requiring sequential generation.

Two decoding modes:

Mode Formula Characteristics
Greedy $y_t = \arg\max_v P(v \mid y_{<t}, x)$ Deterministic, picks highest probability token
Sampling $y_t \sim P(\cdot \mid y_{<t}, x)$ Stochastic, samples from distribution

Regardless of mode, generating K tokens requires K sequential forward passes.

1.2 The Memory-Bound Problem

Modern GPUs have abundant compute power, but LLM inference is memory-bound rather than compute-bound:

Bottleneck Description
Weight Reading Each forward pass reads nearly all model weights from HBM
KV Cache Must read KV Cache for all historical tokens
Low Parallelism Each step generates only 1 token, low GPU utilization

Note: Weights reside in GPU memory (HBM), but each forward pass must read weights from HBM into compute units—this read bandwidth is the bottleneck.

Core contradiction: GPUs have massive compute, but each step only computes one token—most time is spent waiting for memory reads.

1.3 Key Insight

“Hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models.”

Many tokens generated by large models are “easy” (common words, grammatical structures)—small models can predict them correctly. Only a few “hard” tokens truly require the large model’s capability.


2. Core Mechanism

2.1 The Draft-Then-Verify Paradigm

The core idea of Speculative Decoding:

  1. Draft: Use a fast Draft Model to serially generate $\gamma$ candidate tokens
  2. Verify: Use the Target Model to verify these tokens in parallel
  3. Accept/Reject: Use rejection sampling to decide which tokens to accept
Draft Model:  [x] → t1 → t2 → t3 → t4 → t5  (γ=5, serial)
                ↓    ↓    ↓    ↓    ↓
Target Model: [x, t1, t2, t3, t4, t5]        (parallel verify)
                ↓    ↓    ↓    ↓    ↓
Result:       [✓]  [✓]  [✓]  [✗]  [—]       (accept 3 + resample 1)

2.2 The Rejection Sampling Algorithm

Let Draft Model distribution be $q(x)$, Target Model distribution be $p(x)$:

Acceptance probability:

\[P(\text{accept}) = \min\left(1, \frac{p(x)}{q(x)}\right)\]

Two cases:

Case Condition Action
Draft is conservative $q(x) \leq p(x)$ 100% accept
Draft is overconfident $q(x) > p(x)$ Accept with probability $p(x)/q(x)$

Resampling on rejection:

\[x \sim \text{norm}\left(\max(0, p(x) - q(x))\right)\]

2.3 Output Invariance Guarantee

Speculative Decoding guarantees output invariance in both decoding modes:

Greedy mode:

Sampling mode:

This means:


3. The Essence of Speedup

3.1 Memory-Bound: The Fundamental Bottleneck

To understand why Speculative Decoding works, we must first understand why LLM inference is slow.

What determines inference time?

\[T_{inference} = \max(T_{compute}, T_{memory})\]

The autoregressive decoding problem:

Each token generation requires:

  1. Reading nearly all model weights from HBM
  2. Reading KV Cache
  3. Executing matrix multiplications

Back-of-envelope estimate (order of magnitude only):

Assumptions: 70B model, FP16, A100 80GB (2 TB/s bandwidth, 312 TFLOPS)
Ignoring: KV Cache, activations, communication, kernel scheduling

Single token generation:
├── HBM read: ~140GB weights → O(100ms) order
└── Computation: ~140 GFLOPs → O(1ms) order

Bandwidth vs Compute: Two orders of magnitude difference!

⚠️ Actual latency depends on precision (FP16/INT8/INT4), parallelism (TP/PP), sequence length, kernel fusion, etc. Above is only to illustrate order-of-magnitude difference.

Arithmetic Intensity:

\[\text{AI} = \frac{\text{FLOPs}}{\text{Bytes Accessed}}\]
Scenario Arithmetic Intensity Bottleneck
Autoregressive (batch=1) ~1 FLOP/Byte Bandwidth
Batched (batch=128) ~128 FLOP/Byte Compute
A100 balance point ~156 FLOP/Byte -

Key insight: Autoregressive decoding has arithmetic intensity far below GPU’s balance point—most compute is wasted waiting for HBM reads.

3.2 The Essence of Speculative Decoding

Core idea: Use one HBM read to compute multiple tokens

Traditional autoregressive (5 tokens):
├── Step 1: Read weights(140GB) + compute(t1) → 70ms
├── Step 2: Read weights(140GB) + compute(t2) → 70ms
├── Step 3: Read weights(140GB) + compute(t3) → 70ms
├── Step 4: Read weights(140GB) + compute(t4) → 70ms
└── Step 5: Read weights(140GB) + compute(t5) → 70ms
Total: 5 × 70ms = 350ms, HBM reads 700GB

Speculative Decoding (verify 5 tokens):
├── Draft: 5 × 7ms = 35ms (small model, negligible)
└── Verify: Read weights(140GB) + compute(t1,t2,t3,t4,t5) → ~75ms
Total: 110ms, HBM reads 154GB (assuming Draft is 10% size)

Why does verifying 5 tokens take barely longer than 1?

Operation 1 token 5 tokens Growth
Read weights 140GB 140GB 1x (unchanged!)
KV Cache read K bytes K bytes ~1x
Computation 140 GFLOPs 700 GFLOPs 5x
Total time ~70ms ~75ms 1.07x

The essence:

3.3 Speedup Analysis

Theoretical speedup:

Assume:

\[\text{Speedup} = \frac{k \cdot T_t}{\gamma \cdot T_d + T_t} \approx \frac{k \cdot T_t}{T_t} = k\]

When $T_d \ll T_t$, speedup approximately equals the average number of accepted tokens.

Expected accepted tokens (assuming independent acceptance probability $\alpha$ per position):

\[\mathbb{E}[\text{accepted}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}\]
Acceptance Rate $\alpha$ Expected at $\gamma=5$ Actual Speedup
0.5 1.97 ~2x
0.7 3.16 ~3x
0.9 4.69 ~4x

3.4 When Does It Work Well?

Deriving from the memory-bound nature:

Works well when:

Condition Reason
Target Model large enough (≥30B) More memory-bound, more “free” compute
Draft-Target distributions similar High acceptance rate, more tokens confirmed per round
Predictable output Code, formatted text, translation have high acceptance

Works poorly when:

Condition Reason
Target Model small (<7B) Closer to compute-bound, multi-token verification has overhead
Large Draft-Target gap Low acceptance rate, frequent resampling
Creative tasks Unpredictable output, low acceptance rate

Intuition: Speculative Decoding trades “multiple HBM reads by small model” for “one HBM read by Target.” If Target isn’t large/memory-bound enough, the trade isn’t worth it.


4. Draft Model Approaches

4.1 Independent Small Models

The most direct approach: use a smaller model from the same family as Draft.

Target Model Draft Model Param Ratio Source
Llama-70B Llama-7B 10:1 Same family
Chinchilla-70B Chinchilla-1B 70:1 DeepMind original
T5-XXL (11B) T5-small (60M) 183:1 Google original

Selection principles:

Pros: No additional training needed, ready to use Cons: Distribution gap may be large, limited acceptance rate


4.2 Knowledge Distillation

Use knowledge distillation to better align Draft Model with Target Model’s output distribution.

4.2.1 DistillSpec (ICLR 2024)

Paper: DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Core problem: Off-the-shelf small models have large distribution gaps with Target, leading to low acceptance rates.

Two key design choices:

  1. On-Policy Data Generation:
    • Use data generated by Draft Model itself for training
    • Rather than using fixed datasets
    • Reason: Draft needs to align on tokens it might actually generate
  2. Task-Specific Divergence Functions:
    • Different tasks/decoding strategies use different KL divergence variants
    • Greedy decoding: Forward KL
    • Sampling: Reverse KL or JSD

Training pipeline:

1. Draft Model generates candidate sequences
2. Target Model computes probability distributions at these positions
3. Minimize divergence between Draft and Target distributions
4. Repeat until convergence

Results:

4.2.2 AdaSPEC (2025)

Core improvement: Selective token filtering

Observation: Some tokens are inherently hard to predict (proper nouns, rare words). Forcing alignment on these can hurt prediction of easy tokens.

Method:

  1. Use reference model to identify “hard” tokens
  2. Filter out these tokens during distillation
  3. Let Draft focus on aligning “easy” tokens

Results: Acceptance rate improves up to 15% over DistillSpec


4.3 Self-Speculative Decoding

No separate Draft Model—derive Draft from Target Model itself, “drafting for yourself.”

4.3.1 LayerSkip (ACL 2024)

Paper: LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding Code: GitHub

Core idea: Skip later layers, use output from first E layers to directly predict tokens.

Three-stage approach:

Stage 1: Layer Dropout during Training

# Different dropout rates for different layers during training
# Shallow layers: low dropout (maintain stability)
# Deep layers: high dropout (enhance early exit capability)
for layer_idx, layer in enumerate(layers):
    dropout_rate = layer_idx / num_layers * max_dropout
    x = layer(x, dropout=dropout_rate)

Stage 2: Early Exit Loss

Stage 3: Self-Speculative Decoding

Self-Draft:   First E layers → LM Head → draft tokens
Self-Verify:  Remaining layers verify + full forward pass
Key optimization: Reuse KV Cache from draft stage during verification

Usage example:

torchrun generate.py --model facebook/layerskip-llama2-7B \
    --generation_strategy self_speculative \
    --exit_layer 8 \          # Exit at layer 8 for drafting
    --num_speculations 6      # Generate 6 draft tokens per round

Results:

Advantages:

Integration status: Integrated into HuggingFace Transformers and PyTorch TorchTune.


4.4 Additional Heads

Add lightweight prediction heads to Target Model without modifying the original model.

4.4.1 Medusa (ICML 2024)

Paper: Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Code: GitHub

Core idea: Add multiple “Medusa Heads,” each predicting tokens at different future positions.

Architecture:

                    ┌─→ Head 1 → predict t+1
Hidden State (t) ───┼─→ Head 2 → predict t+2
    from LLM        ├─→ Head 3 → predict t+3
                    └─→ Head 4 → predict t+4

Tree Attention Mechanism:

Since each head may have multiple candidates (top-k), combinations form a candidate tree:

Assume Head 1 takes top-2, Head 2 takes top-3:
Candidate tree has 2 × 3 = 6 paths

        t1
       / \
      t1a t1b
     /|\  /|\
    ...  ...

Tree Attention Implementation:

Two training modes:

Mode Training Method Effect
Medusa-1 Freeze LLM, only train heads 2.2x speedup, lossless
Medusa-2 Joint fine-tune LLM + heads 2.3-3.6x speedup

Medusa-2 special training recipe:

Empirical data:

4.4.2 EAGLE / EAGLE-3

EAGLE Paper: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (ICML 2024) EAGLE-3 Paper: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test (NeurIPS 2025) Code: GitHub

Core insights:

  1. Feature-level autoregression is easier than token-level
    • Token space is discrete and sparse
    • Feature space is continuous and smooth
    • Prediction in feature space is more stable
  2. Feature uncertainty is the bottleneck
    • Token sampling results affect next-step features
    • But draft stage cannot see actual sampling results

EAGLE Architecture:

Target LLM:
Input → [...Layers...] → Top Layer Feature → LM Head → Token
                              ↓
EAGLE Draft Head:        Feature + Token(t-1)
                              ↓
                         Predict Feature(t+1)
                              ↓
                         LM Head → Draft Token

Key design:

Training details:

EAGLE Parameter Count (relative to Target):

Target Size EAGLE Params Ratio
7B 0.24B 3.4%
13B 0.37B 2.8%
33B 0.56B 1.7%
70B 0.99B 1.4%

EAGLE-3 Improvements (arXiv:2503.01840, NeurIPS 2025):

EAGLE-3 Architecture Changes:

EAGLE-2: Feature(t) + Token(t-1) → Predict Feature(t+1)
EAGLE-3: Multi-layer Features + Token(t-1) → Direct Token Prediction

Key improvements:

  1. TTT (Training-Time Test): Fuses multi-layer features instead of top-layer only
  2. Simplified prediction target: Direct token prediction instead of feature, easier to learn
  3. Better generalization: More stable performance on out-of-distribution data

Results: 2.7-3.5x latency speedup on LLaMA2-Chat 70B.

SpecForge: EAGLE-3 Training Framework

SpecForge is LMSYS’s open-source EAGLE-3 training framework for efficiently training draft models at various scales.

Two Training Modes:

Mode Description Use Case
Online Target and Draft run together, real-time feature generation Ample GPU, best quality
Offline Pre-generate Target features, train Draft offline Limited GPU, large-scale training

Online Mode:

# Train with FSDP
python -m specforge.train \
    --target_model meta-llama/Llama-3.1-70B-Instruct \
    --mode online \
    --backend fsdp \
    --data_path train_data.jsonl

Offline Mode:

# Step 1: Pre-generate features
python -m specforge.generate_features \
    --target_model meta-llama/Llama-3.1-70B-Instruct \
    --output_path features/

# Step 2: Offline training
python -m specforge.train \
    --mode offline \
    --feature_path features/

Supported Backends:


4.5 Draft-Free Methods

No Draft Model at all—achieve parallel decoding through algorithmic innovation.

4.5.1 Lookahead Decoding (ICML 2024)

Paper: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding Code: GitHub

Core idea: View autoregressive decoding as solving nonlinear equations, use Jacobi iteration for parallel solving.

Problem with Jacobi Decoding:

Traditional Jacobi decoding barely accelerates LLMs (~1.05x) because:

Lookahead’s Solution:

While single Jacobi iteration only determines 1 token, the process produces valuable n-gram byproducts.

2D Window Design:

Dimension 1: Window Size W (how far ahead in future positions)
Dimension 2: N-gram Size N (how many steps back in history)

     Time axis (Jacobi iteration steps)
        t-3  t-2  t-1   t
Pos 1    a    b    c    d  ← Can extract 4-gram: abcd
Pos 2    e    f    g    h
Pos 3    i    j    k    l
Pos 4    m    n    o    p
  ↑
Sequence axis

Two Parallel Branches:

  1. Lookahead Branch:
    • Maintains 2D window
    • Updates predictions at all positions each step
    • Collects n-grams from trajectory into candidate pool
  2. Verification Branch:
    • Selects n-grams from pool matching first token
    • Verifies these candidates in parallel
    • Accepts longest valid prefix

Algorithm flow:

# Configuration
lade.config_lade(
    LEVEL=5,           # N-gram size N
    WINDOW_SIZE=7,     # Window size W
    GUESS_SET_SIZE=7,  # Candidate pool size G
)

# Each decoding step:
for step in decoding:
    # 1. Lookahead branch: generate predictions for W positions in parallel
    # 2. Collect newly generated n-grams into pool
    # 3. Verification branch: verify matching n-grams
    # 4. Accept longest valid prefix, update window

Results:

4.5.2 Prompt Lookup Decoding

Core idea: Find n-grams from prompt that match current generation.

Suitable scenarios:

Implementation:

# vLLM configuration
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="[ngram]",        # Enable n-gram lookup
    num_speculative_tokens=5,
    ngram_prompt_lookup_max=4,          # Max 4-gram
)

Results: vLLM benchmark shows 2.8x speedup on summarization.

Pros: Zero additional overhead, no training needed Limitation: Only effective when prompt overlaps with output


5. Verification Strategies

5.1 Sequential Verification

Original method: verify left-to-right, stop at first rejection.

Draft:  t1  t2  t3  t4  t5
        ✓   ✓   ✗   -   -
Accept: t1, t2 + resample t3'

Problem: One rejection wastes all subsequent drafts.

5.2 Tree-based Verification

SpecInfer (ASPLOS 2024):

Organize candidates as a Token Tree, not a linear sequence:

        t1
       / | \
      t2 t2' t2''
     /|   |
    t3 t3' t3

Advantages:

Implementation: Use Tree Attention to process all paths in parallel.

5.3 Block Verification

Observation: Independent token-by-token verification is not optimal.

Block Verification:


6. Method Summary

6.1 Seminal Works

Work Date Institution Contribution
Fast Inference via Speculative Decoding 2022-11 Google First proposed Speculative Decoding
Speculative Sampling 2023-02 DeepMind Independent proposal, Chinchilla 2-2.5x

6.2 Draft Model Improvements

Work Core Idea Effect
DistillSpec Knowledge distillation alignment +10-45%
Online Speculative Decoding Online Draft updates Adapts to distribution shift
Draft & Verify Self-speculative No extra model
LayerSkip Layer skipping Computation reuse

6.3 Additional Heads

Work Core Idea Speedup
Medusa Multi-head + tree attention 2.2-3.6x
EAGLE Feature-level prediction head 2-3x
EAGLE-3 Training-time test optimization SOTA
Hydra Multi-head variant -

6.4 Verification Optimization

Work Core Idea Effect
SpecInfer Token Tree + Tree Attention 2.6-3.5x
Block Verification Joint block verification Higher acceptance
Staged Speculative Decoding Multi-stage verification -

6.5 Draft-Free

Work Core Idea Speedup
Lookahead Decoding Jacobi iteration + n-gram cache 1.5-2.3x
Prompt Lookup Find n-grams from prompt 2.8x (summarization)
REST Retrieval-augmented -

7. Practical Deployment

7.1 Framework Support

Framework Supported Methods
vLLM Draft model, Prompt lookup, Medusa, EAGLE
TensorRT-LLM Draft model, Medusa
SGLang Draft model, EAGLE
HuggingFace Assisted generation

7.2 vLLM Usage Example

from vllm import LLM, SamplingParams

# Draft model-based
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
)

# Prompt lookup (no draft model needed)
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    ngram_prompt_lookup_max=4,
)

7.3 When to Use?

Recommended:

Not recommended:

7.4 SpecBundle: Production-Grade EAGLE-3 Models

SpecBundle is LMSYS’s collection of production-grade EAGLE-3 draft models trained with SpecForge.

Phase 1 Release (2025-12):

Target Model Draft Model Speedup Model Link
Llama-3.1-8B-Instruct EAGLE-3 2.5-3.0x HuggingFace
Llama-3.1-70B-Instruct EAGLE-3 3.0-4.0x HuggingFace
Llama-3.3-70B-Instruct EAGLE-3 3.0-4.0x HuggingFace
Qwen2.5-7B-Instruct EAGLE-3 2.5-3.0x HuggingFace
Qwen2.5-32B-Instruct EAGLE-3 2.8-3.5x HuggingFace
Qwen2.5-72B-Instruct EAGLE-3 3.0-4.0x HuggingFace
DeepSeek-V3 EAGLE-3 3.0-3.5x HuggingFace
Gemma-2-9B-it EAGLE-3 2.5-3.0x HuggingFace
Gemma-2-27B-it EAGLE-3 2.8-3.5x HuggingFace
Mistral-7B-Instruct-v0.3 EAGLE-3 2.5-3.0x HuggingFace
Mistral-Large-2 EAGLE-3 3.0-3.5x HuggingFace

Key features:

Usage Example (vLLM):

from vllm import LLM, SamplingParams

# Use SpecBundle's EAGLE-3 model
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="lmsys/Llama-3.1-70B-Instruct-EAGLE3",
    num_speculative_tokens=5,
)

# Normal inference
output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(temperature=0.7, max_tokens=512)
)

vLLM Speculators Training Support:

Starting from vLLM v0.3.0, end-to-end EAGLE-3 training is supported via vllm-speculators:

# Install
pip install vllm-speculators

# Train EAGLE-3
python -m vllm_speculators.train_eagle3 \
    --target_model meta-llama/Llama-3.1-8B-Instruct \
    --output_path ./my-eagle3-model \
    --data_path train_data.jsonl

8. Method Taxonomy

Category Core Idea Representative Works
Independent Draft Use small model for drafts Google/DeepMind original, DistillSpec
Self-Speculative Derive Draft from Target LayerSkip, Draft&Verify, SPEED
Additional Heads Add prediction heads to Target Medusa, EAGLE, Hydra
Tree Verification Tree candidates + parallel verify SpecInfer
Draft-Free No Draft Model Lookahead, Prompt Lookup, REST

References

Seminal Works

Draft Model Improvements

Additional Heads

Verification Optimization

Draft-Free

Surveys

Training Ecosystem

Resources

← Back to Home

Comments