Speculative Decoding: A Complete Guide to Principles, Methods, and Speedup Analysis

2026-01-05 · Qi Lu · Views:

Speculative Decoding is one of the most important techniques in LLM inference acceleration. Through a “draft-then-verify” paradigm, it achieves 2-3x speedup without changing the output distribution.

This post provides a comprehensive introduction to Speculative Decoding, covering its principles, implementation methods, draft model acquisition approaches, and an in-depth analysis of why it accelerates inference.

1. Motivation: The Autoregressive Decoding Bottleneck

1.1 Why is LLM Inference Slow?

LLMs generate text autoregressively: each token depends on all previous tokens, requiring sequential generation.

Two decoding modes:

Mode	Formula	Characteristics
Greedy	$y_t = \arg\max_v P(v \mid y_{<t}, x)$	Deterministic, picks highest probability token
Sampling	$y_t \sim P(\cdot \mid y_{<t}, x)$	Stochastic, samples from distribution

Regardless of mode, generating K tokens requires K sequential forward passes.

1.2 The Memory-Bound Problem

Modern GPUs have abundant compute power, but LLM inference is memory-bound rather than compute-bound:

Bottleneck	Description
Weight Reading	Each forward pass reads nearly all model weights from HBM
KV Cache	Must read KV Cache for all historical tokens
Low Parallelism	Each step generates only 1 token, low GPU utilization

Note: Weights reside in GPU memory (HBM), but each forward pass must read weights from HBM into compute units—this read bandwidth is the bottleneck.

Core contradiction: GPUs have massive compute, but each step only computes one token—most time is spent waiting for memory reads.

1.3 Key Insight

“Hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models.”

Many tokens generated by large models are “easy” (common words, grammatical structures)—small models can predict them correctly. Only a few “hard” tokens truly require the large model’s capability.

2. Core Mechanism

2.1 The Draft-Then-Verify Paradigm

The core idea of Speculative Decoding:

Draft: Use a fast Draft Model to serially generate $\gamma$ candidate tokens
Verify: Use the Target Model to verify these tokens in parallel
Accept/Reject: Use rejection sampling to decide which tokens to accept

Draft Model:  [x] → t1 → t2 → t3 → t4 → t5  (γ=5, serial)
                ↓    ↓    ↓    ↓    ↓
Target Model: [x, t1, t2, t3, t4, t5]        (parallel verify)
                ↓    ↓    ↓    ↓    ↓
Result:       [✓]  [✓]  [✓]  [✗]  [—]       (accept 3 + resample 1)

2.2 The Rejection Sampling Algorithm

Let Draft Model distribution be $q(x)$, Target Model distribution be $p(x)$:

Acceptance probability:

\[P(\text{accept}) = \min\left(1, \frac{p(x)}{q(x)}\right)\]

Two cases:

Case	Condition	Action
Draft is conservative	$q(x) \leq p(x)$	100% accept
Draft is overconfident	$q(x) > p(x)$	Accept with probability $p(x)/q(x)$

Resampling on rejection:

\[x \sim \text{norm}\left(\max(0, p(x) - q(x))\right)\]

2.3 Output Invariance Guarantee

Speculative Decoding guarantees output invariance in both decoding modes:

Greedy mode:

Draft and Target give same argmax token at a position → accept
Different → reject, use Target’s result
Guarantee: Output sequence is exactly identical to pure Target decoding (deterministic)

Sampling mode:

Rejection sampling preserves the sampling distribution
Theorem: Tokens sampled via Speculative Sampling from $p(x)$ and $q(x)$ are distributed identically to those sampled from $p(x)$ alone

This means:

Output quality is exactly the same as the original Target Model
Speedup is lossless
No additional quality-speed tradeoff needed

3. The Essence of Speedup

3.1 Memory-Bound: The Fundamental Bottleneck

To understand why Speculative Decoding works, we must first understand why LLM inference is slow.

What determines inference time?

\[T_{inference} = \max(T_{compute}, T_{memory})\]

$T_{compute}$: Time for GPU computation
$T_{memory}$: Time to read data from HBM

The autoregressive decoding problem:

Each token generation requires:

Reading nearly all model weights from HBM
Reading KV Cache
Executing matrix multiplications

Back-of-envelope estimate (order of magnitude only):

Assumptions: 70B model, FP16, A100 80GB (2 TB/s bandwidth, 312 TFLOPS)
Ignoring: KV Cache, activations, communication, kernel scheduling

Single token generation:
├── HBM read: ~140GB weights → O(100ms) order
└── Computation: ~140 GFLOPs → O(1ms) order

Bandwidth vs Compute: Two orders of magnitude difference!

⚠️ Actual latency depends on precision (FP16/INT8/INT4), parallelism (TP/PP), sequence length, kernel fusion, etc. Above is only to illustrate order-of-magnitude difference.

Arithmetic Intensity:

\[\text{AI} = \frac{\text{FLOPs}}{\text{Bytes Accessed}}\]

Scenario	Arithmetic Intensity	Bottleneck
Autoregressive (batch=1)	~1 FLOP/Byte	Bandwidth
Batched (batch=128)	~128 FLOP/Byte	Compute
A100 balance point	~156 FLOP/Byte	-

Key insight: Autoregressive decoding has arithmetic intensity far below GPU’s balance point—most compute is wasted waiting for HBM reads.

3.2 The Essence of Speculative Decoding

Core idea: Use one HBM read to compute multiple tokens

Traditional autoregressive (5 tokens):
├── Step 1: Read weights(140GB) + compute(t1) → 70ms
├── Step 2: Read weights(140GB) + compute(t2) → 70ms
├── Step 3: Read weights(140GB) + compute(t3) → 70ms
├── Step 4: Read weights(140GB) + compute(t4) → 70ms
└── Step 5: Read weights(140GB) + compute(t5) → 70ms
Total: 5 × 70ms = 350ms, HBM reads 700GB

Speculative Decoding (verify 5 tokens):
├── Draft: 5 × 7ms = 35ms (small model, negligible)
└── Verify: Read weights(140GB) + compute(t1,t2,t3,t4,t5) → ~75ms
Total: 110ms, HBM reads 154GB (assuming Draft is 10% size)

Why does verifying 5 tokens take barely longer than 1?

Operation	1 token	5 tokens	Growth
Read weights	140GB	140GB	1x (unchanged!)
KV Cache read	K bytes	K bytes	~1x
Computation	140 GFLOPs	700 GFLOPs	5x
Total time	~70ms	~75ms	1.07x

The essence:

Weights read only once—this is the dominant cost
Computation increases 5x, but compute time was negligible anyway
HBM bandwidth is the bottleneck; compute is “free”

3.3 Speedup Analysis

Theoretical speedup:

Assume:

Target Model time per token: $T_t$
Draft Model time per token: $T_d$ (typically $T_d \ll T_t$)
Generate $\gamma$ draft tokens per round
Accept $k$ tokens on average

\[\text{Speedup} = \frac{k \cdot T_t}{\gamma \cdot T_d + T_t} \approx \frac{k \cdot T_t}{T_t} = k\]

When $T_d \ll T_t$, speedup approximately equals the average number of accepted tokens.

Expected accepted tokens (assuming independent acceptance probability $\alpha$ per position):

\[\mathbb{E}[\text{accepted}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}\]

Acceptance Rate $\alpha$	Expected at $\gamma=5$	Actual Speedup
0.5	1.97	~2x
0.7	3.16	~3x
0.9	4.69	~4x

3.4 When Does It Work Well?

Deriving from the memory-bound nature:

Works well when:

Condition	Reason
Target Model large enough (≥30B)	More memory-bound, more “free” compute
Draft-Target distributions similar	High acceptance rate, more tokens confirmed per round
Predictable output	Code, formatted text, translation have high acceptance

Works poorly when:

Condition	Reason
Target Model small (<7B)	Closer to compute-bound, multi-token verification has overhead
Large Draft-Target gap	Low acceptance rate, frequent resampling
Creative tasks	Unpredictable output, low acceptance rate

Intuition: Speculative Decoding trades “multiple HBM reads by small model” for “one HBM read by Target.” If Target isn’t large/memory-bound enough, the trade isn’t worth it.

4. Draft Model Approaches

4.1 Independent Small Models

The most direct approach: use a smaller model from the same family as Draft.

Target Model	Draft Model	Param Ratio	Source
Llama-70B	Llama-7B	10:1	Same family
Chinchilla-70B	Chinchilla-1B	70:1	DeepMind original
T5-XXL (11B)	T5-small (60M)	183:1	Google original

Selection principles:

Draft should be 10-100x faster than Target
Same-family models have closer distributions, higher acceptance rate
Too small Draft → low acceptance rate; too large Draft → high overhead

Pros: No additional training needed, ready to use Cons: Distribution gap may be large, limited acceptance rate

4.2 Knowledge Distillation

Use knowledge distillation to better align Draft Model with Target Model’s output distribution.

4.2.1 DistillSpec (ICLR 2024)

Paper: DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Core problem: Off-the-shelf small models have large distribution gaps with Target, leading to low acceptance rates.

Two key design choices:

On-Policy Data Generation:
- Use data generated by Draft Model itself for training
- Rather than using fixed datasets
- Reason: Draft needs to align on tokens it might actually generate
Task-Specific Divergence Functions:
- Different tasks/decoding strategies use different KL divergence variants
- Greedy decoding: Forward KL
- Sampling: Reverse KL or JSD

Training pipeline:

Draft Model generates candidate sequences
Target Model computes probability distributions at these positions
Minimize divergence between Draft and Target distributions
Repeat until convergence

Results:

10-45% speedup improvement over standard SD
XSum task: 6.4x latency reduction
GSM8K task: 10.7x latency reduction

4.2.2 AdaSPEC (2025)

Core improvement: Selective token filtering

Observation: Some tokens are inherently hard to predict (proper nouns, rare words). Forcing alignment on these can hurt prediction of easy tokens.

Method:

Use reference model to identify “hard” tokens
Filter out these tokens during distillation
Let Draft focus on aligning “easy” tokens

Results: Acceptance rate improves up to 15% over DistillSpec

4.3 Self-Speculative Decoding

No separate Draft Model—derive Draft from Target Model itself, “drafting for yourself.”

4.3.1 LayerSkip (ACL 2024)

Paper: LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding Code: GitHub

Core idea: Skip later layers, use output from first E layers to directly predict tokens.

Three-stage approach:

Stage 1: Layer Dropout during Training

# Different dropout rates for different layers during training
# Shallow layers: low dropout (maintain stability)
# Deep layers: high dropout (enhance early exit capability)
for layer_idx, layer in enumerate(layers):
    dropout_rate = layer_idx / num_layers * max_dropout
    x = layer(x, dropout=dropout_rate)

Stage 2: Early Exit Loss

All layers share the same LM Head
Compute loss at every layer during training
Enable shallow layers to predict tokens

Stage 3: Self-Speculative Decoding

Self-Draft:   First E layers → LM Head → draft tokens
Self-Verify:  Remaining layers verify + full forward pass
Key optimization: Reuse KV Cache from draft stage during verification

Usage example:

torchrun generate.py --model facebook/layerskip-llama2-7B \
    --generation_strategy self_speculative \
    --exit_layer 8 \          # Exit at layer 8 for drafting
    --num_speculations 6      # Generate 6 draft tokens per round

Results:

CNN/DM summarization: 2.16x speedup
Code generation: 1.82x speedup
Semantic parsing: 2.0x speedup

Advantages:

Only one model, no additional memory
Draft and Target naturally aligned (same model)
Partial computation reuse

Integration status: Integrated into HuggingFace Transformers and PyTorch TorchTune.

4.4 Additional Heads

Add lightweight prediction heads to Target Model without modifying the original model.

4.4.1 Medusa (ICML 2024)

Paper: Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Code: GitHub

Core idea: Add multiple “Medusa Heads,” each predicting tokens at different future positions.

Architecture:

                    ┌─→ Head 1 → predict t+1
Hidden State (t) ───┼─→ Head 2 → predict t+2
    from LLM        ├─→ Head 3 → predict t+3
                    └─→ Head 4 → predict t+4

Tree Attention Mechanism:

Since each head may have multiple candidates (top-k), combinations form a candidate tree:

Assume Head 1 takes top-2, Head 2 takes top-3:
Candidate tree has 2 × 3 = 6 paths

        t1
       / \
      t1a t1b
     /|\  /|\
    ...  ...

Tree Attention Implementation:

Special attention mask: each token can only see its ancestors
Single forward pass processes all candidate paths simultaneously
Pre-processed attention mask improves efficiency

Two training modes:

Mode	Training Method	Effect
Medusa-1	Freeze LLM, only train heads	2.2x speedup, lossless
Medusa-2	Joint fine-tune LLM + heads	2.3-3.6x speedup

Medusa-2 special training recipe:

Balance between preserving original capability and acquiring speculation ability
Progressive training strategy

Empirical data:

Medusa heads achieve ~60% top-1 accuracy for next-next token
But top-5 accuracy exceeds 80%
Tree structure significantly improves acceptance rate

4.4.2 EAGLE / EAGLE-3

EAGLE Paper: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (ICML 2024) EAGLE-3 Paper: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test (NeurIPS 2025) Code: GitHub

Core insights:

Feature-level autoregression is easier than token-level
- Token space is discrete and sparse
- Feature space is continuous and smooth
- Prediction in feature space is more stable
Feature uncertainty is the bottleneck
- Token sampling results affect next-step features
- But draft stage cannot see actual sampling results

EAGLE Architecture:

Target LLM:
Input → [...Layers...] → Top Layer Feature → LM Head → Token
                              ↓
EAGLE Draft Head:        Feature + Token(t-1)
                              ↓
                         Predict Feature(t+1)
                              ↓
                         LM Head → Draft Token

Key design:

Reuse Target Model’s top-layer features
Draft Head input: current feature + previous sampled token
Draft Head output: next-step feature prediction
Use Target’s LM Head to convert feature to token

Training details:

Data: ShareGPT, 68,000 dialogues
Learning rate: 3e-5
Optimizer: AdamW (β1=0.9, β2=0.95)
Gradient clipping: 0.5
Loss: $L = L_{reg} + w_{cls} L_{cls}$

EAGLE Parameter Count (relative to Target):

Target Size	EAGLE Params	Ratio
7B	0.24B	3.4%
13B	0.37B	2.8%
33B	0.56B	1.7%
70B	0.99B	1.4%

EAGLE-3 Improvements (arXiv:2503.01840, NeurIPS 2025):

Abandons feature prediction for direct token prediction
Multi-layer feature fusion (Training-Time Test, TTT)
Supports larger-scale training data
Significantly improved performance

EAGLE-3 Architecture Changes:

EAGLE-2: Feature(t) + Token(t-1) → Predict Feature(t+1)
EAGLE-3: Multi-layer Features + Token(t-1) → Direct Token Prediction

Key improvements:

TTT (Training-Time Test): Fuses multi-layer features instead of top-layer only
Simplified prediction target: Direct token prediction instead of feature, easier to learn
Better generalization: More stable performance on out-of-distribution data

Results: 2.7-3.5x latency speedup on LLaMA2-Chat 70B.

SpecForge: EAGLE-3 Training Framework

SpecForge is LMSYS’s open-source EAGLE-3 training framework for efficiently training draft models at various scales.

Two Training Modes:

Mode	Description	Use Case
Online	Target and Draft run together, real-time feature generation	Ample GPU, best quality
Offline	Pre-generate Target features, train Draft offline	Limited GPU, large-scale training

Online Mode:

# Train with FSDP
python -m specforge.train \
    --target_model meta-llama/Llama-3.1-70B-Instruct \
    --mode online \
    --backend fsdp \
    --data_path train_data.jsonl

Offline Mode:

# Step 1: Pre-generate features
python -m specforge.generate_features \
    --target_model meta-llama/Llama-3.1-70B-Instruct \
    --output_path features/

# Step 2: Offline training
python -m specforge.train \
    --mode offline \
    --feature_path features/

Supported Backends:

FSDP: PyTorch native distributed, suitable for multi-GPU
Tensor Parallel: Model parallelism for very large models
vLLM: Leverage vLLM’s efficient inference for feature generation

4.5 Draft-Free Methods

No Draft Model at all—achieve parallel decoding through algorithmic innovation.

4.5.1 Lookahead Decoding (ICML 2024)

Paper: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding Code: GitHub

Core idea: View autoregressive decoding as solving nonlinear equations, use Jacobi iteration for parallel solving.

Problem with Jacobi Decoding:

Traditional Jacobi decoding barely accelerates LLMs (~1.05x) because:

LLMs are trained autoregressively
Given incorrect prefix, nearly impossible to predict subsequent tokens correctly
Each Jacobi iteration typically only corrects 1 token

Lookahead’s Solution:

While single Jacobi iteration only determines 1 token, the process produces valuable n-gram byproducts.

2D Window Design:

Dimension 1: Window Size W (how far ahead in future positions)
Dimension 2: N-gram Size N (how many steps back in history)

     Time axis (Jacobi iteration steps)
        t-3  t-2  t-1   t
Pos 1    a    b    c    d  ← Can extract 4-gram: abcd
Pos 2    e    f    g    h
Pos 3    i    j    k    l
Pos 4    m    n    o    p
  ↑
Sequence axis

Two Parallel Branches:

Lookahead Branch:
- Maintains 2D window
- Updates predictions at all positions each step
- Collects n-grams from trajectory into candidate pool
Verification Branch:
- Selects n-grams from pool matching first token
- Verifies these candidates in parallel
- Accepts longest valid prefix

Algorithm flow:

# Configuration
lade.config_lade(
    LEVEL=5,           # N-gram size N
    WINDOW_SIZE=7,     # Window size W
    GUESS_SET_SIZE=7,  # Candidate pool size G
)

# Each decoding step:
for step in decoding:
    # 1. Lookahead branch: generate predictions for W positions in parallel
    # 2. Collect newly generated n-grams into pool
    # 3. Verification branch: verify matching n-grams
    # 4. Accept longest valid prefix, update window

Results:

MT-bench: 1.8x speedup
Code generation (multi-GPU): 4x speedup
No extra models or data needed

4.5.2 Prompt Lookup Decoding

Core idea: Find n-grams from prompt that match current generation.

Suitable scenarios:

Summarization: output often contains input fragments
Editing: most content remains unchanged
Code completion: variable names, function names repeat

Implementation:

# vLLM configuration
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="[ngram]",        # Enable n-gram lookup
    num_speculative_tokens=5,
    ngram_prompt_lookup_max=4,          # Max 4-gram
)

Results: vLLM benchmark shows 2.8x speedup on summarization.

Pros: Zero additional overhead, no training needed Limitation: Only effective when prompt overlaps with output

5. Verification Strategies

5.1 Sequential Verification

Original method: verify left-to-right, stop at first rejection.

Draft:  t1  t2  t3  t4  t5
        ✓   ✓   ✗   -   -
Accept: t1, t2 + resample t3'

Problem: One rejection wastes all subsequent drafts.

5.2 Tree-based Verification

SpecInfer (ASPLOS 2024):

Organize candidates as a Token Tree, not a linear sequence:

        t1
       / | \
      t2 t2' t2''
     /|   |
    t3 t3' t3

Advantages:

Significantly higher verification success rate (52-57% → 96-97%)
Single forward pass verifies entire tree
1.5-2.8x (distributed) / 2.6-3.5x (offloading) speedup

Implementation: Use Tree Attention to process all paths in parallel.

5.3 Block Verification

Observation: Independent token-by-token verification is not optimal.

Block Verification:

Jointly verify entire block
Exploits statistical dependencies between tokens
Accepts more tokens than independent verification

6. Method Summary

6.1 Seminal Works

Work	Date	Institution	Contribution
Fast Inference via Speculative Decoding	2022-11	Google	First proposed Speculative Decoding
Speculative Sampling	2023-02	DeepMind	Independent proposal, Chinchilla 2-2.5x

6.2 Draft Model Improvements

Work	Core Idea	Effect
DistillSpec	Knowledge distillation alignment	+10-45%
Online Speculative Decoding	Online Draft updates	Adapts to distribution shift
Draft & Verify	Self-speculative	No extra model
LayerSkip	Layer skipping	Computation reuse

6.3 Additional Heads

Work	Core Idea	Speedup
Medusa	Multi-head + tree attention	2.2-3.6x
EAGLE	Feature-level prediction head	2-3x
EAGLE-3	Training-time test optimization	SOTA
Hydra	Multi-head variant	-

6.4 Verification Optimization

Work	Core Idea	Effect
SpecInfer	Token Tree + Tree Attention	2.6-3.5x
Block Verification	Joint block verification	Higher acceptance
Staged Speculative Decoding	Multi-stage verification	-

6.5 Draft-Free

Work	Core Idea	Speedup
Lookahead Decoding	Jacobi iteration + n-gram cache	1.5-2.3x
Prompt Lookup	Find n-grams from prompt	2.8x (summarization)
REST	Retrieval-augmented	-

7. Practical Deployment

7.1 Framework Support

Framework	Supported Methods
vLLM	Draft model, Prompt lookup, Medusa, EAGLE
TensorRT-LLM	Draft model, Medusa
SGLang	Draft model, EAGLE
HuggingFace	Assisted generation

7.2 vLLM Usage Example

from vllm import LLM, SamplingParams

# Draft model-based
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
)

# Prompt lookup (no draft model needed)
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    ngram_prompt_lookup_max=4,
)

7.3 When to Use?

Recommended:

Target Model ≥ 30B parameters
Highly predictable output (code, formatted, translation)
Latency-sensitive scenarios
Suitable Draft Model available

Not recommended:

Smaller Target Model (< 7B)
Highly creative tasks
No suitable Draft Model and draft-free methods inapplicable

7.4 SpecBundle: Production-Grade EAGLE-3 Models

SpecBundle is LMSYS’s collection of production-grade EAGLE-3 draft models trained with SpecForge.

Phase 1 Release (2025-12):

Target Model	Draft Model	Speedup	Model Link
Llama-3.1-8B-Instruct	EAGLE-3	2.5-3.0x	HuggingFace
Llama-3.1-70B-Instruct	EAGLE-3	3.0-4.0x	HuggingFace
Llama-3.3-70B-Instruct	EAGLE-3	3.0-4.0x	HuggingFace
Qwen2.5-7B-Instruct	EAGLE-3	2.5-3.0x	HuggingFace
Qwen2.5-32B-Instruct	EAGLE-3	2.8-3.5x	HuggingFace
Qwen2.5-72B-Instruct	EAGLE-3	3.0-4.0x	HuggingFace
DeepSeek-V3	EAGLE-3	3.0-3.5x	HuggingFace
Gemma-2-9B-it	EAGLE-3	2.5-3.0x	HuggingFace
Gemma-2-27B-it	EAGLE-3	2.8-3.5x	HuggingFace
Mistral-7B-Instruct-v0.3	EAGLE-3	2.5-3.0x	HuggingFace
Mistral-Large-2	EAGLE-3	3.0-3.5x	HuggingFace

Key features:

Production-ready: Thoroughly tested, ready for production deployment
Wide coverage: Supports major open-source model families
Significant speedup: Large models (70B+) achieve up to 4x speedup
Plug-and-play: Seamless integration with vLLM and SGLang

Usage Example (vLLM):

from vllm import LLM, SamplingParams

# Use SpecBundle's EAGLE-3 model
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="lmsys/Llama-3.1-70B-Instruct-EAGLE3",
    num_speculative_tokens=5,
)

# Normal inference
output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(temperature=0.7, max_tokens=512)
)

vLLM Speculators Training Support:

Starting from vLLM v0.3.0, end-to-end EAGLE-3 training is supported via vllm-speculators:

# Install
pip install vllm-speculators

# Train EAGLE-3
python -m vllm_speculators.train_eagle3 \
    --target_model meta-llama/Llama-3.1-8B-Instruct \
    --output_path ./my-eagle3-model \
    --data_path train_data.jsonl

8. Method Taxonomy

Category	Core Idea	Representative Works
Independent Draft	Use small model for drafts	Google/DeepMind original, DistillSpec
Self-Speculative	Derive Draft from Target	LayerSkip, Draft&Verify, SPEED
Additional Heads	Add prediction heads to Target	Medusa, EAGLE, Hydra
Tree Verification	Tree candidates + parallel verify	SpecInfer
Draft-Free	No Draft Model	Lookahead, Prompt Lookup, REST

References

Seminal Works

Fast Inference from Transformers via Speculative Decoding: arXiv:2211.17192
Accelerating LLM Decoding with Speculative Sampling: arXiv:2302.01318

Draft Model Improvements

DistillSpec: arXiv:2310.08461
Online Speculative Decoding: arXiv:2310.07177
Draft & Verify: arXiv:2309.08168

Additional Heads

Medusa: arXiv:2401.10774
EAGLE: arXiv:2401.15077 (ICML 2024)
EAGLE-3: arXiv:2503.01840 (NeurIPS 2025)

Verification Optimization

SpecInfer: arXiv:2305.09781
Block Verification: OpenReview

Draft-Free

Lookahead Decoding: arXiv:2402.02057

Surveys

Comprehensive Survey of Speculative Decoding: arXiv:2401.07851
Decoding Speculative Decoding: arXiv:2402.01528

Training Ecosystem

SpecForge: GitHub
SpecBundle: LMSYS Blog
vLLM Speculators: GitHub

Resources

Speculative Decoding Papers: GitHub
vLLM Speculative Decoding: Blog
Google Research Blog: Looking back at speculative decoding

← Back to Home

Speculative Decoding: A Complete Guide to Principles, Methods, and Speedup Analysis

1. Motivation: The Autoregressive Decoding Bottleneck

1.1 Why is LLM Inference Slow?

1.2 The Memory-Bound Problem

1.3 Key Insight

2. Core Mechanism

2.1 The Draft-Then-Verify Paradigm

2.2 The Rejection Sampling Algorithm

2.3 Output Invariance Guarantee

3. The Essence of Speedup

3.1 Memory-Bound: The Fundamental Bottleneck

3.2 The Essence of Speculative Decoding

3.3 Speedup Analysis

3.4 When Does It Work Well?

4. Draft Model Approaches

4.1 Independent Small Models

4.2 Knowledge Distillation

4.2.1 DistillSpec (ICLR 2024)

4.2.2 AdaSPEC (2025)

4.3 Self-Speculative Decoding

4.3.1 LayerSkip (ACL 2024)

4.4 Additional Heads

4.4.1 Medusa (ICML 2024)

4.4.2 EAGLE / EAGLE-3

4.5 Draft-Free Methods

4.5.1 Lookahead Decoding (ICML 2024)

4.5.2 Prompt Lookup Decoding

5. Verification Strategies

5.1 Sequential Verification

5.2 Tree-based Verification

5.3 Block Verification

6. Method Summary

6.1 Seminal Works

6.2 Draft Model Improvements

6.3 Additional Heads

6.4 Verification Optimization

6.5 Draft-Free

7. Practical Deployment

7.1 Framework Support

7.2 vLLM Usage Example

7.3 When to Use?

7.4 SpecBundle: Production-Grade EAGLE-3 Models

8. Method Taxonomy

References

Seminal Works

Draft Model Improvements

Additional Heads

Verification Optimization

Draft-Free

Surveys

Training Ecosystem

Resources

Comments