Speculative Decoding: A Complete Guide to Principles, Methods, and Speedup Analysis
2026-01-05 · Qi Lu · Views:
Speculative Decoding is one of the most important techniques in LLM inference acceleration. Through a “draft-then-verify” paradigm, it achieves 2-3x speedup without changing the output distribution.
This post provides a comprehensive introduction to Speculative Decoding, covering its principles, implementation methods, draft model acquisition approaches, and an in-depth analysis of why it accelerates inference.
1. Motivation: The Autoregressive Decoding Bottleneck
1.1 Why is LLM Inference Slow?
LLMs generate text autoregressively: each token depends on all previous tokens, requiring sequential generation.
Two decoding modes:
| Mode | Formula | Characteristics |
|---|---|---|
| Greedy | $y_t = \arg\max_v P(v \mid y_{<t}, x)$ | Deterministic, picks highest probability token |
| Sampling | $y_t \sim P(\cdot \mid y_{<t}, x)$ | Stochastic, samples from distribution |
Regardless of mode, generating K tokens requires K sequential forward passes.
1.2 The Memory-Bound Problem
Modern GPUs have abundant compute power, but LLM inference is memory-bound rather than compute-bound:
| Bottleneck | Description |
|---|---|
| Weight Reading | Each forward pass reads nearly all model weights from HBM |
| KV Cache | Must read KV Cache for all historical tokens |
| Low Parallelism | Each step generates only 1 token, low GPU utilization |
Note: Weights reside in GPU memory (HBM), but each forward pass must read weights from HBM into compute units—this read bandwidth is the bottleneck.
Core contradiction: GPUs have massive compute, but each step only computes one token—most time is spent waiting for memory reads.
1.3 Key Insight
“Hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models.”
Many tokens generated by large models are “easy” (common words, grammatical structures)—small models can predict them correctly. Only a few “hard” tokens truly require the large model’s capability.
2. Core Mechanism
2.1 The Draft-Then-Verify Paradigm
The core idea of Speculative Decoding:
- Draft: Use a fast Draft Model to serially generate $\gamma$ candidate tokens
- Verify: Use the Target Model to verify these tokens in parallel
- Accept/Reject: Use rejection sampling to decide which tokens to accept
Draft Model: [x] → t1 → t2 → t3 → t4 → t5 (γ=5, serial)
↓ ↓ ↓ ↓ ↓
Target Model: [x, t1, t2, t3, t4, t5] (parallel verify)
↓ ↓ ↓ ↓ ↓
Result: [✓] [✓] [✓] [✗] [—] (accept 3 + resample 1)
2.2 The Rejection Sampling Algorithm
Let Draft Model distribution be $q(x)$, Target Model distribution be $p(x)$:
Acceptance probability:
\[P(\text{accept}) = \min\left(1, \frac{p(x)}{q(x)}\right)\]Two cases:
| Case | Condition | Action |
|---|---|---|
| Draft is conservative | $q(x) \leq p(x)$ | 100% accept |
| Draft is overconfident | $q(x) > p(x)$ | Accept with probability $p(x)/q(x)$ |
Resampling on rejection:
\[x \sim \text{norm}\left(\max(0, p(x) - q(x))\right)\]2.3 Output Invariance Guarantee
Speculative Decoding guarantees output invariance in both decoding modes:
Greedy mode:
- Draft and Target give same argmax token at a position → accept
- Different → reject, use Target’s result
- Guarantee: Output sequence is exactly identical to pure Target decoding (deterministic)
Sampling mode:
- Rejection sampling preserves the sampling distribution
- Theorem: Tokens sampled via Speculative Sampling from $p(x)$ and $q(x)$ are distributed identically to those sampled from $p(x)$ alone
This means:
- Output quality is exactly the same as the original Target Model
- Speedup is lossless
- No additional quality-speed tradeoff needed
3. The Essence of Speedup
3.1 Memory-Bound: The Fundamental Bottleneck
To understand why Speculative Decoding works, we must first understand why LLM inference is slow.
What determines inference time?
\[T_{inference} = \max(T_{compute}, T_{memory})\]- $T_{compute}$: Time for GPU computation
- $T_{memory}$: Time to read data from HBM
The autoregressive decoding problem:
Each token generation requires:
- Reading nearly all model weights from HBM
- Reading KV Cache
- Executing matrix multiplications
Back-of-envelope estimate (order of magnitude only):
Assumptions: 70B model, FP16, A100 80GB (2 TB/s bandwidth, 312 TFLOPS)
Ignoring: KV Cache, activations, communication, kernel scheduling
Single token generation:
├── HBM read: ~140GB weights → O(100ms) order
└── Computation: ~140 GFLOPs → O(1ms) order
Bandwidth vs Compute: Two orders of magnitude difference!
⚠️ Actual latency depends on precision (FP16/INT8/INT4), parallelism (TP/PP), sequence length, kernel fusion, etc. Above is only to illustrate order-of-magnitude difference.
Arithmetic Intensity:
\[\text{AI} = \frac{\text{FLOPs}}{\text{Bytes Accessed}}\]| Scenario | Arithmetic Intensity | Bottleneck |
|---|---|---|
| Autoregressive (batch=1) | ~1 FLOP/Byte | Bandwidth |
| Batched (batch=128) | ~128 FLOP/Byte | Compute |
| A100 balance point | ~156 FLOP/Byte | - |
Key insight: Autoregressive decoding has arithmetic intensity far below GPU’s balance point—most compute is wasted waiting for HBM reads.
3.2 The Essence of Speculative Decoding
Core idea: Use one HBM read to compute multiple tokens
Traditional autoregressive (5 tokens):
├── Step 1: Read weights(140GB) + compute(t1) → 70ms
├── Step 2: Read weights(140GB) + compute(t2) → 70ms
├── Step 3: Read weights(140GB) + compute(t3) → 70ms
├── Step 4: Read weights(140GB) + compute(t4) → 70ms
└── Step 5: Read weights(140GB) + compute(t5) → 70ms
Total: 5 × 70ms = 350ms, HBM reads 700GB
Speculative Decoding (verify 5 tokens):
├── Draft: 5 × 7ms = 35ms (small model, negligible)
└── Verify: Read weights(140GB) + compute(t1,t2,t3,t4,t5) → ~75ms
Total: 110ms, HBM reads 154GB (assuming Draft is 10% size)
Why does verifying 5 tokens take barely longer than 1?
| Operation | 1 token | 5 tokens | Growth |
|---|---|---|---|
| Read weights | 140GB | 140GB | 1x (unchanged!) |
| KV Cache read | K bytes | K bytes | ~1x |
| Computation | 140 GFLOPs | 700 GFLOPs | 5x |
| Total time | ~70ms | ~75ms | 1.07x |
The essence:
- Weights read only once—this is the dominant cost
- Computation increases 5x, but compute time was negligible anyway
- HBM bandwidth is the bottleneck; compute is “free”
3.3 Speedup Analysis
Theoretical speedup:
Assume:
- Target Model time per token: $T_t$
- Draft Model time per token: $T_d$ (typically $T_d \ll T_t$)
- Generate $\gamma$ draft tokens per round
- Accept $k$ tokens on average
When $T_d \ll T_t$, speedup approximately equals the average number of accepted tokens.
Expected accepted tokens (assuming independent acceptance probability $\alpha$ per position):
\[\mathbb{E}[\text{accepted}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}\]| Acceptance Rate $\alpha$ | Expected at $\gamma=5$ | Actual Speedup |
|---|---|---|
| 0.5 | 1.97 | ~2x |
| 0.7 | 3.16 | ~3x |
| 0.9 | 4.69 | ~4x |
3.4 When Does It Work Well?
Deriving from the memory-bound nature:
Works well when:
| Condition | Reason |
|---|---|
| Target Model large enough (≥30B) | More memory-bound, more “free” compute |
| Draft-Target distributions similar | High acceptance rate, more tokens confirmed per round |
| Predictable output | Code, formatted text, translation have high acceptance |
Works poorly when:
| Condition | Reason |
|---|---|
| Target Model small (<7B) | Closer to compute-bound, multi-token verification has overhead |
| Large Draft-Target gap | Low acceptance rate, frequent resampling |
| Creative tasks | Unpredictable output, low acceptance rate |
Intuition: Speculative Decoding trades “multiple HBM reads by small model” for “one HBM read by Target.” If Target isn’t large/memory-bound enough, the trade isn’t worth it.
4. Draft Model Approaches
4.1 Independent Small Models
The most direct approach: use a smaller model from the same family as Draft.
| Target Model | Draft Model | Param Ratio | Source |
|---|---|---|---|
| Llama-70B | Llama-7B | 10:1 | Same family |
| Chinchilla-70B | Chinchilla-1B | 70:1 | DeepMind original |
| T5-XXL (11B) | T5-small (60M) | 183:1 | Google original |
Selection principles:
- Draft should be 10-100x faster than Target
- Same-family models have closer distributions, higher acceptance rate
- Too small Draft → low acceptance rate; too large Draft → high overhead
Pros: No additional training needed, ready to use Cons: Distribution gap may be large, limited acceptance rate
4.2 Knowledge Distillation
Use knowledge distillation to better align Draft Model with Target Model’s output distribution.
4.2.1 DistillSpec (ICLR 2024)
Paper: DistillSpec: Improving Speculative Decoding via Knowledge Distillation
Core problem: Off-the-shelf small models have large distribution gaps with Target, leading to low acceptance rates.
Two key design choices:
- On-Policy Data Generation:
- Use data generated by Draft Model itself for training
- Rather than using fixed datasets
- Reason: Draft needs to align on tokens it might actually generate
- Task-Specific Divergence Functions:
- Different tasks/decoding strategies use different KL divergence variants
- Greedy decoding: Forward KL
- Sampling: Reverse KL or JSD
Training pipeline:
1. Draft Model generates candidate sequences
2. Target Model computes probability distributions at these positions
3. Minimize divergence between Draft and Target distributions
4. Repeat until convergence
Results:
- 10-45% speedup improvement over standard SD
- XSum task: 6.4x latency reduction
- GSM8K task: 10.7x latency reduction
4.2.2 AdaSPEC (2025)
Core improvement: Selective token filtering
Observation: Some tokens are inherently hard to predict (proper nouns, rare words). Forcing alignment on these can hurt prediction of easy tokens.
Method:
- Use reference model to identify “hard” tokens
- Filter out these tokens during distillation
- Let Draft focus on aligning “easy” tokens
Results: Acceptance rate improves up to 15% over DistillSpec
4.3 Self-Speculative Decoding
No separate Draft Model—derive Draft from Target Model itself, “drafting for yourself.”
4.3.1 LayerSkip (ACL 2024)
Paper: LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding Code: GitHub
Core idea: Skip later layers, use output from first E layers to directly predict tokens.
Three-stage approach:
Stage 1: Layer Dropout during Training
# Different dropout rates for different layers during training
# Shallow layers: low dropout (maintain stability)
# Deep layers: high dropout (enhance early exit capability)
for layer_idx, layer in enumerate(layers):
dropout_rate = layer_idx / num_layers * max_dropout
x = layer(x, dropout=dropout_rate)
Stage 2: Early Exit Loss
- All layers share the same LM Head
- Compute loss at every layer during training
- Enable shallow layers to predict tokens
Stage 3: Self-Speculative Decoding
Self-Draft: First E layers → LM Head → draft tokens
Self-Verify: Remaining layers verify + full forward pass
Key optimization: Reuse KV Cache from draft stage during verification
Usage example:
torchrun generate.py --model facebook/layerskip-llama2-7B \
--generation_strategy self_speculative \
--exit_layer 8 \ # Exit at layer 8 for drafting
--num_speculations 6 # Generate 6 draft tokens per round
Results:
- CNN/DM summarization: 2.16x speedup
- Code generation: 1.82x speedup
- Semantic parsing: 2.0x speedup
Advantages:
- Only one model, no additional memory
- Draft and Target naturally aligned (same model)
- Partial computation reuse
Integration status: Integrated into HuggingFace Transformers and PyTorch TorchTune.
4.4 Additional Heads
Add lightweight prediction heads to Target Model without modifying the original model.
4.4.1 Medusa (ICML 2024)
Paper: Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Code: GitHub
Core idea: Add multiple “Medusa Heads,” each predicting tokens at different future positions.
Architecture:
┌─→ Head 1 → predict t+1
Hidden State (t) ───┼─→ Head 2 → predict t+2
from LLM ├─→ Head 3 → predict t+3
└─→ Head 4 → predict t+4
Tree Attention Mechanism:
Since each head may have multiple candidates (top-k), combinations form a candidate tree:
Assume Head 1 takes top-2, Head 2 takes top-3:
Candidate tree has 2 × 3 = 6 paths
t1
/ \
t1a t1b
/|\ /|\
... ...
Tree Attention Implementation:
- Special attention mask: each token can only see its ancestors
- Single forward pass processes all candidate paths simultaneously
- Pre-processed attention mask improves efficiency
Two training modes:
| Mode | Training Method | Effect |
|---|---|---|
| Medusa-1 | Freeze LLM, only train heads | 2.2x speedup, lossless |
| Medusa-2 | Joint fine-tune LLM + heads | 2.3-3.6x speedup |
Medusa-2 special training recipe:
- Balance between preserving original capability and acquiring speculation ability
- Progressive training strategy
Empirical data:
- Medusa heads achieve ~60% top-1 accuracy for next-next token
- But top-5 accuracy exceeds 80%
- Tree structure significantly improves acceptance rate
4.4.2 EAGLE / EAGLE-3
EAGLE Paper: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (ICML 2024) EAGLE-3 Paper: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test (NeurIPS 2025) Code: GitHub
Core insights:
- Feature-level autoregression is easier than token-level
- Token space is discrete and sparse
- Feature space is continuous and smooth
- Prediction in feature space is more stable
- Feature uncertainty is the bottleneck
- Token sampling results affect next-step features
- But draft stage cannot see actual sampling results
EAGLE Architecture:
Target LLM:
Input → [...Layers...] → Top Layer Feature → LM Head → Token
↓
EAGLE Draft Head: Feature + Token(t-1)
↓
Predict Feature(t+1)
↓
LM Head → Draft Token
Key design:
- Reuse Target Model’s top-layer features
- Draft Head input: current feature + previous sampled token
- Draft Head output: next-step feature prediction
- Use Target’s LM Head to convert feature to token
Training details:
- Data: ShareGPT, 68,000 dialogues
- Learning rate: 3e-5
- Optimizer: AdamW (β1=0.9, β2=0.95)
- Gradient clipping: 0.5
- Loss: $L = L_{reg} + w_{cls} L_{cls}$
EAGLE Parameter Count (relative to Target):
| Target Size | EAGLE Params | Ratio |
|---|---|---|
| 7B | 0.24B | 3.4% |
| 13B | 0.37B | 2.8% |
| 33B | 0.56B | 1.7% |
| 70B | 0.99B | 1.4% |
EAGLE-3 Improvements (arXiv:2503.01840, NeurIPS 2025):
- Abandons feature prediction for direct token prediction
- Multi-layer feature fusion (Training-Time Test, TTT)
- Supports larger-scale training data
- Significantly improved performance
EAGLE-3 Architecture Changes:
EAGLE-2: Feature(t) + Token(t-1) → Predict Feature(t+1)
EAGLE-3: Multi-layer Features + Token(t-1) → Direct Token Prediction
Key improvements:
- TTT (Training-Time Test): Fuses multi-layer features instead of top-layer only
- Simplified prediction target: Direct token prediction instead of feature, easier to learn
- Better generalization: More stable performance on out-of-distribution data
Results: 2.7-3.5x latency speedup on LLaMA2-Chat 70B.
SpecForge: EAGLE-3 Training Framework
SpecForge is LMSYS’s open-source EAGLE-3 training framework for efficiently training draft models at various scales.
Two Training Modes:
| Mode | Description | Use Case |
|---|---|---|
| Online | Target and Draft run together, real-time feature generation | Ample GPU, best quality |
| Offline | Pre-generate Target features, train Draft offline | Limited GPU, large-scale training |
Online Mode:
# Train with FSDP
python -m specforge.train \
--target_model meta-llama/Llama-3.1-70B-Instruct \
--mode online \
--backend fsdp \
--data_path train_data.jsonl
Offline Mode:
# Step 1: Pre-generate features
python -m specforge.generate_features \
--target_model meta-llama/Llama-3.1-70B-Instruct \
--output_path features/
# Step 2: Offline training
python -m specforge.train \
--mode offline \
--feature_path features/
Supported Backends:
- FSDP: PyTorch native distributed, suitable for multi-GPU
- Tensor Parallel: Model parallelism for very large models
- vLLM: Leverage vLLM’s efficient inference for feature generation
4.5 Draft-Free Methods
No Draft Model at all—achieve parallel decoding through algorithmic innovation.
4.5.1 Lookahead Decoding (ICML 2024)
Paper: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding Code: GitHub
Core idea: View autoregressive decoding as solving nonlinear equations, use Jacobi iteration for parallel solving.
Problem with Jacobi Decoding:
Traditional Jacobi decoding barely accelerates LLMs (~1.05x) because:
- LLMs are trained autoregressively
- Given incorrect prefix, nearly impossible to predict subsequent tokens correctly
- Each Jacobi iteration typically only corrects 1 token
Lookahead’s Solution:
While single Jacobi iteration only determines 1 token, the process produces valuable n-gram byproducts.
2D Window Design:
Dimension 1: Window Size W (how far ahead in future positions)
Dimension 2: N-gram Size N (how many steps back in history)
Time axis (Jacobi iteration steps)
t-3 t-2 t-1 t
Pos 1 a b c d ← Can extract 4-gram: abcd
Pos 2 e f g h
Pos 3 i j k l
Pos 4 m n o p
↑
Sequence axis
Two Parallel Branches:
- Lookahead Branch:
- Maintains 2D window
- Updates predictions at all positions each step
- Collects n-grams from trajectory into candidate pool
- Verification Branch:
- Selects n-grams from pool matching first token
- Verifies these candidates in parallel
- Accepts longest valid prefix
Algorithm flow:
# Configuration
lade.config_lade(
LEVEL=5, # N-gram size N
WINDOW_SIZE=7, # Window size W
GUESS_SET_SIZE=7, # Candidate pool size G
)
# Each decoding step:
for step in decoding:
# 1. Lookahead branch: generate predictions for W positions in parallel
# 2. Collect newly generated n-grams into pool
# 3. Verification branch: verify matching n-grams
# 4. Accept longest valid prefix, update window
Results:
- MT-bench: 1.8x speedup
- Code generation (multi-GPU): 4x speedup
- No extra models or data needed
4.5.2 Prompt Lookup Decoding
Core idea: Find n-grams from prompt that match current generation.
Suitable scenarios:
- Summarization: output often contains input fragments
- Editing: most content remains unchanged
- Code completion: variable names, function names repeat
Implementation:
# vLLM configuration
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="[ngram]", # Enable n-gram lookup
num_speculative_tokens=5,
ngram_prompt_lookup_max=4, # Max 4-gram
)
Results: vLLM benchmark shows 2.8x speedup on summarization.
Pros: Zero additional overhead, no training needed Limitation: Only effective when prompt overlaps with output
5. Verification Strategies
5.1 Sequential Verification
Original method: verify left-to-right, stop at first rejection.
Draft: t1 t2 t3 t4 t5
✓ ✓ ✗ - -
Accept: t1, t2 + resample t3'
Problem: One rejection wastes all subsequent drafts.
5.2 Tree-based Verification
SpecInfer (ASPLOS 2024):
Organize candidates as a Token Tree, not a linear sequence:
t1
/ | \
t2 t2' t2''
/| |
t3 t3' t3
Advantages:
- Significantly higher verification success rate (52-57% → 96-97%)
- Single forward pass verifies entire tree
- 1.5-2.8x (distributed) / 2.6-3.5x (offloading) speedup
Implementation: Use Tree Attention to process all paths in parallel.
5.3 Block Verification
Observation: Independent token-by-token verification is not optimal.
Block Verification:
- Jointly verify entire block
- Exploits statistical dependencies between tokens
- Accepts more tokens than independent verification
6. Method Summary
6.1 Seminal Works
| Work | Date | Institution | Contribution |
|---|---|---|---|
| Fast Inference via Speculative Decoding | 2022-11 | First proposed Speculative Decoding | |
| Speculative Sampling | 2023-02 | DeepMind | Independent proposal, Chinchilla 2-2.5x |
6.2 Draft Model Improvements
| Work | Core Idea | Effect |
|---|---|---|
| DistillSpec | Knowledge distillation alignment | +10-45% |
| Online Speculative Decoding | Online Draft updates | Adapts to distribution shift |
| Draft & Verify | Self-speculative | No extra model |
| LayerSkip | Layer skipping | Computation reuse |
6.3 Additional Heads
| Work | Core Idea | Speedup |
|---|---|---|
| Medusa | Multi-head + tree attention | 2.2-3.6x |
| EAGLE | Feature-level prediction head | 2-3x |
| EAGLE-3 | Training-time test optimization | SOTA |
| Hydra | Multi-head variant | - |
6.4 Verification Optimization
| Work | Core Idea | Effect |
|---|---|---|
| SpecInfer | Token Tree + Tree Attention | 2.6-3.5x |
| Block Verification | Joint block verification | Higher acceptance |
| Staged Speculative Decoding | Multi-stage verification | - |
6.5 Draft-Free
| Work | Core Idea | Speedup |
|---|---|---|
| Lookahead Decoding | Jacobi iteration + n-gram cache | 1.5-2.3x |
| Prompt Lookup | Find n-grams from prompt | 2.8x (summarization) |
| REST | Retrieval-augmented | - |
7. Practical Deployment
7.1 Framework Support
| Framework | Supported Methods |
|---|---|
| vLLM | Draft model, Prompt lookup, Medusa, EAGLE |
| TensorRT-LLM | Draft model, Medusa |
| SGLang | Draft model, EAGLE |
| HuggingFace | Assisted generation |
7.2 vLLM Usage Example
from vllm import LLM, SamplingParams
# Draft model-based
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.1-8B-Instruct",
num_speculative_tokens=5,
)
# Prompt lookup (no draft model needed)
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="[ngram]",
num_speculative_tokens=5,
ngram_prompt_lookup_max=4,
)
7.3 When to Use?
Recommended:
- Target Model ≥ 30B parameters
- Highly predictable output (code, formatted, translation)
- Latency-sensitive scenarios
- Suitable Draft Model available
Not recommended:
- Smaller Target Model (< 7B)
- Highly creative tasks
- No suitable Draft Model and draft-free methods inapplicable
7.4 SpecBundle: Production-Grade EAGLE-3 Models
SpecBundle is LMSYS’s collection of production-grade EAGLE-3 draft models trained with SpecForge.
Phase 1 Release (2025-12):
| Target Model | Draft Model | Speedup | Model Link |
|---|---|---|---|
| Llama-3.1-8B-Instruct | EAGLE-3 | 2.5-3.0x | HuggingFace |
| Llama-3.1-70B-Instruct | EAGLE-3 | 3.0-4.0x | HuggingFace |
| Llama-3.3-70B-Instruct | EAGLE-3 | 3.0-4.0x | HuggingFace |
| Qwen2.5-7B-Instruct | EAGLE-3 | 2.5-3.0x | HuggingFace |
| Qwen2.5-32B-Instruct | EAGLE-3 | 2.8-3.5x | HuggingFace |
| Qwen2.5-72B-Instruct | EAGLE-3 | 3.0-4.0x | HuggingFace |
| DeepSeek-V3 | EAGLE-3 | 3.0-3.5x | HuggingFace |
| Gemma-2-9B-it | EAGLE-3 | 2.5-3.0x | HuggingFace |
| Gemma-2-27B-it | EAGLE-3 | 2.8-3.5x | HuggingFace |
| Mistral-7B-Instruct-v0.3 | EAGLE-3 | 2.5-3.0x | HuggingFace |
| Mistral-Large-2 | EAGLE-3 | 3.0-3.5x | HuggingFace |
Key features:
- Production-ready: Thoroughly tested, ready for production deployment
- Wide coverage: Supports major open-source model families
- Significant speedup: Large models (70B+) achieve up to 4x speedup
- Plug-and-play: Seamless integration with vLLM and SGLang
Usage Example (vLLM):
from vllm import LLM, SamplingParams
# Use SpecBundle's EAGLE-3 model
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="lmsys/Llama-3.1-70B-Instruct-EAGLE3",
num_speculative_tokens=5,
)
# Normal inference
output = llm.generate(
["Explain quantum computing in simple terms."],
SamplingParams(temperature=0.7, max_tokens=512)
)
vLLM Speculators Training Support:
Starting from vLLM v0.3.0, end-to-end EAGLE-3 training is supported via vllm-speculators:
# Install
pip install vllm-speculators
# Train EAGLE-3
python -m vllm_speculators.train_eagle3 \
--target_model meta-llama/Llama-3.1-8B-Instruct \
--output_path ./my-eagle3-model \
--data_path train_data.jsonl
8. Method Taxonomy
| Category | Core Idea | Representative Works |
|---|---|---|
| Independent Draft | Use small model for drafts | Google/DeepMind original, DistillSpec |
| Self-Speculative | Derive Draft from Target | LayerSkip, Draft&Verify, SPEED |
| Additional Heads | Add prediction heads to Target | Medusa, EAGLE, Hydra |
| Tree Verification | Tree candidates + parallel verify | SpecInfer |
| Draft-Free | No Draft Model | Lookahead, Prompt Lookup, REST |
References
Seminal Works
- Fast Inference from Transformers via Speculative Decoding: arXiv:2211.17192
- Accelerating LLM Decoding with Speculative Sampling: arXiv:2302.01318
Draft Model Improvements
- DistillSpec: arXiv:2310.08461
- Online Speculative Decoding: arXiv:2310.07177
- Draft & Verify: arXiv:2309.08168
Additional Heads
- Medusa: arXiv:2401.10774
- EAGLE: arXiv:2401.15077 (ICML 2024)
- EAGLE-3: arXiv:2503.01840 (NeurIPS 2025)
Verification Optimization
- SpecInfer: arXiv:2305.09781
- Block Verification: OpenReview
Draft-Free
- Lookahead Decoding: arXiv:2402.02057
Surveys
- Comprehensive Survey of Speculative Decoding: arXiv:2401.07851
- Decoding Speculative Decoding: arXiv:2402.01528
Training Ecosystem
- SpecForge: GitHub
- SpecBundle: LMSYS Blog
- vLLM Speculators: GitHub
Resources
- Speculative Decoding Papers: GitHub
- vLLM Speculative Decoding: Blog
- Google Research Blog: Looking back at speculative decoding
Comments