Transformer Notes (VII): Deployment Optimization

2025-12-20 · Qi Lu · Views:

This is the seventh article in the Transformer series, providing a comprehensive analysis of deployment optimization techniques for large language models, including model quantization and inference acceleration. These techniques are key to efficiently deploying hundred-billion-parameter models in practical applications.

1. Model Quantization

Model quantization is a technique that converts floating-point representations in neural networks to low-precision representations, and is one of the core technologies for efficient deployment of large language models.

1.1 Why Quantization is Needed

The parameter scale of modern LLMs brings severe deployment challenges:

Memory Requirements: A 70B parameter model stored in FP16 requires 140GB of GPU memory
Bandwidth Bottleneck: Inference is primarily limited by memory bandwidth rather than computation
Energy Costs: Energy consumed by data movement far exceeds computation itself

Storage characteristics of different precisions:

Precision	Bit Width	Relative Memory	Typical Use
FP32	32	1×	Training gradient accumulation
FP16/BF16	16	0.5×	Standard training and inference
FP8	8	0.25×	Efficient training (Hopper+)
INT8	8	0.25×	Quantized inference
INT4	4	0.125×	Aggressive quantized inference

1.2 Mathematical Definition of Quantization

Uniform Quantization is the most commonly used quantization method. Given a floating-point number $x$, the quantization process is:

\[Q(x) = \text{clamp}\left( \left\lfloor \frac{x}{s} \right\rceil + z, 0, 2^b - 1 \right)\]

where $s$ is the scale factor, $z$ is the zero-point, and $b$ is the target bit width.

Dequantization recovers the approximate value:

\[\hat{x} = s \cdot (Q(x) - z)\]

Symmetric Quantization vs Asymmetric Quantization:

Symmetric quantization: $z = 0$, simpler implementation
Asymmetric quantization: allows $z \neq 0$, more effective for skewed distributions

1.3 Quantization Granularity

The computation granularity of quantization parameters $s, z$ affects the trade-off between accuracy and overhead:

Per-Tensor: The entire tensor shares one set of parameters, minimal overhead but large accuracy loss
Per-Channel: Each output channel is quantized independently, commonly used for weights
Per-Token: Each token is quantized independently, commonly used for activations
Per-Group: Channels are divided into groups, each group quantized independently, a compromise between accuracy and overhead

Group Quantization Example: With a group size of 128, the effective bit width for INT4 quantization is approximately $4 + 32/128 = 4.25$ bits.

1.4 Post-Training Quantization (PTQ)

Post-training quantization is performed after model training is complete, requires no retraining, and is the mainstream method for LLM quantization.

Basic PTQ Process:

Calibration: Use a small amount of representative data to collect activation distribution statistics
Determine quantization parameters: Calculate $s, z$ based on statistical information
Quantize weights: Convert floating-point weights to low-precision representation
(Optional) Correction: Reduce quantization error through additional optimization

Calibration Strategies:

MinMax Calibration: Uses observed maximum and minimum values, sensitive to outliers
Percentile Calibration: Uses p-th and (100-p)-th percentiles
MSE Calibration: Minimizes quantization error
KL Divergence Calibration: Minimizes KL divergence between original and quantized distributions

1.5 Challenges in LLM Quantization: Activation Outliers

A key characteristic of LLMs is the presence of outliers in activations: a very small number of channels contain activation values far larger than other channels. These outliers:

Appear in specific channels, consistent across tokens
Can be 100 times larger than normal values
Removing these channels causes model performance to collapse

Standard quantization methods are forced to expand the quantization range to cover outliers, resulting in severe degradation of quantization accuracy for normal values.

1.6 SmoothQuant

SmoothQuant is a breakthrough method for solving the activation outlier problem, achieving W8A8 quantization for LLMs.

Core Idea: Migrate the quantization difficulty from activations to weights:

\[Y = (X \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot W) = \hat{X} \hat{W}\]

where $s$ is the migration factor. $\hat{X}$ has a more uniform distribution and is easier to quantize; $\hat{W}$ absorbs part of the difficulty, but weights themselves have good distribution, so the impact is limited.

Migration Factor Selection:

\[s_j = \frac{\max(|X_j|)^\alpha}{\max(|W_j|)^{1-\alpha}}\]

where $\alpha = 0.5$ works well for most models.

1.7 GPTQ

GPTQ is a weight quantization method based on second-order information that can compress LLMs to 4-bit precision.

Problem Formulation: Optimize layer by layer to minimize output error after quantization:

\[\arg\min_{\hat{W}} \| WX - \hat{W}X \|_2^2\]

Performance:

175B parameter models can be quantized in 4 hours on a single GPU
3-4 bit quantization with minimal accuracy loss (perplexity increase < 0.5)
Widely used for quantized distribution of open-source LLMs

1.8 AWQ

AWQ (Activation-aware Weight Quantization) is based on a key observation: the importance of different weight channels varies greatly.

Core Observation: If certain channels of activation values $X$ have larger numerical values, the quantization error of corresponding weight channels has greater impact.

Method: Amplify important channels through per-channel scaling, then restore after quantization. At 4-bit quantization, AWQ’s perplexity is typically better than GPTQ.

1.9 GGUF Format

GGUF is a model storage format defined by the llama.cpp project, widely used for local LLM deployment.

Format	Effective Bit Width	Accuracy Loss
Q8_0	8.5 bits	Minimal
Q5_K_M	5.5 bits	Small
Q4_K_M	4.8 bits	Medium
Q4_0	4.5 bits	Larger
Q2_K	3.4 bits	Large

“K” indicates the use of k-quant method, using higher precision for important layers.

1.10 FP8 Quantization

FP8 is an 8-bit floating-point format that preserves dynamic range compared to INT8.

Two Formats:

E4M3: 4-bit exponent, 3-bit mantissa, larger dynamic range, suitable for forward propagation
E5M2: 5-bit exponent, 2-bit mantissa, higher precision, suitable for gradients

DeepSeek-V3 demonstrates industrial application of FP8 training:

Memory bandwidth requirement reduced by half
FP8 throughput on H100 is 2× that of FP16
Training requires only 2.788M H800 GPU hours

1.11 KV Cache Quantization

The memory bottleneck for long-context inference comes mainly from KV Cache rather than model weights.

KV Cache Size: Taking LLaMA-70B (80 layers, 64 heads, 128 dimensions) as an example:

100K context, batch=1, FP16: approximately 40GB
Same settings using INT4: approximately 10GB

KIVI Method:

Key: per-channel quantization
Value: per-token quantization
Can compress to 2-bit, reducing memory by 8×

2. Inference Optimization

LLM inference faces unique challenges: huge number of model parameters, autoregressive generation token by token, KV Cache grows with sequence length.

2.1 Two-Phase Inference

Prefill Phase: Process all tokens of the input prompt

Computation characteristics: Parallel processing, compute-bound
Bottleneck: Computational load of matrix multiplication
Metric: Time To First Token (TTFT)

Decode Phase: Generate subsequent tokens one by one

Computation characteristics: Autoregressive generation, memory-bound
Bottleneck: Memory bandwidth for loading model parameters and KV Cache
Metric: Tokens Per Second (TPS)

Characteristic	Prefill	Decode
Token Count	N (input length)	1
Computation Mode	Parallel	Serial
Bottleneck	Computation	Memory Bandwidth
GPU Utilization	High	Low

2.2 Continuous Batching

Problem with traditional static batching: requests have varying lengths, short requests must wait for long requests to complete, padding wastes computational resources.

Continuous Batching dynamically manages requests:

Resources are released immediately after request completion, new requests join immediately
No need to wait for entire batch to complete
Iteration-level scheduling, rather than request-level

2.3 PagedAttention and vLLM

vLLM introduces PagedAttention, borrowing the idea of virtual memory from operating systems to manage KV Cache.

Problems with Traditional KV Cache:

Pre-allocate according to maximum sequence length, causing memory waste
Different requests have varying lengths, creating fragmentation
Cannot dynamically expand, limiting concurrent request count

PagedAttention Principle:

Divide KV Cache into fixed-size Pages (blocks)
Pages can be stored non-contiguously (similar to virtual memory)
Allocate on demand, release when finished

vLLM Performance:

Compared to HuggingFace Transformers, throughput improved 24×
GPU memory utilization close to 100% (no fragmentation)

2.4 Prefix Caching and SGLang

Many application scenarios have shared prefixes (System Prompt, Few-shot examples, etc.).

RadixAttention: SGLang uses a Radix Tree to manage KV Cache:

Each edge of the tree corresponds to a token sequence segment
Requests sharing prefixes share KV Cache
LRU strategy manages cache eviction

SGLang Features:

RadixAttention: Automatic Prefix Caching
Structured output: Compressed finite state machine accelerates JSON generation
Compared to vLLM, throughput improvement can reach 5-6×

2.5 Speculative Decoding

Speculative Decoding is an important technique for accelerating autoregressive generation. The core idea is “first use a small model to quickly guess, then use a large model to batch verify”.

Workflow:

Draft Phase: Use small model to autoregressively generate K candidate tokens
Verify Phase: Input K tokens in parallel to large model for verification
Accept/Reject: Consecutively consistent tokens are directly accepted; divergence point gets correct token from large model

Key Guarantee: Even if all drafts are wrong, at least 1 correct token can be obtained from the large model.

Verification Mechanism in Sampling Scenarios:

Let the draft model’s distribution for generating token $x$ at position $t$ be $q(x)$, and the target model’s distribution be $p(x)$. The acceptance probability is:

\[a(x) = \min\left(1, \frac{p(x)}{q(x)}\right)\]

When rejected, resample from the residual distribution:

\[p'(x) = \frac{\max(0, p(x) - q(x))}{\sum_{x'} \max(0, p(x') - q(x'))}\]

This ensures that the final output distribution is strictly equal to $p(x)$ (unbiased).

EAGLE Series:

Method	Additional Model	Training Requirement	Speedup
Independent Draft	Yes	None	2-3×
EAGLE-1	No	Train Head	2.5-3×
EAGLE-2	No	Train Head	3-4×
EAGLE-3	No	Train Head	4-5×

EAGLE’s core innovation is using the target model’s hidden states to guide draft generation, significantly improving guess accuracy.

2.6 KV Cache Compression

KV Cache Quantization:

INT8/FP8: 50% memory reduction, minimal accuracy loss
2-4 bit (KVQuant, KIVI): 4-8× memory reduction

KV Cache Sparsification:

H2O: Dynamically identify important tokens, retain “Heavy Hitters”
SnapKV: Select important KV based on observation window, 16K input can achieve 3.6× speedup

2.7 Inference Engine Comparison

Engine	Core Technology	Prefix Cache	Speculative Decoding	Features
vLLM	PagedAttention	Supported	Supported	Most widely used
SGLang	RadixAttention	Native	EAGLE etc.	Fast structured output
TensorRT-LLM	Deep optimization	Supported	Multiple	NVIDIA official
llama.cpp	CPU optimization	Limited	Supported	Local deployment

3. Best Practices

3.1 Quantization Strategy Selection

Scenario	Recommended Solution
Sufficient memory	FP16/BF16, no accuracy loss
General deployment	INT8 or FP8, minimal accuracy loss
Edge devices	INT4 (GPTQ/AWQ), acceptable accuracy loss
Extreme compression	2-3 bit, need careful task impact evaluation

3.2 Inference Optimization Strategies

Latency-Priority Scenarios:

Use Prefix Caching (SGLang)
Speculative decoding (EAGLE)
Small batch size + high parallelism

Throughput-Priority Scenarios:

Continuous Batching (vLLM)
Large batch size
KV Cache quantization

Long-Context Scenarios:

KV Cache quantization (KVQuant, KIVI)
KV Cache sparsification (SnapKV, H2O)
Prefill-Decode separation

4. Summary

This article provides a comprehensive analysis of the two core technologies for large model deployment optimization:

Domain	Key Technology	Effect
Model Quantization	GPTQ/AWQ (4-bit)	4× model size reduction
Model Quantization	SmoothQuant (W8A8)	1.5× inference speedup
KV Cache	KIVI (2-bit)	8× memory reduction
Batching	PagedAttention	24× throughput improvement
Decoding Acceleration	EAGLE-3	4-5× speedup
Prefix Cache	RadixAttention	5-6× throughput improvement

These techniques enable hundred-billion-parameter models to run efficiently on limited hardware resources, and are key infrastructure for large model deployment.

The next article, also the last in this series, will discuss Advanced Applications, including multimodal and reasoning-enhanced technologies.

Tags: Transformer Inference

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析