Transformer Notes (III): Attention Mechanisms

2025-12-20 · Qi Lu · Views:

The attention mechanism is the core of Transformers, but the standard $O(N^2)$ complexity becomes a bottleneck for long-context modeling. This post systematically introduces four optimization paths: FlashAttention accelerates computation through IO-aware optimization; MLA reduces KV Cache through low-rank compression; sparse attention only computes important token pairs; linear attention fundamentally changes the computation form.

1. FlashAttention: IO-Aware Efficient Attention

The Roofline model reveals a key insight: modern GPUs have computational power far exceeding memory bandwidth. What determines performance is often not “how fast can we compute” but “how fast can data arrive.” Standard attention is a typical case of this bottleneck—it needs to repeatedly write $O(N^2)$ intermediate results to GPU memory, and these IO operations become the real performance killers.

FlashAttention’s core idea is to reorganize the computation order so that the attention matrix never leaves the GPU’s high-speed cache. This is not a mathematical approximation but an exactly equivalent rearrangement—we trade more computation for less memory access, which suits modern GPUs that have excess compute but scarce bandwidth.

1.1 Motivation: GPU Memory Hierarchy

Modern GPUs have computational capabilities far exceeding memory bandwidth. Taking NVIDIA A100 as an example:

Compute capability: 312 TFLOPS (FP16 Tensor Core)
HBM bandwidth: 2 TB/s
SRAM capacity: 20 MB (shared memory + L1 cache)
SRAM bandwidth: approximately 19 TB/s

Arithmetic Intensity is defined as FLOPs per byte of memory access:

\[\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Memory Access Bytes}}\]

For A100, reaching peak compute requires arithmetic intensity $\geq 156$ FLOPs/Byte. Standard attention’s arithmetic intensity is far below this, making it memory-bound.

IO Problem of Standard Attention:

Standard attention implementation requires multiple HBM reads and writes:

Read $Q, K$ from HBM, compute $S = QK^\top$, write back to HBM
Read $S$ from HBM, compute $P = \text{softmax}(S)$, write back to HBM
Read $P, V$ from HBM, compute $O = PV$, write back to HBM

Total HBM access: $O(N^2 + Nd)$, where $N$ is sequence length and $d$ is head dimension. For long sequences, the $N^2$ term dominates, causing severe IO bottlenecks.

1.2 FlashAttention v1: Attention Matrix That Never Materializes

FlashAttention’s core idea: Never write the complete $N \times N$ attention matrix to HBM.

Online Softmax Algorithm

Standard Softmax computation requires two passes:

\[m = \max_j(x_j), \quad \ell = \sum_j e^{x_j - m}, \quad \text{softmax}(x)_i = \frac{e^{x_i - m}}{\ell}\]

Online Softmax enables single-pass, incremental computation. For concatenation of two blocks $x^{(1)}, x^{(2)}$:

$m^{(1)} = \max(x^{(1)}), \quad \ell^{(1)} = \sum_j e^{x_j^{(1)} - m^{(1)}}$ $m^{(2)} = \max(x^{(2)}), \quad \ell^{(2)} = \sum_j e^{x_j^{(2)} - m^{(2)}}$ $m^{new} = \max(m^{(1)}, m^{(2)})$ $\ell^{new} = e^{m^{(1)} - m^{new}} \ell^{(1)} + e^{m^{(2)} - m^{new}} \ell^{(2)}$

Output can also be incrementally updated:

\[O^{new} = \frac{1}{\ell^{new}} \left[ e^{m^{old} - m^{new}} \ell^{old} \cdot O^{old} + e^{m^{(j)} - m^{new}} \ell^{(j)} \cdot O^{(j)} \right]\]

Tiling Algorithm

FlashAttention divides $Q, K, V$ into blocks of size $B_r \times d$ and $B_c \times d$:

$B_c = \lceil M / (4d) \rceil$ (constrained by SRAM size $M$)
$B_r = \min(B_c, d)$

FlashAttention Forward Pass Algorithm:

FlashAttention Forward Pass Algorithm

Backward Pass and Recomputation

Standard backpropagation requires storing $S, P \in \mathbb{R}^{N \times N}$. FlashAttention uses a Recomputation strategy:

Forward pass only stores $O$ and statistics $(m, \ell)$
Backward pass recomputes $S, P$ from $Q, K, V$ blocks

While this increases FLOPs by approximately 30%, it dramatically reduces HBM access, resulting in faster overall speed.

IO Complexity Analysis

Theorem (FlashAttention IO Complexity): Given SRAM size $M$, sequence length $N$, and head dimension $d$, FlashAttention’s HBM access is:

\[O\left( \frac{N^2 d^2}{M} \right)\]

Standard attention’s HBM access is $O(N^2 + Nd)$. When $M = \Theta(Nd)$, FlashAttention reduces HBM access by a factor of $O(N)$.

Scenario	Speedup	Memory Savings
BERT-large (seq=512)	1.15×	5×
GPT-2 (seq=1K)	3×	10×
Long-range (seq=4K)	2.4×	20×

1.3 FlashAttention-2: Parallelism Strategy Optimization

FlashAttention-2 optimizes parallelism strategies and work distribution, achieving 230 TFLOPS on A100 (approximately 73% peak utilization), about 2× faster than v1.

Parallelism Strategy Improvements

FlashAttention v1: Parallelizes across batch and head dimensions, with each thread block processing one attention head. When $\text{batch} \times \text{heads} < 108$ (number of A100 SMs), GPU utilization is low.

FlashAttention-2: Additionally parallelizes across the sequence length dimension. For long sequences (which typically mean small batch), this significantly improves GPU utilization.

Warp Work Distribution

GPU thread hierarchy: Thread → Warp (32 threads) → Thread Block → Grid.

v1’s Sliced-K Scheme: Partitions $K, V$ among 4 warps, with $Q$ visible to all warps. Problem: requires inter-warp synchronization and shared memory for intermediate results.

v2’s Sliced-Q Scheme: Partitions $Q$ among 4 warps, with $K, V$ visible to all warps. Advantage: eliminates inter-warp communication and reduces shared memory reads/writes.

Reducing Non-Matmul FLOPs

A100’s matrix multiplication throughput is 16× higher than non-matrix multiplication (312 vs 19.5 TFLOPS). v2 reduces non-matmul operations through:

Optimizing Online Softmax rescaling operations
Improving boundary checking and causal mask implementation

Metric	v1	v2	Improvement
A100 Peak (TFLOPS)	124	230	1.85×
GPU Utilization	25-40%	50-73%	~2×
GPT-3 Training (TFLOPS)	173	225	1.3×

1.4 FlashAttention-3: Hopper Architecture Optimization

FlashAttention-3 is designed for NVIDIA Hopper architecture (H100), fully leveraging new hardware features to achieve 740 TFLOPS (75% peak utilization).

New Hopper Hardware Features

WGMMA (Warpgroup Matrix Multiply-Accumulate): New Tensor Core instruction introduced in Hopper, with significantly higher throughput than Ampere’s mma.sync. A warpgroup (4 warps, 128 threads) can execute large-scale matrix multiplications.

TMA (Tensor Memory Accelerator): Dedicated hardware unit for data transfer between Global Memory and Shared Memory:

Automatically handles index calculation and boundary checking
Frees register resources, allowing larger tile sizes
Supports asynchronous transfer, overlapping with computation

Three Major Optimization Techniques

1. Warp Specialization: Divides warps into Producer and Consumer:

Producer warps: Handle TMA data transfer
Consumer warps: Handle WGMMA computation

Data transfer and computation are completely asynchronously overlapped.

2. Ping-Pong Scheduling: Alternates execution between two warpgroups:

While Warpgroup 1 executes GEMM, Warpgroup 2 executes Softmax
Then roles switch

This scheduling improves FP16 forward pass from ~570 TFLOPS to 620 TFLOPS.

3. Intra-warpgroup Overlapping: Within a single warpgroup, Softmax computation is pipelined with GEMM:

While GEMM computes the current block, simultaneously execute Softmax on the previous block
Further improves to 640-660 TFLOPS

FP8 Support

FlashAttention-3 supports FP8 low precision through Block Quantization and Incoherent Processing (based on Hadamard transform) to reduce quantization error:

FP8 throughput: approaches 1.2 PFLOPS
Quantization error is 2.6× lower than baseline FP8 attention

1.5 FlashAttention-4: Blackwell Architecture Optimization

FlashAttention-4 is designed for NVIDIA Blackwell architecture (B200), being the first attention kernel to break the PFLOPS barrier.

Five-Stage Warp Pipeline

Expanded from v3’s 2 stages to a 5-stage pipeline, with each warp type highly specialized:

Load Warp: Loads $Q, K, V$ from Global Memory to Shared Memory via TMA
MMA Warp: Executes matrix multiplication, computes attention scores and output accumulation
Softmax Warps (8 total): Computes normalized attention scores, maintains running statistics
Correction Warps (4 total): Rescales output when scaling factors change
Epilogue Warps: Writes completed output blocks back to Global Memory

Software exp2 Emulation

Traditional implementations rely on Special Function Units (SFU) to compute exponential functions, but SFUs are scarce resources. FlashAttention-4 uses cubic polynomial approximation:

\[2^x \approx a_0 + a_1 x + a_2 x^2 + a_3 x^3, \quad x \in [0, 1)\]

Efficiently computed using Horner’s method with vectorized FMA instructions on CUDA Cores, avoiding SFU bottlenecks.

Selective Rescaling

Traditional Online Softmax rescales every time a new maximum is encountered. FlashAttention-4 introduces threshold checking: rescaling is only triggered when the maximum value change is sufficient to affect numerical stability. Reportedly, this reduces rescaling operations by approximately 10×, while maintaining numerical precision.

Performance:

~20% faster than cuDNN attention
~2× faster than FlashAttention-3
~15× faster than original FlashAttention

1.6 Flash Decoding: KV Parallelism During Inference

FlashAttention is optimized for training, but has issues during inference. Flash Decoding specifically addresses inference bottlenecks.

Inference Problem

During autoregressive generation, each step only generates 1 token, meaning $Q$ has sequence length 1:

FlashAttention parallelizes across batch and head dimensions
When $\text{batch} \times \text{heads} < 108$, GPU is severely underutilized
In long-context scenarios (small batch size), FlashAttention may only use 1% of the GPU

KV Sequence Length Parallelism

Flash Decoding’s core idea: Parallelize across the KV sequence length dimension.

Split KV Cache into $S$ blocks: $K = [K_1, …, K_S]$, $V = [V_1, …, V_S]$
Compute partial attention independently for each block: $O_s = \text{softmax}(Q K_s^\top) V_s, \quad (m_s, \ell_s) = \text{statistics}$
Merge results using Log-Sum-Exp: $O = \frac{\sum_s e^{m_s - m_{global}} \ell_s \cdot O_s}{\sum_s e^{m_s - m_{global}} \ell_s}$

Performance Improvements:

Long sequence decoding speedup up to 8×
On CodeLLaMa-34B, attention operations are 50× faster than FlashAttention
Sequence length increase from 512 to 64K, generation speed remains nearly unchanged

1.7 FlashDecoding++

FlashDecoding++ further optimizes, published at MLSys 2024.

Asynchronous Softmax: Flash Decoding’s reduction step requires synchronously waiting for all partial results. FlashDecoding++ introduces Unified Max Value:

Pre-estimate a global maximum $m_{unified}$ (based on statistics or heuristics)
All blocks use the same $m_{unified}$, no synchronization needed
Fine-grained pipelining, Prefill speedup 1.05×, Decoding speedup 1.14×

Flat GEMM Optimization: Inference GEMM shapes are “flat” ($1 \times N$), standard implementations are inefficient:

cuBLAS/CUTLASS has up to 50% performance loss for this shape
FlashDecoding++ uses Double Buffering and targeted optimizations
Flat GEMM speedup up to 52%

1.8 FlashAttention Version Evolution

Version	GPU	Peak TFLOPS	Utilization	Speedup vs v1
v1	A100	124	40%	1×
v2	A100	230	73%	1.85×
v2	H100	335	35%	2.7×
v3	H100	740	75%	6×
v3 (FP8)	H100	1200	-	9.7×
v4	B200	PFLOPS	-	15×

1.9 Engineering Impact of FlashAttention

FlashAttention has become the standard for modern LLM training and inference:

PyTorch 2.0+: Built-in scaled_dot_product_attention uses FlashAttention
vLLM, TensorRT-LLM: Inference engines use it by default
All major LLMs: GPT-4, Claude, LLaMA, DeepSeek all use FlashAttention

Context Length Revolution: FlashAttention increased practical context lengths from 2-4K to 128K+:

Memory reduced from $O(N^2)$ to $O(N)$
64K sequences require 16GB memory with standard attention, FlashAttention only needs ~1GB

When to Use FlashAttention: FlashAttention provides the most benefit in these scenarios:

Long sequences: Sequence length > 512

Large batch: Fully utilize GPU parallelism

Training: Memory savings allow larger batches

For short sequences and small models, standard attention may be faster (reduced kernel launch overhead).

1.10 FlexAttention: Programmable FlashAttention

FlexAttention is a new API introduced in PyTorch 2.5, providing a flexible programming interface for FlashAttention.

Motivation: While FlashAttention is efficient, each attention variant (Causal, ALiBi, Sliding Window, etc.) requires specialized implementation. When researchers want to experiment with new variants, they often need to hand-write Triton kernels. FlexAttention automatically generates efficient kernels through torch.compile, reducing development time from weeks to minutes.

Core API: FlexAttention provides two functional interfaces:

score_mod: Modifies the score matrix after $QK^\top$ (e.g., adding position bias)
mask_mod: Defines mask pattern (positions returning True participate in computation)

from torch.nn.attention.flex_attention import flex_attention, create_block_mask

# Causal mask
def causal(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

# ALiBi position encoding
def alibi(score, b, h, q_idx, kv_idx):
    return score + (q_idx - kv_idx) * slope[h]

# Usage
block_mask = create_block_mask(causal, B, H, Q_LEN, KV_LEN)
out = flex_attention(q, k, v, score_mod=alibi, block_mask=block_mask)

Performance: FlexAttention achieves approximately 85-90% of FlashAttention-2’s performance, but improves development efficiency by 100×. For variants not natively supported by FlashAttention (such as Document Masking), FlexAttention is 5-8× faster than standard SDPA.

2. Multi-head Latent Attention (MLA)

FlashAttention solves the IO bottleneck of attention matrices during training, but during inference there’s another memory challenge: KV Cache. Autoregressive generation requires caching all historical tokens’ Key and Value vectors. As context length increases, this memory usage can exceed the model parameters themselves.

Multi-head Latent Attention (MLA) is DeepSeek-V2’s solution to this problem. Its core insight is: although each attention head needs independent K and V, they may have a low-rank structure—i.e., they can be recovered from a shared low-dimensional “latent vector”. This compression differs from GQA/MQA’s forced sharing; instead, it lets the network learn the optimal compression method.

2.1 KV Cache Challenge

During autoregressive generation, KV Cache becomes the primary memory bottleneck for long-context inference:

\[\text{KV Cache Size} = 2 \times B \times S \times L \times n_h \times d_h \times \text{bytes}\]

For large models (e.g., $n_h = 128$, $d_h = 128$), KV Cache becomes the main memory bottleneck for long-context inference. For a 70B model ($L=80$, $K=8$, $S=8192$), KV Cache reaches 2.1 GB/request.

2.2 Limitations of Existing Solutions

Method	KV Cache Size	Performance	Principle
MHA	$2 n_h d_h$	Optimal	Independent KV per head
GQA	$2 \frac{n_h}{g} d_h$	Slight degradation	$g$ Q heads share KV
MQA	$2 d_h$	Significant degradation	All heads share KV
MLA	$d_c + d_h^R$	Close to MHA	Low-rank compression

GQA/MQA reduce cache by forcing sharing of KV heads, but this forced sharing often harms model performance. MLA’s core insight: KV can be recovered from a low-dimensional latent vector, without explicit sharing.

2.3 MLA Core Principles

MLA’s core idea is to compress high-dimensional Key and Value into a shared low-dimensional latent vector, then recover K and V from this vector during inference.

Low-rank Compression of KV

For input $\mathbf{h}_t \in \mathbb{R}^d$, first compress to latent vector:

\[\mathbf{c}_t^{KV} = W^{DKV} \mathbf{h}_t\]

where $W^{DKV} \in \mathbb{R}^{d_c \times d}$ is the down-projection matrix, $d_c \ll n_h d_h$ is the compression dimension.

Recover K and V from latent vector:

\[\mathbf{k}_t^C = W^{UK} \mathbf{c}_t^{KV}, \quad \mathbf{v}_t^C = W^{UV} \mathbf{c}_t^{KV}\]

where $W^{UK}, W^{UV} \in \mathbb{R}^{n_h d_h \times d_c}$ are up-projection matrices.

Low-rank Compression of Query

Similarly, Query can also be low-rank compressed (mainly to reduce activation memory during training):

\[\mathbf{c}_t^Q = W^{DQ} \mathbf{h}_t, \quad \mathbf{q}_t^C = W^{UQ} \mathbf{c}_t^Q\]

where $W^{DQ} \in \mathbb{R}^{d_c’ \times d}$, $W^{UQ} \in \mathbb{R}^{n_h d_h \times d_c’}$.

Decoupled RoPE

RoPE needs to apply rotation at each position, but if RoPE is directly applied to the compressed $\mathbf{c}_t^{KV}$, it breaks subsequent weight absorption optimization. MLA adopts a Decoupled RoPE strategy:

Split each attention head into two parts:
- Content part ($d_h^C$ dimensions): Recovered from compressed vector, RoPE not applied
- Position part ($d_h^R$ dimensions): Additional projection, RoPE applied
Final Q and K are concatenations of both parts:

$\mathbf{q}_t = [\mathbf{q}_t^C; \text{RoPE}(\mathbf{q}_t^R, t)]$ $\mathbf{k}_t = [\mathbf{k}_t^C; \text{RoPE}(\mathbf{k}_t^R, t)]$

Where position parts are computed as:

\[\mathbf{q}_t^R = W^{QR} \mathbf{c}_t^Q, \quad \mathbf{k}_t^R = W^{KR} \mathbf{h}_t\]

Why Decoupled RoPE? RoPE is position-dependent: $\text{RoPE}(\mathbf{x}, t)$ depends on position $t$. If RoPE is applied to $\mathbf{c}_t^{KV}$ then K is recovered, then $\mathbf{k}_t = W^{UK} \cdot \text{RoPE}(\mathbf{c}_t^{KV}, t)$, and $W^{UK}$ cannot be absorbed into $W^Q$ (because RoPE is in between). The Decoupled strategy isolates RoPE to separate dimensions, preserving the possibility of weight absorption.

Weight Absorption

A key optimization of MLA is Weight Absorption. Since there’s no nonlinear activation between compression and recovery, matrices can be merged:

Query-Key Absorption: Computing attention scores:

\[\mathbf{q}_t^{C\top} \mathbf{k}_s^C = (\mathbf{c}_t^Q)^\top (W^{UQ})^\top W^{UK} \mathbf{c}_s^{KV} = (\mathbf{c}_t^Q)^\top \underbrace{W^{QK}}_{\text{absorbed}} \mathbf{c}_s^{KV}\]

where $W^{QK} = (W^{UQ})^\top W^{UK} \in \mathbb{R}^{d_c’ \times d_c}$.

Output-Value Absorption: Computing output projection:

\[W^O \mathbf{v}_t^C = W^O W^{UV} \mathbf{c}_t^{KV} = \underbrace{W^{OV}}_{\text{absorbed}} \mathbf{c}_t^{KV}\]

where $W^{OV} = W^O W^{UV} \in \mathbb{R}^{d \times d_c}$.

Inference Flow: After weight absorption, during inference:

Cache $\mathbf{c}_t^{KV}$ and $\mathbf{k}_t^R$ (position part)
Use $W^{QK}$ to directly compute content part attention scores
Use absorbed $W^{OV}$ to compute output

2.4 KV Cache Compression Effectiveness

Method	Cache Elements	DeepSeek-V2 (Specific Values)
MHA	$2 n_h d_h$	$2 \times 128 \times 128 = 32768$
GQA (8 groups)	$2 \times 8 \times d_h$	$2 \times 8 \times 128 = 2048$
MLA	$d_c + d_h^R$	$512 + 64 = 576$

DeepSeek-V2 configuration: $d_c = 512$, $d_h^R = 64$, achieving compression ratio of $\frac{32768}{576} \approx \mathbf{56.9×}$.

2.5 PyTorch Implementation

Below is a simplified PyTorch implementation of MLA:

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadLatentAttention(nn.Module):
    def __init__(
        self,
        d_model: int,      # Model dimension
        n_heads: int,      # Number of attention heads
        d_c: int,          # KV compression dimension
        d_c_q: int,        # Q compression dimension
        d_head_r: int,     # RoPE dimension per head
    ):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.d_c = d_c
        self.d_head_r = d_head_r

        # KV compression
        self.W_dkv = nn.Linear(d_model, d_c, bias=False)
        self.W_uk = nn.Linear(d_c, d_model, bias=False)
        self.W_uv = nn.Linear(d_c, d_model, bias=False)

        # Q compression
        self.W_dq = nn.Linear(d_model, d_c_q, bias=False)
        self.W_uq = nn.Linear(d_c_q, d_model, bias=False)

        # Decoupled RoPE projections
        self.W_qr = nn.Linear(d_c_q, n_heads * d_head_r, bias=False)
        self.W_kr = nn.Linear(d_model, n_heads * d_head_r, bias=False)

        # Output projection
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, rope_fn, kv_cache=None):
        B, T, D = x.shape

        # KV compression: only c_kv needs caching
        c_kv = self.W_dkv(x)  # (B, T, d_c)

        # Decoupled RoPE keys
        k_r = self.W_kr(x)    # (B, T, n_heads * d_head_r)
        k_r = rope_fn(k_r)    # Apply RoPE

        # Handle KV cache
        if kv_cache is not None:
            c_kv_cached, k_r_cached = kv_cache
            c_kv = torch.cat([c_kv_cached, c_kv], dim=1)
            k_r = torch.cat([k_r_cached, k_r], dim=1)
        new_cache = (c_kv, k_r)

        # Q compression
        c_q = self.W_dq(x)    # (B, T, d_c_q)
        q_c = self.W_uq(c_q)  # (B, T, D) - content part
        q_r = self.W_qr(c_q)  # (B, T, n_heads * d_head_r) - RoPE part
        q_r = rope_fn(q_r)

        # Reconstruct K, V from compressed cache
        k_c = self.W_uk(c_kv)  # (B, S, D) - content part
        v = self.W_uv(c_kv)    # (B, S, D)

        # Reshape for multi-head attention
        q_c = q_c.view(B, T, self.n_heads, self.d_head)
        k_c = k_c.view(B, -1, self.n_heads, self.d_head)
        v = v.view(B, -1, self.n_heads, self.d_head)
        q_r = q_r.view(B, T, self.n_heads, self.d_head_r)
        k_r = k_r.view(B, -1, self.n_heads, self.d_head_r)

        # Concatenate content and RoPE parts
        q = torch.cat([q_c, q_r], dim=-1)  # (B, T, H, d_head + d_head_r)
        k = torch.cat([k_c, k_r], dim=-1)  # (B, S, H, d_head + d_head_r)

        # Scaled dot-product attention
        q, k, v = q.transpose(1,2), k.transpose(1,2), v.transpose(1,2)
        scale = (self.d_head + self.d_head_r) ** -0.5
        attn = F.softmax(q @ k.transpose(-2,-1) * scale, dim=-1)
        out = attn @ v  # (B, H, T, d_head)

        # Output projection
        out = out.transpose(1,2).reshape(B, T, D)
        out = self.W_o(out)

        return out, new_cache

Implementation Optimization: The code above is for educational purposes. In actual deployment:

Weight Absorption: Pre-compute $W^{QK} = (W^{UQ})^\top W^{UK}$ and $W^{OV} = W^O W^{UV}$

FlashAttention: Use FlashAttention to accelerate attention computation

Fused Operators: Fuse multiple small matrix multiplications

2.6 MLA vs Other Methods

Feature	MHA	GQA	MQA	MLA
KV Cache	$2n_h d_h$	$2\frac{n_h}{g} d_h$	$2 d_h$	$d_c + d_h^R$
Parameters	Baseline	Reduced	Minimum	Slightly increased
Expressiveness	Strongest	Strong	Weaker	Close to MHA
Inference Latency	High	Medium	Low	Medium
Long Context	Limited	Good	Good	Best

MLA Advantages:

Extreme compression: KV Cache reduced by over 93%
Performance preservation: Unlike GQA/MQA forced sharing, learns optimal compression
Long context friendly: Makes 128K context possible

MLA Costs:

Compute overhead: Requires additional compression/recovery computation
Implementation complexity: Decoupled RoPE and weight absorption increase implementation difficulty
Training cost: Low-rank constraint may require more training

2.7 Quality-Efficiency Tradeoff

MLA’s 56.9× compression ratio comes from a strong assumption: K and V can be losslessly recovered from a 512-dimensional latent space. Under what conditions does this assumption hold?

Validity of Low-rank Assumption

Consider the diversity requirements of attention patterns. If a task requires $r$ essentially different attention patterns (e.g., local dependencies, long-range dependencies, syntactic structure, semantic associations, etc.), then KV representation needs at least $r$ degrees of freedom.

Lower Bound on Compression Dimension: If $d_c < r$, compression causes information loss. DeepSeek chose $d_c = 512$, implicitly assuming that common NLP task attention patterns can be expressed in a space of no more than 512 dimensions.

Task Dependency: Different tasks have different attention diversity requirements:

Text generation: Relatively fixed patterns, low-rank assumption holds, MLA performs well
Code understanding: Needs to track complex variable dependencies and scopes, may require higher-rank representations
Mathematical reasoning: Multi-step reasoning needs to maintain multiple reasoning chains, high attention diversity requirements

Selection Guide

Scenario	Recommended Method	Rationale
Long-context inference (>32K)	MLA	KV Cache is main bottleneck
Short context, high throughput	GQA	Simple implementation, low overhead
Quality priority (small batch)	MHA	No compression loss
Code/Math tasks	GQA or MHA	High attention diversity requirements
Edge device deployment	MLA	Extreme memory compression

Rule of Thumb: When $\text{context length} \times \text{batch size} > 10^6$, KV Cache becomes the main bottleneck, and MLA’s benefits become apparent.

Compression Ratio vs Quality Pareto Frontier

DeepSeek’s ablation experiments show that $d_c$ selection has a “sweet spot”:

$d_c < 256$: Significant quality degradation, over-compressed
$d_c \in [256, 768]$: Quality close to MHA, compression effective
$d_c > 768$: No further quality improvement, diminishing compression returns

$d_c = 512$ is near Pareto optimal—further compression harms quality, further expansion wastes cache space.

2.8 Applications and Extensions

MLA has been applied in the following models:

DeepSeek-V2: First proposed, 236B parameter MoE model
DeepSeek-V3: Further optimized, 671B parameters
DeepSeek-R1: Reasoning model, inherits MLA architecture

Synergy between MLA and MoE: MLA is particularly suitable for MoE architecture: MoE’s sparse activation already reduces computation, while MLA further addresses memory bottleneck. The combination allows DeepSeek-V2 to maintain high performance while reducing training cost by 42.5%.

3. Sparse Attention

FlashAttention and MLA optimize attention from computation efficiency and memory usage perspectives respectively, but they both preserve the full $O(N^2)$ attention computation—they just make this computation faster and more memory-efficient. This section explores another path: if most attention weights are close to zero anyway, can we skip these meaningless computations?

3.1 Core Idea

Standard Softmax Attention has $O(N^2)$ complexity, but in practice not all token pairs are equally important:

Attention distributions are usually sparse (a few tokens get most of the weight)
Distant tokens typically have weak attention
Semantically related tokens tend to cluster at specific positions

The core idea of sparse attention: Only compute important token pairs, reducing complexity from $O(N^2)$ to $O(Nk)$, where $k \ll N$.

Feature	Sparse Attention	Linear Attention
Complexity	$O(Nk)$	$O(Nd^2)$
Attention Type	Exact Softmax	Approximate/Alternative
Long-range Exact Retrieval	Strong	Weak
KV Cache	Requires full	Compressible
Compatibility with Original Transformer	High	Medium

3.2 Sliding Window Attention

Sliding Window Attention (SWA) is the most intuitive form of sparse attention: each token only attends to tokens within a fixed window around it.

Mistral’s Sliding Window

Mistral 7B is the first open-source model to deploy sliding window attention at scale, with window size 4096.

Core Mechanism: Each position $t$ token only attends to tokens in range $[t-w, t]$:

\[\text{Attention}_t = \text{softmax}\left(\frac{q_t K_{[t-w:t]}^\top}{\sqrt{d}}\right) V_{[t-w:t]}\]

Cross-layer Information Propagation: The key insight of sliding window is that through stacked Transformer layers, information can propagate “across windows”. At layer $k$, the token at position $t$ can actually access information from range $[t - k \cdot w, t]$. For a 32-layer model with window size 4096, the theoretical receptive field can reach 128K.

Inference Optimization:

Rolling Buffer: KV Cache only needs to retain the most recent $w$ tokens
Memory Savings: 50% cache savings at 8K sequence length
Speed Improvement: Combined with FlashAttention, 2× speedup at 16K sequences

StreamingLLM: Infinite Length Input

StreamingLLM solves an important problem: how to let LLMs process “infinitely long” streaming input.

Attention Sink Phenomenon: Research found that regardless of input length, models always assign abnormally high attention weight to the first few tokens—even if these tokens are not semantically important. This is called “Attention Sink”.

SinkAttention: Retains two parts of KV Cache:

Sink Tokens: First 4 tokens of sequence (fixed)
Sliding Window: Most recent $w$ tokens

\[\text{KV Cache} = \text{Sink}_{[1:4]} \cup \text{Window}_{[t-w:t]}\]

Effect: Maintains stable performance in streaming scenarios with 4 million+ tokens, while standard sliding window collapses beyond pre-training length.

3.3 KV Cache Sparsification

KV Cache sparsification is sparse attention during inference: dynamically discarding “unimportant” KV entries.

H2O (Heavy-Hitter Oracle): Based on an observation—a few “heavy-hitter” tokens accumulate most attention weight. Algorithm: maintain cumulative attention score for each token, retain Top-k tokens with highest scores + most recent sliding window tokens.

SnapKV: Attention patterns are mostly determined during prefill phase, can prune “once and for all”. Analyze attention distribution at the end of prefill, identify important positions, permanently retain KV at these positions.

PyramidKV: Different layers have different attention sparsity—lower layers have more dispersed attention needing more KV, higher layers have concentrated attention allowing aggressive compression. Uses pyramid-shaped KV allocation: more at bottom, less at top.

Method	Pruning Timing	Dynamicity	Integration Framework
H2O	Every step	Dynamic	vLLM
SnapKV	After Prefill	Static	vLLM
StreamingLLM	Continuous	Static	–
PyramidKV	Layer-level	Static	–

3.4 MoBA: Block Sparse Attention

MoBA (Mixture of Block Attention) is a representative block-level sparse attention work, deployed in Kimi’s long-context service.

Core Idea: Applying MoE to Attention

MoBA’s core insight: Not all context is equally important to the current token. Rather than computing attention over the entire sequence, let the model learn “which blocks to focus on”.

\[\text{MoBA}(q, K, V) = \text{softmax}(qK_{[\mathcal{I}]}^\top)V_{[\mathcal{I}]}\]

where $\mathcal{I} \subseteq [N]$ is the selected KV subset, determined by the routing mechanism.

Block Partitioning and Routing

Block Partitioning: Evenly divide the context of length $N$ into $n$ blocks, each of size $B = N/n$:

\[\mathcal{I}_i = [(i-1) \cdot B + 1, \, i \cdot B], \quad i = 1, \ldots, n\]

Routing Score Calculation (parameter-free): For each query $q$, compute affinity score with each block:

\[s_i = \langle q, \text{mean\_pool}(K_{[\mathcal{I}_i]}) \rangle\]

i.e., inner product of query with the mean vector of all keys in the block. This is a parameter-free routing mechanism.

Top-k Selection: Select the $k$ blocks with highest scores for attention computation (typical setting $k = 12$, block size $L = 4096$).

Causality Guarantee

In autoregressive scenarios:

Future block masking: For query at position $t$, all blocks with $i > \lceil t/B \rceil$ have $s_i = -\infty$
Current block forced selection: The block containing the query is always routed

Metric	MoBA	Full Attention
LM Loss Difference	\multicolumn{2}{c}{$< 10^{-3}$}
Sparsity @32K	\multicolumn{2}{c}{95.31%}
Speedup @1M	\multicolumn{2}{c}{6.5×}
Speedup @10M	\multicolumn{2}{c}{16×}

3.5 NSA: Native Sparse Attention

Native Sparse Attention (NSA) is a hierarchical sparse attention mechanism proposed by DeepSeek.

Three Attention Paths

NSA decomposes attention computation into three parallel paths:

1. Compression Attention: Uses learnable MLP to compress consecutive tokens into block-level representations:

\[\tilde{K}_i = \text{MLP}(K_{[(i-1)l+1:il]}), \quad \tilde{V}_i = \text{MLP}(V_{[(i-1)l+1:il]})\]

where $l$ is compression block size (NSA uses $l=32$). This captures global coarse-grained information.

2. Selection Attention: Selects most relevant blocks through Lightning Indexer to maintain original precision:

Compute query relevance score with all blocks
Select Top-$n$ blocks (NSA uses $n=16$ blocks, block size $l’=64$)
Perform exact Softmax attention on selected blocks

This preserves fine-grained precise information.

3. Sliding Window Attention: Full attention on the most recent $w$ tokens (NSA uses $w=512$). This guarantees accurate modeling of local context.

Lightning Indexer

Lightning Indexer is NSA’s core innovation for efficient block selection:

Maintains separate FP8 quantized Key cache (not MLA’s KV Cache)
Each query computes relevance score with all blocks
Selects Top-k blocks (default 2048 tokens)
Hardware optimized: Efficient CUDA kernel implemented with DeepGEMM

Key Design: Index computation is separated from attention computation; indexing uses low precision for speed.

End-to-end Trainable

Unlike ClusterKV, MagicPIG, and other methods relying on non-differentiable operations, NSA is natively trainable—using sparse attention from the pre-training stage.

Configuration	Value
Compression Block Size	32
Selection Block Size	64
Number of Selection Blocks	16
Sliding Window	512

Training Speedup (64K sequence, A100): Forward 9×, Backward 6×. Speedup increases with sequence length: 4× at 8K, 6.4× at 16K, 9.1× at 32K, 11.6× at 64K.

3.6 DSA: DeepSeek Sparse Attention

DSA (DeepSeek Sparse Attention, September 2025) is the next-generation sparse attention deployed in DeepSeek-V3.2, fundamentally different from NSA. DSA abandons NSA’s complex three-branch design, adopting simpler fine-grained token-level retrieval.

Algorithm Design

DSA’s core idea: Each query only needs to attend to a fixed number $k$ of most relevant tokens ($k=2048$).

Importance Score Calculation: DSA introduces learnable weight $w$ to compute token importance:

\[\text{score}_i = w \cdot f(q, k_i)\]

This is a compromise—simpler than NSA’s MLP, but more expressive than MoBA’s parameter-free mean-pooling.

Top-k Retrieval: Select Top-$k$ tokens based on importance scores for exact attention computation:

\[\mathcal{I} = \text{Top-}k(\{\text{score}_i\}_{i=1}^N), \quad |\mathcal{I}| = 2048\]

Complexity: Single query accesses fixed $k$ tokens, so overall complexity is $O(Nk)$, true linear complexity.

Core Differences from NSA

Feature	NSA	DSA
Selection Granularity	Block-level	Token-level
Number of Branches	3 (compress+select+window)	1 (direct selection)
Importance Calculation	Learnable MLP	Learnable $w$ weight
Attention Variant	GQA	MLA
Validation Model	27B	671B

Engineering Implementation

TileLang Kernel: Fine-grained sparsity + MLA requires custom kernels, TileLang outperforms Triton
vLLM/SGLang Integration: Day-0 support, using DeepGEMM and FlashMLA
Blackwell Optimization: Collaboration with NVIDIA for B200 optimization

Performance Gains: Long-context API cost reduced by approximately 50%, significant speedup on 64K sequences, quality nearly lossless on 671B model.

3.7 Comparison of Three Methods

Design Dimension	NSA	MoBA	DSA
Release Date	2025.02	2025.02	2025.09
Proposer	DeepSeek	Moonshot (Kimi)	DeepSeek
Selection Granularity	Block-level	Block-level	Token-level
Routing Mechanism	Learnable MLP	Parameter-free mean-pool	Learnable $w$
Local Window	Yes ($w$=512)	Current block forced	No
Complexity	$O(N^2/L)$	$O(N \cdot kL)$	$O(Nk)$

Hyperparameter Comparison

Parameter	NSA	MoBA	DSA
Block Size	$l$=32, $l’$=64	$L$=4096	–
Selection Count	$n$=16 blocks	$k$=12 blocks	$k$=2048 tokens
Sliding Window	$w$=512	–	–
@32K Accessed Tokens	~2560	49152	2048
@32K Sparsity	92%	0% (sees all)	94%

Key Observation: At 32K length, MoBA has almost no sparsity (selected 12 blocks × 4096 = 49152 > 32K)! MoBA’s sparsity advantage only appears at longer sequences (e.g., 128K+).

Design Philosophy Differences

NSA: Comprehensive coverage, hierarchical fusion—three branches fused through learnable gating, misses no important information
MoBA: Simple elegance, MoE thinking—treats KV Cache as “expert pool”, parameter-free routing lets attention scores naturally decide selection
DSA: Aggressive sparsity, end-to-end optimization—Token-level selection, each query only sees 2048 tokens (approximately 3%@64K)

Applicable Scenarios

Method	Applicable Scenarios
NSA	Need precise preservation of multi-scale information; can accept complex hyperparameter tuning; using GQA architecture
MoBA	Pursuing simple design; want seamless replacement of existing Attention; sequence length 128K+
DSA	Using MLA architecture; pursuing extreme sparsity; ultra-large models (100B+)

3.8 Ring Attention

When sequence length exceeds single GPU memory, Ring Attention splits long sequences across multiple GPUs, implementing distributed attention computation through ring communication.

Algorithm Flow:

Split Query, Key, Value along sequence dimension across $P$ GPUs
Each GPU holds local Query block, computes attention with local KV
KV blocks are passed in ring fashion between GPUs, accumulating global attention
Uses Online Softmax to avoid numerical overflow

Communication Hiding: Key optimization is compute-communication overlap—asynchronously transmit the next KV block while computing attention with the current block.

LLaMA 3’s Context Parallelism: Uses All-Gather style Context Parallelism—first All-Gather to collect all KV, then compute local Query’s attention. For load balancing, split sequence into $2 \times \text{CP}$ blocks and shuffle, supporting efficient training at 128K context.

3.9 Sparse Attention Method Overview

Method	Sparsity Strategy	Complexity	Global Information	Deployment
Sliding Window	Fixed window	$O(Nw)$	None	Mistral
StreamingLLM	Sink+window	$O(N(s+w))$	Sink tokens	–
MoBA	Block routing	$O(N \cdot kL)$	Top-k blocks	Kimi
NSA	Compress+select+window	$O(N^2/L)$	Compress+select	–
DSA	Token-level retrieval	$O(Nk)$	Top-k tokens	DeepSeek-V3.2
Ring Attention	Distributed	$O(N^2/P)$	Complete	LLaMA 3

Evolution of Sparse Attention: From Longformer/BigBird’s “hand-designed patterns” to MoBA/NSA/DSA’s “learned sparsity”, sparse attention is undergoing a paradigm shift. In 2025, sparse attention was first validated on 600B+ scale models (DeepSeek-V3.2), marking the technology’s transition from academic research to industrial mainstream.

4. Linear Attention

Sparse attention reduces $O(N^2)$ to $O(Nk)$ by “only computing important token pairs”, but it still preserves the softmax computation form. This section introduces a more radical path: fundamentally changing the mathematical form of attention, reducing complexity to truly $O(N)$.

4.1 Core Idea

The core insight of linear attention: softmax attention requires $O(N^2)$ because it must first compute the complete $N \times N$ attention matrix before normalization. If we replace softmax with other functions that allow computation to be “rearranged”, we can avoid explicitly constructing this matrix.

From Softmax Attention to Linearization

Standard self-attention (single head) computation:

\[\text{Attention}(Q, K, V) = \underbrace{\text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)}_{n \times n} V\]

This requires explicitly constructing the $n \times n$ attention matrix, with both time and space complexity $O(n^2)$.

The core idea of linear attention: rewrite softmax or $QK^\top$ in a decomposable form, using associativity of multiplication to change the computation order:

\[\text{Attention}(Q, K, V) \approx \phi(Q) \cdot \underbrace{(\phi(K)^\top V)}_{d \times d}\]

where $\phi(\cdot)$ is some feature mapping function. The key is: first compute $\phi(K)^\top V$ (a $d \times d$ matrix linear in length $n$), then left-multiply by $\phi(Q)$. Total complexity reduces to $O(nd^2)$, approximately $O(n)$ when $d \ll n$.

Recurrent Form: Transformer as RNN

In autoregressive (causal) scenarios, linear attention can be written in RNN-like recurrent form. Let $q_t, k_t, v_t$ be the query, key, value vectors at step $t$, define state matrix $S_t \in \mathbb{R}^{d \times d}$:

\[S_t = S_{t-1} + v_t k_t^\top, \quad o_t = S_t \cdot q_t\]

This reveals a deep connection: Linear Attention is essentially an RNN, with hidden state $S_t$ accumulating historical information.

Dual-Mode of Linear Attention: Linear attention supports two equivalent computation modes:

Parallel mode: Uses matrix multiplication during training, fully utilizing GPU parallelism

Recurrent mode: Uses RNN form during inference, enabling $O(1)$ incremental updates

This dual-mode property allows linear attention to achieve optimal efficiency in both training and inference phases.

4.2 Classic Methods

Linear Transformer

The first work to systematically propose “Transformer as RNN”. Core idea is to rewrite softmax attention in kernel form:

\[\text{Attention}(Q, K, V) = \frac{\phi(Q) (\phi(K)^\top V)}{\phi(Q) (\phi(K)^\top \mathbf{1})}\]

where $\phi(x) = \text{elu}(x) + 1$ ensures non-negativity. Experiments show up to 4000× speedup on autoregressive tasks.

Limitation: Simple feature mappings struggle to precisely approximate softmax behavior, with performance gaps on complex language tasks.

Performer

Proposes FAVOR+ (Fast Attention Via positive Orthogonal Random features), using random features to approximate the softmax kernel:

\[\text{softmax}(q^\top k) \approx \phi(q)^\top \phi(k)\]

where $\phi$ is constructed using orthogonal random features, with unbiased or approximately unbiased theoretical guarantees.

Advantage: Fully compatible with original Transformer, can serve as drop-in replacement. Limitation: Random approximation still has precision loss in practice.

cosFormer

Rather than hard-approximating softmax, designs a linear alternative based on two key properties of softmax:

Non-negativity: Attention weights should be non-negative
Concentration: Attention should concentrate at relevant positions

cosFormer uses ReLU to ensure non-negativity, and introduces cosine position reweighting:

\[\text{Attention}_{ij} = \text{ReLU}(q_i)^\top \text{ReLU}(k_j) \cdot \cos\left(\frac{\pi}{2} \cdot \frac{i - j}{n}\right)\]

Achieved state-of-the-art performance on Long-Range Arena and other long-sequence benchmarks, representing “usable linear Attention”.

4.3 Linear Attention with Forgetting Gates

A fundamental problem with original linear attention: state matrix $S_t$ can only accumulate, cannot forget. As sequence grows, historical information “crowds together”, degrading retrieval capability.

RetNet

Microsoft’s Retentive Network introduces exponential decay factor $\gamma \in (0, 1)$:

\[S_t = \gamma S_{t-1} + v_t k_t^\top, \quad o_t = S_t \cdot q_t\]

This applies exponential decay to historical information, emphasizing the importance of recent tokens.

Multi-Scale Retention: Different attention heads use different $\gamma$ values, achieving multi-scale memory retention:

Small $\gamma$: Focus on recent information (short-range dependencies)
Large $\gamma$: Retain longer history (long-range dependencies)

Three Computation Modes:

Parallel mode: Matrix computation during training
Recurrent mode: $O(1)$ inference
Chunk-wise recurrent mode: Efficient processing of long sequences

Performance: 7B model on 8k sequences, inference speed 8.4× faster than Transformer, memory reduced by 70%.

Lightning Attention

Lightning Attention proposed by MiniMax is currently the first linear attention architecture scaled to commercial-grade. Core innovation:

Block Computation Strategy: Divides attention computation into intra-block and inter-block:

Intra-block: Uses “left-multiply” form within blocks, parallelizable
Inter-block: Uses “right-multiply” form between blocks, accumulating state

This decomposition avoids the slow cumsum operation in traditional linear attention.

Hybrid Architecture: Every 8 layers, 7 use Lightning Attention, 1 uses standard Softmax Attention, balancing efficiency and precision.

Parameter	MiniMax-01
Total Parameters	456B
Active Parameters (MoE)	45.9B
Number of Experts	32
Training Context Length	1M tokens
Inference Extrapolation Length	4M tokens

4.4 DeltaNet: Based on Delta Rule

Motivation: Solving Memory Overload

The core defect of original linear attention is memory overload: can only add new key-value associations, cannot erase existing information. This causes retrieval errors to accumulate as sequence grows.

Delta Rule Update

DeltaNet introduces the “out with the old, in with the new” Delta Rule:

\[S_t = S_{t-1} - \underbrace{(S_{t-1} \cdot k_t - v_t)}_{\text{delta}} \cdot k_t^\top\]

Intuitive understanding:

$S_{t-1} \cdot k_t$: Use current key to retrieve value from memory
$S_{t-1} \cdot k_t - v_t$: Compute difference (delta) between retrieved and true value
Correct memory based on delta, achieving “precise update”

Gated DeltaNet

ICLR 2025 work further introduces gating mechanism:

\[S_t = \alpha_t \odot S_{t-1} + \beta_t \odot (v_t - S_{t-1} \cdot k_t) \cdot k_t^\top\]

where $\alpha_t$ controls forgetting, $\beta_t$ controls update strength.

Complementarity: Gating enables fast memory erasure, Delta Rule enables precise memory update. The combination outperforms Mamba2 and original DeltaNet on multiple benchmarks.

Industrial Adoption: Gated DeltaNet has been adopted by Qwen3-Next as the linear attention layer.

4.5 Connection to State Space Models

Mamba is another important efficient sequence modeling approach, based on Selective State Space Models (Selective SSM). The Mamba-2 paper reveals the Structured State Space Duality (SSD):

“Compared to standard self-attention, SSD has only two differences: removing softmax normalization, and applying a separate element-wise mask matrix.”

This indicates linear attention and SSM can be viewed as different instances of the same framework:

Linear Attention: Decomposes attention matrix through feature mapping
SSM: Models sequences through state space equations
Both have linear complexity and recurrent form

Hybrid Architectures: In practice, pure linear models still have gaps on some tasks, leading to hybrid architectures:

Jamba (AI21): Mamba + Attention
MiniMax-01: Lightning Attention + Sparse Softmax Attention
Qwen3-Next: Gated DeltaNet + SwiGLU

4.6 Test-Time Training Perspective

Su Jianlin pointed out in “A Brief History of Linear Attention” that modern linear attention can be unified under the Test-Time Training (TTT) framework:

“View sequence modeling as an online learning problem, construct RNN using optimizers. Different loss functions correspond to different RNN models.”

Method	Update Rule	Corresponding Optimizer
Linear Attention	$S_t = S_{t-1} + v_t k_t^\top$	Accumulated gradient
RetNet	$S_t = \gamma S_{t-1} + v_t k_t^\top$	Accumulated with decay
DeltaNet	$S_t = S_{t-1} - (S_{t-1}k_t - v_t)k_t^\top$	Delta Rule
Gated DeltaNet	Gated Delta Rule	Adaptive learning rate

This perspective provides principled guidance for designing new linear attention: choose appropriate “optimizers” to update memory state.

4.7 Industrial Deployment Status

Company/Model	Architecture Type	Context Length	Features
MiniMax-01	Lightning Attention + MoE	1M-4M	First commercial-grade linear Attention LLM
MiniMax-M1	Lightning Attention	1M+80k generation	Open-source reasoning model
Qwen3-Next	Gated DeltaNet	–	Linear layer + gated Attention

Key Observation: MiniMax is currently the only vendor to scale linear attention to commercial-grade. Other vendors (like Kimi, DeepSeek) prefer the sparse attention route.

4.8 Limitations and Future Outlook

Current Limitations:

Precision gap: Pure linear attention still lags behind Softmax Attention on complex reasoning tasks
In-context learning capability: Linear models typically have weaker few-shot capability than Transformers
Long-range exact retrieval: Unstable performance on tasks like passkey retrieval

Development Trends:

Hybrid architectures: Combining linear layers with sparse Softmax layers
Gating mechanisms: More fine-grained memory management (like Gated DeltaNet)
Knowledge distillation: Distilling from Softmax models to linear models (like LAWCAT)
TTT principles: Designing new architectures based on optimizer perspective

Historical Assessment: Su Jianlin’s evaluation—”Linear attention has evolved from simply imitating Softmax Attention to ‘feeding back’ to it—applying DeltaNet improvements to Softmax Attention through kernel tricks. This shows the field is thriving with broad exploration space.”

Chapter Summary

This chapter introduced four optimization paths for attention mechanisms:

FlashAttention: IO-aware optimization, doesn’t change computation but dramatically reduces memory access
- v1→v4 continuous evolution, 15× peak performance improvement
MLA: Low-rank compression of KV Cache
- DeepSeek-V2 achieves 56.9× compression ratio, approaching MHA performance
Sparse Attention: Only computes important token pairs
- NSA/MoBA/DSA represent the evolution of “learned sparsity”
- DSA validated on 671B model, API cost reduced by 50%
Linear Attention: Changes computation form, truly $O(N)$ complexity
- Lightning Attention first scaled to commercial-grade
- Gated DeltaNet adopted by Qwen3-Next

Development Trend: Sparse attention and linear attention are moving from academic research to industrial mainstream. In 2025, both approaches have been validated on 600B+ scale models.

The next post will introduce Mixture of Experts (MoE) architecture.

Tags: Transformer Attention

← Back to Home