Transformer Notes (VIII): Frontier Applications
2025-12-20 · Qi Lu · Views:
This is the final article in the Transformer series, exploring frontier applications of large language models: multimodal large models and reasoning large models. These two directions represent the cutting edge of current AI research and are profoundly changing our understanding of intelligence.
1. Multimodal Large Models
As large language models achieve breakthroughs in the text domain, researchers have begun exploring how to combine visual, audio, and other multimodal information with language capabilities.
1.1 From Unimodal to Multimodal
Based on the depth and method of modality fusion, multimodal large models can be categorized as follows:
- Cascaded: Multiple independent models connected in series, such as using a vision model to extract descriptions first, then feeding them into a language model
- Adapter-based: Adding visual adapters on top of pre-trained LLMs, such as LLaVA, BLIP-2
- Native: Jointly trained on multimodal data from scratch, such as GPT-4o, Gemini
1.2 Core Challenges
Building multimodal large models faces several key challenges:
Modality Alignment: Images and text exist in different representation spaces, requiring effective cross-modal mapping. Images are continuous pixel values, while text is a discrete sequence of tokens. How to align them in the same semantic space is the core problem.
Information Compression: A 224×224 image contains 50,176 pixels, while typical vision encoders produce 196-576 visual tokens. How can we compress visual representations while retaining key information, avoiding excessive sequence length burden on the LLM?
Unifying Understanding and Generation: Visual understanding (like VQA) requires high-level semantic abstraction, while image generation requires fine-grained pixel-level information. How can we support both seemingly contradictory requirements in a single model?
1.3 Vision Encoders
Vision encoders are the “eyes” of multimodal large models, responsible for converting images into representations that language models can understand.
Vision Transformer (ViT)
Vision Transformer applies the Transformer architecture to image processing. Its core idea is to divide images into fixed-size patches, then process these patches like text tokens:
\[\mathbf{z}_0 = [\mathbf{x}_\text{class}; \mathbf{E}\mathbf{x}_1; \mathbf{E}\mathbf{x}_2; ...; \mathbf{E}\mathbf{x}_N] + \mathbf{E}_\text{pos}\]where $\mathbf{x}_{i} \in \mathbb{R}^{P^2 \cdot C}$ is the flattened vector of the $i$-th image patch, $\mathbf{E}$ is the patch embedding matrix, and $\mathbf{E}_{\text{pos}}$ is the positional encoding.
CLIP and Contrastive Learning
CLIP (Contrastive Language-Image Pre-training) trains vision encoders on 400 million image-text pairs through contrastive learning, aligning image representations with corresponding text descriptions in semantic space:
\[\mathcal{L}_\text{CLIP} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_i)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j)/\tau)}\right]\]CLIP’s vision encoder (typically ViT-L/14) became the standard choice for early multimodal large models due to its powerful cross-modal alignment capability.
SigLIP and Improvements
SigLIP improves CLIP’s training objective by using sigmoid loss instead of softmax:
\[\mathcal{L}_\text{SigLIP} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{N}\log\sigma(y_{ij} \cdot \text{sim}(\mathbf{v}_i, \mathbf{t}_j) \cdot \tau)\]where $y_{ij} = 1$ when $i=j$, otherwise $y_{ij} = -1$. This design allows training with larger batch sizes without requiring global negative sample synchronization, making training more efficient. SigLIP is widely used in new-generation models like InternVL and Qwen2-VL.
1.4 Modality Fusion Mechanisms
The way visual features are injected into language models determines the architectural design of multimodal models. Current mainstream fusion mechanisms include:
Linear/MLP Projection
The simplest approach is to use linear layers or MLP to map visual features to the language model’s embedding space:
\[\mathbf{H}_v = \mathbf{W}_\text{proj} \cdot \mathbf{Z}_\text{vision} + \mathbf{b}\]LLaVA initially adopted this approach, connecting CLIP ViT-L/14 and Vicuna through a simple linear projection matrix:
- Keeps vision encoder and LLM parameters frozen
- Only trains the projection matrix (~2M parameters)
- Two-stage training: pre-training alignment + instruction tuning
LLaVA-1.5 upgraded linear projection to a two-layer MLP, significantly improving multimodal capabilities:
\[\mathbf{H}_v = \mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \cdot \mathbf{Z}_\text{vision})\]Q-Former (Querying Transformer)
BLIP-2 proposed the Q-Former architecture, using learnable query tokens to extract information from visual features through cross-attention:
\[\mathbf{Q}_\text{out} = \text{CrossAttn}(\mathbf{Q}_\text{learnable}, \mathbf{K}_\text{vision}, \mathbf{V}_\text{vision})\]Q-Former’s core design:
- 32 learnable query embeddings (dimension 768)
- Transformer blocks initialized from BERT
- Cross-attention and self-attention layers alternately stacked
- Outputs fixed number of visual tokens (32), regardless of input image resolution
Two-stage Pre-training:
- Vision-language representation learning: Train Q-Former with frozen vision encoder using ITC, ITM, ITG losses
- Vision-language generative learning: Connect Q-Former output to frozen LLM, train generation capability
BLIP-2 surpasses Flamingo-80B by 8.7% on VQAv2 zero-shot, with only 1/54 of trainable parameters.
Cross-Attention Adapter
Flamingo and LLaMA 3.2 Vision adopt the approach of inserting cross-attention layers inside the LLM:
\[\mathbf{h}_l' = \mathbf{h}_l + \text{CrossAttn}(\mathbf{h}_l, \mathbf{K}_\text{vision}, \mathbf{V}_\text{vision})\]LLaMA 3.2 Vision is built on LLaMA 3.1:
- Adds visual adapters on frozen LLaMA 3.1 text model
- Adapters include multi-layer cross-attention to inject image encoder representations into LLM
- Updates vision encoder and adapters during training, but freezes LLM parameters
- Maintains text capabilities unchanged, achieving “plug-and-play” replacement of LLaMA 3.1
Fusion Mechanism Comparison
| Method | Representative Model | Additional Parameters | Visual Tokens | Characteristics |
|---|---|---|---|---|
| Linear Projection | LLaVA | ~2M | 576 | Simple and efficient |
| MLP Projection | LLaVA-1.5 | ~20M | 576 | More expressive |
| Q-Former | BLIP-2 | ~107M | 32 | Compresses visual information |
| Cross-Attention | LLaMA 3.2 | ~1B | Variable | Deep fusion |
1.5 Representative Multimodal Models
LLaVA Series
LLaVA (Large Language and Vision Assistant) is one of the most influential open-source multimodal large models.
LLaVA-1.0:
- Vision encoder: CLIP ViT-L/14 (frozen)
- Language model: Vicuna-7B/13B (frozen)
- Connection: Linear projection layer
- Training data: 595K image-text pairs (pre-training) + 158K visual instruction data (fine-tuning)
LLaVA-1.5 Improvements:
- MLP replaces linear projection
- Input resolution increased from 224 to 336
- Added academic VQA data
- Larger language model (Vicuna-13B)
LLaVA-NeXT further supports dynamic resolution, dividing images into multiple sub-images for separate encoding.
Qwen-VL Series
Qwen-VL uses a larger vision encoder and higher resolution:
- Vision encoder: OpenCLIP ViT-bigG (448×448)
- Language model: Qwen-7B
- Connection: Single-layer cross-attention
Qwen2-VL Innovations:
- Dynamic resolution: Removes ViT’s absolute position encoding, introduces 2D-RoPE, supports arbitrary resolution input
- M-RoPE: Multimodal Rotary Position Embedding, decomposing rotary position encoding into temporal and spatial (height, width) components
- Token compression: MLP layer compresses adjacent 2×2 tokens into 1, a 224×224 image produces only 66 visual tokens
InternVL Series
InternVL’s unique design lies in scaling up the vision encoder:
- Vision encoder expanded to 6 billion parameters (InternViT-6B)
- Introduces QLLaMA as a “glue layer” (8B parameters) connecting vision and language
- Three-stage training: contrastive learning → generative learning → instruction tuning
InternVL 2.5 is the first open-source model to break 70% on the MMMU benchmark, reaching GPT-4o level.
1.6 Native Multimodal Models
“Native multimodal” refers to models designed with multimodal processing capabilities from the start, rather than “grafting” them onto unimodal models.
Non-native multimodal (e.g., ChatGPT with GPT-4V):
- Text generation: GPT-4
- Image understanding: GPT-4V (separate vision module)
- Speech recognition: Whisper
- Image generation: DALL-E 3
- Modules are independent, connected via API or text intermediary
Native multimodal (e.g., GPT-4o, Gemini):
- Single neural network processes all modalities end-to-end
- Jointly trained on multimodal data from scratch
- Shared representation space across modalities for deep fusion
- No text intermediary between modalities, reducing information loss
GPT-4o
GPT-4o (“o” stands for “omni”) was released in May 2024 as OpenAI’s first native multimodal flagship model.
Core Features:
- Single model processes text, audio, and visual inputs end-to-end
- Can directly generate text, audio, and image outputs
- Real-time voice conversation latency reduced to 232ms (approaching human reaction speed)
- Audio input preserves non-semantic information like tone and emotion
Difference from GPT-4V:
- GPT-4V: Upload image → Vision model recognition → Convert to text description → GPT-4 processing → Generate response
- GPT-4o: Upload image → Direct understanding and response generation (no intermediate conversion)
Google Gemini
Gemini is Google’s native multimodal model series.
Technical Report Statement:
“Gemini models are natively multimodal, as they are trained jointly across text, image, audio, and video.”
Architecture Features:
- Early Fusion architecture
- Jointly trained on multimodal data from the pre-training stage
- Supports 32K (Gemini 1.0) to 1M (Gemini 1.5/2.5) token context
Model Series:
- Gemini Ultra: Largest scale, first to surpass human expert level on MMLU
- Gemini Pro: Balanced performance and efficiency
- Gemini Nano: Optimized for on-device deployment
- Gemini 2.5 Pro: Released in 2025, adds “thinking model” capability
Meta Chameleon
Chameleon is Meta’s open-source native multimodal model, adopting a thorough early fusion architecture.
Core Design:
- Represents all modalities (images, text, code) as discrete tokens
- Unified vocabulary includes text, code, and image tokens
- Uses standard Transformer architecture to process mixed-modality sequences
- End-to-end training from scratch, no separate image encoder/decoder needed
Image Discretization: Uses improved VQ-VAE to encode images as discrete tokens:
- Images encoded as 1024 discrete tokens (32×32 latent grid)
- Codebook size 8192
- Shares unified embedding space with text tokens
Training Scale:
- 7B and 34B parameter versions
- Approximately 4.4 trillion tokens training data (text, image-text pairs, interleaved sequences)
- Over 5 million A100 GPU hours
1.7 Unifying Understanding and Generation
Traditional multimodal models either focus on understanding (like VQA) or generation (like text-to-image). Recent research has begun exploring unifying both capabilities in a single model.
Challenges and Contradictions
Understanding and generation have different requirements for visual representations:
- Understanding: Needs high-level semantic abstraction, focusing on “what is it”
- Generation: Needs fine-grained details, focusing on “how to draw it”
Using the same visual encoder for both tasks creates conflicts—semantic encoders (like CLIP) excel at understanding but generate images lacking details; pixel encoders (like VQ-GAN) can reconstruct details but have weak semantic understanding.
Show-o
Show-o proposes using a single Transformer to unify understanding and generation:
Core Design:
- Omni-Attention: Causal attention for text tokens, full attention for image tokens
- Mixed modeling: Autoregressive generation for text, discrete diffusion model for images
- Unified vocabulary: Text tokens and image tokens (VQ-GAN encoded) share the vocabulary
Task Capabilities:
- Image Captioning
- Visual Question Answering (VQA)
- Text-to-Image Generation
- Image Editing (Inpainting/Outpainting)
- Mixed-modality generation
Show-o outperforms larger models like NExT-GPT and Chameleon on VQAv2, while achieving FID 9.24 (MSCOCO 30K) on image generation.
Janus
DeepSeek’s Janus adopts a “decoupled encoding, unified processing” strategy:
Core Insight: Understanding and generation need different visual encodings, but can share language model processing.
Dual Encoder Design:
- Understanding encoder: SigLIP, extracts high-level semantic features
- Generation encoder: VQ tokenizer, produces discrete visual representations
- Shared Transformer: Unified processing of token sequences from both encoders
Janus-Pro (January 2025) further improves:
- Based on DeepSeek-LLM-7B
- MMBench reaches 79.2 (surpassing LLaVA-v1.5)
- Image generation FID 8.53 (MSCOCO 30K)
JanusFlow
JanusFlow changes the generation end from discrete tokens to continuous flow (Rectified Flow):
- Understanding end remains unchanged (SigLIP encoder)
- Generation end uses Rectified Flow instead of VQ tokenizer
- Image generation quality further improved
1.8 Visual Tokenizer
Visual tokenizers are key components of native multimodal and unified models, responsible for converting continuous images into discrete tokens.
VQ-VAE and VQ-GAN
VQ-VAE first proposed mapping continuous representations to a learnable discrete codebook:
\[z_q = \arg\min_{e_k \in \mathcal{C}} \|z_e - e_k\|_2\]where $z_e$ is the encoder output, and $\mathcal{C}$ is the codebook.
VQ-GAN introduces adversarial loss on top of VQ-VAE:
\[\mathcal{L}_\text{VQ-GAN} = \mathcal{L}_\text{rec} + \mathcal{L}_\text{commit} + \mathcal{L}_\text{GAN} + \mathcal{L}_\text{perceptual}\]VQ-GAN can encode a 256×256 image into 16×16=256 discrete tokens, each token from a codebook of size 1024-16384.
Tokenizer Type Comparison
| Type | Representative | Codebook | Characteristics |
|---|---|---|---|
| Pixel-level | VQ-GAN | 8K-16K | High reconstruction quality, weak semantics |
| Semantic-level | CLIP-ViT | - | Strong semantics, cannot reconstruct |
| Hybrid | SEED | 8K | Balances semantics and reconstruction |
| Unified | TokenFlow | 16K | Dual encoder + shared mapping |
1.9 Multimodal Post-Training
Multimodal post-training is crucial for aligning with human preferences and improving instruction-following capabilities.
Visual Instruction Tuning
Training models to follow vision-related instructions using high-quality multimodal instruction data. LLaVA pioneered multimodal instruction tuning by using GPT-4 to generate multimodal instruction data:
- Using COCO dataset image annotations (bounding boxes, captions)
- Inputting visual information as prompts to GPT-4
- Generating 158K high-quality multimodal conversations, complex reasoning, and detailed descriptions
Multimodal RLHF
LLaVA-RLHF addresses multimodal hallucination issues:
- Training a reward model using 10K human preference data
- Optimizing the policy model through PPO (Proximal Policy Optimization)
- Significantly reducing hallucination rate and improving factual accuracy
mDPO (Multimodal DPO)
Standard DPO has issues in multimodal scenarios: when images are the same condition in preferred and rejected samples, the image condition cancels out in DPO’s optimization objective, causing the optimization process to ignore visual information.
mDPO introduces anchor samples to explicitly optimize image preferences:
\[\mathcal{L}_\text{mDPO} = \mathcal{L}_\text{DPO}(y_w, y_l | x, v) + \lambda \cdot \mathcal{L}_\text{anchor}(y_w | v, v')\]where $v’$ is a reference image different from $v$, and $\mathcal{L}_\text{anchor}$ ensures the model attends to image differences.
Multimodal Hallucinations
Model-generated content that doesn’t match the input image is a major problem for multimodal models:
| Hallucination Type | Description | Example |
|---|---|---|
| Object hallucination | Describing objects not in the image | Saying “there’s a cat” when there isn’t |
| Attribute hallucination | Incorrectly describing object attributes | Calling a red car blue |
| Relationship hallucination | Incorrectly describing relationships | “Person riding horse” but actually standing beside it |
| Quantity hallucination | Incorrectly counting objects | Saying 5 apples when there are 3 |
Causes of Hallucination:
- LLM’s language prior: Tendency to generate descriptions following language statistics
- Insufficient visual information utilization: Model may over-rely on text context
- Training data bias: Certain object-attribute combinations are more common in training data
LLaVA-Critic
LLaVA-Critic is the first open-source multimodal general evaluation model, capable of evaluating output quality from other multimodal models.
Core Capabilities:
- Reference-free evaluation: Directly evaluates generation quality without ground truth
- Pairwise comparison: Judges which of two responses is better
- Multi-dimensional scoring: Accuracy, relevance, detail level, hallucination degree
Self-improvement Path: LLaVA-Critic implements a “Self-Reward” closed loop:
- Generative model produces multiple candidate responses
- LLaVA-Critic evaluates and ranks
- Uses preference data for DPO training
- Model capabilities continuously improve
2. Reasoning Large Models
A series of breakthroughs in 2024 revealed a new dimension: sometimes, letting the model answer more slowly can lead to better results.
2.1 From Fast Thinking to Slow Thinking
Traditional LLMs adopt autoregressive generation, directly predicting the next token given input. This “System 1”-style fast response excels on many tasks but has limitations on tasks requiring complex reasoning:
- Limited reasoning depth: Each token’s generation only depends on previous context, lacking the ability to “go back and check”
- Error accumulation: Early errors in reasoning chains propagate to subsequent steps
- Lack of planning: Cannot plan solution paths in advance, can only “walk and see”
Test-Time Compute Scaling
The core idea of reasoning large models is Test-Time Compute Scaling: investing more computational resources during inference to achieve better output quality.
Key Findings from Snell et al. (2024):
- Test-time compute scaling can be more effective than scaling model parameters
- Using “compute-optimal” strategies, test-time compute efficiency can improve by more than 4x
- In FLOPs-matched evaluation, small model + test-time compute can surpass models 14x larger
Main approaches for test-time compute:
| Approach | Description | Representative Methods |
|---|---|---|
| Search | Generate multiple candidate answers, use verifier to select best | Best-of-N, MCTS |
| Thinking | Let model “think” longer, generate detailed reasoning process | CoT, o1, R1 |
| Iteration | Multiple rounds of self-correction and optimization | Self-Refine, Reflexion |
2.2 Chain-of-Thought and Self-Consistency
Chain-of-Thought (CoT)
Chain-of-Thought prompting is a foundational technique for reasoning large models, improving complex task performance by guiding the model to generate intermediate reasoning steps.
Basic Form:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls, each containing 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans have 2*3=6 balls. 5+6=11. The answer is 11.
Zero-shot CoT: Simply adding “Let’s think step by step” can activate the model’s reasoning capabilities without providing examples.
Self-Consistency
Self-Consistency is an important improvement to Chain-of-Thought. The core idea is:
- Generate multiple reasoning paths for the same problem (by sampling different CoTs)
- Select the most consistent answer through majority voting
- Leverage the intuition that correct answers should be reachable through multiple approaches
Performance Improvements:
| Dataset | Improvement |
|---|---|
| GSM8K | +17.9% |
| SVAMP | +11.0% |
| AQuA | +12.2% |
| StrategyQA | +6.4% |
Self-Consistency Improvements:
- CISC (Confidence-Informed SC): Confidence-weighted voting, reducing sampling requirements by over 40%
- RASC (Reasoning-Aware SC): Dynamically adjusts sampling count—fewer samples for easy problems, more for difficult ones
- LSC (Latent SC): Selection based on semantic consistency, suitable for long-form open-ended answers
2.3 Reward Models and Verifiers
Verifiers are used to evaluate the quality of model-generated reasoning processes and answers, serving as core components of search strategies.
Outcome Reward Model (ORM)
Outcome Reward Models only provide reward signals for the final answer:
\[r_\text{ORM}(x, y) = \begin{cases} 1 & \text{if } y \text{ is correct} \\ 0 & \text{otherwise} \end{cases}\]Advantages: Low annotation cost, only need to judge final answer correctness
Disadvantages:
- Difficult credit assignment: Cannot identify which step went wrong
- Delayed feedback: Reward only available after completing entire reasoning
Process Reward Model (PRM)
Process Reward Models provide reward signals for each reasoning step:
\[r_\text{PRM}(x, y_{1:t}) = \text{score}(y_t | x, y_{1:t-1})\]where $y_t$ is the $t$-th reasoning step, and score is typically ${-1, 0, +1}$ representing ${$wrong, neutral, correct$}$.
OpenAI Experimental Results: Using pre-RLHF GPT-4 as base model, PRM achieves 78.2% accuracy on MATH test set, significantly outperforming ORM.
PRM vs ORM Comparison:
| Feature | ORM | PRM |
|---|---|---|
| Feedback Granularity | Overall result | Each step |
| Annotation Cost | Low | High |
| Credit Assignment | Difficult | Precise |
| Reward Hacking Risk | Low | Higher |
| Search Efficiency | Lower | Higher |
Implicit PRM: Recent research found that training an ORM and then using it as a PRM can obtain “free” process rewards without expensive step-level annotation.
Process Advantage Verifier (PAV)
PAV combines process supervision with advantage estimation:
- Search accuracy improved by over 8% compared to ORM
- Computational efficiency improved 1.5-5x
- Online RL sample efficiency improved 5-6x
2.4 Search and Planning
Best-of-N Sampling
The simplest search strategy is to generate N candidate answers and use a verifier to select the best:
\[y^* = \arg\max_{y \in \{y_1, ..., y_N\}} r(x, y)\]OpenAI o1’s performance on AIME 2024:
- Single sampling (pass@1): 74%
- 64 samples + consensus (consensus@64): 83%
Monte Carlo Tree Search (MCTS)
MCTS models the reasoning process as a tree search problem, where each node is a reasoning state and edges are reasoning steps.
Basic Process:
- Selection: Use UCB formula to select promising nodes
- Expansion: Generate new reasoning steps
- Simulation: Complete reasoning and obtain results
- Backpropagation: Update values of all nodes on the path
UCB Formula:
\[\text{UCB}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}}\]where $Q(s, a)$ is the action value estimate, $N(s)$ is the node visit count, and $c$ is the exploration coefficient.
MCTSr (MCT Self-Refine): Combines LLM self-improvement with MCTS, achieving excellent results on Olympiad-level math problems.
SC-MCTS*: Uses contrastive decoding to design interpretable reward models, combined with speculative decoding for acceleration, improving average per-node speed by 51.9%. Surpasses o1-mini by 17.4% on Blocksworld dataset.
2.5 OpenAI o1
OpenAI o1 (released September 2024) is the first large-scale commercial reasoning large model. Its core innovation is internalizing chain-of-thought as model capability.
Core Design
Key Features:
- Reasoning tokens: Model generates internal reasoning process before answering
- Hidden thinking: Reasoning tokens are invisible to users (but are billed)
- Reinforcement learning training: Learns “how to think” through large-scale RL
OpenAI Official Description:
“Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses.”
Performance
| Benchmark | GPT-4o | o1-preview | o1 |
|---|---|---|---|
| AIME 2024 | 12% | 44% | 74% |
| Codeforces Rating | 808 | 1673 | 1891 |
| MATH-500 | 60.3% | 85.5% | 94.8% |
| GPQA Diamond | 50.6% | 73.3% | 78.0% |
Scaling Laws
o1 demonstrates two dimensions of scaling:
- Training-time compute: More RL training brings stronger reasoning capabilities
- Test-time compute: Longer thinking time brings better answer quality
This opens a new scaling path: performance can be improved not only by increasing parameters and training data, but also by increasing computation during inference.
2.6 DeepSeek-R1
DeepSeek-R1 (January 2025) is the first open-source model to prove that pure reinforcement learning can activate reasoning capabilities.
Breakthrough of Pure RL Training
Key Findings from DeepSeek-R1-Zero:
- No need for SFT, strong reasoning capabilities can be obtained through RL alone
- Advanced reasoning patterns such as self-reflection, verification, and dynamic strategy adjustment emerge
- AIME 2024: improved from 15.6% to 71.0% (pass@1), with majority voting reaching 86.7%
GRPO Algorithm
DeepSeek uses Group Relative Policy Optimization (GRPO) for reinforcement learning training:
Core Ideas:
- Eliminates the need for a Critic model of the same scale as the policy model in traditional RLHF
- Uses within-group relative scores as baseline estimates
- Significantly reduces training costs
GRPO Optimization Objective:
\[\mathcal{L}_\text{GRPO} = -\mathbb{E}_{x, \{y_i\}}\left[\sum_i \frac{r(x, y_i) - \bar{r}}{\sigma_r} \log \pi_\theta(y_i|x)\right]\]where $\bar{r}$ is the within-group average reward, and $\sigma_r$ is the within-group reward standard deviation.
Complete Training Pipeline
DeepSeek-R1’s training includes four stages:
- Cold Start Data: Small amount of high-quality reasoning data, solving R1-Zero’s readability issues
- Reasoning RL: Large-scale RL training, discovering better reasoning patterns
- Rejection Sampling SFT: Collecting high-quality outputs from RL model for SFT
- Preference RL: Alignment with human preferences
Emergent Capabilities
R1-Zero exhibited multiple advanced reasoning behaviors during training:
| Emergent Behavior | Description | Example Expression |
|---|---|---|
| Self-reflection | Re-examining reasoning process | “Wait, let me reconsider…” |
| Verification | Checking correctness of intermediate steps | “Let me verify this step…” |
| Backtracking | Going back and retrying after discovering errors | “That’s wrong, going back…” |
| Strategy switching | Trying another approach when one doesn’t work | “Let me try a different approach…” |
2.7 Knowledge Distillation
DeepSeek pioneered proving that reasoning capabilities can be transferred to smaller models through distillation.
Distillation Method
- Using DeepSeek-R1 to generate 800K reasoning samples
- Performing SFT on smaller models (no additional RL needed)
- Smaller models acquire similar reasoning capabilities
Distillation Model Performance
| Model | Base | AIME 2024 | MATH-500 |
|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen2.5-1.5B | 28.9% | 83.9% |
| R1-Distill-Qwen-7B | Qwen2.5-7B | 55.5% | 92.8% |
| R1-Distill-Qwen-14B | Qwen2.5-14B | 69.7% | 93.9% |
| R1-Distill-Qwen-32B | Qwen2.5-32B | 72.6% | 94.3% |
| R1-Distill-Llama-8B | Llama3.1-8B | 50.4% | 89.1% |
| R1-Distill-Llama-70B | Llama3.3-70B | 70.0% | 94.5% |
Key Findings:
- R1-Distill-Qwen-32B surpasses o1-mini
- Distillation outperforms direct RL training on same-size models
- Distillation is an efficient path to acquiring reasoning capabilities
2.8 Open-Source Reasoning Models
QwQ (Qwen with Questions)
QwQ is an open-source reasoning model released by Alibaba’s Qwen team (November 2024).
Design Philosophy:
“QwQ approaches every problem with genuine wonder and doubt. It knows that it knows nothing, and that’s precisely what drives its curiosity.”
Technical Features:
- 32B parameters, 32K context length
- Uses regularized reinforcement learning to embed reasoning capabilities
- Generates long thinking chains during inference
Performance:
- GPQA: 65.2% (graduate-level scientific reasoning)
- AIME 2024: 50.0%
- MATH-500: 90.6%
- LiveCodeBench: 50.0%
Known Limitations:
- May mix languages or unexpectedly switch languages
- May get stuck in circular reasoning, producing overly long outputs
Marco-o1
Another reasoning model from Alibaba, Marco-o1 uses MCTS algorithm to generate synthetic training data, combined with CoT samples for training.
Mainstream Reasoning Model Comparison
| Model | Parameters | Open-source | Training Method | AIME | MATH | Release |
|---|---|---|---|---|---|---|
| GPT-4o | - | No | SFT | 12% | 60.3% | 2024.05 |
| o1-preview | - | No | RL | 44% | 85.5% | 2024.09 |
| o1 | - | No | RL | 74% | 94.8% | 2024.12 |
| QwQ-32B | 32B | Yes | RL | 50% | 90.6% | 2024.11 |
| DeepSeek-R1 | 671B | Yes | RL | 79.8% | 97.3% | 2025.01 |
| R1-Distill-32B | 32B | Yes | Distillation | 72.6% | 94.3% | 2025.01 |
Training Paradigm Comparison
| Paradigm | Representative Model | Characteristics |
|---|---|---|
| Large-scale RL + Hidden Reasoning | o1 | Closed-source, reasoning process invisible |
| GRPO + Multi-stage Training | DeepSeek-R1 | Fully open-source, four-stage training |
| Regularized RL | QwQ | Open weights, long thinking chains |
| SFT Distillation | R1-Distill series | Efficient path to reasoning capabilities |
2.9 Applications and Limitations
Applicable Scenarios
Reasoning large models are particularly suitable for:
- Mathematical problems: Competition mathematics, theorem proving
- Code generation: Complex algorithms, debugging
- Scientific reasoning: Physics, chemistry problems
- Logical reasoning: Planning, constraint satisfaction
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| High latency | Long thinking time | Not suitable for real-time interaction |
| High cost | Reasoning tokens consume significant computation | Increased API call costs |
| Over-thinking | Even simple problems may produce lengthy reasoning | Resource waste |
| Circular reasoning | May get stuck in meaningless thought loops | Cannot converge |
| Language mixing | May mix multiple languages during thinking | Reduced readability |
Open Questions
- Optimal thinking length: How to determine when to stop thinking?
- Thinking interpretability: Can hidden reasoning processes be trusted?
- General reasoning: Currently mainly in math/code domains, how to extend to more domains?
- Efficiency optimization: How to reduce computational cost while maintaining reasoning quality?
3. Future Directions
3.1 Multimodal Reasoning
Extending reasoning capabilities to multimodal is an important research direction:
| Direction | Capability | Application Scenarios |
|---|---|---|
| Visual reasoning | Logical relationship inference in images | Math geometry problems, chart understanding |
| Video understanding | Temporal reasoning, event causal analysis | Video Q&A, action prediction |
| Embodied intelligence | Planning and interaction in physical world | Robot manipulation, autonomous driving |
3.2 Unifying All Modalities
Most current models primarily handle images and text. The future will extend to more modalities:
- Audio/Speech: Native speech understanding and generation (like GPT-4o)
- Video: Long video understanding and generation
- 3D: 3D scene understanding, spatial reasoning
- Tactile/Force feedback: Perception capabilities for embodied AI
3.3 Reasoning and Agents
Reasoning large models provide stronger planning capabilities for AI Agents:
| Capability | Description | Value |
|---|---|---|
| Task decomposition | Breaking complex tasks into subtasks | Reduces execution difficulty |
| Planning | Pre-planning execution paths | Improves success rate |
| Tool usage | Deciding when to call which tools | Expands capability boundaries |
| Long-term goals | Tracking and progressing toward long-term goals | Complex task completion |
3.4 Efficiency Improvements
Research directions for improving reasoning efficiency:
- Compute-optimal strategies: Dynamically adjust test-time compute based on task difficulty
- Simple problems: Fast response
- Difficult problems: Deep thinking
- Automatically predict difficulty and select strategy
- Early stopping strategies: Stop early when answer convergence is detected
- Speculative decoding: Accelerate reasoning token generation
- Sparse activation: Only activate parameters related to reasoning
- Lightweight models: Distill smaller reasoning models
4. Series Summary
This 8-article series provides a comprehensive analysis of the Transformer architecture and its applications in large language models:
| Article | Topic | Core Content |
|---|---|---|
| I | Fundamentals | Hardware background, Transformer computation, Scaling Law |
| II | Core Components | Tokenizer, positional encoding (RoPE), gating mechanisms |
| III | Attention Mechanisms | FlashAttention, MLA, sparse/linear attention |
| IV | Model Architecture | MoE sparse architecture, load balancing |
| V | Training Techniques | Data engineering, distributed training, Muon optimizer |
| VI | Evaluation Systems | MMLU, LiveCodeBench, Chatbot Arena |
| VII | Deployment Optimization | Quantization, inference engines, speculative decoding |
| VIII | Frontier Applications | Multimodal, reasoning large models |
Since the Transformer paper was published in 2017, this architecture has completely transformed the field of artificial intelligence. Looking ahead:
- Larger scale: Trillion-parameter models will become standard
- Longer context: Million-token-level processing capabilities
- Stronger reasoning: Paradigm shift from “fast thinking” to “slow thinking”
- More modalities: Truly “omnipotent” artificial intelligence
We are in the golden age of artificial intelligence development. I hope this series has helped you deeply understand the core of this technological revolution.
Series completed. Thank you for reading!
Comments