LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

Transformer Notes (VIII): Frontier Applications

2025-12-20 · Qi Lu · Views:

This is the final article in the Transformer series, exploring frontier applications of large language models: multimodal large models and reasoning large models. These two directions represent the cutting edge of current AI research and are profoundly changing our understanding of intelligence.

1. Multimodal Large Models

As large language models achieve breakthroughs in the text domain, researchers have begun exploring how to combine visual, audio, and other multimodal information with language capabilities.

1.1 From Unimodal to Multimodal

Based on the depth and method of modality fusion, multimodal large models can be categorized as follows:

1.2 Core Challenges

Building multimodal large models faces several key challenges:

Modality Alignment: Images and text exist in different representation spaces, requiring effective cross-modal mapping. Images are continuous pixel values, while text is a discrete sequence of tokens. How to align them in the same semantic space is the core problem.

Information Compression: A 224×224 image contains 50,176 pixels, while typical vision encoders produce 196-576 visual tokens. How can we compress visual representations while retaining key information, avoiding excessive sequence length burden on the LLM?

Unifying Understanding and Generation: Visual understanding (like VQA) requires high-level semantic abstraction, while image generation requires fine-grained pixel-level information. How can we support both seemingly contradictory requirements in a single model?

1.3 Vision Encoders

Vision encoders are the “eyes” of multimodal large models, responsible for converting images into representations that language models can understand.

Vision Transformer (ViT)

Vision Transformer applies the Transformer architecture to image processing. Its core idea is to divide images into fixed-size patches, then process these patches like text tokens:

\[\mathbf{z}_0 = [\mathbf{x}_\text{class}; \mathbf{E}\mathbf{x}_1; \mathbf{E}\mathbf{x}_2; ...; \mathbf{E}\mathbf{x}_N] + \mathbf{E}_\text{pos}\]

where $\mathbf{x}_{i} \in \mathbb{R}^{P^2 \cdot C}$ is the flattened vector of the $i$-th image patch, $\mathbf{E}$ is the patch embedding matrix, and $\mathbf{E}_{\text{pos}}$ is the positional encoding.

CLIP and Contrastive Learning

CLIP (Contrastive Language-Image Pre-training) trains vision encoders on 400 million image-text pairs through contrastive learning, aligning image representations with corresponding text descriptions in semantic space:

\[\mathcal{L}_\text{CLIP} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_i)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j)/\tau)}\right]\]

CLIP’s vision encoder (typically ViT-L/14) became the standard choice for early multimodal large models due to its powerful cross-modal alignment capability.

SigLIP and Improvements

SigLIP improves CLIP’s training objective by using sigmoid loss instead of softmax:

\[\mathcal{L}_\text{SigLIP} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{N}\log\sigma(y_{ij} \cdot \text{sim}(\mathbf{v}_i, \mathbf{t}_j) \cdot \tau)\]

where $y_{ij} = 1$ when $i=j$, otherwise $y_{ij} = -1$. This design allows training with larger batch sizes without requiring global negative sample synchronization, making training more efficient. SigLIP is widely used in new-generation models like InternVL and Qwen2-VL.

1.4 Modality Fusion Mechanisms

The way visual features are injected into language models determines the architectural design of multimodal models. Current mainstream fusion mechanisms include:

Linear/MLP Projection

The simplest approach is to use linear layers or MLP to map visual features to the language model’s embedding space:

\[\mathbf{H}_v = \mathbf{W}_\text{proj} \cdot \mathbf{Z}_\text{vision} + \mathbf{b}\]

LLaVA initially adopted this approach, connecting CLIP ViT-L/14 and Vicuna through a simple linear projection matrix:

LLaVA-1.5 upgraded linear projection to a two-layer MLP, significantly improving multimodal capabilities:

\[\mathbf{H}_v = \mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \cdot \mathbf{Z}_\text{vision})\]

Q-Former (Querying Transformer)

BLIP-2 proposed the Q-Former architecture, using learnable query tokens to extract information from visual features through cross-attention:

\[\mathbf{Q}_\text{out} = \text{CrossAttn}(\mathbf{Q}_\text{learnable}, \mathbf{K}_\text{vision}, \mathbf{V}_\text{vision})\]

Q-Former’s core design:

Two-stage Pre-training:

  1. Vision-language representation learning: Train Q-Former with frozen vision encoder using ITC, ITM, ITG losses
  2. Vision-language generative learning: Connect Q-Former output to frozen LLM, train generation capability

BLIP-2 surpasses Flamingo-80B by 8.7% on VQAv2 zero-shot, with only 1/54 of trainable parameters.

Cross-Attention Adapter

Flamingo and LLaMA 3.2 Vision adopt the approach of inserting cross-attention layers inside the LLM:

\[\mathbf{h}_l' = \mathbf{h}_l + \text{CrossAttn}(\mathbf{h}_l, \mathbf{K}_\text{vision}, \mathbf{V}_\text{vision})\]

LLaMA 3.2 Vision is built on LLaMA 3.1:

Fusion Mechanism Comparison

Method Representative Model Additional Parameters Visual Tokens Characteristics
Linear Projection LLaVA ~2M 576 Simple and efficient
MLP Projection LLaVA-1.5 ~20M 576 More expressive
Q-Former BLIP-2 ~107M 32 Compresses visual information
Cross-Attention LLaMA 3.2 ~1B Variable Deep fusion

1.5 Representative Multimodal Models

LLaVA Series

LLaVA (Large Language and Vision Assistant) is one of the most influential open-source multimodal large models.

LLaVA-1.0:

LLaVA-1.5 Improvements:

LLaVA-NeXT further supports dynamic resolution, dividing images into multiple sub-images for separate encoding.

Qwen-VL Series

Qwen-VL uses a larger vision encoder and higher resolution:

Qwen2-VL Innovations:

InternVL Series

InternVL’s unique design lies in scaling up the vision encoder:

InternVL 2.5 is the first open-source model to break 70% on the MMMU benchmark, reaching GPT-4o level.

1.6 Native Multimodal Models

“Native multimodal” refers to models designed with multimodal processing capabilities from the start, rather than “grafting” them onto unimodal models.

Non-native multimodal (e.g., ChatGPT with GPT-4V):

Native multimodal (e.g., GPT-4o, Gemini):

GPT-4o

GPT-4o (“o” stands for “omni”) was released in May 2024 as OpenAI’s first native multimodal flagship model.

Core Features:

Difference from GPT-4V:

Google Gemini

Gemini is Google’s native multimodal model series.

Technical Report Statement:

“Gemini models are natively multimodal, as they are trained jointly across text, image, audio, and video.”

Architecture Features:

Model Series:

Meta Chameleon

Chameleon is Meta’s open-source native multimodal model, adopting a thorough early fusion architecture.

Core Design:

Image Discretization: Uses improved VQ-VAE to encode images as discrete tokens:

Training Scale:

1.7 Unifying Understanding and Generation

Traditional multimodal models either focus on understanding (like VQA) or generation (like text-to-image). Recent research has begun exploring unifying both capabilities in a single model.

Challenges and Contradictions

Understanding and generation have different requirements for visual representations:

Using the same visual encoder for both tasks creates conflicts—semantic encoders (like CLIP) excel at understanding but generate images lacking details; pixel encoders (like VQ-GAN) can reconstruct details but have weak semantic understanding.

Show-o

Show-o proposes using a single Transformer to unify understanding and generation:

Core Design:

Task Capabilities:

Show-o outperforms larger models like NExT-GPT and Chameleon on VQAv2, while achieving FID 9.24 (MSCOCO 30K) on image generation.

Janus

DeepSeek’s Janus adopts a “decoupled encoding, unified processing” strategy:

Core Insight: Understanding and generation need different visual encodings, but can share language model processing.

Dual Encoder Design:

Janus-Pro (January 2025) further improves:

JanusFlow

JanusFlow changes the generation end from discrete tokens to continuous flow (Rectified Flow):

1.8 Visual Tokenizer

Visual tokenizers are key components of native multimodal and unified models, responsible for converting continuous images into discrete tokens.

VQ-VAE and VQ-GAN

VQ-VAE first proposed mapping continuous representations to a learnable discrete codebook:

\[z_q = \arg\min_{e_k \in \mathcal{C}} \|z_e - e_k\|_2\]

where $z_e$ is the encoder output, and $\mathcal{C}$ is the codebook.

VQ-GAN introduces adversarial loss on top of VQ-VAE:

\[\mathcal{L}_\text{VQ-GAN} = \mathcal{L}_\text{rec} + \mathcal{L}_\text{commit} + \mathcal{L}_\text{GAN} + \mathcal{L}_\text{perceptual}\]

VQ-GAN can encode a 256×256 image into 16×16=256 discrete tokens, each token from a codebook of size 1024-16384.

Tokenizer Type Comparison

Type Representative Codebook Characteristics
Pixel-level VQ-GAN 8K-16K High reconstruction quality, weak semantics
Semantic-level CLIP-ViT - Strong semantics, cannot reconstruct
Hybrid SEED 8K Balances semantics and reconstruction
Unified TokenFlow 16K Dual encoder + shared mapping

1.9 Multimodal Post-Training

Multimodal post-training is crucial for aligning with human preferences and improving instruction-following capabilities.

Visual Instruction Tuning

Training models to follow vision-related instructions using high-quality multimodal instruction data. LLaVA pioneered multimodal instruction tuning by using GPT-4 to generate multimodal instruction data:

Multimodal RLHF

LLaVA-RLHF addresses multimodal hallucination issues:

mDPO (Multimodal DPO)

Standard DPO has issues in multimodal scenarios: when images are the same condition in preferred and rejected samples, the image condition cancels out in DPO’s optimization objective, causing the optimization process to ignore visual information.

mDPO introduces anchor samples to explicitly optimize image preferences:

\[\mathcal{L}_\text{mDPO} = \mathcal{L}_\text{DPO}(y_w, y_l | x, v) + \lambda \cdot \mathcal{L}_\text{anchor}(y_w | v, v')\]

where $v’$ is a reference image different from $v$, and $\mathcal{L}_\text{anchor}$ ensures the model attends to image differences.

Multimodal Hallucinations

Model-generated content that doesn’t match the input image is a major problem for multimodal models:

Hallucination Type Description Example
Object hallucination Describing objects not in the image Saying “there’s a cat” when there isn’t
Attribute hallucination Incorrectly describing object attributes Calling a red car blue
Relationship hallucination Incorrectly describing relationships “Person riding horse” but actually standing beside it
Quantity hallucination Incorrectly counting objects Saying 5 apples when there are 3

Causes of Hallucination:

LLaVA-Critic

LLaVA-Critic is the first open-source multimodal general evaluation model, capable of evaluating output quality from other multimodal models.

Core Capabilities:

Self-improvement Path: LLaVA-Critic implements a “Self-Reward” closed loop:

  1. Generative model produces multiple candidate responses
  2. LLaVA-Critic evaluates and ranks
  3. Uses preference data for DPO training
  4. Model capabilities continuously improve

2. Reasoning Large Models

A series of breakthroughs in 2024 revealed a new dimension: sometimes, letting the model answer more slowly can lead to better results.

2.1 From Fast Thinking to Slow Thinking

Traditional LLMs adopt autoregressive generation, directly predicting the next token given input. This “System 1”-style fast response excels on many tasks but has limitations on tasks requiring complex reasoning:

Test-Time Compute Scaling

The core idea of reasoning large models is Test-Time Compute Scaling: investing more computational resources during inference to achieve better output quality.

Key Findings from Snell et al. (2024):

Main approaches for test-time compute:

Approach Description Representative Methods
Search Generate multiple candidate answers, use verifier to select best Best-of-N, MCTS
Thinking Let model “think” longer, generate detailed reasoning process CoT, o1, R1
Iteration Multiple rounds of self-correction and optimization Self-Refine, Reflexion

2.2 Chain-of-Thought and Self-Consistency

Chain-of-Thought (CoT)

Chain-of-Thought prompting is a foundational technique for reasoning large models, improving complex task performance by guiding the model to generate intermediate reasoning steps.

Basic Form:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls, each containing 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans have 2*3=6 balls. 5+6=11. The answer is 11.

Zero-shot CoT: Simply adding “Let’s think step by step” can activate the model’s reasoning capabilities without providing examples.

Self-Consistency

Self-Consistency is an important improvement to Chain-of-Thought. The core idea is:

Performance Improvements:

Dataset Improvement
GSM8K +17.9%
SVAMP +11.0%
AQuA +12.2%
StrategyQA +6.4%

Self-Consistency Improvements:

2.3 Reward Models and Verifiers

Verifiers are used to evaluate the quality of model-generated reasoning processes and answers, serving as core components of search strategies.

Outcome Reward Model (ORM)

Outcome Reward Models only provide reward signals for the final answer:

\[r_\text{ORM}(x, y) = \begin{cases} 1 & \text{if } y \text{ is correct} \\ 0 & \text{otherwise} \end{cases}\]

Advantages: Low annotation cost, only need to judge final answer correctness

Disadvantages:

Process Reward Model (PRM)

Process Reward Models provide reward signals for each reasoning step:

\[r_\text{PRM}(x, y_{1:t}) = \text{score}(y_t | x, y_{1:t-1})\]

where $y_t$ is the $t$-th reasoning step, and score is typically ${-1, 0, +1}$ representing ${$wrong, neutral, correct$}$.

OpenAI Experimental Results: Using pre-RLHF GPT-4 as base model, PRM achieves 78.2% accuracy on MATH test set, significantly outperforming ORM.

PRM vs ORM Comparison:

Feature ORM PRM
Feedback Granularity Overall result Each step
Annotation Cost Low High
Credit Assignment Difficult Precise
Reward Hacking Risk Low Higher
Search Efficiency Lower Higher

Implicit PRM: Recent research found that training an ORM and then using it as a PRM can obtain “free” process rewards without expensive step-level annotation.

Process Advantage Verifier (PAV)

PAV combines process supervision with advantage estimation:

2.4 Search and Planning

Best-of-N Sampling

The simplest search strategy is to generate N candidate answers and use a verifier to select the best:

\[y^* = \arg\max_{y \in \{y_1, ..., y_N\}} r(x, y)\]

OpenAI o1’s performance on AIME 2024:

Monte Carlo Tree Search (MCTS)

MCTS models the reasoning process as a tree search problem, where each node is a reasoning state and edges are reasoning steps.

Basic Process:

  1. Selection: Use UCB formula to select promising nodes
  2. Expansion: Generate new reasoning steps
  3. Simulation: Complete reasoning and obtain results
  4. Backpropagation: Update values of all nodes on the path

UCB Formula:

\[\text{UCB}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}}\]

where $Q(s, a)$ is the action value estimate, $N(s)$ is the node visit count, and $c$ is the exploration coefficient.

MCTSr (MCT Self-Refine): Combines LLM self-improvement with MCTS, achieving excellent results on Olympiad-level math problems.

SC-MCTS*: Uses contrastive decoding to design interpretable reward models, combined with speculative decoding for acceleration, improving average per-node speed by 51.9%. Surpasses o1-mini by 17.4% on Blocksworld dataset.

2.5 OpenAI o1

OpenAI o1 (released September 2024) is the first large-scale commercial reasoning large model. Its core innovation is internalizing chain-of-thought as model capability.

Core Design

Key Features:

OpenAI Official Description:

“Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses.”

Performance

Benchmark GPT-4o o1-preview o1
AIME 2024 12% 44% 74%
Codeforces Rating 808 1673 1891
MATH-500 60.3% 85.5% 94.8%
GPQA Diamond 50.6% 73.3% 78.0%

Scaling Laws

o1 demonstrates two dimensions of scaling:

  1. Training-time compute: More RL training brings stronger reasoning capabilities
  2. Test-time compute: Longer thinking time brings better answer quality

This opens a new scaling path: performance can be improved not only by increasing parameters and training data, but also by increasing computation during inference.

2.6 DeepSeek-R1

DeepSeek-R1 (January 2025) is the first open-source model to prove that pure reinforcement learning can activate reasoning capabilities.

Breakthrough of Pure RL Training

Key Findings from DeepSeek-R1-Zero:

GRPO Algorithm

DeepSeek uses Group Relative Policy Optimization (GRPO) for reinforcement learning training:

Core Ideas:

GRPO Optimization Objective:

\[\mathcal{L}_\text{GRPO} = -\mathbb{E}_{x, \{y_i\}}\left[\sum_i \frac{r(x, y_i) - \bar{r}}{\sigma_r} \log \pi_\theta(y_i|x)\right]\]

where $\bar{r}$ is the within-group average reward, and $\sigma_r$ is the within-group reward standard deviation.

Complete Training Pipeline

DeepSeek-R1’s training includes four stages:

  1. Cold Start Data: Small amount of high-quality reasoning data, solving R1-Zero’s readability issues
  2. Reasoning RL: Large-scale RL training, discovering better reasoning patterns
  3. Rejection Sampling SFT: Collecting high-quality outputs from RL model for SFT
  4. Preference RL: Alignment with human preferences

Emergent Capabilities

R1-Zero exhibited multiple advanced reasoning behaviors during training:

Emergent Behavior Description Example Expression
Self-reflection Re-examining reasoning process “Wait, let me reconsider…”
Verification Checking correctness of intermediate steps “Let me verify this step…”
Backtracking Going back and retrying after discovering errors “That’s wrong, going back…”
Strategy switching Trying another approach when one doesn’t work “Let me try a different approach…”

2.7 Knowledge Distillation

DeepSeek pioneered proving that reasoning capabilities can be transferred to smaller models through distillation.

Distillation Method

Distillation Model Performance

Model Base AIME 2024 MATH-500
R1-Distill-Qwen-1.5B Qwen2.5-1.5B 28.9% 83.9%
R1-Distill-Qwen-7B Qwen2.5-7B 55.5% 92.8%
R1-Distill-Qwen-14B Qwen2.5-14B 69.7% 93.9%
R1-Distill-Qwen-32B Qwen2.5-32B 72.6% 94.3%
R1-Distill-Llama-8B Llama3.1-8B 50.4% 89.1%
R1-Distill-Llama-70B Llama3.3-70B 70.0% 94.5%

Key Findings:

2.8 Open-Source Reasoning Models

QwQ (Qwen with Questions)

QwQ is an open-source reasoning model released by Alibaba’s Qwen team (November 2024).

Design Philosophy:

“QwQ approaches every problem with genuine wonder and doubt. It knows that it knows nothing, and that’s precisely what drives its curiosity.”

Technical Features:

Performance:

Known Limitations:

Marco-o1

Another reasoning model from Alibaba, Marco-o1 uses MCTS algorithm to generate synthetic training data, combined with CoT samples for training.

Mainstream Reasoning Model Comparison

Model Parameters Open-source Training Method AIME MATH Release
GPT-4o - No SFT 12% 60.3% 2024.05
o1-preview - No RL 44% 85.5% 2024.09
o1 - No RL 74% 94.8% 2024.12
QwQ-32B 32B Yes RL 50% 90.6% 2024.11
DeepSeek-R1 671B Yes RL 79.8% 97.3% 2025.01
R1-Distill-32B 32B Yes Distillation 72.6% 94.3% 2025.01

Training Paradigm Comparison

Paradigm Representative Model Characteristics
Large-scale RL + Hidden Reasoning o1 Closed-source, reasoning process invisible
GRPO + Multi-stage Training DeepSeek-R1 Fully open-source, four-stage training
Regularized RL QwQ Open weights, long thinking chains
SFT Distillation R1-Distill series Efficient path to reasoning capabilities

2.9 Applications and Limitations

Applicable Scenarios

Reasoning large models are particularly suitable for:

Current Limitations

Limitation Description Impact
High latency Long thinking time Not suitable for real-time interaction
High cost Reasoning tokens consume significant computation Increased API call costs
Over-thinking Even simple problems may produce lengthy reasoning Resource waste
Circular reasoning May get stuck in meaningless thought loops Cannot converge
Language mixing May mix multiple languages during thinking Reduced readability

Open Questions

3. Future Directions

3.1 Multimodal Reasoning

Extending reasoning capabilities to multimodal is an important research direction:

Direction Capability Application Scenarios
Visual reasoning Logical relationship inference in images Math geometry problems, chart understanding
Video understanding Temporal reasoning, event causal analysis Video Q&A, action prediction
Embodied intelligence Planning and interaction in physical world Robot manipulation, autonomous driving

3.2 Unifying All Modalities

Most current models primarily handle images and text. The future will extend to more modalities:

3.3 Reasoning and Agents

Reasoning large models provide stronger planning capabilities for AI Agents:

Capability Description Value
Task decomposition Breaking complex tasks into subtasks Reduces execution difficulty
Planning Pre-planning execution paths Improves success rate
Tool usage Deciding when to call which tools Expands capability boundaries
Long-term goals Tracking and progressing toward long-term goals Complex task completion

3.4 Efficiency Improvements

Research directions for improving reasoning efficiency:

4. Series Summary

This 8-article series provides a comprehensive analysis of the Transformer architecture and its applications in large language models:

Article Topic Core Content
I Fundamentals Hardware background, Transformer computation, Scaling Law
II Core Components Tokenizer, positional encoding (RoPE), gating mechanisms
III Attention Mechanisms FlashAttention, MLA, sparse/linear attention
IV Model Architecture MoE sparse architecture, load balancing
V Training Techniques Data engineering, distributed training, Muon optimizer
VI Evaluation Systems MMLU, LiveCodeBench, Chatbot Arena
VII Deployment Optimization Quantization, inference engines, speculative decoding
VIII Frontier Applications Multimodal, reasoning large models

Since the Transformer paper was published in 2017, this architecture has completely transformed the field of artificial intelligence. Looking ahead:

We are in the golden age of artificial intelligence development. I hope this series has helped you deeply understand the core of this technological revolution.


Series completed. Thank you for reading!

← Back to Home

Comments