LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

Transformer Notes (VI): Evaluation and Benchmarks

2025-12-20 · Qi Lu · Views:

This is the sixth article in the Transformer series, systematically introducing evaluation and benchmarks for large language models. Evaluation is a complex and rapidly evolving field. This article focuses on evaluation benchmarks widely adopted by top-tier models since 2024.

1. Evaluation System Overview

1.1 Why Multi-Dimensional Evaluation is Needed

A single benchmark cannot comprehensively reflect model capabilities:

1.2 Modern Evaluation Framework

Mainstream models typically report benchmarks in the following categories when released:

Dimension Core Benchmarks Notes
Knowledge & Understanding MMLU, MMLU-Pro, C-Eval Multidisciplinary knowledge
Reasoning GPQA, ARC-C, BBH Complex reasoning
Mathematics GSM8K, MATH-500, AIME Elementary to competition level
Code HumanEval, LiveCodeBench Code generation and execution
Instruction Following IFEval, MT-Bench Instruction understanding and execution
Long Context RULER, LongBench Long text processing
Multilingual MGSM, C-Eval Non-English capabilities
Safety & Alignment TruthfulQA, BBQ Truthfulness and bias

2. Knowledge and Understanding

2.1 MMLU (Massive Multitask Language Understanding)

MMLU is the most widely used knowledge evaluation benchmark, covering 57 subjects:

Evaluation Method:

Current Performance (5-shot):

Model MMLU Release Date
GPT-4o 88.7% 2024.05
Claude 3.5 Sonnet 88.7% 2024.06
DeepSeek-V3 88.5% 2024.12
Qwen2.5-72B 86.1% 2024.09
LLaMA 3.1-405B 88.6% 2024.07

2.2 MMLU-Pro

An upgraded version of MMLU that addresses issues with the original:

Better Discrimination: The gap between GPT-4 and Claude on MMLU is about 1%, but on MMLU-Pro it expands to 5-10%, better reflecting true capability differences.

2.3 GPQA (Graduate-Level Google-Proof QA)

Targets graduate-level specialized questions:

GPQA-Diamond is the most difficult subset, serving as a key benchmark for distinguishing top-tier models:

Model GPQA-Diamond
DeepSeek-R1 71.5%
o1-preview 73.3%
DeepSeek-V3 59.1%
Claude 3.5 Sonnet 59.4%
GPT-4o 53.6%

3. Reasoning Capabilities

3.1 BBH (BIG-Bench Hard)

The 23 most challenging tasks from BIG-Bench:

3.2 ARC (AI2 Reasoning Challenge)

Scientific reasoning questions:

3.3 HellaSwag

Common sense reasoning and sentence completion:

4. Mathematical Capabilities

4.1 GSM8K

Elementary school math word problems:

Current top models achieve >95% accuracy, approaching saturation.

4.2 MATH

Competition-level mathematical problems:

MATH-500: A curated set of 500 high-difficulty problems from the MATH dataset, currently the mainstream evaluation standard.

4.3 AIME (American Invitational Mathematics Examination)

American Invitational Mathematics Examination:

Math Benchmark Performance Comparison:

Model GSM8K MATH-500 AIME 2024
o1 96.4% 96.4% 74%
DeepSeek-R1 97.3% 97.3% 79.8%
DeepSeek-V3 91.1% 90.2% 39.2%
Claude 3.5 Sonnet 96.4% 78.3% -
GPT-4o 95.8% 76.6% -

5. Code Capabilities

5.1 HumanEval

Python function generation:

HumanEval+: Adds more test cases, reducing false positives.

5.2 LiveCodeBench

The most important code evaluation innovation in 2024, solving data contamination problems:

Why LiveCodeBench is Important:

5.3 SWE-bench

Real-world software engineering tasks:

Code Benchmark Performance Comparison:

Model HumanEval LiveCodeBench SWE-bench Verified
Claude 3.5 Sonnet 92.0% 41.4% 50.8%
DeepSeek-V3 82.6% 40.5% 42.0%
GPT-4o 90.2% 34.2% 38.4%

6. Instruction Following

6.1 IFEval (Instruction Following Evaluation)

Tests the model’s ability to strictly follow instructions:

Two Metrics:

IFEval is one of the core benchmarks in the Open LLM Leaderboard.

6.2 MT-Bench

Multi-turn conversation evaluation:

6.3 Arena-Hard

A difficult subset based on Chatbot Arena:

7. Long Context Evaluation

7.1 RULER

Systematic evaluation of long context capabilities:

Task Types:

Length Range: 4K to 128K+

Evaluation: Accuracy degradation curves at different lengths

7.2 LongBench

Multi-task long text evaluation:

7.3 Needle-in-a-Haystack

The simplest but most intuitive long context test:

8. Multilingual Evaluation

8.1 C-Eval / CMMLU

Chinese knowledge evaluation:

8.2 MGSM (Multilingual GSM)

Multilingual mathematical reasoning:

9. Safety and Alignment

9.1 TruthfulQA

Tests whether models generate false but common misinformation:

9.2 SimpleQA

Factual accuracy evaluation (released by OpenAI in 2024):

10. Comprehensive Evaluation Platforms

10.1 Open LLM Leaderboard

Open evaluation platform maintained by Hugging Face:

Current version (v2) includes:

Features: Anyone can submit models for evaluation, transparent and reproducible.

10.2 Chatbot Arena

Evaluation based on real user voting:

10.3 LiveBench

Dynamic evaluation resistant to contamination:

11. Evaluation Best Practices

11.1 Avoiding Data Contamination

11.2 Standardized Evaluation Configuration

11.3 Choosing Appropriate Benchmarks

Evaluation Goal Recommended Benchmarks
Quick general capability assessment MMLU-Pro, GPQA-Diamond
Mathematical reasoning MATH-500, AIME
Code generation LiveCodeBench, SWE-bench
Instruction following IFEval
Long context RULER, Needle-in-Haystack
Chinese capabilities C-Eval, CMMLU
Real user preferences Chatbot Arena, Arena-Hard

12. Summary

This article systematically introduces the evaluation system for large language models:

Dimension Key Benchmarks Current Trends
Knowledge MMLU-Pro, GPQA Toward harder, more specialized
Math MATH-500, AIME Competition-level problems becoming standard
Code LiveCodeBench Dynamic updates to prevent contamination
Instruction IFEval Programmatically verifiable constraints
Comprehensive Chatbot Arena Real user preferences

Limitations of Evaluation:

In the next article, we will discuss deployment optimization, including model quantization and inference acceleration techniques.

← Back to Home

Comments