Why is LoRA Effective in RL Fine-tuning? An Information Bandwidth Perspective

2025-12-19 · Qi Lu · Views:

Introduction

In the post-training phase of Large Language Models (LLMs), Low-Rank Adaptation (LoRA) has become the most popular Parameter-Efficient Fine-Tuning (PEFT) method. A surprising finding is that in Reinforcement Learning (RL) fine-tuning scenarios, even when using very small ranks, LoRA’s performance can match that of Full Fine-Tuning.

This blog post will synthesize two excellent articles—Thinking Machines Lab’s LoRA Without Regret and Yingru Li’s Information Bandwidth in Reinforcement Learning—to explore the information-theoretic explanation behind this phenomenon.

The Basic Principles of LoRA

The core idea of LoRA is to approximate weight updates using low-rank matrices. Specifically, for an original weight matrix $W$, LoRA replaces it with:

\[W' = W + \gamma BA\]

where $B$ and $A$ are two low-rank matrices whose parameters are far fewer than the original $W$. This significantly reduces the memory and computational resources required for training.

Key Findings from LoRA Without Regret

Research from Thinking Machines Lab (led by John Schulman) reveals several important experimental conclusions:

1. Equivalent Performance in RL Scenarios

In reinforcement learning fine-tuning, even when using very small ranks, LoRA’s performance is nearly identical to full fine-tuning. This stands in stark contrast to supervised learning—in SL tasks with large datasets, LoRA often suffers performance degradation due to insufficient capacity.

2. The Importance of Learning Rate

LoRA requires a much larger learning rate than full fine-tuning—typically 20-100 times larger. After selecting the optimal learning rate, the training curves of different-sized LoRAs and full fine-tuning almost completely overlap.

3. Practical Recommendations

For RL fine-tuning: Feel confident using small-rank LoRA
For small to medium-scale SL datasets: LoRA performs comparably to full fine-tuning
Apply LoRA to all weight matrices, especially MLP and MoE layers

Information Bandwidth: A Theoretical Explanation

Why does RL require such low model capacity? Yingru Li’s article provides an elegant explanation from an information-theoretic perspective.

Core Insight: Each Episode Learns Only About 1 Bit of Information

Policy gradient algorithms have a fundamental information bottleneck: each episode can only learn approximately 1 bit of information.

This limitation stems from the structural properties of gradients. When using scalar advantages:

\[g = \nabla \log \pi_\theta(a|s) \cdot A\]

All rewards from timesteps are aggregated into a single scalar $A$, which leads to an information upper bound of $\leq \log_2(B)$ bits, where $B$ is the batch size.

Structural Bottleneck

This is a structural bottleneck that cannot be overcome by adding more parameters or computational resources. No matter how large your model is, the amount of information that can be learned per episode is limited by this theoretical upper bound.

The Alternative of Per-Timestep Advantages

Using per-timestep advantages can increase the information upper bound to $\leq H(r)$ bits:

\[g = \sum_{t=0}^{T-1} \nabla \log \pi_\theta(a_t|s_t) \cdot A_t\]

However, in practice, this requires more complex credit assignment.

Unifying Theory and Practice

Now we can understand why LoRA is so effective in RL:

Information bottleneck determines required capacity: Since each episode can only learn about 1 bit of information, the “capacity” required for model updates is very limited.
LoRA’s capacity is sufficient: Even a very small-rank LoRA has enough trainable parameters to accommodate these sparse information updates.
Extra parameters are wasteful: In RL scenarios, full fine-tuning has no essential advantage over LoRA, because the bottleneck is not in model capacity, but in information acquisition.

This also explains why LoRA lags behind in large-dataset SL tasks—SL tasks don’t have this 1 bit/episode limitation. The larger the dataset, the more information can be learned, and at that point LoRA’s capacity limitation becomes the bottleneck.

Practical Implications

Based on these theoretical and experimental findings, we can draw the following practical recommendations:

For RL Fine-tuning (such as RLHF)

Boldly use LoRA, even very small ranks are sufficient
The saved computational resources can be used for other purposes (such as more sampling)
Use larger learning rates (20-100x)

For SL Tasks

Small datasets: LoRA is a safe choice
Large datasets: Consider full fine-tuning or larger-rank LoRA
Always apply LoRA to all weight matrices

For System Design

LoRA’s low capacity requirements mean that a single inference server can simultaneously maintain multiple adapters
This provides tremendous flexibility for personalized model serving

Conclusion

LoRA’s success in RL fine-tuning is not accidental—it has a profound information-theoretic foundation. The inherent 1 bit/episode information bottleneck in policy gradient algorithms means we simply don’t need many parameters to capture these updates. This insight not only explains existing experimental phenomena but also points the way for future algorithm design: rather than increasing model capacity, we should think about how to increase information bandwidth.

References

Tags: RL RLHF

← Back to Home

LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析