LLM Notes

LLM 与强化学习学习笔记 - Transformer、RLHF、PPO、DPO 等技术深度解析

Nondeterminism in LLM Inference: Root Cause Analysis and Batch Invariance Solutions

2025-12-24 · Qi Lu · Views:

Problem Statement

In LLM inference services, identical inputs should produce identical outputs. However, empirical observations show that even with greedy decoding (temperature=0), outputs remain nondeterministic. The following experimental data comes from Thinking Machines Lab:

Model Samples Distinct Outputs Most Frequent Output Count
Qwen3-235B-A22B 1000 80 78

This phenomenon contradicts the mathematical definition of greedy decoding: $\hat{y}t = \arg\max_v p(v \mid y{<t}, x)$ should be deterministic.

The objectives of this article are: (1) identify the root cause of nondeterminism; (2) design engineering-deployable solutions.


Hypothesis Testing: Concurrent Floating-Point vs. Batch Size Variation

Hypothesis 1: Floating-Point Non-Associativity from GPU Concurrency

The mainstream hypothesis suggests that GPU parallel computation introduces nondeterminism. The reasoning chain is:

  1. Floating-point addition is non-associative: $(a + b) + c \neq a + (b + c)$
  2. GPU parallel reduction execution order depends on thread scheduling
  3. Different execution orders produce different accumulation results
  4. Small differences are amplified by argmax, changing the output token

The typical supporting evidence for this hypothesis is the nondeterminism of CUDA’s atomicAdd operation.

Problems with Hypothesis 1

Thinking Machines Lab’s analysis points out that core operations in modern Transformer inference do not rely on atomic operations:

Experimental verification: With a single GPU and fixed batch size, multiple inference runs on the same input produce completely identical outputs. This rules out thread-level nondeterminism as the primary source.

Hypothesis 2: Batch Size Variation Causes Numerical Path Divergence

The real problem is: kernel numerical output is a function of batch size.

Batch Size Variation Causes Numerical Path Divergence

When inference services use dynamic batching, the same request may be assigned to batches of different sizes at different times. Each kernel’s tiling strategy and reduction partitioning may change with batch size, leading to different numerical results.


Mathematical Definition of Batch Invariance

Let kernel $K$ operate on input tensor $X \in \mathbb{R}^{B \times N \times D}$, producing output $Y = K(X)$.

Batch Invariance requires: for any sample $x_i$ in the batch, its output $y_i$ depends only on $x_i$ itself, independent of other samples in the batch:

\[K(X)[i] = K'(x_i), \quad \forall i \in [1, B]\]

where $K’$ is an equivalent single-sample kernel.

In practice, this property does not hold in standard kernel implementations. The reason lies in the batch dependency of tiling strategies.


Batch Variance Analysis of Key Kernels

RMSNorm

RMSNorm is computed as:

\[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{D}\sum_{i=1}^{D} x_i^2 + \epsilon}\]

The key step is reduction: $\sum_{i=1}^{D} x_i^2$.

Standard CUDA implementations typically use tree reduction:

Tree Reduction Diagram

The problem: When batch size changes, the CUDA kernel may choose different block size and grid configurations. Different configurations result in different reduction partitioning:

Different partitioning produces different intermediate rounding errors, with final results differing at the ULP (Unit in the Last Place) level.

Matrix Multiplication

Standard tiled GEMM implementation: $C = AB$, where $A \in \mathbb{R}^{M \times K}$, $B \in \mathbb{R}^{K \times N}$.

cuBLAS selects optimal tiling configuration based on matrix dimensions and GPU architecture. For example:

Configuration Tile Size Reduction Partitioning
Small matrix 64×64 K dimension split into 2 blocks
Large matrix 128×128 K dimension split into 4 blocks

When batch size changes, the effective matrix dimension $M’ = B \times M$ changes, triggering different tiling choices, resulting in different K-dimension reduction partitioning.

Mathematical expression: Let the K dimension be split into $P$ blocks, each of size $k_p$:

\[C_{ij} = \sum_{p=1}^{P} \sum_{l \in \text{block}_p} A_{il} B_{lj}\]

The order of floating-point additions depends on $P$ and block boundaries. Different $P$ values produce different results.

Attention

The core of FlashAttention is tiled computation + Online Softmax. The incremental update formulas for Softmax:

\(m^{new} = \max(m^{old}, m^{(j)})\) \(\ell^{new} = e^{m^{old} - m^{new}} \ell^{old} + e^{m^{(j)} - m^{new}} \ell^{(j)}\) \(O^{new} = \frac{e^{m^{old} - m^{new}} \ell^{old} \cdot O^{old} + e^{m^{(j)} - m^{new}} \ell^{(j)} \cdot O^{(j)}}{\ell^{new}}\)

The key parameter is KV split size: how many blocks to split the KV cache into for incremental computation.

Attention Split Size Impact

During decoding, FlashAttention dynamically selects split size based on KV cache length and GPU configuration. Different split sizes lead to different numbers of rescaling operations, accumulating different rounding errors.


Error Propagation: From ULP to Token Divergence

Individual kernel numerical differences are typically on the order of $10^{-6}$ to $10^{-8}$. How does this lead to token-level divergence?

Error Accumulation Model

Let a Transformer have $L$ layers, with relative error $\epsilon$ per layer. The final logits relative error is approximately:

\[\epsilon_{total} \approx L \cdot \epsilon\]

For $L = 80$ (like Llama-70B) and $\epsilon = 10^{-7}$:

\[\epsilon_{total} \approx 8 \times 10^{-6}\]

Argmax Fragility

When two tokens have close probabilities, small logits differences can flip the argmax:

\[\Delta \ell = \ell_1 - \ell_2\]
If $ \Delta \ell < \epsilon_{total}$, the result is unstable.

Empirical observations show that in greedy decoding, approximately 1-5% of token positions are in this “fragile” state. Once divergence occurs, subsequent generation is completely different, exhibiting a butterfly effect.

Butterfly Effect of Token Divergence


Solution: Batch-Invariant Kernel Design

Design Principles

Core principles for achieving batch invariance:

  1. Fixed tiling configuration: Use fixed block size and grid configuration regardless of input dimensions
  2. Fixed reduction order: Ensure reduction tree structure is independent of batch size
  3. Fixed split size: Use fixed KV split in Attention, not dynamically adjusted by sequence length

Batch-Invariant RMSNorm Implementation

Key modification: Enforce fixed reduction partitioning.

# Standard implementation (batch-variant)
def rmsnorm_standard(x, gamma, eps):
    # Reduction partitioning decided by CUDA runtime
    rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + eps)
    return x / rms * gamma

# Batch-invariant implementation
def rmsnorm_batch_invariant(x, gamma, eps, fixed_block_size=256):
    # Enforce fixed partitioning
    D = x.shape[-1]
    num_blocks = (D + fixed_block_size - 1) // fixed_block_size

    # Block accumulation in fixed order
    sq_sum = 0.0
    for i in range(num_blocks):
        start = i * fixed_block_size
        end = min(start + fixed_block_size, D)
        sq_sum = sq_sum + torch.sum(x[..., start:end] ** 2, dim=-1, keepdim=True)

    rms = torch.sqrt(sq_sum / D + eps)
    return x / rms * gamma

Actual CUDA implementations need to ensure fixed warp-level reduction order, typically through explicit __shfl_down_sync sequences.

Batch-Invariant MatMul Implementation

cuBLAS auto-tuning selects different kernels based on matrix dimensions. Solutions:

  1. Disable auto-tuning: Force use of fixed GEMM kernel
  2. Pad to fixed dimensions: Pad matrices to $2^n$ sizes to ensure consistent tiling
# Use torch.Library to replace default kernel
import batch_invariant_ops

# Automatically replaces all matmul with batch-invariant versions
# Internally uses fixed tile size = 128x128

Batch-Invariant Attention Implementation

Key: Fixed KV split size.

# FlashAttention batch-invariant configuration
flash_attn_func(
    q, k, v,
    deterministic=True,      # Enable deterministic mode
    fixed_split_kv=64        # Fix KV split to 64
)

SGLang’s implementation supports batch-invariant mode for multiple attention backends:

Backend Fixed Split Size Performance Overhead
FlashInfer 64 ~30%
FlashAttention 3 128 ~25%
Triton 64 ~40%

Multi-GPU Scenarios: Deterministic AllReduce

In Tensor Parallelism, AllReduce operations also introduce nondeterminism. NCCL’s default implementation uses ring-based or tree-based algorithms, where reduction order depends on inter-GPU communication latency.

Deterministic AllReduce

Solution: Use fixed-order reduce-scatter + all-gather:

Deterministic AllReduce

SGLang implements deterministic tensor parallelism, ensuring complete reproducibility in multi-GPU scenarios.


Additional Challenges for MoE Models

Mixture-of-Experts (MoE) architectures introduce sources of nondeterminism that don’t exist in dense models. MoE architectures like Qwen3-235B-A22B and DeepSeek-V3, with their sparse activation characteristics, face additional challenges for deterministic inference.

Nondeterminism in Token Routing

The core of MoE is the gating network, which decides which experts each token is routed to:

\[G(x) = \text{TopK}(\text{softmax}(W_g \cdot x))\]

The problem: When multiple experts have close gating scores, small numerical perturbations can change the TopK selection result.

MoE Token Routing Diagram

Expert Capacity and Token Dropping

To ensure computational efficiency, MoE typically sets Expert Capacity: the maximum number of tokens each expert can process. When an expert receives more tokens than its capacity, excess tokens are dropped, passing directly to the next layer via residual connections.

\[\text{Expert Capacity} = \frac{\text{Batch Tokens} \times \text{Capacity Factor}}{\text{Num Experts}}\]

Sources of nondeterminism from token dropping:

  1. Batch composition changes: Different batches have different token distributions; the same token may be processed in some batches and dropped in others
  2. Routing order dependency: When capacity is near saturation, tokens arriving first are processed, later ones are dropped
  3. Load imbalance: Popular experts are more likely to trigger dropping

Training-Inference Inconsistency

Research shows significant differences in routing behavior between training and inference phases of MoE models:

“Even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes.”

This inconsistency is particularly severe in RL training:

Solutions for Deterministic MoE Inference

1. Fixed Routing Threshold

Avoid unstable decisions when gating scores are close:

def deterministic_topk(scores, k, margin=1e-5):
    # Use fixed tie-breaking rule when score difference < margin
    sorted_scores, indices = torch.sort(scores, descending=True)

    # Detect tie situations
    for i in range(k-1, len(sorted_scores)-1):
        if sorted_scores[i] - sorted_scores[i+1] < margin:
            # Use expert index as tie-breaker
            pass

    return indices[:k]

2. Disable Token Dropping

Trade computational efficiency for determinism:

# Set large enough capacity factor to prevent dropping
config.moe_capacity_factor = 2.0  # or higher
config.moe_drop_tokens = False

3. Soft MoE

Use soft routing instead of hard routing, where each token is assigned to all experts with weights:

\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]

Soft MoE eliminates discrete routing decisions, but has higher computational overhead.

Special Impact on RL Training

MoE models face dual challenges in RL training:

  1. Numerical nondeterminism: The batch variance problem described above
  2. Structural nondeterminism: Routing decision instability

Research shows that MoE RL training instability partly stems from routing distribution drift:

“The routing distribution has been identified as a pivotal factor contributing to the instability of MoE RL.”

Experimental data shows that approximately 10% of routers select different experts between training and inference, and 94% of tokens select a different expert in at least one layer.

Routing Replay: R2 and R3

To address training-inference routing inconsistency, researchers proposed the Routing Replay mechanism, with the core idea of replaying inference-phase routing decisions during training.

R2: Vanilla Routing Replay

Definition: Reuse experts selected by the training system during rollout.

Rollout (Training System) → Record routing decisions → Replay during training

R3: Rollout Routing Replay

Definition: Reuse experts selected by the inference system during rollout.

Rollout (Inference System) → Record routing decisions → Replay during training

R3 enforces the constraint: specific experts activated during rollout must be strictly reused during training backpropagation.

R2 vs R3 Comparison

Choosing Between R2 and R3

Scenario Recommended Reason
Small off-policy degree R2 R3’s cost of modifying target policy outweighs benefits
Large off-policy degree R3 Maintaining validity of first-order approximation is more important

Implementation Details

R3 can be integrated with KV Cache:

# Store routing mask for each layer and token prefix
routing_cache = {}

def forward_with_replay(x, layer_idx, prefix_hash):
    if prefix_hash in routing_cache:
        # Cache hit, reuse routing decision
        mask = routing_cache[prefix_hash]
    else:
        # Compute new routing
        mask = compute_routing(x)
        routing_cache[prefix_hash] = mask

    # Use mask for softmax gating, but don't block gradient flow
    return apply_routing(x, mask, allow_grad=True)

Key point: The replayed mask is used for softmax gating computation but does not block gradient flow through standard router weights, ensuring the router remains trainable.

Other Solutions

RSPO (Router-Shift Policy Optimization)

Unlike R2/R3’s hard constraints, RSPO uses a soft adjustment mechanism:

\[w_{rspo} = w_{is} \cdot \text{clip}(\text{router\_shift\_ratio}, 1-\epsilon, 1+\epsilon)\]

Freeze Router

The simplest approach is freezing router parameters, but this harms model adaptability and is generally not recommended.


Performance Analysis

Main overhead sources for batch-invariant kernels:

  1. Abandoning dynamic optimization: Cannot use optimal kernels for specific sizes
  2. Fixed tiling inefficiency: Small matrices using large tiles waste resources
  3. Additional synchronization overhead: Ensuring fixed execution order requires more synchronization points

Benchmarks

Approach Throughput (tokens/s) Relative Overhead
Standard vLLM 1000 baseline
TML batch_invariant_ops 385 61.5%
SGLang + CUDA Graphs 657 34.3%

CUDA Graphs eliminates kernel launch overhead by precompiling execution graphs, significantly reducing performance loss in deterministic mode.

Latency Distribution

Another advantage of deterministic mode is reduced latency variance:

Standard mode:      P50=45ms, P99=120ms, stddev=25ms
Deterministic mode: P50=52ms, P99=58ms,  stddev=3ms

For latency-sensitive applications, lower variance may be more important than lower mean.


Use Cases

Reproducibility in RL Training

In RLHF/RLVR training, policy rollouts need to correspond precisely with training steps. Nondeterministic inference causes:

  1. Debugging difficulty: Cannot reproduce specific failure cases
  2. Training instability: Same checkpoint behaves inconsistently across different machines
  3. Incomparable experiments: Cannot distinguish algorithmic improvements from random fluctuations

SGLang’s collaboration with slime achieved 100% reproducible RL training:

“Taking this deterministic inference capability further, SGLang collaborated with the slime team to unlock 100% reproducible RL training.”

Safety and Compliance

For AI systems requiring auditing, deterministic behavior is a fundamental requirement:

Nondeterminism means inability to trace and explain the origin of specific outputs.

Model Debugging

When models produce anomalous outputs, deterministic inference allows:

  1. Precisely reproduce the problem
  2. Layer-by-layer inspection of intermediate states
  3. Binary search to locate the source of issues

LayerCast (NeurIPS 2025)

Yuan et al. proposed an alternative approach: reduce error accumulation by increasing numerical precision.

Experiments show LayerCast reduced accuracy fluctuation of DeepSeek-R1-Distill-Qwen-7B from 9% to 2%, but still not fully deterministic.

OpenAI seed Parameter

OpenAI API provides a seed parameter to improve reproducibility:

“If specified, our system will make a best effort to sample deterministically…”

But the official documentation explicitly states determinism is not guaranteed, due to:

  1. Backend model updates
  2. Load balancing to different hardware
  3. System configuration changes

The system_fingerprint field is used to track configuration changes, but cannot guarantee complete determinism with the same fingerprint.


Conclusion

Nondeterminism in LLM inference stems from kernel implementations’ sensitivity to batch size, not GPU concurrency characteristics. By designing batch-invariant kernels, fully deterministic inference can be achieved at the engineering level.

Key techniques for dense models:

  1. Fix RMSNorm reduction partitioning
  2. Fix MatMul tiling configuration
  3. Fix Attention KV split size
  4. Use deterministic AllReduce in multi-GPU scenarios

Additional challenges for MoE models:

  1. Discrete routing decisions in token routing are unstable when gating scores are close
  2. Expert capacity and token dropping introduce batch-dependent nondeterminism
  3. Training-inference routing inconsistency requires Routing Replay (R2/R3) mechanisms

The performance cost is approximately 30-60%, but is a necessary investment for RL training, model debugging, and safety auditing scenarios.


References

  1. Thinking Machines Lab. Defeating Nondeterminism in LLM Inference. 2025.

  2. Yuan, J. et al. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. NeurIPS 2025 (Oral).

  3. LMSYS Org. Towards Deterministic Inference in SGLang and Reproducible RL Training. 2025.

  4. Thinking Machines Lab. batch_invariant_ops. GitHub.

  5. vLLM Documentation. Batch Invariance.

  6. Dao, T. et al. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. 2023.

  7. Ma, W. et al. Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers. 2025.

  8. Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts. arXiv 2025.

← Back to Home

Comments