Nondeterminism in LLM Inference: Root Cause Analysis and Batch Invariance Solutions

2025-12-24 · Qi Lu · Views:

Problem Statement

In LLM inference services, identical inputs should produce identical outputs. However, empirical observations show that even with greedy decoding (temperature=0), outputs remain nondeterministic. The following experimental data comes from Thinking Machines Lab:

Model	Samples	Distinct Outputs	Most Frequent Output Count
Qwen3-235B-A22B	1000	80	78

This phenomenon contradicts the mathematical definition of greedy decoding: $\hat{y}t = \arg\max_v p(v \mid y{<t}, x)$ should be deterministic.

The objectives of this article are: (1) identify the root cause of nondeterminism; (2) design engineering-deployable solutions.

Hypothesis Testing: Concurrent Floating-Point vs. Batch Size Variation

Hypothesis 1: Floating-Point Non-Associativity from GPU Concurrency

The mainstream hypothesis suggests that GPU parallel computation introduces nondeterminism. The reasoning chain is:

Floating-point addition is non-associative: $(a + b) + c \neq a + (b + c)$
GPU parallel reduction execution order depends on thread scheduling
Different execution orders produce different accumulation results
Small differences are amplified by argmax, changing the output token

The typical supporting evidence for this hypothesis is the nondeterminism of CUDA’s atomicAdd operation.

Problems with Hypothesis 1

Thinking Machines Lab’s analysis points out that core operations in modern Transformer inference do not rely on atomic operations:

GEMM: Uses cuBLAS tiled matrix multiplication with fixed reduction tree structure
LayerNorm/RMSNorm: Standard implementations use deterministic warp-level reduction
Attention: FlashAttention’s tiled implementation also uses fixed reduction order

Experimental verification: With a single GPU and fixed batch size, multiple inference runs on the same input produce completely identical outputs. This rules out thread-level nondeterminism as the primary source.

Hypothesis 2: Batch Size Variation Causes Numerical Path Divergence

The real problem is: kernel numerical output is a function of batch size.

Batch Size Variation Causes Numerical Path Divergence

When inference services use dynamic batching, the same request may be assigned to batches of different sizes at different times. Each kernel’s tiling strategy and reduction partitioning may change with batch size, leading to different numerical results.

Mathematical Definition of Batch Invariance

Let kernel $K$ operate on input tensor $X \in \mathbb{R}^{B \times N \times D}$, producing output $Y = K(X)$.

Batch Invariance requires: for any sample $x_i$ in the batch, its output $y_i$ depends only on $x_i$ itself, independent of other samples in the batch:

\[K(X)[i] = K'(x_i), \quad \forall i \in [1, B]\]

where $K’$ is an equivalent single-sample kernel.

In practice, this property does not hold in standard kernel implementations. The reason lies in the batch dependency of tiling strategies.

Batch Variance Analysis of Key Kernels

RMSNorm

RMSNorm is computed as:

\[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{D}\sum_{i=1}^{D} x_i^2 + \epsilon}\]

The key step is reduction: $\sum_{i=1}^{D} x_i^2$.

Standard CUDA implementations typically use tree reduction:

Tree Reduction Diagram

The problem: When batch size changes, the CUDA kernel may choose different block size and grid configurations. Different configurations result in different reduction partitioning:

Batch size = 1: May use 256 threads/block, reduction in 4 rounds
Batch size = 8: May use 128 threads/block, reduction in 8 rounds

Different partitioning produces different intermediate rounding errors, with final results differing at the ULP (Unit in the Last Place) level.

Matrix Multiplication

Standard tiled GEMM implementation: $C = AB$, where $A \in \mathbb{R}^{M \times K}$, $B \in \mathbb{R}^{K \times N}$.

cuBLAS selects optimal tiling configuration based on matrix dimensions and GPU architecture. For example:

Configuration	Tile Size	Reduction Partitioning
Small matrix	64×64	K dimension split into 2 blocks
Large matrix	128×128	K dimension split into 4 blocks

When batch size changes, the effective matrix dimension $M’ = B \times M$ changes, triggering different tiling choices, resulting in different K-dimension reduction partitioning.

Mathematical expression: Let the K dimension be split into $P$ blocks, each of size $k_p$:

\[C_{ij} = \sum_{p=1}^{P} \sum_{l \in \text{block}_p} A_{il} B_{lj}\]

The order of floating-point additions depends on $P$ and block boundaries. Different $P$ values produce different results.

Attention

The core of FlashAttention is tiled computation + Online Softmax. The incremental update formulas for Softmax:

$m^{new} = \max(m^{old}, m^{(j)})$ $\ell^{new} = e^{m^{old} - m^{new}} \ell^{old} + e^{m^{(j)} - m^{new}} \ell^{(j)}$ $O^{new} = \frac{e^{m^{old} - m^{new}} \ell^{old} \cdot O^{old} + e^{m^{(j)} - m^{new}} \ell^{(j)} \cdot O^{(j)}}{\ell^{new}}$

The key parameter is KV split size: how many blocks to split the KV cache into for incremental computation.

Attention Split Size Impact

During decoding, FlashAttention dynamically selects split size based on KV cache length and GPU configuration. Different split sizes lead to different numbers of rescaling operations, accumulating different rounding errors.

Error Propagation: From ULP to Token Divergence

Individual kernel numerical differences are typically on the order of $10^{-6}$ to $10^{-8}$. How does this lead to token-level divergence?

Error Accumulation Model

Let a Transformer have $L$ layers, with relative error $\epsilon$ per layer. The final logits relative error is approximately:

\[\epsilon_{total} \approx L \cdot \epsilon\]

For $L = 80$ (like Llama-70B) and $\epsilon = 10^{-7}$:

\[\epsilon_{total} \approx 8 \times 10^{-6}\]

Argmax Fragility

When two tokens have close probabilities, small logits differences can flip the argmax:

\[\Delta \ell = \ell_1 - \ell_2\]

If $

\Delta \ell

< \epsilon_{total}$, the result is unstable.

Empirical observations show that in greedy decoding, approximately 1-5% of token positions are in this “fragile” state. Once divergence occurs, subsequent generation is completely different, exhibiting a butterfly effect.

Butterfly Effect of Token Divergence

Solution: Batch-Invariant Kernel Design

Design Principles

Core principles for achieving batch invariance:

Fixed tiling configuration: Use fixed block size and grid configuration regardless of input dimensions
Fixed reduction order: Ensure reduction tree structure is independent of batch size
Fixed split size: Use fixed KV split in Attention, not dynamically adjusted by sequence length

Batch-Invariant RMSNorm Implementation

Key modification: Enforce fixed reduction partitioning.

# Standard implementation (batch-variant)
def rmsnorm_standard(x, gamma, eps):
    # Reduction partitioning decided by CUDA runtime
    rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + eps)
    return x / rms * gamma

# Batch-invariant implementation
def rmsnorm_batch_invariant(x, gamma, eps, fixed_block_size=256):
    # Enforce fixed partitioning
    D = x.shape[-1]
    num_blocks = (D + fixed_block_size - 1) // fixed_block_size

    # Block accumulation in fixed order
    sq_sum = 0.0
    for i in range(num_blocks):
        start = i * fixed_block_size
        end = min(start + fixed_block_size, D)
        sq_sum = sq_sum + torch.sum(x[..., start:end] ** 2, dim=-1, keepdim=True)

    rms = torch.sqrt(sq_sum / D + eps)
    return x / rms * gamma

Actual CUDA implementations need to ensure fixed warp-level reduction order, typically through explicit __shfl_down_sync sequences.

Batch-Invariant MatMul Implementation

cuBLAS auto-tuning selects different kernels based on matrix dimensions. Solutions:

Disable auto-tuning: Force use of fixed GEMM kernel
Pad to fixed dimensions: Pad matrices to $2^n$ sizes to ensure consistent tiling

# Use torch.Library to replace default kernel
import batch_invariant_ops

# Automatically replaces all matmul with batch-invariant versions
# Internally uses fixed tile size = 128x128

Batch-Invariant Attention Implementation

Key: Fixed KV split size.

# FlashAttention batch-invariant configuration
flash_attn_func(
    q, k, v,
    deterministic=True,      # Enable deterministic mode
    fixed_split_kv=64        # Fix KV split to 64
)

SGLang’s implementation supports batch-invariant mode for multiple attention backends:

Backend	Fixed Split Size	Performance Overhead
FlashInfer	64	~30%
FlashAttention 3	128	~25%
Triton	64	~40%

Multi-GPU Scenarios: Deterministic AllReduce

In Tensor Parallelism, AllReduce operations also introduce nondeterminism. NCCL’s default implementation uses ring-based or tree-based algorithms, where reduction order depends on inter-GPU communication latency.

Deterministic AllReduce

Solution: Use fixed-order reduce-scatter + all-gather:

Deterministic AllReduce

SGLang implements deterministic tensor parallelism, ensuring complete reproducibility in multi-GPU scenarios.

Additional Challenges for MoE Models

Mixture-of-Experts (MoE) architectures introduce sources of nondeterminism that don’t exist in dense models. MoE architectures like Qwen3-235B-A22B and DeepSeek-V3, with their sparse activation characteristics, face additional challenges for deterministic inference.

Nondeterminism in Token Routing

The core of MoE is the gating network, which decides which experts each token is routed to:

\[G(x) = \text{TopK}(\text{softmax}(W_g \cdot x))\]

The problem: When multiple experts have close gating scores, small numerical perturbations can change the TopK selection result.

MoE Token Routing Diagram

Expert Capacity and Token Dropping

To ensure computational efficiency, MoE typically sets Expert Capacity: the maximum number of tokens each expert can process. When an expert receives more tokens than its capacity, excess tokens are dropped, passing directly to the next layer via residual connections.

\[\text{Expert Capacity} = \frac{\text{Batch Tokens} \times \text{Capacity Factor}}{\text{Num Experts}}\]

Sources of nondeterminism from token dropping:

Batch composition changes: Different batches have different token distributions; the same token may be processed in some batches and dropped in others
Routing order dependency: When capacity is near saturation, tokens arriving first are processed, later ones are dropped
Load imbalance: Popular experts are more likely to trigger dropping

Training-Inference Inconsistency

Research shows significant differences in routing behavior between training and inference phases of MoE models:

“Even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes.”

This inconsistency is particularly severe in RL training:

On-policy requirement: RL training requires rollouts to use the same policy as training
Routing drift: Inference routing decisions may deviate from training distribution
Reward noise: Nondeterministic routing introduces hard-to-track reward fluctuations

Solutions for Deterministic MoE Inference

1. Fixed Routing Threshold

Avoid unstable decisions when gating scores are close:

def deterministic_topk(scores, k, margin=1e-5):
    # Use fixed tie-breaking rule when score difference < margin
    sorted_scores, indices = torch.sort(scores, descending=True)

    # Detect tie situations
    for i in range(k-1, len(sorted_scores)-1):
        if sorted_scores[i] - sorted_scores[i+1] < margin:
            # Use expert index as tie-breaker
            pass

    return indices[:k]

2. Disable Token Dropping

Trade computational efficiency for determinism:

# Set large enough capacity factor to prevent dropping
config.moe_capacity_factor = 2.0  # or higher
config.moe_drop_tokens = False

3. Soft MoE

Use soft routing instead of hard routing, where each token is assigned to all experts with weights:

\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]

Soft MoE eliminates discrete routing decisions, but has higher computational overhead.

Special Impact on RL Training

MoE models face dual challenges in RL training:

Numerical nondeterminism: The batch variance problem described above
Structural nondeterminism: Routing decision instability

Research shows that MoE RL training instability partly stems from routing distribution drift:

“The routing distribution has been identified as a pivotal factor contributing to the instability of MoE RL.”

Experimental data shows that approximately 10% of routers select different experts between training and inference, and 94% of tokens select a different expert in at least one layer.

Routing Replay: R2 and R3

To address training-inference routing inconsistency, researchers proposed the Routing Replay mechanism, with the core idea of replaying inference-phase routing decisions during training.

R2: Vanilla Routing Replay

Definition: Reuse experts selected by the training system during rollout.

Rollout (Training System) → Record routing decisions → Replay during training

Pros: Simple implementation, low overhead
Cons: Effectiveness decreases when off-policy degree is large

R3: Rollout Routing Replay

Definition: Reuse experts selected by the inference system during rollout.

Rollout (Inference System) → Record routing decisions → Replay during training

R3 enforces the constraint: specific experts activated during rollout must be strictly reused during training backpropagation.

R2 vs R3 Comparison

Choosing Between R2 and R3

Scenario	Recommended	Reason
Small off-policy degree	R2	R3’s cost of modifying target policy outweighs benefits
Large off-policy degree	R3	Maintaining validity of first-order approximation is more important

Implementation Details

R3 can be integrated with KV Cache:

# Store routing mask for each layer and token prefix
routing_cache = {}

def forward_with_replay(x, layer_idx, prefix_hash):
    if prefix_hash in routing_cache:
        # Cache hit, reuse routing decision
        mask = routing_cache[prefix_hash]
    else:
        # Compute new routing
        mask = compute_routing(x)
        routing_cache[prefix_hash] = mask

    # Use mask for softmax gating, but don't block gradient flow
    return apply_routing(x, mask, allow_grad=True)

Key point: The replayed mask is used for softmax gating computation but does not block gradient flow through standard router weights, ensuring the router remains trainable.

Performance Analysis

Main overhead sources for batch-invariant kernels:

Abandoning dynamic optimization: Cannot use optimal kernels for specific sizes
Fixed tiling inefficiency: Small matrices using large tiles waste resources
Additional synchronization overhead: Ensuring fixed execution order requires more synchronization points

Benchmarks

Approach	Throughput (tokens/s)	Relative Overhead
Standard vLLM	1000	baseline
TML batch_invariant_ops	385	61.5%
SGLang + CUDA Graphs	657	34.3%

CUDA Graphs eliminates kernel launch overhead by precompiling execution graphs, significantly reducing performance loss in deterministic mode.

Latency Distribution

Another advantage of deterministic mode is reduced latency variance:

Standard mode:      P50=45ms, P99=120ms, stddev=25ms
Deterministic mode: P50=52ms, P99=58ms,  stddev=3ms

For latency-sensitive applications, lower variance may be more important than lower mean.

Use Cases

Reproducibility in RL Training

In RLHF/RLVR training, policy rollouts need to correspond precisely with training steps. Nondeterministic inference causes:

Debugging difficulty: Cannot reproduce specific failure cases
Training instability: Same checkpoint behaves inconsistently across different machines
Incomparable experiments: Cannot distinguish algorithmic improvements from random fluctuations

SGLang’s collaboration with slime achieved 100% reproducible RL training:

“Taking this deterministic inference capability further, SGLang collaborated with the slime team to unlock 100% reproducible RL training.”

Safety and Compliance

For AI systems requiring auditing, deterministic behavior is a fundamental requirement:

Medical diagnosis
Financial decisions
Legal analysis

Nondeterminism means inability to trace and explain the origin of specific outputs.

Model Debugging

When models produce anomalous outputs, deterministic inference allows:

Precisely reproduce the problem
Layer-by-layer inspection of intermediate states
Binary search to locate the source of issues

LayerCast (NeurIPS 2025)

Yuan et al. proposed an alternative approach: reduce error accumulation by increasing numerical precision.

Method: Store weights in FP16/BF16, compute in FP32
Advantage: No kernel modification required
Limitation: Cannot completely eliminate batch variance, only reduces its impact

Experiments show LayerCast reduced accuracy fluctuation of DeepSeek-R1-Distill-Qwen-7B from 9% to 2%, but still not fully deterministic.

OpenAI seed Parameter

OpenAI API provides a seed parameter to improve reproducibility:

“If specified, our system will make a best effort to sample deterministically…”

But the official documentation explicitly states determinism is not guaranteed, due to:

Backend model updates
Load balancing to different hardware
System configuration changes

The system_fingerprint field is used to track configuration changes, but cannot guarantee complete determinism with the same fingerprint.

Conclusion

Nondeterminism in LLM inference stems from kernel implementations’ sensitivity to batch size, not GPU concurrency characteristics. By designing batch-invariant kernels, fully deterministic inference can be achieved at the engineering level.

Key techniques for dense models:

Fix RMSNorm reduction partitioning
Fix MatMul tiling configuration
Fix Attention KV split size
Use deterministic AllReduce in multi-GPU scenarios

Additional challenges for MoE models:

Discrete routing decisions in token routing are unstable when gating scores are close
Expert capacity and token dropping introduce batch-dependent nondeterminism
Training-inference routing inconsistency requires Routing Replay (R2/R3) mechanisms

The performance cost is approximately 30-60%, but is a necessary investment for RL training, model debugging, and safety auditing scenarios.

References

Thinking Machines Lab. Defeating Nondeterminism in LLM Inference. 2025.
Yuan, J. et al. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. NeurIPS 2025 (Oral).
LMSYS Org. Towards Deterministic Inference in SGLang and Reproducible RL Training. 2025.
Thinking Machines Lab. batch_invariant_ops. GitHub.
vLLM Documentation. Batch Invariance.
Dao, T. et al. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. 2023.
Ma, W. et al. Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers. 2025.
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts. arXiv 2025.

← Back to Home

Nondeterminism in LLM Inference: Root Cause Analysis and Batch Invariance Solutions

Problem Statement

Hypothesis Testing: Concurrent Floating-Point vs. Batch Size Variation

Hypothesis 1: Floating-Point Non-Associativity from GPU Concurrency

Problems with Hypothesis 1

Hypothesis 2: Batch Size Variation Causes Numerical Path Divergence

Mathematical Definition of Batch Invariance

Batch Variance Analysis of Key Kernels

RMSNorm

Matrix Multiplication

Attention

Error Propagation: From ULP to Token Divergence

Error Accumulation Model

Argmax Fragility

Solution: Batch-Invariant Kernel Design

Design Principles

Batch-Invariant RMSNorm Implementation

Batch-Invariant MatMul Implementation

Batch-Invariant Attention Implementation

Multi-GPU Scenarios: Deterministic AllReduce

Deterministic AllReduce

Additional Challenges for MoE Models

Nondeterminism in Token Routing

Expert Capacity and Token Dropping

Training-Inference Inconsistency

Solutions for Deterministic MoE Inference

1. Fixed Routing Threshold

2. Disable Token Dropping

3. Soft MoE

Special Impact on RL Training

Routing Replay: R2 and R3

R2: Vanilla Routing Replay

R3: Rollout Routing Replay

Choosing Between R2 and R3

Implementation Details

Other Solutions

RSPO (Router-Shift Policy Optimization)

Freeze Router

Performance Analysis

Benchmarks

Latency Distribution

Use Cases

Reproducibility in RL Training

Safety and Compliance

Model Debugging

Related Work

LayerCast (NeurIPS 2025)

OpenAI seed Parameter

Conclusion

References

Comments