Nondeterminism in LLM Inference: Root Cause Analysis and Batch Invariance Solutions
2025-12-24 · Qi Lu · Views:
Problem Statement
In LLM inference services, identical inputs should produce identical outputs. However, empirical observations show that even with greedy decoding (temperature=0), outputs remain nondeterministic. The following experimental data comes from Thinking Machines Lab:
| Model | Samples | Distinct Outputs | Most Frequent Output Count |
|---|---|---|---|
| Qwen3-235B-A22B | 1000 | 80 | 78 |
This phenomenon contradicts the mathematical definition of greedy decoding: $\hat{y}t = \arg\max_v p(v \mid y{<t}, x)$ should be deterministic.
The objectives of this article are: (1) identify the root cause of nondeterminism; (2) design engineering-deployable solutions.
Hypothesis Testing: Concurrent Floating-Point vs. Batch Size Variation
Hypothesis 1: Floating-Point Non-Associativity from GPU Concurrency
The mainstream hypothesis suggests that GPU parallel computation introduces nondeterminism. The reasoning chain is:
- Floating-point addition is non-associative: $(a + b) + c \neq a + (b + c)$
- GPU parallel reduction execution order depends on thread scheduling
- Different execution orders produce different accumulation results
- Small differences are amplified by argmax, changing the output token
The typical supporting evidence for this hypothesis is the nondeterminism of CUDA’s atomicAdd operation.
Problems with Hypothesis 1
Thinking Machines Lab’s analysis points out that core operations in modern Transformer inference do not rely on atomic operations:
- GEMM: Uses cuBLAS tiled matrix multiplication with fixed reduction tree structure
- LayerNorm/RMSNorm: Standard implementations use deterministic warp-level reduction
- Attention: FlashAttention’s tiled implementation also uses fixed reduction order
Experimental verification: With a single GPU and fixed batch size, multiple inference runs on the same input produce completely identical outputs. This rules out thread-level nondeterminism as the primary source.
Hypothesis 2: Batch Size Variation Causes Numerical Path Divergence
The real problem is: kernel numerical output is a function of batch size.
When inference services use dynamic batching, the same request may be assigned to batches of different sizes at different times. Each kernel’s tiling strategy and reduction partitioning may change with batch size, leading to different numerical results.
Mathematical Definition of Batch Invariance
Let kernel $K$ operate on input tensor $X \in \mathbb{R}^{B \times N \times D}$, producing output $Y = K(X)$.
Batch Invariance requires: for any sample $x_i$ in the batch, its output $y_i$ depends only on $x_i$ itself, independent of other samples in the batch:
\[K(X)[i] = K'(x_i), \quad \forall i \in [1, B]\]where $K’$ is an equivalent single-sample kernel.
In practice, this property does not hold in standard kernel implementations. The reason lies in the batch dependency of tiling strategies.
Batch Variance Analysis of Key Kernels
RMSNorm
RMSNorm is computed as:
\[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{D}\sum_{i=1}^{D} x_i^2 + \epsilon}\]The key step is reduction: $\sum_{i=1}^{D} x_i^2$.
Standard CUDA implementations typically use tree reduction:
The problem: When batch size changes, the CUDA kernel may choose different block size and grid configurations. Different configurations result in different reduction partitioning:
- Batch size = 1: May use 256 threads/block, reduction in 4 rounds
- Batch size = 8: May use 128 threads/block, reduction in 8 rounds
Different partitioning produces different intermediate rounding errors, with final results differing at the ULP (Unit in the Last Place) level.
Matrix Multiplication
Standard tiled GEMM implementation: $C = AB$, where $A \in \mathbb{R}^{M \times K}$, $B \in \mathbb{R}^{K \times N}$.
cuBLAS selects optimal tiling configuration based on matrix dimensions and GPU architecture. For example:
| Configuration | Tile Size | Reduction Partitioning |
|---|---|---|
| Small matrix | 64×64 | K dimension split into 2 blocks |
| Large matrix | 128×128 | K dimension split into 4 blocks |
When batch size changes, the effective matrix dimension $M’ = B \times M$ changes, triggering different tiling choices, resulting in different K-dimension reduction partitioning.
Mathematical expression: Let the K dimension be split into $P$ blocks, each of size $k_p$:
\[C_{ij} = \sum_{p=1}^{P} \sum_{l \in \text{block}_p} A_{il} B_{lj}\]The order of floating-point additions depends on $P$ and block boundaries. Different $P$ values produce different results.
Attention
The core of FlashAttention is tiled computation + Online Softmax. The incremental update formulas for Softmax:
\(m^{new} = \max(m^{old}, m^{(j)})\) \(\ell^{new} = e^{m^{old} - m^{new}} \ell^{old} + e^{m^{(j)} - m^{new}} \ell^{(j)}\) \(O^{new} = \frac{e^{m^{old} - m^{new}} \ell^{old} \cdot O^{old} + e^{m^{(j)} - m^{new}} \ell^{(j)} \cdot O^{(j)}}{\ell^{new}}\)
The key parameter is KV split size: how many blocks to split the KV cache into for incremental computation.
During decoding, FlashAttention dynamically selects split size based on KV cache length and GPU configuration. Different split sizes lead to different numbers of rescaling operations, accumulating different rounding errors.
Error Propagation: From ULP to Token Divergence
Individual kernel numerical differences are typically on the order of $10^{-6}$ to $10^{-8}$. How does this lead to token-level divergence?
Error Accumulation Model
Let a Transformer have $L$ layers, with relative error $\epsilon$ per layer. The final logits relative error is approximately:
\[\epsilon_{total} \approx L \cdot \epsilon\]For $L = 80$ (like Llama-70B) and $\epsilon = 10^{-7}$:
\[\epsilon_{total} \approx 8 \times 10^{-6}\]Argmax Fragility
When two tokens have close probabilities, small logits differences can flip the argmax:
\[\Delta \ell = \ell_1 - \ell_2\]| If $ | \Delta \ell | < \epsilon_{total}$, the result is unstable. |
Empirical observations show that in greedy decoding, approximately 1-5% of token positions are in this “fragile” state. Once divergence occurs, subsequent generation is completely different, exhibiting a butterfly effect.
Solution: Batch-Invariant Kernel Design
Design Principles
Core principles for achieving batch invariance:
- Fixed tiling configuration: Use fixed block size and grid configuration regardless of input dimensions
- Fixed reduction order: Ensure reduction tree structure is independent of batch size
- Fixed split size: Use fixed KV split in Attention, not dynamically adjusted by sequence length
Batch-Invariant RMSNorm Implementation
Key modification: Enforce fixed reduction partitioning.
# Standard implementation (batch-variant)
def rmsnorm_standard(x, gamma, eps):
# Reduction partitioning decided by CUDA runtime
rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + eps)
return x / rms * gamma
# Batch-invariant implementation
def rmsnorm_batch_invariant(x, gamma, eps, fixed_block_size=256):
# Enforce fixed partitioning
D = x.shape[-1]
num_blocks = (D + fixed_block_size - 1) // fixed_block_size
# Block accumulation in fixed order
sq_sum = 0.0
for i in range(num_blocks):
start = i * fixed_block_size
end = min(start + fixed_block_size, D)
sq_sum = sq_sum + torch.sum(x[..., start:end] ** 2, dim=-1, keepdim=True)
rms = torch.sqrt(sq_sum / D + eps)
return x / rms * gamma
Actual CUDA implementations need to ensure fixed warp-level reduction order, typically through explicit __shfl_down_sync sequences.
Batch-Invariant MatMul Implementation
cuBLAS auto-tuning selects different kernels based on matrix dimensions. Solutions:
- Disable auto-tuning: Force use of fixed GEMM kernel
- Pad to fixed dimensions: Pad matrices to $2^n$ sizes to ensure consistent tiling
# Use torch.Library to replace default kernel
import batch_invariant_ops
# Automatically replaces all matmul with batch-invariant versions
# Internally uses fixed tile size = 128x128
Batch-Invariant Attention Implementation
Key: Fixed KV split size.
# FlashAttention batch-invariant configuration
flash_attn_func(
q, k, v,
deterministic=True, # Enable deterministic mode
fixed_split_kv=64 # Fix KV split to 64
)
SGLang’s implementation supports batch-invariant mode for multiple attention backends:
| Backend | Fixed Split Size | Performance Overhead |
|---|---|---|
| FlashInfer | 64 | ~30% |
| FlashAttention 3 | 128 | ~25% |
| Triton | 64 | ~40% |
Multi-GPU Scenarios: Deterministic AllReduce
In Tensor Parallelism, AllReduce operations also introduce nondeterminism. NCCL’s default implementation uses ring-based or tree-based algorithms, where reduction order depends on inter-GPU communication latency.
Deterministic AllReduce
Solution: Use fixed-order reduce-scatter + all-gather:
SGLang implements deterministic tensor parallelism, ensuring complete reproducibility in multi-GPU scenarios.
Additional Challenges for MoE Models
Mixture-of-Experts (MoE) architectures introduce sources of nondeterminism that don’t exist in dense models. MoE architectures like Qwen3-235B-A22B and DeepSeek-V3, with their sparse activation characteristics, face additional challenges for deterministic inference.
Nondeterminism in Token Routing
The core of MoE is the gating network, which decides which experts each token is routed to:
\[G(x) = \text{TopK}(\text{softmax}(W_g \cdot x))\]The problem: When multiple experts have close gating scores, small numerical perturbations can change the TopK selection result.
Expert Capacity and Token Dropping
To ensure computational efficiency, MoE typically sets Expert Capacity: the maximum number of tokens each expert can process. When an expert receives more tokens than its capacity, excess tokens are dropped, passing directly to the next layer via residual connections.
\[\text{Expert Capacity} = \frac{\text{Batch Tokens} \times \text{Capacity Factor}}{\text{Num Experts}}\]Sources of nondeterminism from token dropping:
- Batch composition changes: Different batches have different token distributions; the same token may be processed in some batches and dropped in others
- Routing order dependency: When capacity is near saturation, tokens arriving first are processed, later ones are dropped
- Load imbalance: Popular experts are more likely to trigger dropping
Training-Inference Inconsistency
Research shows significant differences in routing behavior between training and inference phases of MoE models:
“Even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes.”
This inconsistency is particularly severe in RL training:
- On-policy requirement: RL training requires rollouts to use the same policy as training
- Routing drift: Inference routing decisions may deviate from training distribution
- Reward noise: Nondeterministic routing introduces hard-to-track reward fluctuations
Solutions for Deterministic MoE Inference
1. Fixed Routing Threshold
Avoid unstable decisions when gating scores are close:
def deterministic_topk(scores, k, margin=1e-5):
# Use fixed tie-breaking rule when score difference < margin
sorted_scores, indices = torch.sort(scores, descending=True)
# Detect tie situations
for i in range(k-1, len(sorted_scores)-1):
if sorted_scores[i] - sorted_scores[i+1] < margin:
# Use expert index as tie-breaker
pass
return indices[:k]
2. Disable Token Dropping
Trade computational efficiency for determinism:
# Set large enough capacity factor to prevent dropping
config.moe_capacity_factor = 2.0 # or higher
config.moe_drop_tokens = False
3. Soft MoE
Use soft routing instead of hard routing, where each token is assigned to all experts with weights:
\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]Soft MoE eliminates discrete routing decisions, but has higher computational overhead.
Special Impact on RL Training
MoE models face dual challenges in RL training:
- Numerical nondeterminism: The batch variance problem described above
- Structural nondeterminism: Routing decision instability
Research shows that MoE RL training instability partly stems from routing distribution drift:
“The routing distribution has been identified as a pivotal factor contributing to the instability of MoE RL.”
Experimental data shows that approximately 10% of routers select different experts between training and inference, and 94% of tokens select a different expert in at least one layer.
Routing Replay: R2 and R3
To address training-inference routing inconsistency, researchers proposed the Routing Replay mechanism, with the core idea of replaying inference-phase routing decisions during training.
R2: Vanilla Routing Replay
Definition: Reuse experts selected by the training system during rollout.
Rollout (Training System) → Record routing decisions → Replay during training
- Pros: Simple implementation, low overhead
- Cons: Effectiveness decreases when off-policy degree is large
R3: Rollout Routing Replay
Definition: Reuse experts selected by the inference system during rollout.
Rollout (Inference System) → Record routing decisions → Replay during training
R3 enforces the constraint: specific experts activated during rollout must be strictly reused during training backpropagation.
Choosing Between R2 and R3
| Scenario | Recommended | Reason |
|---|---|---|
| Small off-policy degree | R2 | R3’s cost of modifying target policy outweighs benefits |
| Large off-policy degree | R3 | Maintaining validity of first-order approximation is more important |
Implementation Details
R3 can be integrated with KV Cache:
# Store routing mask for each layer and token prefix
routing_cache = {}
def forward_with_replay(x, layer_idx, prefix_hash):
if prefix_hash in routing_cache:
# Cache hit, reuse routing decision
mask = routing_cache[prefix_hash]
else:
# Compute new routing
mask = compute_routing(x)
routing_cache[prefix_hash] = mask
# Use mask for softmax gating, but don't block gradient flow
return apply_routing(x, mask, allow_grad=True)
Key point: The replayed mask is used for softmax gating computation but does not block gradient flow through standard router weights, ensuring the router remains trainable.
Other Solutions
RSPO (Router-Shift Policy Optimization)
Unlike R2/R3’s hard constraints, RSPO uses a soft adjustment mechanism:
\[w_{rspo} = w_{is} \cdot \text{clip}(\text{router\_shift\_ratio}, 1-\epsilon, 1+\epsilon)\]- Compute router shift ratio between current and old policies
- Quantify routing deviation for each token
- Adaptively reweight updates, limiting excessive updates while maintaining router flexibility
Freeze Router
The simplest approach is freezing router parameters, but this harms model adaptability and is generally not recommended.
Performance Analysis
Main overhead sources for batch-invariant kernels:
- Abandoning dynamic optimization: Cannot use optimal kernels for specific sizes
- Fixed tiling inefficiency: Small matrices using large tiles waste resources
- Additional synchronization overhead: Ensuring fixed execution order requires more synchronization points
Benchmarks
| Approach | Throughput (tokens/s) | Relative Overhead |
|---|---|---|
| Standard vLLM | 1000 | baseline |
| TML batch_invariant_ops | 385 | 61.5% |
| SGLang + CUDA Graphs | 657 | 34.3% |
CUDA Graphs eliminates kernel launch overhead by precompiling execution graphs, significantly reducing performance loss in deterministic mode.
Latency Distribution
Another advantage of deterministic mode is reduced latency variance:
Standard mode: P50=45ms, P99=120ms, stddev=25ms
Deterministic mode: P50=52ms, P99=58ms, stddev=3ms
For latency-sensitive applications, lower variance may be more important than lower mean.
Use Cases
Reproducibility in RL Training
In RLHF/RLVR training, policy rollouts need to correspond precisely with training steps. Nondeterministic inference causes:
- Debugging difficulty: Cannot reproduce specific failure cases
- Training instability: Same checkpoint behaves inconsistently across different machines
- Incomparable experiments: Cannot distinguish algorithmic improvements from random fluctuations
SGLang’s collaboration with slime achieved 100% reproducible RL training:
“Taking this deterministic inference capability further, SGLang collaborated with the slime team to unlock 100% reproducible RL training.”
Safety and Compliance
For AI systems requiring auditing, deterministic behavior is a fundamental requirement:
- Medical diagnosis
- Financial decisions
- Legal analysis
Nondeterminism means inability to trace and explain the origin of specific outputs.
Model Debugging
When models produce anomalous outputs, deterministic inference allows:
- Precisely reproduce the problem
- Layer-by-layer inspection of intermediate states
- Binary search to locate the source of issues
Related Work
LayerCast (NeurIPS 2025)
Yuan et al. proposed an alternative approach: reduce error accumulation by increasing numerical precision.
- Method: Store weights in FP16/BF16, compute in FP32
- Advantage: No kernel modification required
- Limitation: Cannot completely eliminate batch variance, only reduces its impact
Experiments show LayerCast reduced accuracy fluctuation of DeepSeek-R1-Distill-Qwen-7B from 9% to 2%, but still not fully deterministic.
OpenAI seed Parameter
OpenAI API provides a seed parameter to improve reproducibility:
“If specified, our system will make a best effort to sample deterministically…”
But the official documentation explicitly states determinism is not guaranteed, due to:
- Backend model updates
- Load balancing to different hardware
- System configuration changes
The system_fingerprint field is used to track configuration changes, but cannot guarantee complete determinism with the same fingerprint.
Conclusion
Nondeterminism in LLM inference stems from kernel implementations’ sensitivity to batch size, not GPU concurrency characteristics. By designing batch-invariant kernels, fully deterministic inference can be achieved at the engineering level.
Key techniques for dense models:
- Fix RMSNorm reduction partitioning
- Fix MatMul tiling configuration
- Fix Attention KV split size
- Use deterministic AllReduce in multi-GPU scenarios
Additional challenges for MoE models:
- Discrete routing decisions in token routing are unstable when gating scores are close
- Expert capacity and token dropping introduce batch-dependent nondeterminism
- Training-inference routing inconsistency requires Routing Replay (R2/R3) mechanisms
The performance cost is approximately 30-60%, but is a necessary investment for RL training, model debugging, and safety auditing scenarios.
References
-
Thinking Machines Lab. Defeating Nondeterminism in LLM Inference. 2025.
-
Yuan, J. et al. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. NeurIPS 2025 (Oral).
-
LMSYS Org. Towards Deterministic Inference in SGLang and Reproducible RL Training. 2025.
-
Thinking Machines Lab. batch_invariant_ops. GitHub.
-
vLLM Documentation. Batch Invariance.
-
Dao, T. et al. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. 2023.
-
Ma, W. et al. Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers. 2025.
-
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts. arXiv 2025.
Comments