Over the past few years, large language models (LLMs) have emerged as the foundation of modern AI systems, powering chatbots, assistants, copilots, and content generators. But while these models appear increasingly intelligent, a foundational flaw continues to undermine their reliability: nondeterminism during inference.
Even when configured for maximum determinism (e.g., temperature set to 0), LLMs often return different outputs for the same prompt when run repeatedly. In scientific contexts, this kind of behavioral instability would be unacceptable. Yet the field has long tolerated it as an inevitable byproduct of randomness or floating-point precision.
Enter Thinking Machines Lab, the new research organization founded by former OpenAI CTO Mira Murati. In a recent technical post titled “Defeating Nondeterminism in LLM Inference”, the lab reveals a groundbreaking explanation for this phenomenon: batch sensitivity in kernel-level inference pipelines. Their findings offer not only a diagnosis, but also a remedy—an inference engine that ensures consistent outputs regardless of batch structure or hardware configuration.
This work has far-reaching implications across research reproducibility, AI safety, RL training pipelines, and enterprise deployments where consistency is paramount.
Cracking the Core Question: Why Don’t LLMs Always Say the Same Thing?
To understand the significance of Thinking Machines’ work, consider the typical expectation: given a prompt and deterministic settings (like temperature = 0), the model should always produce the same output. After all, it’s a deterministic neural network inference pass.
But that doesn’t happen. AI developers regularly observe this inconsistency across platforms, APIs, and hardware configurations—even when prompts are fixed, context windows are static, and random seeds are disabled.
Until now, the blame was scattered across a range of suspected culprits:
- Floating-point rounding: due to the nature of decimal representations on GPUs.
- Concurrent kernel execution: arising from the non-deterministic behavior of parallel computation on CUDA.
- Sampling randomness: even when supposedly “turned off” at zero temperature.
These explanations were incomplete. Murati’s team proposed a more cohesive theory: batch size and grouping during inference affect the outputs, because the underlying kernels are not batch-invariant.
What Is Batch Invariance, and Why Does It Matter?
In machine learning, especially during inference, multiple prompts are often grouped into batches for efficiency. These batches are then processed simultaneously via matrix multiplication, attention heads, normalization layers, and more. Batch processing enables utilization of vectorized GPU operations and leads to significant speedups.
However, as Thinking Machines uncovered, the way kernels handle batched data isn’t always neutral. Even in deterministic settings, altering the composition of the batch—for instance, combining two prompts instead of one—can subtly shift:
- Memory access patterns
- Thread scheduling
- Execution order of reductions
- Floating-point summation orders
The result? Even a supposedly “isolated” prompt can experience slight numerical changes when processed alongside others. These differences are minute at first—affecting perhaps the 5th or 6th decimal place in a floating-point vector—but propagate through deep transformer layers, ultimately changing the final output token selection.
Murati’s lab refers to this phenomenon as batch-sensitive nondeterminism.
The Empirical Test: 1 Prompt, 1,000 Runs, 80 Distinct Outputs
To quantify the impact, Thinking Machines ran a rigorous experiment using the Qwen-3-8B model, an open-source LLM with strong performance. They submitted the same prompt 1,000 times under identical inference settings (temperature = 0, deterministic decoding, no sampling).
The only thing that varied was the batch context: sometimes the prompt was run alone, sometimes with filler inputs.
In total, they recorded 80 different outputs—not just rephrased, but structurally distinct completions, with different reasoning paths, token sequences, and factual content. This level of divergence under zero-temperature decoding was previously thought to be nearly impossible.
The culprit? The way CUDA kernels, attention layers, and normalization functions executed under different batch sizes.
The Engineering Solution: Batch-Invariant Kernels
To address this, Thinking Machines engineered a modified inference engine with batch-invariant behavior at every critical step:
1. Deterministic Matrix Multiplication
They rewrote matrix multiplication primitives to ensure that reduction operations (e.g., summing) were performed in a fixed order, regardless of how many prompts were being processed. This avoids the common trap of floating-point associativity errors.
2. Stable RMS Normalization
RMSNorm is widely used in modern transformer architectures. Their implementation guaranteed that scaling and normalization steps would produce the same output whether the prompt was processed alone or in a batch.
3. Softmax with Fixed Pathways
Softmax is notoriously sensitive to numerical drift. Thinking Machines implemented a version that ensures identical exponent handling, clipping, and scaling, regardless of surrounding inputs.
4. Prompt Isolation Within Kernel Execution
They restructured the computation graph to isolate prompt executions, even when batched, ensuring no cross-contamination or variation due to neighboring inputs.
The Results: Absolute Inference Reproducibility
After implementing their batch-invariant pipeline, Thinking Machines re-ran the same 1,000-prompt experiment. The outcome was striking: all 1,000 outputs were identical.
In effect, they had achieved complete reproducibility under conditions previously thought to be non-deterministic.
This is not just a theoretical improvement. For developers working in safety-critical environments—such as medical diagnosis tools, legal AI systems, or autonomous control—this means inference is now testable and auditable.
Trade-offs and Performance Impacts
Of course, batch invariance doesn’t come free.
Thinking Machines reported a modest performance penalty when using their deterministic kernels—particularly on GPU-optimized matrix operations where unordered reductions are faster. However, in most cases, this overhead was deemed acceptable for applications requiring trust and reproducibility.
They also acknowledged limitations:
- Some third-party libraries still inject nondeterminism at lower levels.
- Complete cross-platform determinism (e.g., CPU vs. GPU) remains difficult.
- Vendor-level kernel implementations (e.g., cuBLAS) still limit full control.
Nevertheless, the architectural changes provide a blueprint for deterministic inference that is practical today.
Why This Research Matters Now
The implications of this work are both deep and wide-ranging.
Scientific Reproducibility
As LLMs are used in computational science, hypothesis generation, and simulation, reproducibility is non-negotiable. Thinking Machines’ approach ensures that models used in research contexts behave like any other testable software.
Reinforcement Learning Stability
In reinforcement learning from human feedback (RLHF), small inconsistencies in model outputs lead to unstable reward signals. A deterministic inference engine drastically reduces this variance and allows more stable training.
Legal and Regulatory Compliance
As regulatory scrutiny around AI increases, deterministic behavior may soon become a compliance requirement for systems making decisions in finance, healthcare, and law. Determinism supports audit trails, traceability, and robust version control.
Debugging and Testing
Most importantly for engineers, reproducible outputs are the key to debugging. If your AI behaves differently every time, you cannot isolate bugs, regressions, or failures. Thinking Machines offers a solution to this fundamental engineering challenge.
A Paradigm Shift in LLM Infrastructure
What Mira Murati and her team have proposed is not just a patch—it is a shift in the underlying expectations of inference. Until now, LLMs have been treated more like stochastic black boxes. With batch-invariant inference, they can now behave like reliable software components.
This reframing forces the AI industry to reconsider where determinism belongs. In creative applications, randomness may be desirable. But in critical systems, predictability is not optional.
Final Thoughts
Mira Murati’s return to research with Thinking Machines Lab is already producing tangible breakthroughs. By exposing how deeply batch size affects LLM inference, and offering a concrete solution to eliminate that instability, her team has illuminated one of the most underappreciated flaws in current AI infrastructure.
For developers, scientists, regulators, and users alike, this work lays the foundation for a future where AI systems are not just powerful—but also predictable, repeatable, and trustworthy.
Thinking Machines isn’t just pushing the state of the art forward. They’re making it stable.