The Illusion of Stochasticity in LLMs

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

The Illusion of Stochasticity in LLMs

The Illusion of Stochasticity: Why LLMs Can’t Actually "Pick a Random Number"

总结

问题

方法

结果

要点

摘要

This paper investigates the "illusion of stochasticity" in Large Language Models (LLMs), specifically their inability to reliably sample from probability distributions (e.g., Uniform, Gaussian) required for agentic tasks. Using frontier models like Gemini and Qwen3, the authors reveal that LLMs suffer from deep-seated distributional and positional biases when prompted to act randomly.

Executive Summary

TL;DR: New research from Google DeepMind and NUS reveals that frontier LLMs (Gemini, Qwen3) are mathematically "stochastic-blind." Even when a model perfectly describes a Gaussian distribution in its reasoning trace, it fails to sample from it, instead falling back on training data biases or positional preferences.

Background: This work positions itself as a critical diagnostic of the Agentic LLM paradigm. It argues that the "knowing-doing gap"—where models understand instructions but fail to execute them—is partly caused by a fundamental inability to simulate randomness.

1. The "C" Bias and the Knowing-Doing Gap

When you ask an LLM to generate a randomized multiple-choice test, it doesn't flip a coin. It has a "favorite" answer. As shown in the study, models like Gemini exhibit a massive bias toward placing the correct answer in position "C".

Figure 1: LLMs are heavily biased towards “C” rather than sampling uniformly

This isn't just a quirk; it’s a failure of reliable sampling. In an adversarial environment, an agent that is predictable is exploitable.

2. Why Can't LLMs Just Sample?

The study tested models on simple distributions: Uniform (Discrete/Continuous) and Gaussian.

The Methodology

The authors used Goodness-of-Fit (GoF) tests (Chi-Square and Kolmogorov-Smirnov). A p-value > 0.05 would indicate "real" randomness. The result? Almost every independent sampling test returned a p-value of 0.00.

Figure 2: Empirical distribution estimates showing failure to follow target distributions

The Insights:

Distributional Bias: Models love the numbers 7 and 42.
Positional Bias: The order of options in the prompt changes the "random" choice.
Temperature is not a Cure: Even at high temperatures (T=2.0), the fundamental bias toward specific values remains; the model just gets worse at following formatting instructions.

3. The Sequential Sampling Trap

Can we fix this by letting the model see what it picked before?

Sequential with All History: Improves uniformity but creates "Repulsive Bias"—the model tries too hard not to repeat itself, leading to negative auto-correlation.
Batched Sampling: Leads to "Periodic Patterns". The model might generate "0, 1, 2... 9" and then repeat that sequence, which is the opposite of true randomness.

Figure 5: Auto-correlation showing repulsive and periodic biases in sequential/batched sampling

4. The Silver Lining: Deterministic Conversion

The most fascinating finding is that LLMs can do math, they just can't "feel" randomness. When provided with an external uniform random seed (e.g., "Here is a number from [0, 1]: 0.639"), frontier models like Qwen3-32B and Gemini-3.0-Pro can execute complex Inverse Transform Sampling to convert that seed into a perfect Gaussian distribution.

Figure 9: LLMs successfully converting uniform seeds to various complex distributions

Why does this work? Conversion is a deterministic process. LLMs are excellent at following algorithms (bucketization, probit functions) but terrible at originating stochastic states.

5. Critical Analysis & Future Directions

Limitations

Computational Cost: Simulating a PRNG via Chain-of-Thought is slow and expensive for high-frequency agents.
State Management: PRNGs are stateful; stateless API calls destroy the "randomness" chain unless the state is manually passed back and forth.

Final Takeaway

If you are building an AI agent that needs to explore an environment (like an RL agent) or randomize choices, do not ask the LLM to be random. Instead:

Generate a seed using an external tool (Python random or time).
Pass that seed to the LLM.
Let the LLM use its emergent deterministic reasoning to map that seed to the desired action space.

The future of agentic AI isn't in teaching models to "roll dice" internally, but in providing them with high-quality, stateful external samplers.

发现相似论文

试试这些示例

Search for recent studies exploring why LLMs exhibit specific numerical biases like the preference for the number 42 or 7 in zero-shot sampling tasks.
Which prior works established the "knowing-doing gap" in LLM-based agents, and how does this paper's focus on stochasticity expand that theory?
Investigate state-of-the-art methods for integrating stateful external samplers or PRNG tools into LLM-agent frameworks to bypass internal sampling biases.

The Illusion of Stochasticity: Why LLMs Can’t Actually "Pick a Random Number"

1. Executive Summary

2. 1. The "C" Bias and the Knowing-Doing Gap

3. 2. Why Can't LLMs Just Sample?

3.1. The Methodology

3.2. The Insights:

4. 3. The Sequential Sampling Trap

5. 4. The Silver Lining: Deterministic Conversion

6. 5. Critical Analysis & Future Directions

6.1. Limitations

6.2. Final Takeaway