Are LLM emergent abilities just a measurement trick?
A major clue comes from comparing how we measure LLM performance. Standard evaluations often use a single test prompt and count a task as "solved" only if the model gets it exactly right. This binary pass/fail approach can make small models look like they have zero ability until they suddenly cross a threshold. Researchers introduced PassUntil, which samples many possible answers from the model and checks if any are correct, giving effectively infinite measurement resolution [4][6]. Using this method, they found that even small models show consistent, gradual improvements that were previously invisible. For instance, they could predict the code-generation performance of a 2.4 billion parameter model with only 0.05% error before training even started [4][6]. This shows that much of the "sudden emergence" people observed was actually an artifact of coarse measurement tools.
However, not all emergence disappears under closer inspection. The same study identified a type of "accelerated emergence" where performance improves at an increasing rate as model size grows, a pattern that cannot be fit by standard smooth scaling curves [4][6]. They suggest this might be due to multiple neural circuits in the model activating together at a certain scale, creating a genuine qualitative leap. So while some abilities are artifacts, others appear to be real.
Does the way we test LLMs create fake emergence?
Yes, the testing method itself can create the illusion of emergent abilities. Researchers compared two ways of measuring LLMs' linguistic knowledge: asking the model a direct question (prompting) versus reading out its internal probability scores for different word sequences [3][5]. They found that prompting consistently underestimates what the model actually knows. For example, a model might give a wrong answer when asked "Is this sentence grammatical?" but its internal probability scores show it correctly assigns higher likelihood to grammatical sentences [3][5]. This means that when a small model fails a prompted test, it might still have the underlying ability—it just can't express it through metalinguistic judgment. As prompts become more different from the model's natural next-word prediction task, the inconsistency gets worse [3][5]. So some reported "emergent abilities" may simply be cases where larger models become better at understanding the test format, not at the underlying skill.
What emergent abilities are genuinely real?
Despite measurement artifacts, some abilities do genuinely emerge in LLMs without explicit training. In a direct comparison with humans, GPT-3 solved analogical reasoning problems (like Raven's matrices) at a level matching or surpassing human performance, even though it was never trained on such tasks [2]. This is a clear case of zero-shot reasoning—the model figured out the abstract pattern without any examples. Similarly, when populations of LLM agents interact, they spontaneously develop social conventions and collective biases without being programmed to do so [1]. These conventions emerged universally across decentralized groups, and even a committed minority of adversarial agents could drive the whole population to adopt new norms [1]. These findings show that LLMs can bootstrap complex social and reasoning abilities from their training alone, which is a genuinely emergent phenomenon not reducible to measurement tricks.
Sources used in this answer
Emergent social conventions and collective bias in LLM populations.
LLM populations spontaneously develop universal social conventions and collective biases without explicit programming, and a committed minority can drive social change [1].
Emergent analogical reasoning in large language models
GPT-3 matched or surpassed humans on zero-shot analogical reasoning tasks, including Raven's matrices, showing genuine emergent reasoning ability [2].
Prompting is not a substitute for probability measurements in large language models
LLMs' metalinguistic judgments from prompting are inferior to direct probability measurements, and consistency worsens as prompts diverge from next-word prediction [3].
Predicting Emergent Abilities with Infinite Resolution Evaluation
Using PassUntil (infinite-resolution evaluation), small models show predictable task scaling; a 2.4B model's code generation performance was predicted with 0.05% error [4].
Prompt-based methods may underestimate large language models' linguistic generalizations
Prompting underestimates LLMs' linguistic knowledge compared to direct probability measurements, so negative prompt-based results are not conclusive [5].
Unlock Predictable Scaling from Emergent Abilities
PassUntil reveals both predictable scaling and accelerated emergence that cannot be fit by standard scaling laws, suggesting multiple circuits may cause real qualitative leaps [6].
