Is there a hard limit on how many steps an LLM can reason through internally?
Yes, there appears to be a strict ceiling on latent (internal) planning depth, and scaling model size alone does not break through it. In a 2026 study, researchers tested whether LLMs could discover and execute multi-step planning strategies within a single forward pass, without being explicitly taught intermediate steps [1]. They found that tiny transformers trained from scratch could only manage up to 3 latent steps, fine-tuned GPT-4o and Qwen3-32B reached 5 steps, and the most advanced model tested, GPT-5.4, achieved 7 steps under few-shot prompting [1]. This means that even the largest models hit a wall around 5-7 internal reasoning steps when they have to figure out the strategy on their own.
Crucially, this ceiling is not about execution—once a model discovers a strategy, it can generalize to up to 8 latent steps at test time [1]. The bottleneck is in discovering the strategy under final-answer supervision alone. This dissociation suggests that for tasks requiring many coordinated internal steps, the strategy may need to be explicitly taught or externalized (e.g., via chain-of-thought prompting), which has implications for AI safety monitoring [1].
Does the way we test LLMs create an artificial ceiling?
Yes, the evaluation format fundamentally shapes whether a ceiling appears. A 2026 benchmark called CAKE tested 22 model configurations (0.5B to 70B parameters) on cloud architecture knowledge using two formats: multiple-choice questions (MCQs) and free-response questions [4]. The results showed a clear ceiling effect for MCQs: accuracy plateaued above 3 billion parameters, with the best model reaching 99.2% [4]. This means that for simple recognition tasks, scaling beyond a modest size yields no benefit—the test becomes too easy.
However, free-response scores continued to scale steadily across all cognitive levels (recall, analyze, design, implement) and model sizes [4]. This shows that the ceiling is not in the model's capability but in the measurement tool. The two formats capture different facets of knowledge: MCQs measure recognition, while free-responses measure deeper understanding and generation. So when people claim scaling has hit a ceiling, it may be that they are using the wrong yardstick.
Can we keep scaling without hitting a cost ceiling?
Parameter-efficient fine-tuning (PEFT) methods show that scaling does not have to mean training all parameters, offering a practical workaround to the cost ceiling. A 2023 survey of over 100 NLP tasks found that large-scale models can be effectively stimulated by optimizing just a small fraction of parameters—often less than 1%—while keeping the rest fixed [2]. This drastically cuts computation and storage costs, making it feasible to adapt ever-larger models without prohibitive expense [2].
Open-source projects like DeepSeek LLM also demonstrate that scaling laws are not monolithic. DeepSeek's 67B model surpassed LLaMA-2 70B on code, math, and reasoning benchmarks, and its chat version outperformed GPT-3.5 in open-ended evaluations [3]. This shows that scaling can continue to yield gains, especially when guided by careful data curation (2 trillion tokens and growing) and alignment techniques like supervised fine-tuning and direct preference optimization [3]. The practical ceiling is not a fixed size but a moving target that depends on data quality, training methodology, and the specific capability being measured.
Sources used in this answer
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
Latent planning depth in LLMs is capped at 5-7 steps even for the largest models, with a dissociation between strategy discovery and execution [1].
Parameter-efficient fine-tuning of large-scale pre-trained language models
Parameter-efficient fine-tuning (delta-tuning) can effectively stimulate large models by optimizing a tiny fraction of parameters, cutting costs drastically [2].
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B surpassed LLaMA-2 70B on code, math, and reasoning, and its chat version outperformed GPT-3.5, showing scaling gains with quality data [3].
CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models
Multiple-choice accuracy plateaus above 3B parameters (best 99.2%), but free-response scores continue to scale, showing evaluation format creates an artificial ceiling [4].
Scaling laws for neural language models
Scaling laws for neural language models show that performance improves predictably with model size, data, and compute, but overfitting becomes a concern near the infinite data limit [5].
