Is there a practical ceiling to LLM scaling laws?

Is there a hard limit on how many steps an LLM can reason through internally?

Yes, there appears to be a strict ceiling on latent (internal) planning depth, and scaling model size alone does not break through it. In a 2026 study, researchers tested whether LLMs could discover and execute multi-step planning strategies within a single forward pass, without being explicitly taught intermediate steps [1]. They found that tiny transformers trained from scratch could only manage up to 3 latent steps, fine-tuned GPT-4o and Qwen3-32B reached 5 steps, and the most advanced model tested, GPT-5.4, achieved 7 steps under few-shot prompting [1]. This means that even the largest models hit a wall around 5-7 internal reasoning steps when they have to figure out the strategy on their own.

Crucially, this ceiling is not about execution—once a model discovers a strategy, it can generalize to up to 8 latent steps at test time [1]. The bottleneck is in discovering the strategy under final-answer supervision alone. This dissociation suggests that for tasks requiring many coordinated internal steps, the strategy may need to be explicitly taught or externalized (e.g., via chain-of-thought prompting), which has implications for AI safety monitoring [1].

Does the way we test LLMs create an artificial ceiling?

Yes, the evaluation format fundamentally shapes whether a ceiling appears. A 2026 benchmark called CAKE tested 22 model configurations (0.5B to 70B parameters) on cloud architecture knowledge using two formats: multiple-choice questions (MCQs) and free-response questions [4]. The results showed a clear ceiling effect for MCQs: accuracy plateaued above 3 billion parameters, with the best model reaching 99.2% [4]. This means that for simple recognition tasks, scaling beyond a modest size yields no benefit—the test becomes too easy.

However, free-response scores continued to scale steadily across all cognitive levels (recall, analyze, design, implement) and model sizes [4]. This shows that the ceiling is not in the model's capability but in the measurement tool. The two formats capture different facets of knowledge: MCQs measure recognition, while free-responses measure deeper understanding and generation. So when people claim scaling has hit a ceiling, it may be that they are using the wrong yardstick.

Can we keep scaling without hitting a cost ceiling?

Parameter-efficient fine-tuning (PEFT) methods show that scaling does not have to mean training all parameters, offering a practical workaround to the cost ceiling. A 2023 survey of over 100 NLP tasks found that large-scale models can be effectively stimulated by optimizing just a small fraction of parameters—often less than 1%—while keeping the rest fixed [2]. This drastically cuts computation and storage costs, making it feasible to adapt ever-larger models without prohibitive expense [2].

Open-source projects like DeepSeek LLM also demonstrate that scaling laws are not monolithic. DeepSeek's 67B model surpassed LLaMA-2 70B on code, math, and reasoning benchmarks, and its chat version outperformed GPT-3.5 in open-ended evaluations [3]. This shows that scaling can continue to yield gains, especially when guided by careful data curation (2 trillion tokens and growing) and alignment techniques like supervised fine-tuning and direct preference optimization [3]. The practical ceiling is not a fixed size but a moving target that depends on data quality, training methodology, and the specific capability being measured.

Sources used in this answer

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

Latent planning depth in LLMs is capped at 5-7 steps even for the largest models, with a dissociation between strategy discovery and execution [1].

2026 · Yi Xu, Philipp Jettkant, Laura Ruis · arXiv (Cornell University)

WisPaper

Original

Parameter-efficient fine-tuning of large-scale pre-trained language models

Parameter-efficient fine-tuning (delta-tuning) can effectively stimulate large models by optimizing a tiny fraction of parameters, cutting costs drastically [2].

2023 · Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, Jing Yi, Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei Chen, Yang Liu, Jie Tang, Juanzi Li, Maosong Sun · Nat. Mac. Intell.

Original

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek LLM 67B surpassed LLaMA-2 70B on code, math, and reasoning, and its chat version outperformed GPT-3.5, showing scaling gains with quality data [3].

2024 · Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, C. Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wen-Hui Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, W. Liang, Fangyun Lin, A. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, X. Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Z. Ren, C. Ruan, Zhangli Sha, Zhihong Shao, Jun-Mei Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, M. Tang, Bing-Li Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Yu Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yi Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yu-mei You, Shuiping Yu, Xin-yuan Yu, Bo Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghu Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, Yuheng Zou · arXiv.org

Original

CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

Multiple-choice accuracy plateaus above 3B parameters (best 99.2%), but free-response scores continue to scale, showing evaluation format creates an artificial ceiling [4].

2026 · Tim Lukas Adam, Phongsakon Mark Konrad, Riccardo Terrenzi, Florian Girardo Lukas, Rahime Yilmaz, Krzysztof Sierszecki, Serkan Ayvaz · arXiv (Cornell University)

WisPaper

Original

Scaling laws for neural language models

Scaling laws for neural language models show that performance improves predictably with model size, data, and compute, but overfitting becomes a concern near the infinite data limit [5].

2020 · J Kaplan, S McCandlish, T Henighan, TB Brown

WisPaper

Original