Can LLMs reliably plan and execute multi-step reasoning tasks?

How well do LLMs actually perform on multi-step reasoning?

LLMs can perform multi-step reasoning, but their accuracy varies widely by task and model. On radiology questions, a multi-step retrieval framework (RaR) boosted mean diagnostic accuracy from 67% to 75% across 25 models [2]. On medical board exams, an ensemble reasoning approach improved accuracy by up to 4% on GPT-3.5 and Med42-70B, and by 1.15% on GPT-4 [1]. However, on complex clinical cases, even state-of-the-art reasoning models exceeded 85% accuracy only on simple diagnostic tasks when given sufficient test results, and performance dropped sharply on treatment planning and exam recommendation [3]. This means that while LLMs can reason step-by-step, they are most reliable on well-defined, data-rich problems and struggle with open-ended planning.

On visual reasoning tasks, a new benchmark covering over 4,000 reasoning steps found that the best open-source multimodal model (LlamaV-o1) achieved an average score of 67.3% across six benchmarks, outperforming prior models by 3.8% while being 5x faster [4]. This shows that structured, step-by-step training can improve both accuracy and efficiency, but the absolute performance still leaves room for improvement.

What prompting strategies make LLMs reason better?

Several prompting strategies significantly boost multi-step reasoning. Chain-of-thought (CoT) prompting, where the model is asked to 'think step by step,' helps smaller models break down complex medical queries into sequential steps, improving accuracy and interpretability on the PubMedQA dataset [5]. A more advanced approach, Plan-and-Solve (PS) prompting, first devises a plan to divide the task into subtasks and then executes them, consistently outperforming standard zero-shot CoT across ten math and reasoning datasets [10]. For example, PS prompting matched the performance of 8-shot CoT on math problems, meaning it eliminated the need for manual examples while achieving similar accuracy.

Another effective method is evidence chaining, where related facts are grouped into 'evidence chains' to avoid missing important information. This approach (MindMap) significantly improved CoT and Selection-Inference frameworks on multi-step reasoning benchmarks like bAbI and ProofWriterOWA [6]. Similarly, integrating logic programming (ChatLogic) enhanced LLMs' multi-step deductive reasoning by converting problems into symbolic form and using an inference engine [7]. These results show that the right prompting structure can make even smaller or less powerful models reason more reliably.

When do LLMs still fail at multi-step reasoning?

LLMs still fail in several key ways. On clinical cases, reasoning is generally factual, but critical steps are often missing, especially in examination recommendation and treatment planning [3]. This means the model might give a correct final answer but skip important intermediate logic, making its reasoning unreliable for high-stakes decisions. On software engineering tasks, complex autonomous agents that plan and execute multi-step fixes actually underperformed a simpler agentless approach, which achieved 32% correct fixes on SWE-bench Lite at low cost ($0.70 per fix) [8]. This suggests that current LLMs' planning abilities are not yet robust enough to justify complex agent architectures.

LLMs also make three specific types of errors in multi-step reasoning: calculation errors, missing-step errors, and semantic misunderstanding errors [10]. Even with advanced prompting, these errors persist. For example, on medical reasoning, smaller models still struggle with highly specialized content, and retrieval-augmented generation is needed to close the gap with larger models [5]. Additionally, LLMs often fail to use necessary and sufficient knowledge, leading to incorrect conclusions from missing evidence or false reasoning paths [9]. These limitations mean that for critical applications, human-AI teaming is still essential [1].

Sources used in this answer

Reasoning with large language models for medical question answering.

Ensemble reasoning improved accuracy by up to 4% on GPT-3.5 and Med42-70B, and by 1.15% on GPT-4, for medical question answering.

2024 · Mary M Lucas, Justin Yang, Jon K Pomeroy, Christopher C Yang · Journal of the American Medical Informatics Association : JAMIA

Original

Multi-step retrieval and reasoning improves radiology question answering with large language models.

A multi-step retrieval framework (RaR) improved mean diagnostic accuracy from 67% to 75% on radiology questions across 25 LLMs.

2025 · Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh · NPJ digital medicine

Original

Quantifying the reasoning abilities of LLMs on clinical cases.

Current reasoning LLMs exceed 85% accuracy on simple diagnostic tasks but drop on treatment planning and exam recommendation, often missing critical reasoning steps.

2025 · Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Yanjie Fan, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang, Weidi Xie · Nature communications

Original

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

LlamaV-o1 achieved 67.3% average score on six visual reasoning benchmarks, outperforming prior models by 3.8% while being 5x faster.

2025 · Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman H. Khan · Findings of the Association for Computational Linguistics: ACL 2025

Original

Chain of Thought Strategy for Smaller LLMs for Medical Reasoning.

Chain-of-thought prompting helped smaller LLMs break down complex medical queries, improving accuracy and interpretability on PubMedQA.

2025 · Hurmat Ali Shah, Mowafa Househ · Studies in health technology and informatics

Original

MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models

MindMap, using evidence chains, significantly improved CoT and Selection-Inference on multi-step reasoning benchmarks like bAbI and ProofWriterOWA.

2024 · Yangyu Wu, Xu Han, Wei Song, Miaomiao Cheng, Fei Li · AAAI

Original

ChatLogic: Integrating Logic Programming with Large Language Models for Multi-Step Reasoning

ChatLogic, integrating logic programming, significantly improved LLMs' multi-step deductive reasoning by converting problems into symbolic form.

2024 · Zhongsheng Wang, Jiamou Liu, Qiming Bao, Hongfei Rong, Jingfeng Zhang · IJCNN

Original

Demystifying LLM-Based Software Engineering Agents

A simple agentless approach achieved 32% correct fixes on SWE-bench Lite at $0.70 per fix, outperforming complex autonomous software agents.

2025 · Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang · Proc. ACM Softw. Eng.

Original

Necessary and sufficient knowledge enhanced collaborative logical reasoning in LLMs.

A collaborative logical reasoning framework (CLR) outperformed baselines on multiple datasets by using deductive, abductive, and inductive reasoning together.

2025 · Peng Wang, Xiao Ding, Kai Xiong, Bing Qin, Ting Liu · Neural networks : the official journal of the International Neural Network Society

Original

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Plan-and-Solve prompting consistently outperformed zero-shot chain-of-thought across ten datasets, matching 8-shot CoT on math reasoning.

2023 · Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, Ee-Peng Lim · Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Original