How well do LLMs actually perform on multi-step reasoning?
LLMs can perform multi-step reasoning, but their accuracy varies widely by task and model. On radiology questions, a multi-step retrieval framework (RaR) boosted mean diagnostic accuracy from 67% to 75% across 25 models [2]. On medical board exams, an ensemble reasoning approach improved accuracy by up to 4% on GPT-3.5 and Med42-70B, and by 1.15% on GPT-4 [1]. However, on complex clinical cases, even state-of-the-art reasoning models exceeded 85% accuracy only on simple diagnostic tasks when given sufficient test results, and performance dropped sharply on treatment planning and exam recommendation [3]. This means that while LLMs can reason step-by-step, they are most reliable on well-defined, data-rich problems and struggle with open-ended planning.
On visual reasoning tasks, a new benchmark covering over 4,000 reasoning steps found that the best open-source multimodal model (LlamaV-o1) achieved an average score of 67.3% across six benchmarks, outperforming prior models by 3.8% while being 5x faster [4]. This shows that structured, step-by-step training can improve both accuracy and efficiency, but the absolute performance still leaves room for improvement.
What prompting strategies make LLMs reason better?
Several prompting strategies significantly boost multi-step reasoning. Chain-of-thought (CoT) prompting, where the model is asked to 'think step by step,' helps smaller models break down complex medical queries into sequential steps, improving accuracy and interpretability on the PubMedQA dataset [5]. A more advanced approach, Plan-and-Solve (PS) prompting, first devises a plan to divide the task into subtasks and then executes them, consistently outperforming standard zero-shot CoT across ten math and reasoning datasets [10]. For example, PS prompting matched the performance of 8-shot CoT on math problems, meaning it eliminated the need for manual examples while achieving similar accuracy.
Another effective method is evidence chaining, where related facts are grouped into 'evidence chains' to avoid missing important information. This approach (MindMap) significantly improved CoT and Selection-Inference frameworks on multi-step reasoning benchmarks like bAbI and ProofWriterOWA [6]. Similarly, integrating logic programming (ChatLogic) enhanced LLMs' multi-step deductive reasoning by converting problems into symbolic form and using an inference engine [7]. These results show that the right prompting structure can make even smaller or less powerful models reason more reliably.
When do LLMs still fail at multi-step reasoning?
LLMs still fail in several key ways. On clinical cases, reasoning is generally factual, but critical steps are often missing, especially in examination recommendation and treatment planning [3]. This means the model might give a correct final answer but skip important intermediate logic, making its reasoning unreliable for high-stakes decisions. On software engineering tasks, complex autonomous agents that plan and execute multi-step fixes actually underperformed a simpler agentless approach, which achieved 32% correct fixes on SWE-bench Lite at low cost ($0.70 per fix) [8]. This suggests that current LLMs' planning abilities are not yet robust enough to justify complex agent architectures.
LLMs also make three specific types of errors in multi-step reasoning: calculation errors, missing-step errors, and semantic misunderstanding errors [10]. Even with advanced prompting, these errors persist. For example, on medical reasoning, smaller models still struggle with highly specialized content, and retrieval-augmented generation is needed to close the gap with larger models [5]. Additionally, LLMs often fail to use necessary and sufficient knowledge, leading to incorrect conclusions from missing evidence or false reasoning paths [9]. These limitations mean that for critical applications, human-AI teaming is still essential [1].
Sources used in this answer
Reasoning with large language models for medical question answering.
Ensemble reasoning improved accuracy by up to 4% on GPT-3.5 and Med42-70B, and by 1.15% on GPT-4, for medical question answering.
Multi-step retrieval and reasoning improves radiology question answering with large language models.
A multi-step retrieval framework (RaR) improved mean diagnostic accuracy from 67% to 75% on radiology questions across 25 LLMs.
Quantifying the reasoning abilities of LLMs on clinical cases.
Current reasoning LLMs exceed 85% accuracy on simple diagnostic tasks but drop on treatment planning and exam recommendation, often missing critical reasoning steps.
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
LlamaV-o1 achieved 67.3% average score on six visual reasoning benchmarks, outperforming prior models by 3.8% while being 5x faster.
Chain of Thought Strategy for Smaller LLMs for Medical Reasoning.
Chain-of-thought prompting helped smaller LLMs break down complex medical queries, improving accuracy and interpretability on PubMedQA.
MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models
MindMap, using evidence chains, significantly improved CoT and Selection-Inference on multi-step reasoning benchmarks like bAbI and ProofWriterOWA.
ChatLogic: Integrating Logic Programming with Large Language Models for Multi-Step Reasoning
ChatLogic, integrating logic programming, significantly improved LLMs' multi-step deductive reasoning by converting problems into symbolic form.
Demystifying LLM-Based Software Engineering Agents
A simple agentless approach achieved 32% correct fixes on SWE-bench Lite at $0.70 per fix, outperforming complex autonomous software agents.
Necessary and sufficient knowledge enhanced collaborative logical reasoning in LLMs.
A collaborative logical reasoning framework (CLR) outperformed baselines on multiple datasets by using deductive, abductive, and inductive reasoning together.
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Plan-and-Solve prompting consistently outperformed zero-shot chain-of-thought across ten datasets, matching 8-shot CoT on math reasoning.
