What the Research Says
Kojima et al. [1] demonstrated that simply adding the prompt 'Let's think step by step' before each answer significantly improves LLM performance on reasoning benchmarks, increasing accuracy on MultiArith from 17.7% to 78.7% and on GSM8K from 10.4% to 40.7% with the large InstructGPT model. This zero-shot chain-of-thought prompting suggests that LLMs possess untapped zero-shot reasoning capabilities that can be elicited without task-specific examples [1].
Webb et al. [2] directly compared GPT-3 (text-davinci-003) with human reasoners on analogical reasoning tasks, including a non-visual matrix reasoning task based on Raven's Standard Progressive Matrices. They found that GPT-3 matched or even surpassed human performance in most settings, with preliminary tests of GPT-4 showing even better results, indicating an emergent ability for abstract pattern induction [2].
Zhang et al. [3] introduced the NLGift benchmark to test whether LLMs generalize beyond pattern memorization in graph reasoning. Their experiments across four graph reasoning tasks showed that while LLMs generalize well on simple semantic and numeric patterns, they struggle with reasoning and real-world patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks [3].
Caveats and Mechanisms: When Reasoning Fails
Kanduri [4] compared LLMs and pattern-matching functions in health-coaching dialogue systems, finding that while LLMs handle diverse and complex inputs better, pattern-matching functions offer faster response times and stricter script adherence. This suggests that LLMs may rely on pattern matching in some contexts, even when they appear to reason [4].
Taub-Tabib et al. [5] compared a syntactic pattern-based NLP method with GPT-4 for mining symptom etiologies from scientific literature. They found that while GPT-4 was highly precise, it offered lesser coverage than the syntactic approach, and combining both methods yielded synergistic outcomes. This indicates that LLMs may excel at precise pattern matching but lack the broad coverage of rule-based systems [5].
The NLGift benchmark [3] specifically tested generalization across reasoning patterns and real-world patterns, finding that LLMs fail to generalize in these settings. This suggests that LLMs may be memorizing patterns in synthetic training data rather than learning generalizable graph reasoning skills, highlighting a key limitation in their reasoning capabilities [3].
Sources used in this answer
Large Language Models are Zero-Shot Reasoners
Zero-shot chain-of-thought prompting ('Let's think step by step') improved LLM accuracy on MultiArith from 17.7% to 78.7% and on GSM8K from 10.4% to 40.7% with the large InstructGPT model, suggesting untapped zero-shot reasoning capabilities [1].
Emergent analogical reasoning in large language models
GPT-3 matched or surpassed human performance on analogical reasoning tasks including a non-visual matrix reasoning task based on Raven's Standard Progressive Matrices, with GPT-4 showing even better results [2].
Can LLM Graph Reasoning Generalize beyond Pattern Memorization?
LLMs generalize well on simple semantic and numeric patterns in graph reasoning but struggle with reasoning and real-world patterns, indicating limited generalization beyond pattern memorization [3].
ENHANCING SCRIPTED DIALOGUE SYSTEMS FOR HEALTH-COACH APPLICATIONS: A COMPARATIVE STUDY OF LARGE LANGUAGE MODELS AND PATTERN-MATCHING FUNCTIONS
In health-coaching dialogue systems, LLMs handle diverse inputs better than pattern-matching functions, but pattern-matching offers faster response times and stricter script adherence [4].
Identifying symptom etiologies using syntactic patterns and large language models.
GPT-4 was highly precise in mining symptom etiologies but had lesser coverage than a syntactic pattern-based approach; combining both methods yielded synergistic outcomes [5].
