Can large language models reason or just pattern-match?

What the Research Says

Kojima et al. [1] demonstrated that simply adding the prompt 'Let's think step by step' before each answer significantly improves LLM performance on reasoning benchmarks, increasing accuracy on MultiArith from 17.7% to 78.7% and on GSM8K from 10.4% to 40.7% with the large InstructGPT model. This zero-shot chain-of-thought prompting suggests that LLMs possess untapped zero-shot reasoning capabilities that can be elicited without task-specific examples [1].

Webb et al. [2] directly compared GPT-3 (text-davinci-003) with human reasoners on analogical reasoning tasks, including a non-visual matrix reasoning task based on Raven's Standard Progressive Matrices. They found that GPT-3 matched or even surpassed human performance in most settings, with preliminary tests of GPT-4 showing even better results, indicating an emergent ability for abstract pattern induction [2].

Zhang et al. [3] introduced the NLGift benchmark to test whether LLMs generalize beyond pattern memorization in graph reasoning. Their experiments across four graph reasoning tasks showed that while LLMs generalize well on simple semantic and numeric patterns, they struggle with reasoning and real-world patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks [3].

Caveats and Mechanisms: When Reasoning Fails

Kanduri [4] compared LLMs and pattern-matching functions in health-coaching dialogue systems, finding that while LLMs handle diverse and complex inputs better, pattern-matching functions offer faster response times and stricter script adherence. This suggests that LLMs may rely on pattern matching in some contexts, even when they appear to reason [4].

Taub-Tabib et al. [5] compared a syntactic pattern-based NLP method with GPT-4 for mining symptom etiologies from scientific literature. They found that while GPT-4 was highly precise, it offered lesser coverage than the syntactic approach, and combining both methods yielded synergistic outcomes. This indicates that LLMs may excel at precise pattern matching but lack the broad coverage of rule-based systems [5].

The NLGift benchmark [3] specifically tested generalization across reasoning patterns and real-world patterns, finding that LLMs fail to generalize in these settings. This suggests that LLMs may be memorizing patterns in synthetic training data rather than learning generalizable graph reasoning skills, highlighting a key limitation in their reasoning capabilities [3].

本文引用的文献

Large Language Models are Zero-Shot Reasoners

Zero-shot chain-of-thought prompting ('Let's think step by step') improved LLM accuracy on MultiArith from 17.7% to 78.7% and on GSM8K from 10.4% to 40.7% with the large InstructGPT model, suggesting untapped zero-shot reasoning capabilities [1].

2022 · Takeshi Kojima, S. Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa · NeurIPS

原文

Emergent analogical reasoning in large language models

GPT-3 matched or surpassed human performance on analogical reasoning tasks including a non-visual matrix reasoning task based on Raven's Standard Progressive Matrices, with GPT-4 showing even better results [2].

2023 · Taylor Webb, Keith J Holyoak, Hongjing Lu · Nature human behaviour

原文

Can LLM Graph Reasoning Generalize beyond Pattern Memorization?

LLMs generalize well on simple semantic and numeric patterns in graph reasoning but struggle with reasoning and real-world patterns, indicating limited generalization beyond pattern memorization [3].

2024 · Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xiaochuang Han, Tianxing He, Yulia Tsvetkov · Findings of the Association for Computational Linguistics: EMNLP 2024

原文

ENHANCING SCRIPTED DIALOGUE SYSTEMS FOR HEALTH-COACH APPLICATIONS: A COMPARATIVE STUDY OF LARGE LANGUAGE MODELS AND PATTERN-MATCHING FUNCTIONS

In health-coaching dialogue systems, LLMs handle diverse inputs better than pattern-matching functions, but pattern-matching offers faster response times and stricter script adherence [4].

2024 · Sai Sangameswara Aadithya Kanduri · UWM Digital Commons (University of Wisconsin–Milwaukee)

Identifying symptom etiologies using syntactic patterns and large language models.

GPT-4 was highly precise in mining symptom etiologies but had lesser coverage than a syntactic pattern-based approach; combining both methods yielded synergistic outcomes [5].

2024 · Hillel Taub-Tabib, Yosi Shamay, Micah Shlain, Menny Pinhasov, Mark Polak, Aryeh Tiktinsky, Sigal Rahamimov, Dan Bareket, Ben Eyal, Moriya Kassis, Yoav Goldberg, Tal Kaminski Rosenberg, Simon Vulfsons, Maayan Ben Sasson · Scientific reports

原文