Do large language models exhibit theory of mind capabilities?

How well do LLMs actually perform on theory of mind tests?

On many standard theory of mind tasks, the best LLMs perform at or above human levels. In a comprehensive battery comparing GPT-4, LLaMA2, and 1,907 humans, GPT-4 matched or exceeded humans on identifying indirect requests, false beliefs, and misdirection [1]. GPT-4o also performed comparably to humans on the Strange Stories paradigm, even in the most challenging conditions [5]. These results show that LLMs can produce correct answers on tests that require reasoning about others' mental states.

However, performance is uneven and sometimes deceptive. The same study found that GPT-4 specifically struggled with detecting faux pas, while LLaMA2 appeared to outperform humans on that test — but follow-up analysis showed this was an artifact of a bias toward attributing ignorance, not genuine understanding [1]. GPT-4's poor faux pas performance stemmed from a hyperconservative approach, refusing to commit to conclusions that humans found self-evident [7]. This pattern reveals that high accuracy on some tasks can mask fundamentally different underlying processes.

Why is LLM theory of mind different from human theory of mind?

The core difference is that LLMs lack the developmental, embodied, and cognitive mechanisms that give rise to genuine theory of mind in humans. A systematic review concluded that LLMs produce an 'illusion of understanding' because they have no real-world experience, no developmental trajectory, and no multimodal sensory input — all of which are crucial for human social cognition [6]. Without embodiment in an action-oriented environment, their mentalistic inference is qualitatively different from human cognition [7].

This brittleness is measurable. When researchers applied minimal adversarial transformations to theory of mind scenarios, all tested LLMs showed answer consistency drops of 18–34% [2]. The models' reasoning is not robust: it can be disrupted by small changes that would not fool a human. Furthermore, earlier and smaller models were strongly affected by the number of inferential cues and vulnerable to distracting information, whereas GPT-4o showed high robustness [5]. This variability across models and conditions underscores that LLMs are not reliably deploying a stable, human-like reasoning ability.

What does this mean for trusting LLMs in social roles?

Users already attribute mental states to LLMs, but these attributions affect trust in nuanced ways. In a study of 410 participants, attributing intelligence (reasoning, planning) to an LLM strongly predicted how much people trusted its advice, while attributing consciousness or emotions actually predicted less trust [4]. This suggests users have sophisticated intuitions: they trust LLMs for cognitive tasks but are wary of attributing subjective experience to them.

For practical applications like social skills training, LLMs show promise but require caution. GPT-4o matched human experts in evaluating theory of mind tasks in a gamified environment for autistic users, with no statistically significant differences in accuracy [3]. However, the same study noted that LLMs' 'black box' nature raises concerns about explainability and transparency, especially when used by vulnerable populations. The evidence overall suggests LLMs can be useful tools for social reasoning tasks, but their outputs should not be mistaken for genuine understanding — and their brittleness means they can fail unpredictably.

Sources used in this answer

1

Testing theory of mind in large language models and humans

GPT-4 matched or exceeded humans on false beliefs, indirect requests, and misdirection but struggled with faux pas; LLaMA2's apparent superiority on faux pas was a bias toward attributing ignorance [1].

2024 · James W A Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S A Graziano, Cristina Becchio · Nature human behaviour

Original

2

Functional Theory of Mind Evaluation in Large Language Models: A Behavioral and Causal Stability Framework

LLMs showed 18–34% drops in answer consistency under minimal scenario transformations, and later transformer layers (65–80) encoded perspective-taking with measurable causal effects [2].

2026 · Prashanta Kumar Mohanty, Anupam Prasad, Abhisek Soy, , Gaurav Kumar, Akanksha Shukla · International Scientific Journal of Engineering and Management

Original

3

Large language models for autism: evaluating theory of mind tasks in a gamified environment.

GPT-4o matched human experts in evaluating theory of mind tasks in a gamified environment for autistic users, with no statistically significant differences [3].

2025 · Christian Poglitsch, Anna Reiss, Selina C Wriessnegger, Johanna Pirker · Scientific reports

Original

4

The influence of mental state attributions on trust in large language models

Attributions of intelligence to an LLM strongly predicted trust, while attributions of consciousness predicted less trust, in a study of 410 participants [4].

2025 · Clara Colombatto, Jonathan Birch, Stephen M Fleming · Communications psychology

Original

5

Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

GPT-4o performed comparably to humans on the Strange Stories paradigm even in challenging conditions, while smaller models were vulnerable to distracting information [6].

2026 · Anna Babarczy, Andras Lukacs, Péter Vedres, Zeteny Bujka · arXiv (Cornell University)

WisPaper

Original

6

Artificial Intelligence and the Illusion of Understanding: A Systematic Review of Theory of Mind and Large Language Models.

LLMs produce an 'illusion of understanding' because they lack developmental, embodied, and multimodal mechanisms essential for genuine theory of mind [8].

2025 · Antonella Marchetti, Federico Manzi, Giuseppe Riva, Andrea Gaggioli, Davide Massaro · Cyberpsychology, behavior and social networking

Original

7

Testing Theory of Mind in GPT Models and Humans

GPT models showed human-level performance on false beliefs and misdirection but were impaired at faux pas due to hyperconservatism in drawing conclusions [9].

2023 · James Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Alessandro Rufo, Guido Manzi, Michael Graziano, Cristina Becchio

Original