Do existing LLM benchmark scores reflect real-world performance?

Why do benchmark scores overestimate real-world ability?

Standard benchmarks typically test LLMs on static, independent questions, but real-world tasks often require iterative learning, tool use, and handling ambiguous or sensitive data. A 2025 survey of LLM capabilities noted that existing benchmark-based assessments often fail to capture real-world performance because the required capabilities differ from those measured [3]. For instance, in a clinical decision task, agentic AI systems (which can browse the web, run code, and edit files) achieved only 30.3% accuracy on MedAgentsBench, a medical QA dataset, despite access to advanced tools [1]. This is far below the near-perfect scores these same models often post on popular benchmarks like MMLU.

Another study introduced a framework called LLM-Evolve, which tests models over multiple rounds with feedback, mimicking real-world learning. They found that models improved up to 17% from past interactions, but standard i.i.d. (independent and identically distributed) benchmarks completely miss this dynamic capability [4]. This means a model's static benchmark score can be misleadingly high or low compared to how it performs when it must adapt over time.

How do LLMs perform in high-stakes fields like medicine and finance?

In medicine, the gap between benchmark scores and real-world reliability is stark. A randomized, single-blind evaluation by 27 experienced clinicians (averaging 25.9 years of practice) tested seven top LLMs on 685 real and simulated clinical cases. The top models scored only around 6.0 out of 10 on medical strength—equivalent to a physician with 6 years of experience—and 40 instances of hallucinations were documented, including fabricated conditions and medications [2]. Another study on precision oncology found that medium-scale LLMs gave outdated or incorrect information in a large percentage of answers, with high disagreement among expert evaluators about correctness [6]. The authors concluded that there is a clear gap between benchmark- and real-world performance.

In finance, a 2025 review highlighted that while LLMs like GPT-4 and Claude can extract structured knowledge from earnings calls and reports, they still suffer from hallucination, bias, and difficulty explaining reasoning—problems that standard benchmarks do not adequately capture [5]. The review emphasized that human oversight remains essential for high-stakes financial decisions, directly contradicting the impression of competence that high benchmark scores might give.

What are the hidden costs of using LLMs in practice?

Even when LLMs show modest real-world gains, they come with steep computational and safety costs. In the clinical agent study, the best agent systems used over 10 times more tokens and had more than 2 times the latency compared to baseline LLMs, yet only achieved modest accuracy improvements (e.g., 60.3% on AgentClinic MedQA vs. lower baseline scores) [1]. Despite built-in safeguards filtering 89.9% of hallucinations, the remaining errors still posed risks. This means that deploying LLMs in real-world settings is not just about accuracy—it also involves significant resource demands and persistent safety concerns that benchmarks rarely reflect.

Sources used in this answer

Benchmarking large language model-based agent systems for clinical decision tasks

Agentic AI systems for clinical tasks achieved only 30.3% accuracy on MedAgentsBench and 8.6% on Humanity's Last Exam, despite using advanced tools, and consumed >10× tokens and >2× latency compared to baseline LLMs.

2026 · Yunsong Liu, Zunamys I. Carrero, Xiaofeng Jiang, Dyke Ferber, Georg Wölflein, Li Zhang, Sanddhya Jayabalan, Tim Lenz, Zhouguang Hui, J. Kather · npj Digital Medicine

Original

An interdisciplinary, randomized, single-blind evaluation of state-of-the-art large language models for their implications and risks in medical diagnosis and management

In a single-blind evaluation by 27 expert clinicians, top LLMs scored ~6.0/10 on medical strength (equivalent to a physician with 6 years of experience) and produced 40 instances of hallucinations including fabricated conditions and medications.

2025 · Peikai Chen, Jifu Cai, Jiaying Zhou, Shaoxi Chen, Chenguang Xu, Lihua Yuan, Xiaoying Dai, Xiaowei Chen, Yanzhe Wei, Xia Li, Shaofeng Gong, Xiaolong Liang, Jiancheng Yang, Jun Jin, Kanglin Dai, Yuzhen Cui, Guan-Ming Kuang, Jianshen Xie, Libing Luo, Haibing Xiao, Shijie Yin, Jun Yang, Yulan Yan, Jianliang Chen, Yihua Chen, Qianshen Zhang, Qingshan Zhou, Lina Zhao, Min Wu, Xin Tang, Lei Rong, Zanxin Wang, Weifu Qiu, Yanli Wang, Liwen Cui, Xiangyang Li, Yong Hu, Huiren Tao, Nan Wu, Pearl Pai, Minxin Wei, Michael Kai-tsun To, Kenneth M.C. Cheung

Original

Fundamental Capabilities and Applications of Large Language Models: A Survey

A survey of LLM capabilities concluded that existing benchmark-based assessments often fail to capture real-world performance because the required capabilities differ from those measured in benchmarks.

2025 · Jiawei Li, Yang Gao, Yizhe Yang, Yu Bai, Xiaofeng Zhou, Yinghao Li, Huashan Sun, Yuhang Liu, Xingpeng Si, Yuhao Ye, Yixiao Wu, Yiguan Lin, Bin Xu, Ren Bowen, Chong Feng, Heyan Huang · ACM Comput. Surv.

Original

LLM-Evolve: Evaluation for LLM's Evolving Capability on Benchmarks

The LLM-Evolve framework showed that models can improve up to 17% from past interactions with feedback, a dynamic capability that standard i.i.d. benchmarks completely overlook.

2024 · Jiaxuan You, Mingjie Liu, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro · Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Original

Large Language Models for Financial Knowledge Extraction Analytical Insights and Corporate Planning Support

A review of financial LLM applications identified persistent challenges including hallucination, bias, and difficulty explaining reasoning, which standard benchmarks do not adequately capture.

2025 · Xuguang Zhang, Mengdie Wang · Mathematical Modeling and Algorithm Application

Original

Evaluating Medium Scale, Open-Source Large Language Models: Towards Decision Support in a Precision Oncology Care Delivery Context

In precision oncology, medium-scale LLMs frequently gave outdated or incorrect information, with high disagreement among expert evaluators, indicating a clear gap between benchmark and real-world performance.

2025 · Kevin Kaufmes, Georg Mathes, Dilyana Vladimirova, Stephanie Berger, Christian Fegeler, Stefan Sigle · Studies in health technology and informatics

Original