Can LLMs be used as reliable evaluators of other AI systems?

When do LLMs actually work well as evaluators?

LLMs shine as evaluators in structured, domain-specific tasks where the criteria are clear and the context is narrow. In a 2025 study of medical history-taking, an LLM-based system achieved over 97.9% dialog accuracy across simple, moderate, and complex cases, and its automated assessments showed over 95% item-level consistency with human experts [1]. This means that for evaluating a trainee's medical interview skills, the LLM was nearly as reliable as a human supervisor.

Similarly, in emergency triage, ChatGPT and Copilot matched the accuracy of trained triage nurses (around 65% overall) and actually outperformed nurses in identifying high-acuity patients—87.8% for ChatGPT versus 32.7% for nurses [2]. The LLMs were more consistent across patient age and gender, while nurses were more likely to mistriage younger patients. So in high-stakes, rule-based screening tasks, LLMs can be more reliable than humans.

Where do LLMs fall short as evaluators?

LLMs struggle badly in tasks that require nuanced judgment or detection of subtle patterns. A 2025 study tested GPT-4o and Llama 3.2 Vision on detecting publication bias from funnel plots—a standard meta-analysis task. Neither model consistently detected bias, and even when given both visual and numerical data, performance did not improve [3]. In plain terms, these models were no better than random at spotting a known statistical distortion.

In strategic business evaluation, a 2024 study found that single evaluations from LLMs were 'inconsistent and biased'—meaning if you asked the same model to rank 60 business plans twice, you'd get different rankings [4]. However, when the researchers averaged many evaluations across different models, prompts, or roles, the aggregated rankings began to resemble those of human experts. The takeaway: one LLM judgment is unreliable, but a crowd of LLM judgments can be useful.

What is the reliability gap, and how can you bridge it?

The core problem is that LLMs are not inherently stable evaluators. A 2024 oncology study tested five LLMs on over 2,000 questions and found significant performance differences between models—GPT-4 scored above the 50th percentile of human doctors, but all models had 'clinically significant error rates' and examples of overconfidence [5]. Even the best model got things wrong consistently in some areas, like female-predominant cancers.

A 2024 survey on 'LLM-as-a-judge' systems concluded that ensuring reliability requires deliberate strategies: improving consistency, mitigating biases, and adapting to specific scenarios [6]. Practical steps include repeating prompts multiple times (as the oncology study did to identify high-performing subgroups with 81% accuracy [5]), using multiple LLMs, and combining AI evaluations with human oversight. For text summarization, a 2024 field guide warns that LLM-based evaluation is 'powerful but lacking in reliability' and recommends using multiple methods together [7].

The bottom line: LLMs can be reliable evaluators, but only when you design the evaluation carefully—use clear criteria, aggregate multiple judgments, and never trust a single output. In narrow, structured tasks they can match or beat humans; in open-ended or subtle tasks, they currently fail.

Sources used in this answer

Development and Validation of a Large Language Model–Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency

An LLM-based medical history-taking system achieved over 97.9% dialog accuracy and over 95% item-level consistency with human experts across simple, moderate, and complex cases.

2025 · Yang Liu, Chujun Shi, Liping Wu, Xiule Lin, Xiaoqin Chen, Yiying Zhu, Haizhu Tan, Weishan Zhang · JMIR Medical Education

Original

Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses.

ChatGPT and Copilot matched nurse triage accuracy (around 65%) but outperformed nurses in identifying high-acuity patients (87.8% vs 32.7%).

2025 · B Arslan, C Nuhoglu, M O Satici, E Altinbilek · The American journal of emergency medicine

Original

Leveraging AI for Meta-Analysis: Evaluating LLMs in Detecting Publication Bias for Next-Generation Evidence Synthesis.

GPT-4o and Llama 3.2 Vision failed to consistently detect publication bias from funnel plots, even with additional quantitative data.

2025 · Xing Xing, Lifeng Lin, Mohammad Hassan Murad, Jiayi Tong · Cochrane evidence synthesis and methods

Original

Generative artificial intelligence and evaluating strategic decisions

Single LLM evaluations of business models were inconsistent and biased, but aggregated rankings across models and prompts resembled human expert rankings.

2024 · Anil R. Doshi, J. Jason Bell, Emil Mirzayev, Bart S. Vanneste · Strategic Management Journal

Original

Comparative Evaluation of LLMs in Clinical Oncology

GPT-4 outperformed other LLMs on 2,044 oncology questions but all models had clinically significant error rates and overconfidence issues.

2024 · Nicholas R Rydzewski, Deepak Dinakaran, Shuang G Zhao, Eytan Ruppin, Baris Turkbey, Deborah E Citrin, Krishnan R Patel · NEJM AI

Original

A survey on LLM-as-a-Judge

A comprehensive survey concluded that building reliable LLM-as-a-judge systems requires deliberate strategies to improve consistency, mitigate biases, and adapt to scenarios.

2026 · Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Lionel Ni, Wen Gao, Yuanzhuo Wang, Jian Guo · The Innovation

Original

A Field Guide to Automatic Evaluation of LLM-Generated Summaries

A field guide on evaluating LLM-generated summaries warns that LLM-based evaluation methods are powerful but lack reliability, recommending multiple methods together.

2024 · Tempest A. van Schaik, Brittany Pugh · SIGIR

Original