Do multi-agent systems actually perform better?
Yes, but the size of the advantage depends on how complex the task is. The strongest evidence comes from a 2025 study that tested GPT-4o on a 1,062-question Spanish medical licensing exam (EUNACOM) across 21 specialties [1]. The best multi-agent strategy, MDAGENTS, scored 89.97% accuracy, while the best single-agent method (Chain-of-Thought with Few-Shot) scored 87.67%. That 2.3 percentage point gap was statistically significant, meaning it's a real improvement, not random noise. However, the same study found that many exam questions were answered correctly by simple single-agent strategies, suggesting that multi-agent collaboration mainly helps on the hardest questions that require reasoning or domain coordination.
This pattern holds in other domains. In machine learning automation, a multi-agent system using a mix of free and cheap models (Gemini + occasional GPT-4 calls) achieved a 32.95% success rate on the MLAgentBench benchmark, compared to 22.72% for a single GPT-4 agent — a 45% relative improvement [2]. The multi-agent system also slashed costs by 94%, from $0.93 to $0.05 per run. In legal translation, a pilot study found that a multi-agent system with four specialized agents (translation, adequacy review, fluency review, final editing) produced higher quality translations than single-agent or traditional machine translation, especially for domain-specific and context-heavy texts [4].
When is a single agent good enough?
Single-agent systems are perfectly adequate — and often preferable — for simpler tasks. In the medical exam study, many questions were answered correctly by basic single-agent methods like Zero-Shot (just asking the model) or Few-Shot (giving a few examples) without any complex reasoning or collaboration [1]. The researchers noted that only a fraction of standardized exam questions required sophisticated multi-agent interaction. This means that for routine, well-defined tasks — like answering straightforward factual questions or generating simple code snippets — a single capable LLM is sufficient and faster.
Similarly, in machine translation, single-agent systems are well-suited for simpler translation tasks where domain-specific knowledge and high contextual awareness aren't critical [4]. The key takeaway: multi-agent systems add complexity, cost, and latency. If your task is simple, you're better off with a single agent. The extra overhead only pays off when the task involves multiple subtasks, conflicting requirements, or the need for specialized expertise.
What are the limitations and open challenges?
Multi-agent systems are not a magic bullet. Current architectures often rely on predefined, static agent designs, which limits their adaptability in dynamic real-world environments [5]. For example, if a task changes mid-execution, a fixed set of agents may not be able to adjust. Researchers are working on solutions like Dynamic Real-Time Agent Generation (DRTAG), which automatically creates new agents on the fly based on the conversation or task context, and this approach has shown improved adaptability and performance compared to static multi-agent systems [5].
Another challenge is coordination overhead. In decentralized task allocation (e.g., robots deciding who does what), multi-agent systems can struggle with communication delays and conflicts. A 2022 study on consensus-based algorithms found that while multi-agent systems can minimize task start times, performance degrades with poor communication network topologies [3]. Additionally, multi-agent systems can suffer from error propagation — if one agent makes a mistake, it can cascade through the pipeline. A 2025 framework called TDAG addresses this by dynamically decomposing complex tasks into smaller subtasks and generating specialized subagents for each, which improved adaptability and context awareness in travel planning benchmarks [6]. Still, these are early-stage solutions, and robust, production-ready multi-agent systems remain an active research area.
Sources used in this answer
Performance of single-agent and multi-agent language models in Spanish language medical competency exams.
On a 1,062-question Spanish medical exam, the multi-agent MDAGENTS system achieved 89.97% accuracy, significantly outperforming the best single-agent method (87.67%), though many questions were answerable by simpler single-agent strategies.
BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks
A cost-efficient multi-agent system using cheap models achieved a 32.95% success rate on ML tasks, beating a single GPT-4 agent (22.72%) while reducing costs by 94% (from $0.93 to $0.05 per run).
Consensus-Based Decentralized Task Allocation for Multi-Agent Systems and Simultaneous Multi-Agent Tasks
A consensus-based decentralized algorithm (CBTA) for multi-agent task allocation achieved near-optimal start times for single-agent tasks and outperformed existing methods for simultaneous multi-agent tasks across various network topologies.
Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication
A pilot study in legal translation found that a multi-agent system with four specialized agents produced superior translation quality compared to single-agent or traditional machine translation, especially for domain-specific texts.
Auto-scaling LLM-based multi-agent systems through dynamic integration of agents.
Dynamic Real-Time Agent Generation (DRTAG) significantly improved adaptability and task performance over static multi-agent architectures by automatically creating new agents based on evolving contexts.
TDAG: A multi-agent framework based on dynamic Task Decomposition and Agent Generation.
The TDAG multi-agent framework, which dynamically decomposes tasks into subtasks and generates specialized subagents, significantly outperformed established baselines on complex travel planning benchmarks.
