Can AI agents autonomously complete complex real-world tasks?

Where do AI agents reliably complete complex tasks autonomously?

AI agents excel in tasks that involve analyzing large, structured datasets or following well-defined procedures in controlled settings. For example, the ClockBase Agent autonomously reanalyzed 43,602 intervention-control comparisons from millions of molecular profiles, identifying over 500 interventions that significantly reduce biological age—findings that were missed by original human researchers [1]. One of these, ouabain, was experimentally validated in aged mice, showing reduced frailty and improved heart function [1]. This demonstrates that when the task is data-intensive and the goal is clear (e.g., find age-slowing compounds), AI can outperform humans in scale and thoroughness.

Similarly, in IT operations, agentic AI systems can autonomously predict and resolve system issues, reducing downtime and human intervention [2]. In scientific research, AI agents have autonomously performed peer review, generated hypotheses, and conducted systematic reviews with higher accuracy and efficiency than humans [8]. These successes share a common thread: the task involves processing large amounts of structured information within a well-defined framework.

Where do AI agents fail at complex real-world tasks?

AI agents struggle significantly in unpredictable physical environments where they must adapt to real-world noise, human behavior, or unexpected events. A striking example comes from autonomous surface grading: a bulldozer AI that performed perfectly in clean simulation failed catastrophically when tested on a real-world prototype with sand piles, because the simulation couldn't capture real-world dynamics like sensor noise and irregular terrain [9]. The authors note that heuristics that worked in simulation were useless in reality, though training in simulation helped a learning agent generalize better [9].

Autonomous vehicles also reveal clear limits. In a study reconstructing 453 real-world crashes, an autonomous driving system (Baidu Apollo) could avoid only about 61% of crashes (363 out of 596 vehicles) [4]. The unavoidable crashes shared a pattern: the AV was not at fault, but the crash was caused by unpredictable human driver behavior—something the AI could not anticipate or react to in time [4]. This highlights a fundamental limitation: AI agents cannot yet handle the full complexity of human-driven environments.

Even in digital tasks, current AI agents are limited. A 2023 evaluation of language model agents on 12 tasks related to autonomous replication and adaptation found they could only complete the easiest tasks, and the authors warn that these evaluations don't rule out near-future agents being capable of more [3]. The key takeaway is that AI agents are brittle—they work well in controlled conditions but fail when faced with the unpredictability of the real world.

Is full autonomy possible, or do humans still need to be in the loop?

Full autonomy is possible in some narrow domains, but most complex tasks still benefit from human oversight. The LLMAgentNet framework explicitly supports three modes: Human-in-the-Loop, Human-on-the-Loop, and Human-out-of-the-Loop, allowing for both supervised and fully autonomous operation [6]. In practice, even advanced systems like DeviceAgent, which autonomously designs bioelectronic devices and generates fabrication protocols, maintains human oversight at critical decision points [7]. This suggests that while AI can handle many steps autonomously, humans are still needed for high-stakes judgments.

The need for human oversight is also driven by safety and ethics. A certification framework called CertAI was developed specifically to evaluate autonomous AI agents across dimensions like security, privacy, ethics, and fairness, finding that fairness and transparency remain the weakest dimensions even in larger models [5]. This implies that until these issues are resolved, full autonomy in sensitive domains (e.g., healthcare, finance) would be irresponsible without human checks. As one review notes, agentic AI represents a paradigm shift but requires careful governance frameworks and validation standards [8].

Sources used in this answer

Autonomous AI Agents Discover Aging Interventions from Millions of Molecular Profiles.

ClockBase Agent autonomously reanalyzed 43,602 intervention-control comparisons, identifying over 500 interventions that reduce biological age, with one (ouabain) validated in live mice showing reduced frailty and improved heart function.

2025 · Kejun Ying, Alexander Tyshkovskiy, Alibek Moldakozhayev, Hanchen Wang, Cecília G De Magalhães, Sharif Iqbal, Amanda E Garza, Albina Tskhay, Jesse R Poganik, Kexin Huang, Yuanhao Qu, Dmitrii Glubokov, Cheng Jin, Donghyun Lee, Hanna Liu, Carolina Leote, Alexandre Trapp, Lucas Paulo de Lima Camillo, Csaba Kerepesi, Mahdi Moqri, Odin Zhang, Kaiyi Jiang, Fedor Galkin, Alex Zhavoronkov, Jeremy M Van Raamsdonk, Mengdi Wang, Le Cong, Aviv Regev, Jure Leskovec, Tony Wyss-Coray, Vadim N Gladyshev · bioRxiv : the preprint server for biology

Original

Agentic AI in Predictive AIOps: Enhancing IT Autonomy and Performance

Agentic AI in Predictive AIOps enhances IT autonomy by proactively predicting and resolving system issues, reducing downtime and human intervention.

2024 · Shanmugasundaram Sivakumar · International Journal of Scientific Research and Management (IJSRM)

Original

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Language model agents could only complete the easiest of 12 autonomous replication and adaptation tasks, but evaluations don't rule out near-future agents being capable of more.

2023 · Megan Kinniment, L. Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, T. Lin, H. Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, P. Christiano · arXiv.org

Original

How would autonomous vehicles behave in real-world crash scenarios?

An autonomous driving system (Baidu Apollo) could avoid about 61% of 596 real-world crashes, but failed when crashes were caused by unpredictable human driver behavior.

2024 · Rui Zhou, Guoqing Zhang, Helai Huang, Zhiyuan Wei, Hanchu Zhou, Jieling Jin, Fangrong Chang, Jiguang Chen · Accident; analysis and prevention

Original

CertAI: A Certification Framework for Trustworthy and Secure Autonomous AI Agents.

CertAI framework evaluates autonomous AI agents on security, privacy, ethics, robustness, transparency, and fairness, finding fairness and transparency are the weakest dimensions.

2026 · Faisal Anwer, Mohammad Nadeem, Mohammed Abdullah Tahir, Jaafar Gaber, Salman Ali · ICAART (1)

Original

LLMAGENTNET: A COLLABORATIVE NETWORK OF AUTONOMOUS AI AGENTS FOR COMPLEX TASK EXECUTION

LLMAgentNet framework enables collaborative multi-agent systems with three operational modes (Human-in-the-Loop, Human-on-the-Loop, Human-out-of-the-Loop) and demonstrated improvements in efficiency over single-agent approaches.

2025 · А. Р. Бідочко, Я. І. Виклюк · Scientific Bulletin of UNFU

Original

DeviceAgent: An autonomous multimodal AI agent for flexible bioelectronics.

DeviceAgent autonomously generates bioelectronic layouts, creates fabrication protocols, identifies microscopic defects, and analyzes cardiac signals, but maintains human oversight at critical decision points.

2025 · Jaeyong Lee, Zuwan Lin, Wenbo Wang, Jongmin Baek, Ariel J Lee, Almir Aljović, Arnau Marin-Llobet, Xinhe Zhang, Ren Liu, Na Li, Jia Liu · bioRxiv : the preprint server for biology

Original

Toward autonomous discovery: agentic AI and the future of ophthalmic research.

Agentic AI systems can autonomously perform peer review, hypothesis generation, systematic reviews, and experimental design, but require governance frameworks for ethics and accountability.

2025 · Brian T Soetikno, Christopher S Nielsen, Andreas Pollreisz, Daniel S W Ting · Current opinion in ophthalmology

Original

Towards Autonomous Grading In The Real World

A bulldozer AI for surface grading succeeded in simulation but failed catastrophically in real-world tests, though simulation-trained learning agents could generalize to a scaled prototype.

2022 · Yakov Miron, Chana Ross, Yuval Goldfracht, Chen Tessler, Dotan Di Castro · IROS

Original