Can reinforcement learning enable LLMs to use external tools effectively?

How does reinforcement learning actually help LLMs use tools better?

Reinforcement learning (RL) turns tool use into a learnable skill. Instead of just prompting an LLM to 'use a calculator,' RL lets the model try different strategies, get feedback (like whether the answer was correct), and improve over time. This is especially powerful because LLMs often struggle to know when and how to invoke external tools—RL provides a structured way to learn that judgment.

The AGILE framework [11] is a clear example: it treats the entire LLM agent (with memory, tools, and ability to consult experts) as a policy in an RL problem, fine-tuned with the PPO algorithm. On the ProductQA dataset, a 7B-parameter AGILE agent outperformed GPT-4 agents, showing that RL can make smaller models surpass much larger ones when tool use is part of the training. The ablation study confirmed that removing RL caused a significant drop in performance, proving RL was essential, not just a nice add-on.

Similarly, ToolBox-RL [4] uses RL to unify query rewriting, intent understanding, and tool retrieval into one end-to-end optimization. It achieved the best tool call accuracy on both white-box and black-box tools, and crucially, it generalized well to out-of-domain datasets—meaning the RL-trained agent could handle tools it hadn't seen during training. This suggests RL helps LLMs learn general tool-use strategies, not just memorize specific tool calls.

What's the gap between the best results and what you can typically expect?

The best-case evidence is striking: the Athena framework [1] achieved 83% accuracy on math reasoning and 88% on scientific reasoning, beating GPT-4o by 15-20 percentage points. Reflexion [2] hit 91% on HumanEval, 11 points above GPT-4. These are huge jumps, but they come from carefully designed systems with specific feedback loops and often multiple rounds of refinement.

However, typical results are more modest. The ChatAssert framework [3] improved test oracle generation by only 15% over the prior state-of-the-art (from 27.5% to about 31.6% Acc@1). That's a meaningful gain, but far from the dramatic leaps seen in the best cases. The TCP-TRL robot system [5] achieved an 81.86% success rate on long-horizon tasks, but that only matched—not exceeded—a state-of-the-art model trained with human demonstrations. So RL can close the gap with human-designed systems, but doesn't always surpass them.

The Apple 'thinking illusion' study [8] adds an important caveat: even with RL-trained reasoning, LLMs' performance collapses to zero on problems beyond a certain complexity threshold. However, when external tools (like a Python interpreter) were added, the collapse was largely overcome. This means RL + tools can push the boundary, but there's still a ceiling—and the ceiling depends on the tool's capabilities, not just the RL training.

What are the catches—when does RL for tool use fall short?

First, RL requires a good reward signal. In the radiotherapy beam angle optimization study [6], the LLM-based RL approach outperformed random baselines but still needed carefully designed reward functions to produce clinically meaningful plans. Without a clear, verifiable reward (like 'is the answer correct?'), RL can reinforce bad habits or hallucinated tool calls.

Second, scale matters. The DeepSeek-R1 paper [10] showed that pure RL can incentivize reasoning in LLMs without human demonstrations, but this emerged only at large scale (hundreds of billions of parameters). Smaller models may not develop the same self-reflection and verification behaviors. The TCM prescription study [7] used a 7B model and got only a 2.01% improvement from RL-based preference optimization—a tiny gain compared to the 15-20% jumps seen in larger systems.

Third, tool internalization is tricky. The TInR framework [9] found that internalizing tool knowledge into the LLM (rather than relying on external documentation) improved efficiency but required a three-phase training pipeline including RL with specialized rewards. It worked well in-domain but showed less dramatic gains out-of-domain. So RL doesn't automatically make tool use robust—it needs to be paired with the right training data and reward structure.

Finally, safety is an open issue. The OpenAI GPT-5 safety post [8] highlights that tool-augmented LLMs can be misused (e.g., generating detailed instructions for harmful tasks). RL can help align tool use with safety boundaries, but it's not a silver bullet—the reward function must encode safety, which is hard to define precisely.

Sources used in this answer

Integrating External Tools with Large Language Models (LLMs) to Improve Accuracy

The Athena framework, which integrates external tools via APIs, achieved 83% accuracy on math reasoning and 88% on scientific reasoning, outperforming GPT-4o by over 15 percentage points.

2025 · Nripesh Niketan, Arunima Santhoshkumar, Hadj Batatia · Lecture notes in networks and systems

Original

Reflexion: language agents with verbal reinforcement learning

Reflexion uses verbal reinforcement learning (no weight updates) to achieve 91% pass@1 on the HumanEval coding benchmark, surpassing GPT-4's 80%.

2023 · Noah Shinn, Federico Cassano, A. Gopinath, Karthik Narasimhan, Shunyu Yao · NeurIPS

Original

ChatAssert: LLM-Based Test Oracle Generation With External Tools Assistance

ChatAssert improved test oracle generation accuracy (Acc@1) by 15% over the prior state-of-the-art teco, using dynamic and static information to refine LLM prompts.

2024 · Ishrak Hayet, Adam Scott, Marcelo d'Amorim · IEEE Trans. Software Eng.

Original

ToolBox-RL: Learning to Generalize Tool Use Across Massive Repositories.

ToolBox-RL uses reinforcement learning to unify query rewriting and tool retrieval, achieving best tool call accuracy on both white-box and black-box tools with strong out-of-domain generalization.

2026 · Xinyan Shi, Renzhi Wang, Haodong Liu, Piji Li · WWW

Original

Bimanual Long-Horizon Lifecare Robotics with Temporal Context LLM Planner and Transformer Reinforcement Learning.

TCP-TRL combines an LLM planner with transformer reinforcement learning to achieve 81.86% success rate on bimanual lifecare tasks, matching performance of models trained with human demonstrations.

2025 · Ji-Heon Oh, Ismael Espinoza, Danbi Jung, Yong-Hyeok Choi, Ki Joo Pahk, Won Hee Lee, Wonha Kim, Tae-Seong Kim · Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference

Original

Beam angle optimization for radiotherapy using LLMs via reinforcement-learning inspired iterative refinement.

An off-the-shelf GPT-4 model, guided by an RL-inspired iterative strategy, outperformed random baselines in radiotherapy beam angle optimization without any domain-specific fine-tuning.

2026 · Sara Cammarota, Matteo Ferrante, Alessandra Carosi, Rolando Maria D'Angelillo, Nicola Toschi · Medical physics

Original

Reinforcement learning for LLM-based explainable TCM prescription recommendation with implicit preferences from small language models.

A two-stage framework using knowledge distillation and RL-based preference optimization achieved P@30 of 35.62% and F1@30 of 37.36% for TCM prescription recommendations, with RL adding only 2.01% improvement.

2025 · Xinyu Wang, Xiaohe Sun, Lei Yang, Yitong Zhang, Tao Yang, Jiadong Xie, Kongfa Hu · Chinese medicine

Original

Highlights of the Issue - Large Language Models III

The Apple 'thinking illusion' study found that LLM reasoning performance collapses to zero beyond a certain complexity threshold, but tool augmentation (Python interpreter, scratchpad) largely overcame this limitation.

2025 · Kris Carlson · SuperIntelligence - Robotics - Safety & Alignment

Original

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR-U, a tool-internalized reasoning framework trained with RL, achieved superior performance on in-domain and out-of-domain settings, but required a three-phase pipeline with specialized rewards.

2026 · Qiancheng Xu, Yongqing Li, Fangcheng Liu, Hongru Wang, Min Yang, Wenjie Li

WisPaper

Original

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-R1 showed that pure reinforcement learning can incentivize reasoning in LLMs without human demonstrations, leading to superior performance on math, coding, and STEM tasks at large scale.

2025 · Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Honghui Ding, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jingchang Chen, Jingyang Yuan, Jinhao Tu, Junjie Qiu, Jiashi Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaichao You, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shunfeng Zhou, Shaoqing Wu, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xiaodong Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Xingchao Liu, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Y. K. Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuyang Zhou, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang · Nature

Original

AGILE: A Novel Reinforcement Learning Framework of LLM Agents

The AGILE framework fine-tuned a 7B LLM with PPO to create an agent using memory, tools, and expert consultation, outperforming GPT-4 agents on the ProductQA dataset.

2024 · Peiyuan Feng, Yichen He, Guanhua Huang, Hang Li, Yuan Lin, Hanchong Zhang, Yuchen Zhang · Advances in Neural Information Processing Systems 37

Original