WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[arXiv 2025] RETROAGENT: Moving LLM Agents from Static Solving to Continuous Evolution
总结
问题
方法
结果
要点
摘要

RETROAGENT is an online reinforcement learning (RL) framework designed for LLM-based agents that shifts the focus from static problem-solving to continuous adaptation. It introduces a hindsight self-reflection mechanism producing dual intrinsic feedback, achieving state-of-the-art results on ALFWorld (+18.3%), WebShop (+15.4%), and Sokoban (+27.1%) relative to the GRPO baseline.

TL;DR

RETROAGENT is a new online RL framework for LLM agents that implements a hindsight self-reflection process. By generating both numerical rewards for subtask progress and linguistic "lessons" stored in a memory buffer, it enables agents to learn from their mistakes and successes explicitly. It shatters previous SOTA benchmarks, improving success rates by up to 27% on challenging tasks like Sokoban and WebShop.

Problem & Motivation: The "Solving" vs. "Evolving" Gap

Most current Reinforcement Learning (RL) approaches for LLM agents suffer from a "exploitation bias." Once an agent finds a way to complete a task, it stops looking for better ways. This leads to two major issues:

  1. Suboptimal Convergence: Agents settle for "good enough" strategies because standard rewards are often binary (Success/Failure).
  2. Implicit Knowledge: What the agent learns is buried in millions of weights. There is no explicit "memory" of why a previous attempt failed, leading to brittle performance in new or out-of-distribution (OOD) scenarios.

The authors argue that true intelligence requires Metacognition—the ability to monitor oneself and adapt behavior based on specific past experiences.

Methodology: The Dual Feedback Engine

The core of RETROAGENT is its Hindsight Self-Reflection Mechanism. After an episode, the agent looks back at its own trajectory and produces two specific types of feedback:

1. Intrinsic Numerical Feedback (The "Carrot")

Instead of just a 0 or 1 at the end of a game, the agent calculates a Capability-Evolution Reward. It looks at subtasks (e.g., "Did I at least find the object even if I didn't wash it?") and rewards itself if it performs better than its historical average for that specific subtask. This encourages the agent to explore "promising" paths even if they don't lead to a final win yet.

2. Intrinsic Language Feedback (The "Lessons")

The agent distills its experience into natural language lessons (e.g., "Searching for 'youth' shirts is more effective than generic 'men's shirts' for this query"). These are stored in a memory buffer.

3. SimUtil-UCB Retrieval

To use these lessons during the next training step, RETROAGENT uses a clever retrieval strategy called SimUtil-UCB. It doesn't just pick the most "similar" lesson; it uses a Multi-Armed Bandit (UCB) approach to balance:

  • Relevance: Is the lesson related to the current task?
  • Utility: Has this lesson actually helped me win in the past?
  • Exploration: Should I try a less-used lesson to see if it holds hidden value?

Framework Overview

Experiments & Results

The researchers tested RETROAGENT on four punishing environments: ALFWorld (embodied AI), WebShop (web navigation), Sokoban (planning puzzles), and MineSweeper (logic).

Key Breakthroughs:

  • Performance Leap: On the Sokoban puzzle, RETROAGENT achieved a 38.3% success rate, whereas standard GRPO only managed 11.2%.
  • Efficiency: The framework reached the peak performance levels of previous SOTA methods up to 46% faster.
  • Robustness: Even when the tasks were made harder at test time (e.g., more mines in MineSweeper), RETROAGENT maintained significantly higher performance than its peers.

Performance Scenarios

Deep Insight: Reflection Accuracy

One of the most fascinating findings is the comparison between "In-Context" reflection (using an LLM with a fixed prompt) and "RL-Trained" reflection (where the agent is also trained to be a better self-critic). The RL-Trained variant (Blue line in Figure 8b) maintained high reflection accuracy even as its task performance improved, suggesting that self-criticism is a skill that can—and should—be optimized alongside action.

Critical Analysis & Conclusion

While RETROAGENT is a massive step forward, it isn't a free lunch. The total wall-clock training time is higher because of the secondary "reflection" step. However, the trade-off is worth it: the resulting agents are more capable, more diverse in their strategies, and better at adapting to new environments.

The Takeaway: For the next generation of AI agents, we must stop training them to just "solve" and start training them to "reflect." By externalizing lessons into a bridge between parameter-based memory and symbolic logic, we create agents that actually get smarter with every failure.

发现相似论文

试试这些示例

  • Find recent papers on LLM agents that utilize intrinsic motivation or curiosity-driven exploration to solve sparse reward environments.
  • What are the primary theoretical differences between the GRPO (Group Relative Policy Optimization) algorithm and traditional PPO in the context of agentic tasks?
  • Search for research that combines state-space models or persistent memory buffers with reinforcement learning to improve long-horizon planning in robots or virtual agents.
目录
[arXiv 2025] RETROAGENT: Moving LLM Agents from Static Solving to Continuous Evolution
1. TL;DR
2. Problem & Motivation: The "Solving" vs. "Evolving" Gap
3. Methodology: The Dual Feedback Engine
3.1. 1. Intrinsic Numerical Feedback (The "Carrot")
3.2. 2. Intrinsic Language Feedback (The "Lessons")
3.3. 3. SimUtil-UCB Retrieval
4. Experiments & Results
4.1. Key Breakthroughs:
5. Deep Insight: Reflection Accuracy
6. Critical Analysis & Conclusion