Expanding LLM Agent Boundaries with Strategy-Guided Exploration

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

[Apple Research] Strategy-Guided Exploration: Breaking the "Self-Refinement" Ceiling in LLM Agents

总结

问题

方法

结果

要点

摘要

Strategy-Guided Exploration (SGE) is a reinforcement learning framework designed to improve LLM agents' performance in complex tasks like coding and UI control. It shifts exploration from low-level action sampling to high-level natural language strategies, achieving new SOTA results across AndroidWorld, AppWorld, and specialized coding benchmarks.

TL;DR

Reinforcement Learning (RL) for LLMs often suffers from a "glass ceiling"—it is great at refining what the model already knows but struggles to learn truly new behaviors. Apple researchers have introduced Strategy-Guided Exploration (SGE), a method that leverages the LLM's own reasoning to explore the environment via high-level language strategies. By decoupling "intent" (strategy) from "execution" (actions), SGE allows agents to solve tasks that were previously impossible for the base model, even with thousands of random attempts.

The Exploration Wall: Why Standard RL Fails Agents

In traditional RL (like Atari games), an agent can eventually "stumble" upon a reward through random movement. However, for an LLM agent trying to solve a complex coding problem or navigate a smartphone UI, the "action space" is nearly infinite.

Prior research (e.g., Yue et al., 2025) suggests that current RL post-training acts more like a "filter" than a "teacher"—it picks the best of what the model can already do. If the base model has a 0% success rate on a hard task, standard RL has no "signal" to learn from. This is the Exploration Challenge: how do we get a model to try something fundamentally different?

Methodology: Thinking Before Acting

SGE's core insight is that language is the best space for exploration. Instead of sampling different code snippets (which might just be syntax variations), SGE samples different plans.

1. The Strategy-Action Loop

For every step, the agent is prompted to:

Strategy: "First, give a strategy of what to do to make progress..."
Action: "Then, generate environment actions conditioned on that strategy."

2. Mixed-Temperature Sampling

This is the "secret sauce" of the paper. Typically, high temperature makes a model "creative" but "unreliable." SGE applies:

High Temperature for Strategies: Forces the model to brainstorm radically different approaches (e.g., "Try to find a dropdown menu" vs. "Try to type the extension manually").
Low Temperature for Actions: Ensures that once a strategy is chosen, the execution is precise and follows the plan.

3. Strategy Reflection

SGE maintains a buffer of "Failed" and "Successful" strategies. During training, the agent is shown a failed attempt and told: "This strategy failed. Critique it and try something new." This forces the model to move away from known failure modes.

SGE Overview

Experiments: Surpassing the "Pass@K" Limit

The researchers tested SGE across four distinct domains: AndroidWorld (UI), LangR (Embodied AI), Coding, and AppWorld (Tools).

The most striking result came from the Coding environment. The base model had a pass@2048 of 0.69 (meaning if you asked it 2048 times, it could only solve 69% of problems). Standard GRPO RL couldn't break this 69% ceiling. SGE reached 73%.

This proves that SGE isn't just "averaging" the model's best guesses; it is actively exploring and learning new mental models for problem-solving.

Results Comparison

Visual Evidence: UI Interaction

In AndroidWorld, SGE demonstrated superior exploration in visual tasks. While standard models kept clicking the same area with slight variations (the "Action Overlap" problem), SGE's strategy-driven approach prompted the agent to try entirely different UI elements, eventually finding the correct dropdown menu to change a file extension.

Visual Exploration

Critical Analysis & Future Outlook

Strengths

Plug-and-Play: SGE works with any RL algorithm (GRPO, PPO, etc.) as it only changes the sampling distribution and prompts.
Interpretability: Because strategies are in natural language, we can actually read what the agent is trying to do during its exploration phase.

Limitations

The Reasoning Floor: Minimalist models (like 600M parameters) don't benefit much because they lack the basic reasoning to "leverage" a strategy. SGE is a "rich get richer" tool for capable models (4B+).
Latency: Generating a strategy before every action increases the number of tokens and thus the inference cost.

Conclusion

Strategy-Guided Exploration represents a shift in how we think about LLM "post-training." Instead of just rewarding the right answer, we are teaching the model a structured way to search for the right answer. For the next generation of autonomous agents, the ability to "reflect and pivot" in language may be more important than the ability to predict the next token.

发现相似论文

试试这些示例

Search for recent papers that use hierarchical reinforcement learning or "options" frameworks specifically applied to Large Language Model agents in 2025-2026.
What are the seminal works on "Random Network Distillation" (RND) in LLMs, and how does SGE's strategy-space exploration fundamentally differ in its approach to state-space coverage?
Investigate studies applying Strategy-Guided Exploration or similar high-level reasoning abstractions to multi-modal agents in robotics or physical embodiment tasks.

[Apple Research] Strategy-Guided Exploration: Breaking the "Self-Refinement" Ceiling in LLM Agents

1. TL;DR

2. The Exploration Wall: Why Standard RL Fails Agents

3. Methodology: Thinking Before Acting

3.1. 1. The Strategy-Action Loop

3.2. 2. Mixed-Temperature Sampling

3.3. 3. Strategy Reflection

4. Experiments: Surpassing the "Pass@K" Limit

5. Visual Evidence: UI Interaction

6. Critical Analysis & Future Outlook

6.1. Strengths

6.2. Limitations

7. Conclusion