WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
The Hidden Wall: Why Horizon Length—Not Reasoning—Is Killing Your LLM Agent
总结
问题
方法
结果
要点
摘要

This empirical study identifies "horizon length" as a fundamental bottleneck in training LLM agents for long-horizon tasks, independent of reasoning complexity. The authors present a systematic analysis using controlled environments like Sudoku and Rush Hour, demonstrating that reducing the effective horizon through macro actions or subgoal decomposition significantly stabilizes Reinforcement Learning (RL) and enables "horizon generalization" to unseen task lengths.

TL;DR

Recent research has blamed LLM failures on "reasoning gaps," but a new empirical study reveals a more systemic culprit: Horizon Length. As the number of interaction steps grows, Reinforcement Learning (RL) becomes exponentially unstable due to exploration hurdles and "noisy" credit assignment. By applying Horizon Reduction—using macro actions and subgoal decomposition—developers can stabilize training and unlock "Horizon Generalization," where models solve longer tasks than they were ever trained on.

The Problem: The "Catastrophic Collapse" of Long Interactions

When we train an LLM to solve a complex puzzle or browse the web, we expect it to get better with more RL steps. However, as the task horizon (the number of steps to reach a goal) increases, the training dynamics shift from steady improvement to sudden, catastrophic collapse.

The authors identify two fatal flaws in standard RL for long-horizon tasks:

  1. Exponential Exploration Difficulty: The probability of sampling a successful sequence of independent atomic actions decays exponentially as the chain grows.
  2. Noisy Credit Assignment: When an agent fails at step 50, the "negative reward" signal is propagated back to all 50 steps. This diffuses the gradient, inadvertently penalizing perfectly "correct" intermediate actions and injecting massive noise into the model's vocabulary distribution.

Training Instability vs Horizon Length Figure 1: Notice how training on longer horizons (L3-L4) leads to performance collapse compared to shorter tasks (L1-L2).


Methodology: Horizon Reduction as a First Principle

Instead of trying to fix the RL optimizer, the authors argue for changing the structure of the task. The goal is to reduce the Effective Horizon () through two key strategies:

1. Macro Actions (Action Abstraction)

By allowing the agent to generate multiple actions in a single turn (e.g., "Fill these three cells" instead of "Fill one cell"), the decision count is slashed. This reduces the opportunities for error accumulation.

  • Insight: Flexible macro actions (where the model decides how many steps to take) outperform fixed-length ones, offering robustness against "overshooting."

2. Subgoal Decomposition

Breaking a global goal into verifiable segments (e.g., completing one 3x3 subgrid in Sudoku) allows for dense, localized rewards. This prevents a single failure at the end of a long chain from ruining the learning signal for the successful beginning of the chain.

Macro Actions vs Atomic Actions Figure 2: Horizon reduction (blue) consistently outperforms atomic execution (orange) and prevents collapse in long-horizon regimes.


Key Discovery: Horizon Generalization

Perhaps the most striking finding is Horizon Generalization. Models trained on short-to-medium horizons using reduction techniques actually learn the "logic" of the task so well that they can generalize to much longer, unseen horizons during inference.

The paper shows that a model trained on tasks with 20–30 empty cells (level L3-L4) could still solve tasks with 45 empty cells (L7), despite never seeing such long sequences during training. This suggests that step accuracy is the driver: by stabilizing training on shorter horizons, we produce a model with higher per-step precision that can survive the "compounding error" problem of longer distances.

Horizon Generalization Results Figure 3: Models trained on limited horizons (RL-short/RL-long) maintain high success rates even as the goal distance increases significantly beyond their training data.


Critical Analysis & Professional Insight

This work marks a shift from algorithmic obsession to system-centric design. While the industry often chases more complex RL optimizers (like PPO variants or GRPO), this study proves that Action Space Design and Reward Topology are more influential.

Limitations to Consider:

  • Technique vs. Horizon: The study admits that while agents generalize across lengths, they struggle to generalize across reasoning techniques. If a Sudoku puzzle requires a new logic (e.g., "X-Wing") not seen in training, horizon reduction won't help.
  • Environment Specifics: The findings are strongest in deterministic, verifiable games (Sudoku, Rush Hour). In stochastic real-world environments (like live web browsing), the "noise" from the environment may overlap with the "noise" from the horizon, requiring even more aggressive subgoal verification.

Conclusion: Lessons for AI Engineers

If you are building an agentic system (e.g., a coding assistant or a web navigator) and seeing performance plateaus:

  1. Don't just add more data. Check your average interaction horizon.
  2. Abstract your API. Move from low-level "clicks" to high-level "intent actions" (Macro Actions).
  3. Validate intermediate steps. Use a Process Reward Model (PRM) or verifiable subgoals to keep the credit assignment signal "clean."

As the authors aptly quote: "The best way to escape from a problem is to solve it." In the world of LLM agents, that means making the problem shorter.

发现相似论文

试试这些示例

  • Search for recent papers that investigate "horizon generalization" or zero-shot length expansion in LLM-based planning and decision-making tasks.
  • Which studies first formalize the relationship between credit assignment noise and interaction trajectory length in Transformer-based Reinforcement Learning?
  • Explore how hierarchical reinforcement learning (HRL) techniques are being adapted for large-scale multi-modal agents to mitigate the challenges of long effective horizons.
目录
The Hidden Wall: Why Horizon Length—Not Reasoning—Is Killing Your LLM Agent
1. TL;DR
2. The Problem: The "Catastrophic Collapse" of Long Interactions
3. Methodology: Horizon Reduction as a First Principle
3.1. 1. Macro Actions (Action Abstraction)
3.2. 2. Subgoal Decomposition
4. Key Discovery: Horizon Generalization
5. Critical Analysis & Professional Insight
6. Conclusion: Lessons for AI Engineers