Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

[ICLR 2025] HGPO: Solving Context Inconsistency in Long-Horizon Agentic RL

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces Hierarchy-of-Groups Policy Optimization (HGPO), a novel group-based reinforcement learning algorithm designed for long-horizon agentic tasks. By implementing context-aware hierarchical grouping and adaptive weighting, HGPO achieves state-of-the-art results on ALFWorld and WebShop benchmarks using Qwen2.5 models.

TL;DR

Reinforcement Learning for LLM agents often fails in long-horizon tasks due to poor credit assignment. Hierarchy-of-Groups Policy Optimization (HGPO) fixes this by recognizing that "same state" doesn't mean "same situation" if the history differs. By creating a hierarchy of groups based on historical depth and adaptively weighting their advantages, HGPO achieves SOTA performance on ALFWorld and WebShop without needing extra models or rollouts.

Context Inconsistency: The Hidden Bias in Agentic RL

In complex environments like WebShop, an agent often revisits the same state (e.g., a search results page). Previous methods like GRPO or GiGPO group these steps together to calculate "relative advantages."

However, the authors identify a critical flaw: Historical Context Inconsistency. If Agent A reached the search page by clicking a specific filter and Agent B reached it through a direct search, their "prompts" are actually different. Treating them as a single group introduces massive bias in advantage estimation, as shown in the authors' empirical study.

Inconsistency Illustration

Methodology: The Hierarchy of Groups

The core innovation of HGPO is the Context-aware Hierarchical Grouping. Instead of a flat group based solely on the current state $s_{t}$ , HGPO defines a $k$ -step context operator to capture history.

1. Hierarchical Grouping

For any given step, HGPO identifies multiple groups:

$G_{0}$ : Steps with the same current state (high variance, high bias).
$G_{k}$ : Steps sharing the same current state AND the last $k$ steps of history.
$G_{O r a c l e}$ : Steps sharing the exact same history (low bias, but very small group size).

2. Adaptive Weighting

Instead of picking one, HGPO aggregates advantages from all levels: $A^{H} = \sum_{k = 0}^{K} w_{k} A_{k}^{H}$ Using a weighting coefficient $α$ , the model puts more weight on higher-level groups (groups with more consistent history) where the bias is lower, while still using lower-level groups to keep the variance under control.

HGPO Architecture

Experimental Performance

The researchers tested HGPO against heavyweights like PPO, RLOO, and GRPO using Qwen2.5 (1.5B and 7B) models.

Generalization: HGPO showed significantly less performance degradation on Out-of-Distribution (OOD) tasks in ALFWorld compared to GiGPO.
Efficiency: Despite the mathematical complexity, the method is purely offline (using hashmap lookups). It adds less than 0.001% to total execution time.
Model Size Scaling: Interestingly, HGPO provides larger gains for smaller models (1.5B). Smaller models tend to produce more redundant/lengthy trajectories, making biased advantage estimation a more severe bottleneck that HGPO effectively resolves.

Experimental Results

Critical Insight: The Bias-Variance Trade-off

The mathematical elegance of HGPO lies in its Proposition 4.1. It proves that the HGPO estimator interpolates between the high-variance/low-bias Oracle estimator and the low-variance/high-bias step-level estimator. By tuning the hierarchy depth $K$ and the weight $α$ , researchers can find the "sweet spot" for policy stability.

Conclusion & Future Work

HGPO demonstrates that for agentic tasks, History Matters. Simply treating an LLM agent as a Markovian policy ignores the fact that the LLM's "state" is its entire context window.

Future directions suggested by the authors include applying this hierarchical logic to agents with "summarized memory" (where contexts aren't easily divisible) and exploring more advanced uncertainty-based weighting schemes.

Author Affiliations: Nanyang Technological University, Southeast University. Code: GitHub Repository

Find Similar Papers

Try Our Examples

Search for recent papers that address context inconsistency or credit assignment issues in multi-turn LLM agent reinforcement learning.
Which paper originally proposed Group Relative Policy Optimization (GRPO), and how does HGPO specifically modify its advantage estimation formula for stepwise multi-turn interactions?
Are there studies that apply hierarchical grouping or adaptive weighting to State Space Models (SSMs) or other non-Transformer architectures in agentic environments?

Contents

[ICLR 2025] HGPO: Solving Context Inconsistency in Long-Horizon Agentic RL

1. TL;DR

2. Context Inconsistency: The Hidden Bias in Agentic RL

3. Methodology: The Hierarchy of Groups

3.1. 1. Hierarchical Grouping

3.2. 2. Adaptive Weighting

4. Experimental Performance

5. Critical Insight: The Bias-Variance Trade-off

6. Conclusion & Future Work