The paper introduces Hierarchy-of-Groups Policy Optimization (HGPO), a novel group-based reinforcement learning algorithm designed for long-horizon agentic tasks. By implementing context-aware hierarchical grouping and adaptive weighting, HGPO achieves state-of-the-art results on ALFWorld and WebShop benchmarks using Qwen2.5 models.
TL;DR
Reinforcement Learning for LLM agents often fails in long-horizon tasks due to poor credit assignment. Hierarchy-of-Groups Policy Optimization (HGPO) fixes this by recognizing that "same state" doesn't mean "same situation" if the history differs. By creating a hierarchy of groups based on historical depth and adaptively weighting their advantages, HGPO achieves SOTA performance on ALFWorld and WebShop without needing extra models or rollouts.
Context Inconsistency: The Hidden Bias in Agentic RL
In complex environments like WebShop, an agent often revisits the same state (e.g., a search results page). Previous methods like GRPO or GiGPO group these steps together to calculate "relative advantages."
However, the authors identify a critical flaw: Historical Context Inconsistency. If Agent A reached the search page by clicking a specific filter and Agent B reached it through a direct search, their "prompts" are actually different. Treating them as a single group introduces massive bias in advantage estimation, as shown in the authors' empirical study.

Methodology: The Hierarchy of Groups
The core innovation of HGPO is the Context-aware Hierarchical Grouping. Instead of a flat group based solely on the current state , HGPO defines a -step context operator to capture history.
1. Hierarchical Grouping
For any given step, HGPO identifies multiple groups:
- : Steps with the same current state (high variance, high bias).
- : Steps sharing the same current state AND the last steps of history.
- : Steps sharing the exact same history (low bias, but very small group size).
2. Adaptive Weighting
Instead of picking one, HGPO aggregates advantages from all levels: Using a weighting coefficient , the model puts more weight on higher-level groups (groups with more consistent history) where the bias is lower, while still using lower-level groups to keep the variance under control.

Experimental Performance
The researchers tested HGPO against heavyweights like PPO, RLOO, and GRPO using Qwen2.5 (1.5B and 7B) models.
- Generalization: HGPO showed significantly less performance degradation on Out-of-Distribution (OOD) tasks in ALFWorld compared to GiGPO.
- Efficiency: Despite the mathematical complexity, the method is purely offline (using hashmap lookups). It adds less than 0.001% to total execution time.
- Model Size Scaling: Interestingly, HGPO provides larger gains for smaller models (1.5B). Smaller models tend to produce more redundant/lengthy trajectories, making biased advantage estimation a more severe bottleneck that HGPO effectively resolves.

Critical Insight: The Bias-Variance Trade-off
The mathematical elegance of HGPO lies in its Proposition 4.1. It proves that the HGPO estimator interpolates between the high-variance/low-bias Oracle estimator and the low-variance/high-bias step-level estimator. By tuning the hierarchy depth and the weight , researchers can find the "sweet spot" for policy stability.
Conclusion & Future Work
HGPO demonstrates that for agentic tasks, History Matters. Simply treating an LLM agent as a Markovian policy ignores the fact that the LLM's "state" is its entire context window.
Future directions suggested by the authors include applying this hierarchical logic to agents with "summarized memory" (where contexts aren't easily divisible) and exploring more advanced uncertainty-based weighting schemes.
Author Affiliations: Nanyang Technological University, Southeast University. Code: GitHub Repository
