T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

T2PO: Taming the Chaos of Multi-Turn Agentic RL via Uncertainty Control

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces Token- and Turn-level Policy Optimization (T2PO), a framework for stabilizing multi-turn reinforcement learning in LLM agents. By explicitly controlling exploration through a self-calibrated uncertainty signal, T2PO achieves state-of-the-art results on WebShop, ALFWorld, and Search QA benchmarks.

TL;DR

Training LLM agents to solve multi-step tasks (like online shopping or embodied navigation) via Reinforcement Learning (RL) is notoriously unstable. This paper identifies "Hesitation"—the generation of redundant tokens and repetitive, unproductive turns—as the primary driver of training collapse. The authors propose T2PO, which uses intrinsic uncertainty signals to adaptively truncate over-thinking at the token level and resample redundant moves at the turn level, leading to unprecedented stability and efficiency.

The "Hesitation" Problem: Why Agents Fail

In the world of multi-turn RL, traditional stabilization focuses on better reward functions or trajectory filtering. However, T2PO’s authors argue the problem is more fundamental: Inefficient Exploration.

They find that LLMs often enter two types of "Hesitation" loops:

Token-level over-thinking: The model generates an "aha moment" and makes a decision, but then continues to ramble, adding noise to the credit assignment.
Turn-level repetition: The agent gets stuck in a loop (e.g., searching for the same item twice without finding it) because its internal state hasn't meaningfully shifted.

Training Instability Figure 1: SOTA baselines frequently suffer from gradient explosions and performance collapse due to these inefficiencies.

Methodology: Hierarchical Exploration Control

1. The Self-Calibrated Uncertainty Signal

T2PO doesn't just use Entropy (too smooth) or Confidence (too localized). It fuses them into a combined signal $M_{t}$ .

Entropy ( $H_{t}$ ) captures the overall distribution spread.
Confidence ( $C_{t}$ ) captures the peak probability. The resulting $M_{t}$ creates a non-degenerate geometry that distinguishes between meaningful reasoning and "rambling."

2. Token-Level Thinking Intervention (TTI)

Instead of a hard budget, TTI monitors the velocity of uncertainty. When the marginal change in uncertainty falls below a threshold ( $ε$ ), the system triggers an intervention. It forcibly injects a </think> tag and resumes with the structured <action> phase.

3. Turn-Level Dynamical Sampling (TDS)

If the average uncertainty of a turn is too similar to the previous turn, the agent is likely "stuck." T2PO identifies these low-information turns and forces a dynamic resample (regeneration), ensuring that every rollout sent to the optimizer is informative.

T2PO Architecture Figure 2: Overview of the hierarchical control at both token and turn levels.

Experiments: SOTA Achievement

The authors tested T2PO on WebShop, ALFWorld, and Search QA. The results were striking:

Performance: On ALFWorld, T2PO scored 92.4% success, roughly 10 points higher than the previous SOTA (GiGPO).
Efficiency: It completed tasks with 25% fewer turns and significantly fewer tokens.
Stability: Unlike baselines that collapsed after a few hundred steps, T2PO showed monotonic improvement across multiple random seeds.

Results Comparison Table 1: Quantifying the massive lead in success rates over PPO, GRPO, and GiGPO.

Critical Insight: Efficiency is Effectiveness

The most profound takeaway from T2PO is that training effectiveness and efficiency are not at odds. By pruning the noise (the "hesitation"), the learning signal becomes cleaner. This allows the agent to learn the "causal" relationship between its reasoning and the outcome much faster.

Limitations & Future Work

While T2PO excels in structured environments like WebShop, its reliance on specific tags (<think>, <action>) assumes a "Reasoning LLM" backbone (like Kimi or DeepSeek-style models). Future work could explore how to adapt this to models that do not use explicit CoT tags.

Conclusion

T2PO represents a shift from implicit control (reward shaping) to explicit control (uncertainty-guided intervention). For developers building agents, the message is clear: if you want a stable RL agent, don't just wait for it to fail; monitor its uncertainty and stop it once it stops learning.

Find Similar Papers

Try Our Examples

Search for recent studies that mitigate training collapse in multi-turn reinforcement learning for large language models beyond credit assignment techniques.
Examine the origins of combining entropy and confidence for uncertainty estimation in LLMs and how this paper's self-calibration method improves upon those foundations.
Explore how uncertainty-guided reasoning termination (like TTI) could be applied to multi-modal reasoning agents or autonomous robotic planning tasks.

Contents

T2PO: Taming the Chaos of Multi-Turn Agentic RL via Uncertainty Control

1. TL;DR

2. The "Hesitation" Problem: Why Agents Fail

3. Methodology: Hierarchical Exploration Control

3.1. 1. The Self-Calibrated Uncertainty Signal

3.2. 2. Token-Level Thinking Intervention (TTI)

3.3. 3. Turn-Level Dynamical Sampling (TDS)

4. Experiments: SOTA Achievement

5. Critical Insight: Efficiency is Effectiveness

5.1. Limitations & Future Work

6. Conclusion