WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[NVIDIA 2026] PivotRL: High-Accuracy Agentic Post-Training at Low Compute Cost
总结
问题
方法
结果
要点
摘要

PivotRL is a novel post-training framework for long-horizon agentic tasks that combines supervised fine-tuning (SFT) efficiency with reinforcement learning (RL) generalization. It utilizes local on-policy rollouts at "pivots"—critical intermediate states with high outcome variance—and employs functional-equivalent verifiers instead of strict string matching, achieving SOTA agentic performance with 4x fewer rollout turns than end-to-end RL.

TL;DR

NVIDIA's PivotRL solves the fundamental tension in AI agent training: how to get the generalization of Reinforcement Learning (RL) at the cost-efficiency of Supervised Fine-Tuning (SFT). By focusing training on "pivots"—highly informative intermediate steps with high outcome variance—and using functional verifiers, PivotRL outperforms SFT in-domain (+4.17%), saves 4x compute compared to E2E RL, and crucially eliminates the catastrophic forgetting typical of SFT.

Problem & Motivation: The "Agentic Tension"

Building long-horizon agents (coding, browsing, tool use) usually forces a choice between two suboptimal paths:

  1. SFT (Supervised Fine-Tuning): Fast and data-efficient, but brittle. It memorizes sequences but fails if the environment shifts slightly. Even worse, it often causes the model to "forget" how to do math or general chat (OOD regression).
  2. E2E RL (End-of-End RL): Highly robust and generalizes well, but incredibly slow. You have to run full trajectories for hours just to get a single reward signal.

The Insight: Not all turns in a trajectory are equal. Many turns are "easy" (always succeed) or "impossible" (always fail), yielding a zero gradient in group-normalized RL. PivotRL targets only the "pivots"—the crucial moments where the model's choices actually change the outcome.

Methodology: The Core of PivotRL

PivotRL optimizes for efficiency through two primary innovations:

1. Pivot Filtering (Informativeness)

Instead of full rollouts, PivotRL extracts turns from SFT data and profiles them. It only trains on states where the reward variance is high ($\sigma^2 > 0$).

  • The Physics: The paper proves that the natural gradient norm is proportional to the reward standard deviation. If all rollouts in a group yield the same reward, the model learns nothing. By targeting "pivots," it maximizes the signal-to-noise ratio.

2. Functional Verifiers (Flexibility)

Standard SFT-to-RL bridges often fail because they require a "strict match" with expert data. If the expert used ls and the model used ls -a, a strict matcher gives zero reward even if both work. PivotRL uses domain-specific verifiers to reward "functionally equivalent" actions.

Overall Architecture Figure 1: Comparison between SFT, E2E RL, and the PivotRL framework.

Experiments & SOTA Results

PivotRL was benchmarked across coding (SWE-Bench), terminal control, browsing, and tool use.

Performance vs. Efficiency

On SWE-Bench, PivotRL achieved the same accuracy as full E2E RL but used 4x fewer rollout turns and processed the training 5.5x faster.

SWE-Bench Performance Figure 2: PivotRL reaches SOTA territory significantly faster than E2E RL in both turn count and wall-clock time.

Mitigating Catastrophic Forgetting

One of the most striking results is OOD (Out-of-Domain) retention. In traditional SFT, training a model to use a terminal often "breaks" its ability to do math (e.g., AIME25 scores dropping by -64%). PivotRL maintains a near-zero change (+0.21%) across general benchmarks while still improving agentic skills.

| Domain | Base Accuracy | SFT Δ | PivotRL Δ | | :--- | :--- | :--- | :--- | | Agentic Average | - | +9.94% | +14.11% | | OOD Average | 66.62 | -9.83% | +0.21% |

Critical Analysis & Conclusion

PivotRL represents a shift from "brute-force" RL to "surgical" RL. Its success in NVIDIA’s Nemotron-3-Super-120B proves its production-scale viability.

  • Takeaway: The "Superficial Alignment Hypothesis" suggests SFT only teaches style. PivotRL shows that on-policy RL at critical decision points is what actually teaches reasoning and robustness.
  • Limitations: Currently, it relies on having programmatic verifiers (like code testers). Scaling this to areas without clear "pass/fail" signals (like creative writing) will require LLM-as-a-judge or process reward models.

Future Outlook: We expect "Pivot-based training" to become the standard for training expensive long-context agents, moving away from wasteful full-trajectory rollouts toward highly targeted local updates.

发现相似论文

试试这些示例

  • Search for recent papers that focus on "pivot" or "bottleneck" state detection in Large Language Model reinforcement learning for agents.
  • Which paper first established the theoretical relationship between reward variance and the Scale of Natural Gradient updates in Policy Optimization (GRPO), and how does PivotRL build upon it?
  • Investigate how functional verifiers and LLM-as-a-judge frameworks are being used to replace exact-string matching in other agentic domains like web-navigation or multi-modal UI control.
目录
[NVIDIA 2026] PivotRL: High-Accuracy Agentic Post-Training at Low Compute Cost
1. TL;DR
2. Problem & Motivation: The "Agentic Tension"
3. Methodology: The Core of PivotRL
3.1. 1. Pivot Filtering (Informativeness)
3.2. 2. Functional Verifiers (Flexibility)
4. Experiments & SOTA Results
4.1. Performance vs. Efficiency
4.2. Mitigating Catastrophic Forgetting
5. Critical Analysis & Conclusion