Off-Policy Value-Based Reinforcement Learning for Large Language Models

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Off-Policy Value-Based Reinforcement Learning for Large Language Models

ReVal: Breaking the Efficiency Bottleneck of LLM RL with Off-Policy Value Learning

总结

问题

方法

结果

要点

摘要

ReVal is a novel off-policy value-based reinforcement learning framework for LLM post-training that interprets model logits as Q-values to enable efficient experience replay. It combines stepwise internal consistency signals with trajectory-level outcome rewards, achieving state-of-the-art results on mathematical reasoning benchmarks like AIME and GPQA while being significantly more sample-efficient than on-policy methods like GRPO.

TL;DR

Reinforcement Learning (RL) has become the cornerstone of "O1-style" reasoning models. However, mainstream methods like GRPO are fundamentally on-policy, meaning they are "data hungry"—they generate expensive reasoning traces, use them once, and throw them away. ReVal changes this by treating LLM logits as Q-values, allowing the model to "re-read" its past experiences through a replay buffer. This approach achieves comparable or superior performance to SOTA baselines while being up to 4.3x faster in convergence.

Problem & Motivation: The "Expensive Rollout" Trap

The current paradigm of Reinforcement Learning with Verifiable Rewards (RLVR) faces a massive efficiency wall. In complex reasoning or agentic tasks, generating a single trajectory (a "rollout") is incredibly slow and computationally expensive.

Standard Actor-Only methods like GRPO or ReMax have lowered memory overhead but remain on-policy. They suffer from two major flaws:

Low Sample Efficiency: Trajectories are discarded immediately after one gradient step.
Coupled Generation-Update: Total training time is dominated by autoregressive generation ($T_{generation} \gg T_{update}$).

To solve this, we need Off-Policy RL, which allows the model to learn from historical data stored in a Replay Buffer.

Methodology: Unifying Policy and Value

The researchers introduce ReVal, a value-based framework built on two clever insights:

1. Logit-as-Q Parameterization

Instead of training a separate, memory-heavy Value Network (Critic), ReVal treats the LLM's own logits as soft Q-values. This means the same weights that predict the next token also estimate the long-term "value" of that token.

2. Calibrated Initialization & Reward Shaping

Previous attempts at this (like TBRM) failed a crucial test: if the reward is zero, the model should stay the same as the starting (reference) model. ReVal fixes this by introducing a specific Bellman residual loss with reward shaping:

$$L_{ReVal}( heta) = \sum_{ au \in \mathcal{D}} (V_{ heta}(s_1) - V_{ref}(s_1) + \log \pi_{ heta}( au) - \frac{r_{rule}( au)}{\beta} - \log \pi_{ref}( au))^2$$

This formula ensures that at the start of training (when $r=0$), the gradient is zero, preventing the "spurious policy drift" that plagues other value-based methods.

Model Architecture Figure 1: ReVal Framework. By unifying policy and value via logits, it enables the use of an off-policy Replay Buffer.

Experiments: Speed Meets Stability

ReVal was tested against GRPO and TBRM on various reasoning benchmarks using DeepSeek-R1-Distill-1.5B and Qwen2.5-Math-7B.

Dramatic Convergence Speedup

Across easy, medium, and hard tasks, ReVal reaches peak performance significantly faster than GRPO. On "Hard" tasks, it achieves the same threshold in ~10 updates compared to GRPO's 33, representing a 3.6x to 5.2x speedup.

Superior Reasoning Performance

In final accuracy, ReVal consistently beats on-policy baselines:

AIME24: +2.7% improvement over GRPO.
GPQA (Out-of-Domain): +4.5% improvement, showcasing better generalization.
Performance @ N=1: When rollouts are strictly limited (one sample per prompt), ReVal's lead grows even further (up to +4.8% on AIME), proving that the ability to reuse data is a "superpower" in data-constrained environments.

Efficiency Comparison Figure 2: Performance curves on DPSK-R1-Distill-1.5B. ReVal (purple) maintains a clear lead in accuracy and stability throughout training.

Critical Analysis & Insights

The success of ReVal hinges on three "Tuning Secrets" revealed in the paper:

Periodic Reference Reset: The KL penalty $\log \pi_ heta/\pi_{ref}$ grows over time, which eventually "muffles" the gradient signal. Periodically updating $\pi_{ref}$ to the current $\pi_ heta$ "refreshes" the learning signal.
The Right $\beta$: Longer responses (like those from DeepSeek-R1) need a smaller $\beta$ (0.002) because log-ratios accumulate over more tokens.
Negative Samples: Using Normalized Advantage (standardizing rewards across a batch) is significantly more effective than simple +1/-1 binary rewards for value learning.

Limitations

While ReVal is powerful, it currently uses a simple FIFO (First-In-First-Out) buffer. Future iterations could likely extract even more performance by using Prioritized Experience Replay (PER)—focusing the model on the most "surprising" or educational failures.

Conclusion

ReVal proves that we don't need to choose between memory efficiency and data efficiency. By treating an LLM as its own value model, we can finally stop the wasteful "sample-and-discard" cycle of on-policy RL, paving the way for more complex, long-horizon AI agents.

发现相似论文

试试这些示例

Search for recent papers that utilize off-policy reinforcement learning or experience replay specifically for fine-tuning Large Language Models on reasoning tasks.
Which original study first demonstrated that the logits of a pretrained Transformer-based language model can be mathematically interpreted as implicit Q-values?
Explore research that applies prioritized experience replay or more advanced sampling strategies to value-based RL frameworks in the context of generative AI.

ReVal: Breaking the Efficiency Bottleneck of LLM RL with Off-Policy Value Learning

1. TL;DR

2. Problem & Motivation: The "Expensive Rollout" Trap

3. Methodology: Unifying Policy and Value

3.1. 1. Logit-as-Q Parameterization

3.2. 2. Calibrated Initialization & Reward Shaping

4. Experiments: Speed Meets Stability

4.1. Dramatic Convergence Speedup

4.2. Superior Reasoning Performance

5. Critical Analysis & Insights

5.1. Limitations

6. Conclusion