WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[UC Berkeley] V1: Unifying Generation and Pairwise Self-Verification for Parallel Reasoners
总结
问题
方法
结果
要点
摘要

The paper introduces V1, a framework that unifies generation and self-verification for parallel reasoners. It features V1-Infer, an uncertainty-guided tournament ranking algorithm for test-time scaling, and V1-PairRL, an RL training paradigm that co-evolves a single model as both a generator and a pairwise verifier, achieving up to 10% Pass@1 gains over pointwise methods.

TL;DR

The "System 2" paradigm in LLMs—scaling compute at inference time—usually relies on sampling multiple solutions. However, selecting the right one is the bottleneck. V1 solves this by replacing unreliable absolute scoring (pointwise) with relative comparisons (pairwise). Through a smart tournament-based inference algorithm (V1-Infer) and a co-evolving RL training framework (V1-PairRL), V1 pushes the boundaries of parallel reasoning, achieving a 10% boost in accuracy on complex coding and math tasks while remaining compute-efficient.

Problem: The Calibration and Diversity Collapse

Current test-time scaling methods usually fall into two traps:

  1. Pointwise Calibration Collapse: When you ask a model to score a solution from 1 to 10, it lacks a reference. Often, it gives its own (even incorrect) answers high scores.
  2. Diversity Collapse (RSA): Recursive aggregation methods try to merge answers, but often end up "averaging out" the correct outlier, causing the overall success probability () to drop.

V1's core insight is that pairwise comparison is a more natural and robust primitive. Models find it much easier to say "A is better than B" than to say "A is exactly an 8/10."

Methodology: V1-Infer & V1-PairRL

1. V1-Infer: The Swiss Tournament for LLMs

Instead of comparisons, V1-Infer uses a two-phase strategy:

  • Phase 1 (Topology Coverage): Anchors comparisons to ensure all solutions are linked.
  • Phase 2 (Swiss Refinement): Like a chess tournament, it pairs solutions with similar "win rates." This focuses the verification budget on the most uncertain decision boundaries, maximizing information gain per LLM call.

Model Architecture Figure: The V1-Infer workflow—efficiently ranking N candidates through uncertainty-guided pairings.

2. V1-PairRL: Co-evolving Training

V1 doesn't just use a pre-trained model; it trains the model to be a better verifier via Group-Relative Policy Optimization (GRPO).

  • Joint Objective: The model is penalized if it generates the wrong code and rewarded if its pairwise "judge" ratings align with the ground truth.
  • Anti-Hacking: To prevent the model from giving "safe/middle" scores, V1 uses a sparsity threshold, rewarding only confident, correct judgments.

Training Results Figure: The V1-PairRL loop where generation and verification improve together.

Experiments & Results

The authors tested V1 across CodeContests, LiveCodeBench, AIME, and even real-world software engineering (SWE-bench).

  • Significant Gains: V1-Infer improved Pass@1 by up to 10% over pointwise baselines.
  • Efficiency: Compared to Recursive Self-Aggregation (RSA), V1-Infer achieved higher accuracy with far fewer model calls.
  • Foundational Improvement: Even without test-time scaling, the co-trained V1-PairRL base model outperformed standard RL baselines by 8.7% on CodeContests, proving that learning to verify actually makes the model a better "thinker."

Performance Comparison Figure: V1-PairRL consistently outperforms standard RL baselines in both base accuracy and test-time scaling.

Depth Insight: Why Pairwise Wins on Code

One striking qualitative result: Pointwise verification often suffers from score saturation. In a coding problem where one solution is and another is , a pointwise judge often gives both 10/10 because they both look "correct." However, when forced to look at them side-by-side (Pairwise), the model immediately notices the algorithmic efficiency difference and picks the superior patch.

Conclusion

V1 demonstrates that the future of "System 2" LLMs lies in unified training. By forcing models to argue between their own candidate outputs during training and using tournament-style verification at inference, we can overcome the calibration limits of the current generation of reasoners.

Limitations: Pairwise verification currently requires multiple LLM calls per problem, which can increase latency. Future work should look into distilling these pairwise insights back into a faster, pointwise-style reward head.

发现相似论文

试试这些示例

  • Search for recent papers that address "diversity collapse" in LLM self-aggregation or iterative refinement techniques.
  • Which study first introduced Multi-agent "LLM-as-a-judge" for pairwise ranking, and how does V1-Infer's tournament strategy optimize its computational overhead?
  • Explore research applying co-evolving generator-verifier training (similar to V1-PairRL) to non-verifiable domains like creative writing or open-ended dialogue.
目录
[UC Berkeley] V1: Unifying Generation and Pairwise Self-Verification for Parallel Reasoners
1. TL;DR
2. Problem: The Calibration and Diversity Collapse
3. Methodology: V1-Infer & V1-PairRL
3.1. 1. V1-Infer: The Swiss Tournament for LLMs
3.2. 2. V1-PairRL: Co-evolving Training
4. Experiments & Results
5. Depth Insight: Why Pairwise Wins on Code
6. Conclusion