The paper introduces V1, a framework that unifies generation and self-verification for parallel reasoners. It features V1-Infer, an uncertainty-guided tournament ranking algorithm for test-time scaling, and V1-PairRL, an RL training paradigm that co-evolves a single model as both a generator and a pairwise verifier, achieving up to 10% Pass@1 gains over pointwise methods.
TL;DR
The "System 2" paradigm in LLMs—scaling compute at inference time—usually relies on sampling multiple solutions. However, selecting the right one is the bottleneck. V1 solves this by replacing unreliable absolute scoring (pointwise) with relative comparisons (pairwise). Through a smart tournament-based inference algorithm (V1-Infer) and a co-evolving RL training framework (V1-PairRL), V1 pushes the boundaries of parallel reasoning, achieving a 10% boost in accuracy on complex coding and math tasks while remaining compute-efficient.
Problem: The Calibration and Diversity Collapse
Current test-time scaling methods usually fall into two traps:
- Pointwise Calibration Collapse: When you ask a model to score a solution from 1 to 10, it lacks a reference. Often, it gives its own (even incorrect) answers high scores.
- Diversity Collapse (RSA): Recursive aggregation methods try to merge answers, but often end up "averaging out" the correct outlier, causing the overall success probability () to drop.
V1's core insight is that pairwise comparison is a more natural and robust primitive. Models find it much easier to say "A is better than B" than to say "A is exactly an 8/10."
Methodology: V1-Infer & V1-PairRL
1. V1-Infer: The Swiss Tournament for LLMs
Instead of comparisons, V1-Infer uses a two-phase strategy:
- Phase 1 (Topology Coverage): Anchors comparisons to ensure all solutions are linked.
- Phase 2 (Swiss Refinement): Like a chess tournament, it pairs solutions with similar "win rates." This focuses the verification budget on the most uncertain decision boundaries, maximizing information gain per LLM call.
Figure: The V1-Infer workflow—efficiently ranking N candidates through uncertainty-guided pairings.
2. V1-PairRL: Co-evolving Training
V1 doesn't just use a pre-trained model; it trains the model to be a better verifier via Group-Relative Policy Optimization (GRPO).
- Joint Objective: The model is penalized if it generates the wrong code and rewarded if its pairwise "judge" ratings align with the ground truth.
- Anti-Hacking: To prevent the model from giving "safe/middle" scores, V1 uses a sparsity threshold, rewarding only confident, correct judgments.
Figure: The V1-PairRL loop where generation and verification improve together.
Experiments & Results
The authors tested V1 across CodeContests, LiveCodeBench, AIME, and even real-world software engineering (SWE-bench).
- Significant Gains: V1-Infer improved Pass@1 by up to 10% over pointwise baselines.
- Efficiency: Compared to Recursive Self-Aggregation (RSA), V1-Infer achieved higher accuracy with far fewer model calls.
- Foundational Improvement: Even without test-time scaling, the co-trained V1-PairRL base model outperformed standard RL baselines by 8.7% on CodeContests, proving that learning to verify actually makes the model a better "thinker."
Figure: V1-PairRL consistently outperforms standard RL baselines in both base accuracy and test-time scaling.
Depth Insight: Why Pairwise Wins on Code
One striking qualitative result: Pointwise verification often suffers from score saturation. In a coding problem where one solution is and another is , a pointwise judge often gives both 10/10 because they both look "correct." However, when forced to look at them side-by-side (Pairwise), the model immediately notices the algorithmic efficiency difference and picks the superior patch.
Conclusion
V1 demonstrates that the future of "System 2" LLMs lies in unified training. By forcing models to argue between their own candidate outputs during training and using tournament-style verification at inference, we can overcome the calibration limits of the current generation of reasoners.
Limitations: Pairwise verification currently requires multiple LLM calls per problem, which can increase latency. Future work should look into distilling these pairwise insights back into a faster, pointwise-style reward head.
