Recursive Think-Answer Process for LLMs and VLMs

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Recursive Think-Answer Process for LLMs and VLMs

[ArXiv 2025] R-TAP: Beyond the Single-Pass Limit—Recursive Self-Correction for LLMs and VLMs

总结

问题

方法

结果

要点

摘要

The paper introduces Recursive Think-Answer Process (R-TAP), a confidence-guided iterative reasoning framework that allows Large Language Models (LLMs) and Vision-Language Models (VLMs) to refine their internal reasoning cycles. By integrating a Confidence Generator and a recursive reward structure during training, the authors achieve State-of-the-Art (SOTA) performance across math, code, and multimodal benchmarks while reducing inference-time "Oops"-style self-corrections.

TL;DR

Recursive Think-Answer Process (R-TAP) is a training framework that transforms "Single-Pass" reasoners into "Recursive" reasoners. By using a specialized Confidence Generator and a recursive reward system, models learn to internally assess their logic and iterate until certain. The result? Higher accuracy, fewer "Oops" moments, and surprisingly, faster inference due to more stable reasoning trajectories.

Background Positioning: This is a major "System 2" enhancement for reasoning models, moving from DeepSeek-style single-shot RL to a more sophisticated, iterative self-correction paradigm that works for both text and vision.

The "Oops" Problem: Why Single-Pass is Not Enough

Current SOTA reasoners like DeepSeek-R1 have an "Aha!" moment, but they also have plenty of "Oops!" moments. Have you ever noticed a model saying "Wait, let me re-calculate that..." only to fail anyway?

The problem is Inference Rigidity. Current GRPO-style sampling optimizes for a single trajectory. If the model starts down a wrong path, it has no structural incentive to stop, go back, and fix it before the EOS (End of Sentence) token. R-TAP identifies that confidence is the missing signal. If a model doesn't know how sure it is, it can't know when to try again.

Methodology: The Confidence-Guided Loop

R-TAP breaks the mold by introducing a two-stage training process:

Confidence Generator ( $C_{ϕ}$ ): A specialized head is trained to predict the probability that a given (Question, Reason-Answer) pair is correct.
Recursive RL (GRPO): The model is trained to maximize a "Recursive Reward." Unlike standard RL which only cares if the final answer is right, R-TAP rewards:
- Improvement: Is the confidence of step $t + 1$ higher than step $t$ ?
- Certainty: Is the final answer's confidence above a threshold $a u$ ?

R-TAP Framework Figure 1: The R-TAP loop. The model recursively generates $o^{(t)}$ until internal confidence is high enough.

The Mathematical Intuition

Instead of a simple binary reward $r \in {0, 1}$ , the model sees: $R = R_{I n cr e a se} + R_{F ina l} + R_{A n s w er}$ This forces the policy $π_{h} e t a$ to learn a self-corrective trajectory. If the first "thought" is low-confidence, the model is incentivized to refine it in the next cycle to capture the $R_{I n cr e a se}$ reward.

Experimental Results: Scaling Performance & Efficiency

The researchers tested R-TAP on a massive array of models (Qwen, Llama, Phi-4, Skywork). The results are strikingly consistent.

1. The Performance Leap

On the AIME 2024 (American Invitational Mathematics Examination) benchmark, R1-Distill-Qwen-7B saw its score jump from 33.3% to 39.7%. In the multimodal domain, MM-Eureka-32B achieved an 80.2% on MathVista, surpassing many larger closed-source models.

Performance Comparison Table 1: Dramatic improvements across LLM reasoning benchmarks using R-TAP.

2. The Efficiency Paradox

One might assume recursive reasoning takes longer. However, the data shows that R-TAP models actually have shorter inference times.

Why? Because R-TAP trains the model to be more "decisive." By reducing circular, erroneous reasoning (the "Oops" tokens), the total number of tokens required to reach a correct answer actually decreases.

Inference Efficiency Figure 2: R-TAP reduces "Oops"-style tokens, leading to a significant drop in total inference time.

Critical Analysis & Takeaways

The "Oops" Metric

The authors use a fascinating proxy for reasoning quality: the frequency of self-reflective cues like "Oops." While usually considered a sign of "intelligence" (self-reflection), R-TAP shows that excessive self-correction is actually a failure of policy stability. By internalizing the correction process during training, the model arrives at the truth more directly.

Limitations

Training Overhead: Generating $T$ recursive paths during GRPO sampling is computationally expensive.
Confidence Calibration: The system relies heavily on the Confidence Generator being accurate. If $C_{ϕ}$ is overconfident, the recursion fails.

The Future of Reasoning

R-TAP moves us closer to Adaptive Inference. Imagine a model that spends 10 seconds on a Grade 1 math problem but "thinks" for 2 minutes on a complex physics proof—not because of instructions, but because it internally senses its own uncertainty. R-TAP provides the foundational training signals to make that autonomy possible.

Summary for the Practitioner: If you are fine-tuning specialized reasoning models, adding a confidence-based recursive reward can yield better performance and faster production inference than standard single-pass SFT/RL.

发现相似论文

试试这些示例

Search for recent papers that utilize internal confidence scores or uncertainty estimation to dynamically adjust the computational budget (FLOPs) during LLM inference.
Who first proposed the 'Think-Answer' paradigm (also known as Chain-of-Thought with an Answer separator), and how have subsequent works evolved beyond simple prompting into reinforcement learning frameworks?
Are there any studies applying recursive self-correction or multi-step reasoning specifically to real-time robotics or autonomous agent decision-making using Vision-Language-Action (VLA) models?

[ArXiv 2025] R-TAP: Beyond the Single-Pass Limit—Recursive Self-Correction for LLMs and VLMs

1. TL;DR

2. The "Oops" Problem: Why Single-Pass is Not Enough

3. Methodology: The Confidence-Guided Loop

3.1. The Mathematical Intuition

4. Experimental Results: Scaling Performance & Efficiency

4.1. 1. The Performance Leap

4.2. 2. The Efficiency Paradox

5. Critical Analysis & Takeaways

5.1. The "Oops" Metric

5.2. Limitations

5.3. The Future of Reasoning