WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[MIT Research] Beyond the Mode: Training LLMs for Distributional Reasoning and Multi-Answer Generation
总结
问题
方法
结果
要点
摘要

The paper introduces Multi-Answer Reinforcement Learning (Multi-Answer RL), a post-training framework that enables Large Language Models (LLMs) to generate a calibrated distribution of multiple plausible answers in a single forward pass. By modifying the standard RL objective with set-level rewards and proper scoring rules (Multi-Brier score), the authors achieve SOTA performance in answer coverage and diversity on datasets like DDXPlus (medical) and MBPP (coding).

TL;DR

Standard RL often forces Language Models into a "single-track mind," causing them to ignore valid alternative answers in favor of a dominant mode. MIT researchers have introduced Multi-Answer RL, a training paradigm where models learn to generate a set of $K$ diverse, calibrated answers in a single response. This approach doesn't just improve diversity; it slashes inference costs by 50% and boosts coding accuracy by 50% by eliminating redundant reasoning.

Problem & Motivation: The "Invisible Leash" of Single-Answer RL

Most current LLMs are post-trained using Reinforcement Learning (RLVR) to maximize the probability of a single correct answer. While great for benchmarks, this creates a "Mode Collapse" failure in the real world.

  • In Medicine: A patient with abdominal pain might have appendicitis or a kidney stone. A model trained for a single answer will "pick one" and stay silent about the other.
  • In Code: There are multiple ways to implement an algorithm. Standard RL models often repeat the same logic even when asked to try again.
  • Efficiency Gap: Current solutions rely on "Best-of-N" sampling—running the model 10 times to get 10 answers. This is incredibly wasteful because the model repeats the same "scaffolding" reasoning over and over.

Methodology: Internalizing the Distribution

The core innovation is moving the set-generation logic from the inference strategy into the model weights.

1. Multi-Answer RLVR (Coverage)

Instead of rewarding a single string, the reward function checks a generated set $A$ against a ground-truth set $\mathcal{Y}^$: $$R_{\mathrm{RLVR}}^{\mathrm{multi}}(A, \mathcal{Y}^) = \sum_{i=1}^{K} \mathbf{1}[ a_i \in \mathcal{Y}^* ]$$ This forces the model to explore the breadth of the solution space within its internal Chain-of-Thought (CoT).

2. Multi-Answer RLCR (Calibration)

To make the model "know what it knows," the authors add a calibration reward using the Multi-Brier Score. The model must output both an answer and its confidence. If the model says 70% confidence for $a_1$ and $a_1$ is wrong, it gets heavily penalized.

Model Architecture and Process Figure 1: Comparison between standard RL (mode-seeking) and Multi-Answer RL (distribution-seeking).

Experimental Battleground: Medical, QA, and Code

The researchers tested the method on three distinct regimes:

  1. DDXPlus (Medical): Intrinsically multi-answer (Differential Diagnosis).
  2. HotPotQA (Ambiguous): Single answer but incomplete info (Epistemic Uncertainty).
  3. MBPP (Coding): Unambiguous but allows multiple algorithmic paths.

Key Findings:

  • Diversity Explosion: In coding (MBPP), Multi-Answer RLVR produced nearly twice as many unique functional implementations compared to the single-answer baseline under the same sampling budget.
  • The Efficiency Dividend: Because the model generates multiple answers in one go (sharing the initial "thinking" phase), token usage dropped by 44% in medical tasks.

Experimental Results Ranking Figure 2: Multi-Answer RL shows significantly higher coverage and unique correct answers than Single-Answer RL across DDXPlus and MBPP.

Critical Insight: Why it Works

The "magic" happens in the Chain-of-Thought. By being forced to provide $K$ answers, the model's reasoning trace becomes a comparison-fest. It internally weighs "Hypothesis A vs. Hypothesis B." This structure prevents the model from "committing" too early to a single path—a common cause of hallucinations in standard models.

Limitations & Future Work

While a breakthrough, the authors note a potential "Sum-to-One" Bias: In extremely difficult tasks, models often feel pressured to make their total confidence across $K$ answers sum to 1.0 (100%), even when they are actually clueless about all of them. Future work needs to focus on better "abstention" mechanisms where a model can admit it's 0% sure about every candidate it generated.

Conclusion

Multi-Answer RL shifts the paradigm of LLM alignment. Instead of teaching a model to be a "confident expert" that gives one answer, it teaches the model to be a "risk-aware analyst" that maps out the entire probability landscape. For high-stakes AI applications in medicine, law, and engineering, this isn't just an optimization—it's a requirement.

发现相似论文

试试这些示例

  • Search for recent papers published after 2024 that attempt to solve the mode collapse or entropy collapse problem in LLM Reinforcement Learning (RLVR/PPO).
  • Which paper first introduced the concept of "Reinforcement Learning with Calibration Rewards" (RLCR), and how does this multi-answer approach evolve that specific theoretical foundation?
  • Explore if Multi-Answer RL techniques have been applied to multi-modal models or vision-language tasks where a single image might have multiple valid descriptions or labels.
目录
[MIT Research] Beyond the Mode: Training LLMs for Distributional Reasoning and Multi-Answer Generation
1. TL;DR
2. Problem & Motivation: The "Invisible Leash" of Single-Answer RL
3. Methodology: Internalizing the Distribution
3.1. 1. Multi-Answer RLVR (Coverage)
3.2. 2. Multi-Answer RLCR (Calibration)
4. Experimental Battleground: Medical, QA, and Code
4.1. Key Findings:
5. Critical Insight: Why it Works
6. Limitations & Future Work
7. Conclusion