The paper introduces Multi-Answer Reinforcement Learning (Multi-Answer RL), a post-training framework that enables Large Language Models (LLMs) to generate a calibrated distribution of multiple plausible answers in a single forward pass. By modifying the standard RL objective with set-level rewards and proper scoring rules (Multi-Brier score), the authors achieve SOTA performance in answer coverage and diversity on datasets like DDXPlus (medical) and MBPP (coding).
TL;DR
Standard RL often forces Language Models into a "single-track mind," causing them to ignore valid alternative answers in favor of a dominant mode. MIT researchers have introduced Multi-Answer RL, a training paradigm where models learn to generate a set of $K$ diverse, calibrated answers in a single response. This approach doesn't just improve diversity; it slashes inference costs by 50% and boosts coding accuracy by 50% by eliminating redundant reasoning.
Problem & Motivation: The "Invisible Leash" of Single-Answer RL
Most current LLMs are post-trained using Reinforcement Learning (RLVR) to maximize the probability of a single correct answer. While great for benchmarks, this creates a "Mode Collapse" failure in the real world.
- In Medicine: A patient with abdominal pain might have appendicitis or a kidney stone. A model trained for a single answer will "pick one" and stay silent about the other.
- In Code: There are multiple ways to implement an algorithm. Standard RL models often repeat the same logic even when asked to try again.
- Efficiency Gap: Current solutions rely on "Best-of-N" sampling—running the model 10 times to get 10 answers. This is incredibly wasteful because the model repeats the same "scaffolding" reasoning over and over.
Methodology: Internalizing the Distribution
The core innovation is moving the set-generation logic from the inference strategy into the model weights.
1. Multi-Answer RLVR (Coverage)
Instead of rewarding a single string, the reward function checks a generated set $A$ against a ground-truth set $\mathcal{Y}^$: $$R_{\mathrm{RLVR}}^{\mathrm{multi}}(A, \mathcal{Y}^) = \sum_{i=1}^{K} \mathbf{1}[ a_i \in \mathcal{Y}^* ]$$ This forces the model to explore the breadth of the solution space within its internal Chain-of-Thought (CoT).
2. Multi-Answer RLCR (Calibration)
To make the model "know what it knows," the authors add a calibration reward using the Multi-Brier Score. The model must output both an answer and its confidence. If the model says 70% confidence for $a_1$ and $a_1$ is wrong, it gets heavily penalized.
Figure 1: Comparison between standard RL (mode-seeking) and Multi-Answer RL (distribution-seeking).
Experimental Battleground: Medical, QA, and Code
The researchers tested the method on three distinct regimes:
- DDXPlus (Medical): Intrinsically multi-answer (Differential Diagnosis).
- HotPotQA (Ambiguous): Single answer but incomplete info (Epistemic Uncertainty).
- MBPP (Coding): Unambiguous but allows multiple algorithmic paths.
Key Findings:
- Diversity Explosion: In coding (MBPP), Multi-Answer RLVR produced nearly twice as many unique functional implementations compared to the single-answer baseline under the same sampling budget.
- The Efficiency Dividend: Because the model generates multiple answers in one go (sharing the initial "thinking" phase), token usage dropped by 44% in medical tasks.
Figure 2: Multi-Answer RL shows significantly higher coverage and unique correct answers than Single-Answer RL across DDXPlus and MBPP.
Critical Insight: Why it Works
The "magic" happens in the Chain-of-Thought. By being forced to provide $K$ answers, the model's reasoning trace becomes a comparison-fest. It internally weighs "Hypothesis A vs. Hypothesis B." This structure prevents the model from "committing" too early to a single path—a common cause of hallucinations in standard models.
Limitations & Future Work
While a breakthrough, the authors note a potential "Sum-to-One" Bias: In extremely difficult tasks, models often feel pressured to make their total confidence across $K$ answers sum to 1.0 (100%), even when they are actually clueless about all of them. Future work needs to focus on better "abstention" mechanisms where a model can admit it's 0% sure about every candidate it generated.
Conclusion
Multi-Answer RL shifts the paradigm of LLM alignment. Instead of teaching a model to be a "confident expert" that gives one answer, it teaches the model to be a "risk-aware analyst" that maps out the entire probability landscape. For high-stakes AI applications in medicine, law, and engineering, this isn't just an optimization—it's a requirement.
