WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[Survey 2026] Beyond SFT vs. RLHF: A Unified Behavioral Theory of LLM Post-Training
总结
问题
方法
结果
要点
摘要

This survey proposes a unified framework for Large Language Model (LLM) post-training, categorizing methods by trajectory provenance into off-policy and on-policy learning. It introduces the functional roles of support expansion, policy reshaping, and behavioral consolidation to explain how diverse techniques like SFT, DPO, and RLHF work in concert within modern multi-stage pipelines.

TL;DR

Post-training has evolved from simple fine-tuning into complex, multi-stage "behavioral engineering." This seminal survey moves beyond superficial labels like "Instruction Tuning" to categorize methods by Trajectory Provenance (where the data comes from) and Functional Roles (what it does to the model's behavior). It provides the first rigorous vocabulary to explain why hybrid pipelines—combining SFT, RL, and Distillation—are now the necessary standard for SOTA models like DeepSeek-R1 and Llama-3.

The Motivation: Why Objective Functions Aren't Enough

For years, the community has debated whether DPO is "better" than PPO, or if SFT is "sufficient." However, these comparisons often miss the point: different methods address different behavioral bottlenecks.

The authors argue that we must look at the Occupancy Measure of the model—essentially, which "states" (prefixes) the model visits and which "actions" (tokens) it selects. A model might fail because it never "reaches" the correct reasoning path, or it might reach the path but "branch off" incorrectly. You cannot solve the first problem with on-policy RL, and you cannot solve the second effectively with off-policy SFT.

The Framework: Provenance, Interface, and Role

The survey introduces a three-layered lens to analyze any post-training intervention:

  1. Trajectory Provenance (The "Where"):
    • Off-Policy: Learning from external traces (Demonstrations, GPT-4 outputs).
    • On-Policy: Learning from the model's own "hallucinations" and "correctness" (Fresh rollouts).
  2. Supervision Interface (The "How"): The form of the signal—Token targets, pairwise preferences, or verifier rewards.
  3. Functional Roles (The "What"):
    • Support Expansion: Opening up new behavioral doors that were previously locked.
    • Policy Reshaping: Teaching the model to pick the "best" door among those it already knows how to open.
    • Behavioral Consolidation: Ensuring the model doesn't "forget" how to open the door when it's shrunk or moved to a new stage.

Unified Overview of Post-Training Framework

Methodology: The Strategic Hand-off

The survey demystifies why the most powerful models use a specific sequence:

1. Support Expansion (The SFT Phase)

SFT is the "import" phase. If a model doesn't know how to solve a differential equation, on-policy RL will likely fail because the model will never randomly sample the correct answer (the sparse reward problem). SFT expands the Effective Support, making these behaviors reachable.

2. Policy Reshaping (The RL/Preference Phase)

Once the behavior is reachable, the model often struggles with "rollout-dependent failures"—errors that compound over long sequences. On-policy methods (RLHF/RLVR) observe these specific failures in the model's own distribution and correct them.

3. Behavioral Consolidation (The Distillation Phase)

High-performance behaviors are often "fragile" or "expensive" (e.g., requiring O1-style long-thought chains). Consolidation (Distillation) "amortizes" these behaviors, baking them into the model weights so they are robust and efficient during deployment.

Comparison of Post-Training Subfamilies

Critical Insights: The "Consolidation" Bottleneck

The most profound contribution of this paper is the emphasis on Consolidation. In many modern pipelines, we see amazing performance in specialized "experts" (e.g., a math-specialized RL model) that disappears when merged into a generalist model. This "Support Attrition" is the next great frontier. We need better ways to ensure that when we move from one stage to another, we aren't just "reshaping" the model into a narrower and narrower corner until it becomes brittle.

Future Outlook & Open Problems

The survey concludes with a call to action for the research community:

  • Diagnosis: We need "bottleneck-sensitive" metrics to tell us if we should be doing more SFT or more RL.
  • Transferability: Why do some behaviors (like "style") distill easily, while others (like "logical verification") require massive on-policy correction?
  • Interleaved Training: The shift away from rigid SFT RL sequences toward unified objectives where off-policy guidance and on-policy exploration happen simultaneously.

Conclusion

This survey is a masterclass in shifting the perspective from "algorithms" to "systems." By viewing post-training as a coordinated effort to expand, reshape, and consolidate behavior, practitioners can move away from "vibe coding" (trial and error with hyperparameters) toward principled behavioral design.

Final Takeaway: Don't just ask if your model is "aligned"; ask if its current bottleneck is Support, Selection, or Retention.

发现相似论文

试试这些示例

  • Search for recent papers that attempt to measure 'effective support' or 'capability injection' vs 'elicitation' in LLMs during the SFT and RL phases.
  • Which studies first introduced the concept of 'on-policy distillation' for reasoning tasks, and how do they address the distribution shift between teacher and student rollouts?
  • Find technical reports of frontier models (released after 2024) that detail the specific 'consolidation' strategies used to merge domain-specific experts into a generalist base model.
目录
[Survey 2026] Beyond SFT vs. RLHF: A Unified Behavioral Theory of LLM Post-Training
1. TL;DR
2. The Motivation: Why Objective Functions Aren't Enough
3. The Framework: Provenance, Interface, and Role
4. Methodology: The Strategic Hand-off
4.1. 1. Support Expansion (The SFT Phase)
4.2. 2. Policy Reshaping (The RL/Preference Phase)
4.3. 3. Behavioral Consolidation (The Distillation Phase)
5. Critical Insights: The "Consolidation" Bottleneck
6. Future Outlook & Open Problems
7. Conclusion