WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[ArXiv 2025] DGO: Bridging the Gap Between Experience Utilization and Internalization in LLM Reasoning
Summary
Problem
Method
Results
Takeaways
Abstract

This paper introduces Dual Guidance Optimization (DGO), a unified reinforcement learning (RL) framework for Large Language Models (LLMs). DGO improves reasoning by integrating external experience (from a dynamic experience bank) and internal knowledge (parametric learning), achieving state-of-the-art performance on reasoning benchmarks like AIME25 and MATH500.

TL;DR

While Reinforcement Learning from Verifiable Rewards (RLVR) has pushed LLM reasoning to new heights, it remains a "brute-force" approximation of human learning. Human learners don't just solve problems; they accumulate external strategies and internalize them into intuition. Dual Guidance Optimization (DGO) formalizes this by creating a closed-loop system where a model explores under the guidance of an "Experience Bank" and then distills those successful paths into its own weights, effectively turning transient exploration into permanent capability.

Problem & Motivation: The Limitations of "Experience-Blind" RL

Standard RL approaches (e.g., GRPO) treat training trajectories as one-off gradient updates. Once a batch is processed, the specific "how-to" of that reasoning path is often lost. This leads to two major issues:

  1. Limited Exploration: Without external guidance, models are stuck in their own "semantic bubbles," struggling to find complex reasoning paths.
  2. Poor Retention: Models may "forget" specialized strategies once the RL distribution shifts, as the knowledge isn't deeply internalized.

The authors observe that humans use Dual Guidance: External (books, mentors) and Internal (memory/intuition). DGO aims to replicate this by coupling utilization (using a memory bank to guide RL) with internalization (baking that memory into parameters).

Methodology: The DGO Closed-Loop

DGO operates in an iterative three-stage cycle:

1. Experience Construction

Instead of storing raw text, DGO uses an Experience Generator to extract transferable triplets: (context, condition, action). This prevents the model from just "memorizing answers" and encourages learning structural strategies (e.g., "When dealing with Roots of Unity, THEN factorize the polynomial...").

2. Joint Trajectory-Policy Refinement

The model explores new solutions using a "Mixed Distribution." Early in training, it relies heavily on the Experience Bank. Through Experience Annealing, the reliance on external prompts is gradually reduced, forcing the model to lean on its evolving internal logic.

Overall Architecture Fig 1: The DGO framework: Construction -> Refinement -> Internalization.

3. Experience Internalization

This is the "magic" step. To prevent the model from becoming a "prompt-slave" (only working when experience is provided), DGO rewrites successful trajectories to remove phrases like "referring to experience." These cleaned, "internalized" trajectories are then used for supervised distillation.

Experiments: Proving the Synergy

DGO was tested on Qwen3 backbones across challenging math (AIME, MATH500) and science (GPQA) benchmarks.

Key Findings:

  • Intrinsic Boost: Even without any experience prompts at test time, DGO outperforms GRPO (e.g., +2.27% on 8B model).
  • Test-Time Scaling (TTS): When given access to the experience bank at inference, DGO models show massive gains, reaching 44.78% on average for the 14B model.
  • Robustness: Unlike standard models, DGO is surprisingly "noise-resistant." When fed irrelevant history, its performance degrades far less than GRPO-trained models, showing it has learned to selectively utilize information.

Performance Comparison Fig 2: Comparison of DGO vs. Baselines across scales. DGO (Zero) and DGO (TTS) represent the intrinsic and guided modes.

Deep Insight: Expanding the Semantic Space

One of the most profound visualizations in the paper is the t-SNE of reasoning trajectories. DGO-trained models produce "rare trajectories" that are semantically distinct from those produced by standard RL.

Trajectory Visualization Fig 3: t-SNE visualization showing DGO discovering reasoning "islands" that standard RL (GRPO) never visits.

Essentially, the external experience bank acts as a "scaffold" that allows the model to reach far-away clusters of correct logic. Once the path is found, the internalization stage "builds a bridge," allowing the model to return to that logic cluster later without the scaffold.

Takeaway & Conclusion

DGO represents a shift from "Training as Optimization" to "Training as Experiential Learning." By formalizing how models use and absorb external prompts, DGO creates LLMs that are not just smarter at inference, but more efficient at learning.

Limitations: Currently, the Experience Generator is a separate fine-tuned model (Qwen3-8B). Future iterations could see models managing and updating their own experience banks fully autonomously, leading to truly self-evolving agents.

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize non-parametric memory or experience banks to improve reinforcement learning efficiency in Large Language Models.
  • What are the seminal works on "Experience Internalization" or "Knowledge Distillation from Self-Generated Trajectories" that influenced the DGO framework?
  • Explore if the Dual Guidance Optimization approach has been applied to multi-modal reasoning or agentic tasks involving long-term tool-use history.
Contents
[ArXiv 2025] DGO: Bridging the Gap Between Experience Utilization and Internalization in LLM Reasoning
1. TL;DR
2. Problem & Motivation: The Limitations of "Experience-Blind" RL
3. Methodology: The DGO Closed-Loop
3.1. 1. Experience Construction
3.2. 2. Joint Trajectory-Policy Refinement
3.3. 3. Experience Internalization
4. Experiments: Proving the Synergy
5. Deep Insight: Expanding the Semantic Space
6. Takeaway & Conclusion