WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[ICML 2025] MPCwDWM: Elevating Offline RL via Differentiable Diffusion "Thought" at Inference Time
总结
问题
方法
结果
要点
摘要

This paper introduces MPCwDWM (Model Predictive Control with Differentiable World Models), an inference-time adaptation framework for offline Reinforcement Learning. It utilizes a differentiable diffusion-based world model and a reward predictor to refine pre-trained policy parameters on-the-fly via gradient-based optimization through imagined rollouts, achieving new SOTA results on D4RL benchmarks.

TL;DR

One of the greatest limitations of Offline Reinforcement Learning is the "train-then-freeze" paradigm. Once a policy is learned from static data, it is deployed as a static artifact. MPCwDWM breaks this mold by introducing inference-time adaptation. By unrolling a differentiable diffusion-based world model at test time, the agent "imagines" a short-term future and uses backpropagation to optimize its own policy parameters before taking a single real-world action.

The "Static Policy" Bottleneck

In offline RL, we often struggle with two ghosts: Distribution Shift and Long-Horizon Credit Assignment.

  1. Distribution Shift: The policy might encounter states at test time that were sparse in the training data, leading to erratic behavior.
  2. Long-Horizon Difficulty: Q-values (critics) are notoriously hard to estimate because they aggregate infinite future rewards into a single scalar.

The authors' insight is brilliant: while long-term value is hard to estimate, local dynamics (transitions and rewards) are easier to learn and policy-independent. By using Model Predictive Control (MPC), we can use these "local" models to build a "global" understanding on-the-fly.

Methodology: The Differentiable World Model (DWM)

The heart of the paper lies in making the entire world model—specifically a Diffusion Model—fully differentiable.

1. Reverse Diffusion as a Computation Graph

Standard diffusion models generate samples. Here, the authors treat the reverse diffusion step $s_{t+1} = f_ heta(s_t, a_t, \epsilon_t)$ as a deterministic function for a fixed noise $\epsilon$. This allows them to compute the Jacobian $ abla_a f_ heta$, meaning we can ask: "How would the next state change if I tweaked the current action slightly?"

2. The MPC Objective

At each step $t$ in the real environment, the agent doesn't just act. It runs $E$ steps of internal "budgeted thinking":

  • Rollout: Generates $M$ imagined trajectories of length $H$.
  • Evaluate: Calculates a surrogate return using a reward model $r_\xi$ for immediate steps and a terminal critic $Q_\phi$ for the remainder.
  • Adapt: Backpropagates through the entire chain of imagined states to update the policy parameters $\psi$.

Workflow Overview Figure 1: The MPCwDWM pipeline. Red dashed arrows represent the gradient flow back through time, allowing the policy $\pi_\psi$ to optimize its parameters $\psi$ based on imagined outcomes.

Quantitative Dominance

The results on the D4RL benchmark demonstrate that this "test-time thinking" pays off significantly.

  • SOTA Achievement: MPCwDWM reaches an average of 85.33 on MuJoCo, consistently outperforming strong baselines like ReBRAC and IQL.
  • The AntMaze Breakthrough: In sparse-reward AntMaze environments, the method shows massive jumps (e.g., +15 points in large-diverse), where precise planning is more critical than raw imitation.

Performance Comparison Table

The following table highlights how MPCwDWM (rightmost column) consistently edges out previous SOTA methods across diverse locomotion tasks.

Performance Table Table 1: D4RL Normalized Scores. Note the consistent gains over ReBRAC, which serves as the pre-trained base policy.

Critical Insight: Why Does This Work?

Why is optimizing the policy at inference time better than just using a diffusion planner (like Diffuser) to pick an action?

The authors argue that backpropagating into policy parameters provides a smoother optimization landscape and utilizes the structural inductive bias of the policy network, whereas pure action-sequence optimization can be noisy. Furthermore, by using a terminal critic $Q_\phi$, they combine the precision of short-horizon planning with the broad "intuition" of the value function.

Future Outlook & Limitations

  • Computational Overhead: Calculating gradients through diffusion rollouts at every step is expensive. The authors suggest "multi-step" generative models or "flow matching" as potential speed-ups.
  • Model Accuracy: If the world model $f_ heta$ is biased, the policy might "optimize" into hallucinations. Robustness to model error remains a frontier.

Conclusion

MPCwDWM represents a shift from "reactive" RL to "deliberative" RL. By merging the generative prowess of Diffusion Models with the classical rigors of MPC, it provides a robust recipe for high-performance offline agents that can adapt to the nuances of the environment they are actually experiencing.

发现相似论文

试试这些示例

  • Search for recent papers that apply gradient-based inference-time optimization to policies in continuous control, specifically comparing those using diffusion models vs. standard world models.
  • Which study first formalized the use of differentiable physics engines or world models for policy refinement, and how does the Differentiable World Model in this paper extend those concepts to stochastic diffusion processes?
  • Explore research that integrates Model Predictive Control (MPC) with Test-Time Training (TTT) in robotics or autonomous driving to mitigate distribution shift in offline-to-online transitions.
目录
[ICML 2025] MPCwDWM: Elevating Offline RL via Differentiable Diffusion "Thought" at Inference Time
1. TL;DR
2. The "Static Policy" Bottleneck
3. Methodology: The Differentiable World Model (DWM)
3.1. 1. Reverse Diffusion as a Computation Graph
3.2. 2. The MPC Objective
4. Quantitative Dominance
4.1. Performance Comparison Table
5. Critical Insight: Why Does This Work?
6. Future Outlook & Limitations
6.1. Conclusion