DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

[CVPR/CoRL 2024] DreamPlan: Teaching VLMs Physical Common Sense via Video "Imagination"

Summary

Problem

Method

Results

Takeaways

Abstract

DreamPlan is a novel framework for efficiently fine-tuning Vision-Language Model (VLM) planners for complex robotic tasks like deformable object manipulation. It utilizes an action-conditioned video world model to simulate physical dynamics, allowing for reinforcement learning via Odds Ratio Policy Optimization (ORPO) entirely within a "virtual imagination" rather than costly real-world trials.

TL;DR

Large Vision-Language Models (VLMs) are great at reasoning but terrible at physics. They might know that a cloth needs folding, but they don't know how the fabric will bunch up when pulled. DreamPlan solves this by training a video world model to act as a "mental simulator." By practicing in this virtual imagination using Odds Ratio Policy Optimization (ORPO), a VLM planner can learn complex deformable object manipulation without the risk or cost of real-world failure.

Background: The Gap Between Semantics and Physics

While zero-shot VLMs like GPT-4o or Qwen-VL can generate high-level plans, they often suffer from a lack of physical grounding. In tasks involving deformable objects (rope, cloth, soft toys), the dynamics are highly non-linear. A millimeter of difference in a "grasp-and-pull" action can result in a completely different topological state.

Current solutions usually rely on simulators (like SoftGym), but these often fail the "sim-to-real" test. DreamPlan takes a different route: Learning the simulator itself from real-world video data.

Methodology: Dreaming of Success

The DreamPlan pipeline consists of three distinct stages:

1. Exploratory Data Collection

The system uses a zero-shot VLM to perform random or sub-optimal interactions in the real world. Even if the robot fails to fold the cloth, the resulting video data is a goldmine for learning causality—it shows the world model exactly how the object reacts to specific forces.

2. The Action-Conditioned World Model

To make the world model "controllable," the authors rendered the kinematic configuration of robot arms as visual cues. Using a ControlNet architecture integrated with the CogVideoX-5B diffusion backbone, the model learns to predict object deformation based on these rendered trajectories.

DreamPlan Architecture Fig 1: The DreamPlan framework—from zero-shot proposals to world model training and preference-based alignment.

3. Policy Alignment via Imagination

Instead of slow, online RL, DreamPlan uses a Best-of-K strategy:

The VLM proposes $K$ possible actions.
The World Model "dreams" the outcome for each.
A high-level evaluator (like GPT-4o) picks the best visual outcome.
The VLM is fine-tuned using ORPO to prefer the successful action over the failures.

Experimental Results: Small Models, Big Performance

The most striking result is the efficiency gain. By "internalizing" the physics during fine-tuning, the VLM no longer needs to query the world model at inference time.

Performance Boost: DreamPlan improved success rates by up to 40% over zero-shot baselines.
Model Scaling: A fine-tuned 8B model (Qwen3-VL-8B) consistently beat the 32B version of the same model that hadn't undergone "imagination training."
Speed: Inference time dropped from ~900 seconds (using explicit verification) to 1.12 seconds.

Qualitative Results Fig 2: Qualitative comparisons show that DreamPlan's world model (top) generates much more physically plausible deformations than standard video generation baselines.

Critical Analysis: Why This Matters

The core insight of DreamPlan is that sub-optimal data is sufficient for world modeling. You don't need expert demonstrations to learn how physics works; you just need to see enough examples of "if I do X, Y happens."

Limitations:

The current framework still requires a few hundred real-world trajectories (approx. 4 hours) to train the initial world model.
As a "discrete" keypoint-based planner, it may struggle with tasks requiring continuous, high-frequency reactive control (like catching a falling object).

Conclusion

DreamPlan provides a blueprint for the next generation of "Physically Intelligent" agents. By decoupling the learning of physics (via video models) from the learning of policy (via ORPO), it achieves a level of sample efficiency and grounded reasoning that zero-shot foundation models currently lack.

Experimental Success Fig 3: Success rate improvements across various deformable manipulation tasks.

Find Similar Papers

Try Our Examples

Search for recent papers using video diffusion models as world models for reinforcement learning in non-rigid or deformable object manipulation.
Which original paper introduced Odds Ratio Policy Optimization (ORPO), and how does its efficiency compare to PPO or DPO for vision-language-action model alignment?
Explore studies that apply ControlNet-style architectures to condition generative video models on continuous robotic control parameters or kinematic trajectories.

Contents

[CVPR/CoRL 2024] DreamPlan: Teaching VLMs Physical Common Sense via Video "Imagination"

1. TL;DR

2. Background: The Gap Between Semantics and Physics

3. Methodology: Dreaming of Success

3.1. 1. Exploratory Data Collection

3.2. 2. The Action-Conditioned World Model

3.3. 3. Policy Alignment via Imagination

4. Experimental Results: Small Models, Big Performance

5. Critical Analysis: Why This Matters

6. Conclusion