Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

Drifting Field Policy: Redefining One-Step Generative RL via Wasserstein Gradient Flow

Summary

Problem

Method

Results

Takeaways

Abstract

This paper introduces Drifting Field Policy (DFP), a novel one-step generative policy for RL finetuning that bypasses ODE-based trajectories. By framing policy updates as a Wasserstein-2 gradient flow, DFP achieves State-of-the-Art performance on Robomimic and OGBench, outperforming diffusion and flow-based policies in both efficiency and success rate.

TL;DR

While Diffusion and Flow-matching have dominated the generative policy landscape, they suffer from a "structural burden": they model actions as the end-point of an ODE trajectory. Drifting Field Policy (DFP) breaks this by treating policy updates as a direct Wasserstein-2 Gradient Flow in probability space. This eliminates the need for time-indexed velocity fields, allowing 1-step policies to match or beat multi-step SOTA on complex manipulation tasks with significantly faster finetuning.

Problem & Motivation: The ODE Tax

Most modern robotic policies are "trapped" in time. Whether using Diffusion or Flow-matching, the model essentially predicts a velocity that integrates into an action. When we try to finetune these models with Reinforcement Learning (RL), we face a credit assignment nightmare: the reward is achieved at the action level, but the gradient must flow back through every "virtual" timestep of the ODE.

Even "1-step" distillation models (like Consistency Models or MeanFlow) carry this baggage because their training objective is still defined along that trajectory. The authors identify this as the reason online adaptation is often slow and unstable for generative agents.

Methodology: The Power of the Drifting Field

DFP replaces the ODE with a Drifting Model. In this paradigm, a single-pass network $f_{h} e t a$ maps a noise prior directly to an action. The "training magic" happens through a Drifting Field $V_{p, q} = V_{p}^{+} - V_{q}^{-}$ .

1. The Physical Intuition

Think of the policy distribution as a cloud of particles.

Attraction ( $V_{p}^{+}$ ): Pulls particles toward "good" high-reward actions.
Repulsion ( $V_{q}^{-}$ ): Pushes particles away from the current distribution to prevent mode collapse.

2. Wasserstein Gradient Flow

The authors prove that this drifting field is equivalent to the steepest-descent direction toward an optimal policy (the "Soft Target") in the space of probability measures. Mathematically, they decompose the update into:

A $ab l a_{a} Q$ Ascent: Moving toward higher value.
A Score Matching Regularizer: Keeping the update within a trust region.

Model Architecture and Algorithm The structural decomposition of the DFP update, showing the Q-ascent and trust-region components.

3. The Top-K Surrogate

Since the "ideal" target policy is mathematically intractable, DFP uses a clever trick: it samples $N$ candidates, picks the Top-K based on the Critic ( $Q$ ), and uses those as the "attractive" targets for the drifting field. This makes the implementation remarkably simple—almost like Behavior Cloning on self-generated "best" actions.

Experiments & Results

The authors tested DFP on 12 demanding tasks across Robomimic and OGBench.

SOTA Performance

DFP achieved an average success rate of 95.8%, dominating baselines like QC-FQL and MVP (Mean Velocity Policy). In complex tasks like cube-triple-task4 and cube-quadruple-task3, DFP showed a massive jump in performance (e.g., from ~47% to ~96%).

Performance Comparison Table

Why Drifting Beats MeanFlow

An ablation study revealed a key insight: when applying the same "Top-K" supervision to an ODE-based backbone (MeanFlow), the gains were marginal (+2.4 pp). However, on the Drifting backbone, the gain was substantial (+7.4 pp). This confirms that probability-space descent is fundamentally more compatible with RL rewards than velocity-field re-fitting.

Training Curves DFP (purple) shows markedly faster and higher convergence compared to the MeanFlow-based MVP (red) across various tasks.

Critical Analysis & Conclusion

Takeaways

Direct is Better: For 1-step policies, eliminating the ODE simplifies the optimization landscape significantly.
Innate Diversity: The built-in repulsion mechanism of Drifting models naturally handles multimodality without the overhead of diffusion.

Limitations

Critic Dependence: Like all actor-critic methods, DFP's performance is capped by the quality of the learned Q-function.
Sim-to-Real: While the results in simulation are stunning, the impact of real-world noise on the kernel-based drifting field remains to be seen.

In conclusion, DFP represents a significant pivot in generative robot control. By moving from "simulating a process" to "flowing a distribution," it provides a more robust and efficient foundation for agents that must learn and adapt in the real world.

Find Similar Papers

Try Our Examples

Search for recent papers that apply Drifting Models or non-ODE generative modeling to high-dimensional robot visuomotor control beyond simulated environments.
Which paper first established the theoretical link between Kernel Mean Shift and Wasserstein Gradient Flows, and how did DFP extend this to Reinforcement Learning?
Identify other Reinforcement Learning methods that utilize top-K action selection as a surrogate for KL-regularized policy improvement in continuous action spaces.

Contents

Drifting Field Policy: Redefining One-Step Generative RL via Wasserstein Gradient Flow

1. TL;DR

2. Problem & Motivation: The ODE Tax

3. Methodology: The Power of the Drifting Field

3.1. 1. The Physical Intuition

3.2. 2. Wasserstein Gradient Flow

3.3. 3. The Top-K Surrogate

4. Experiments & Results

4.1. SOTA Performance

4.2. Why Drifting Beats MeanFlow

5. Critical Analysis & Conclusion

5.1. Takeaways

5.2. Limitations