Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

FAN: Breaking the Speed-Expressivity Tradeoff in Offline RL

总结

问题

方法

结果

要点

摘要

This paper introduces Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient offline reinforcement learning algorithm. It combines expressive flow policies with distributional critics while achieving state-of-the-art results on D4RL and OGBench benchmarks with significantly reduced computational overhead.

TL;DR

Expressive models (like Flow Matching) and Distributional Critics are the "gold standard" for high-performance Offline RL, but they are notoriously slow. Flow-Anchored Noise-conditioned Q-Learning (FAN) changes the game by achieving SOTA performance while being 5-14x faster than existing distributional methods. It achieves this through two breakthroughs: Flow Anchoring (no more ODE solving for regularization) and Noise-conditioned Critics (captures the return distribution with a single noise sample).

Background: The Cost of Expressivity

In Offline RL, we don't have the luxury of environment interaction. We must learn from a fixed dataset, which often contains multi-modal behavior (multiple ways to solve a task). This has led to the rise of:

Flow Matching/Diffusion Policies: To model complex action distributions.
Distributional Critics: To understand the uncertainty and range of future returns.

However, these are "compute hogs." A Flow policy usually requires 10+ steps of an ODE solver to produce an action, and a distributional critic needs to calculate dozens of quantiles. This makes training slow and real-time inference nearly impossible on limited hardware.

The Problem: Why is Everyone So Slow?

The bottleneck is iterative sampling. Previous methods like Value Flows or CODAC require multiple samples to either regularize the policy or estimate the value. The central question of FAN is: Can we keep the mathematical expressivity of these models while using only a single sample?

Methodology: The FAN Architecture

FAN introduces a "lean but powerful" architecture that targets both the Actor and the Critic.

1. Flow Anchoring (The Actor Efficiency)

Instead of solving the full ODE to find where an action "lands" in the distribution, FAN regularizes the one-step policy $π_{ω}$ by comparing its trajectory velocity directly against the behavior flow's velocity field $v_{h} e t a$ .

This is called Flow Anchoring. It effectively "anchors" the learned policy to the dataset's flow without ever needing to run the flow to completion.

Overall Architecture of FAN

2. Noise-conditioned Critic & $T_{n}^{π}$ (The Critic Efficiency)

Standard distributional RL models quantiles ( $Q (s, a, a u)$ ). FAN models the value as $Q (s, a, ϵ)$ , where $ϵ$ is simple Gaussian noise.

To train this efficiently, the authors propose a new operator, $T_{n}^{π}$ : $T_{n}^{π} Q (s, a, ϵ^{'}) \approx r + γ ess sup_{ϵ \sim N (0, I_{d})} Q (s^{'}, π (s^{'}, ϵ^{'}), ϵ)$

By using Upper Expectile Regression (with $κ = 0.9$ ), FAN can estimate the "best possible" return distribution (the essential supremum) without needing a large ensemble or dozens of quantile samples.

Experiments & Results: SOTA Performance at Warp Speed

The most striking result of FAN is the Runtime vs. Success plot. FAN sits at the top-left corner: the highest success rate with the lowest training time.

Performance vs Efficiency

Key Findings:

Robotic Mastery: On OGBench, FAN achieved 100% success on Puzzle-3x3, whereas most non-distributional methods hovered around 20-30%.
Wall-clock Speed: Training is 5-14x faster than other distributional baselines. Inference (action sampling) is nearly instantaneous because it's a one-step calculation.
Offline-to-Online: FAN excels at being "fine-tuned" in a real environment after its offline pre-training phase, showing its robustness.

Deep Insight: Why Does It Work?

The "physics intuition" here is that we don't need to know the entire path to know the direction of the flow. By aligning gradients and velocities (Flow Anchoring) and using noise as a latent variable for returns (Noise-conditioned Critic), FAN bypasses the "sampling tax" that has plagued generative RL for the last few years.

Conclusion & Limitations

The Takeaway: FAN proves that you don't need expensive iterative sampling to benefit from expressive generative models. It makes high-end Offline RL practical for real-world robotics.

Limitations: While FAN is fast, it still relies on a shared noise input between the actor and critic, which might limit its expressivity in extremely high-dimensional or discontinuous reward landscapes. Future work could explore more complex noise structures or apply Flow Anchoring to multi-agent settings.

For more details, check out the official implementation.

发现相似论文

试试这些示例

Search for recent papers in offline reinforcement learning that utilize Flow Matching or Diffusion models without the need for multi-step ODE solvers during inference.
Which paper first established the theoretical relationship between expectile regression and the essential supremum of a distribution, and how does this paper build upon that theory for Bellman operators?
Explore if the Flow Anchoring regularization technique has been applied to online off-policy reinforcement learning or multi-agent RL tasks for improved efficiency.

FAN: Breaking the Speed-Expressivity Tradeoff in Offline RL

1. TL;DR

2. Background: The Cost of Expressivity

3. The Problem: Why is Everyone So Slow?

4. Methodology: The FAN Architecture

4.1. 1. Flow Anchoring (The Actor Efficiency)

4.2. 2. Noise-conditioned Critic & $T_n^\pi$ (The Critic Efficiency)

5. Experiments & Results: SOTA Performance at Warp Speed

5.1. Key Findings:

6. Deep Insight: Why Does It Work?

7. Conclusion & Limitations