TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

[TDM-R1] Reinforcing Few-Step Diffusion: Breaking the Differentiability Barrier

总结

问题

方法

结果

要点

摘要

TDM-R1 is a novel reinforcement learning (RL) paradigm for few-step diffusion models that enables post-training using non-differentiable reward signals. Built upon Trajectory Distribution Matching (TDM), it achieves state-of-the-art results (e.g., 92% on GenEval) using only 4 sampling steps, significantly outperforming 80-step base models and commercial engines like GPT-4o.

TL;DR

TDM-R1 introduces a breakthrough RL paradigm for few-step diffusion models, allowing them to learn from complex, non-differentiable rewards (like "does this image have exactly 5 dogs?" or human binary "likes"). By decoupling reward learning from generator optimization and leveraging deterministic sampling paths, it enables a 4-step model to outperform 80-step giants and GPT-4o in complex instruction following.

Context: The Efficiency vs. Alignment Paradox

In the race for real-time AIGC, "few-step" models (distilled via TDM, ADD, or LCM) have become the industry standard for production. However, these models often trade off "intelligence"—specifically the ability to follow complex spatial or numerical instructions—for speed.

While Large Language Models (LLMs) have mastered alignment via RLHF, Diffusion models have lagged behind because most RL methods (like DPO or reward backprop) require differentiable reward functions. This excludes the most valuable feedback: human intuition and discrete logical checks.

The Problem with "Direct" RL

If you try to apply standard RL losses to a 4-step distilled model, the result is usually a blurry mess. Why?

Gradient Mismatch: Standard denoising losses (used in many RL methods) contradict the distribution-matching goals of distillation.
Sparse Feedback: Assigning a reward to an intermediate noisy latent is mathematically noisy and high-variance.
Differentiability: You can't "backprop" through a human's "No" or an OCR model's discrete character count.

Methodology: The TDM-R1 Architecture

TDM-R1 solves this through a sophisticated "Surrogate" approach. Instead of forcing the reward to be differentiable, it trains a Surrogate Reward Model to learn the non-differentiable reward signal.

1. Deterministic Trajectories

The authors leverage Trajectory Distribution Matching (TDM). Because TDM uses deterministic ODE paths, the model can precisely estimate the reward for any intermediate step $x_{t}$ based on the final $x_{0}$ . This significantly reduces the variance that plagues stochastic sampling.

2. The Dynamic Surrogate Reward

Inspired by the success of GRPO in LLMs, TDM-R1 uses a Group Preference Optimization strategy. It generates a group of images, ranks them using the "black-box" non-differentiable reward, and then trains a differentiable surrogate ( $p_{ϕ}$ ) to mimic these preferences.

Model Architecture Figure: The TDM-R1 workflow showing the decoupling of surrogate reward learning and generator optimization.

3. Generator Optimization

The few-step generator is then updated to maximize this surrogate reward while staying pinned to the original distribution via a marginal-level reverse KL regularization. This ensures the model doesn't "break" its natural image-generating priors while chasing the reward.

Experimental Triumphs: 4 Steps > 80 Steps

The most striking result is the performance on GenEval, a benchmark for complex composition (counting, position, etc.).

Base SD3.5 (80 Steps): 63%
GPT-4o: 84%
TDM-R1 (Ours, 4 Steps): 92%

Performance Comparison Figure: TDM-R1 rapidly boosts GenEval scores, proving that RL can unlock hidden reasoning potential in few-step models.

Beyond reasoning, the model also shows superior Visual Text Rendering. By using OCR accuracy as a non-differentiable reward, TDM-R1 learns to spell complex words correctly where previous few-step models would produce "alphabet soup."

Critical Insights & Takeaways

Adaptive Surrogate Wins: The paper proves that a dynamic surrogate reward is better than a frozen one. The reward model needs to evolve alongside the generator to identify increasingly subtle flaws.
Scale Matters: The method scales seamlessly from Stable Diffusion 3.5 to the massive 6B Z-Image model, suggesting this is a universal recipe for post-training.
Efficiency: We can finally have our cake and eat it too—the speed of few-step generation with the logical alignment of many-step RL models.

Conclusion

TDM-R1 represents a pivotal shift in how we think about "Post-Training" for Diffusion. It moves the field away from simple distillation (copying a teacher) toward active reinforcement (learning from feedback). For developers building real-world image generation products, this framework provides the first scalable way to integrate specific, discrete business logic—like "always put the logo in the top right"—into a high-speed inference pipeline.

发现相似论文

试试这些示例

Research other recent papers that utilize Group Relative Policy Optimization (GRPO) or similar group-based preference methods for image generation beyond the TDM framework.
Which paper first introduced the Trajectory Distribution Matching (TDM) concept, and how does TDM-R1 specifically modify the original distillation loss to accommodate RL signals?
Explore the potential of applying the TDM-R1 surrogate reward mechanism to video generation or other temporal few-step generative tasks.

[TDM-R1] Reinforcing Few-Step Diffusion: Breaking the Differentiability Barrier

1. TL;DR

2. Context: The Efficiency vs. Alignment Paradox

3. The Problem with "Direct" RL

4. Methodology: The TDM-R1 Architecture

4.1. 1. Deterministic Trajectories

4.2. 2. The Dynamic Surrogate Reward

4.3. 3. Generator Optimization

5. Experimental Triumphs: 4 Steps > 80 Steps

6. Critical Insights & Takeaways

7. Conclusion