[CVPR 2026] ImageEdit-R1: Orchestrating Multi-Agent Collaboration via RL for Complex Image Editing

Summary

Problem

Method

Results

Takeaways

Abstract

ImageEdit-R1 is a multi-agent framework that treats image editing as a sequential decision-making task, utilizing Qwen2.5-VL and Diffusion models. It introduces Group Relative Policy Optimization (GRPO) to train a decomposition agent, achieving state-of-the-art performance with gains of up to +1.02 on the FLUX.1 backbone across complex editing benchmarks.

TL;DR

ImageEdit-R1 is a breakthrough multi-agent framework that solves the "complex instruction" bottleneck in image editing. By using Reinforcement Learning (GRPO) to train a specialized decomposition agent, it breaks down ambiguous user prompts into structured, executable steps. The result? It outperforms proprietary giants like GPT-4o, boosting average editing quality by up to 14% across multiple benchmarks without ever touching the weights of the underlying diffusion model.

The "Monolithic Model" Limitation

Despite the brilliance of models like FLUX or Dall-E 3, they often suffer from "instruction blindness" when a user asks for something multi-faceted. If you request to "Change the color of the coat, enhance the lighting, and remove the person in the background," a single-pass model might hallucinate or ignore one of the commands.

The root cause is a lack of structural reasoning. Most models treat the prompt as a single latent vector. ImageEdit-R1's authors argue that editing should be treated as a sequential decision-making problem, requiring a "Manager" to plan before the "Artist" paints.

Methodology: The R1 Pipeline

The framework is composed of three distinct agents working in harmony:

Decomposition Agent ( $A_{d eco m}$ ): The brain. It takes the image and instruction, then outputs a structured tuple of (Actions, Subjects, Goals).
Sequencing Agent ( $A_{or d er}$ ): The planner. It takes the decomposition and generates a logical sequence of sub-requests.
Editing Agent ( $A_{e d i t}$ ): The muscle. A diffusion-based model (like FLUX or Qwen-Image-Edit) that executes the plan.

The RL Edge: GRPO Training

The "secret sauce" is the use of Group Relative Policy Optimization (GRPO) to train the Decomposition Agent. Instead of just fine-tuning on text, the agent is rewarded for:

Format Accuracy: Correctly using <think>, <action>, and <goal> tags.
Semantic Precision: Achieving a high F1-score relative to ground truth actions and subjects.

ImageEdit-R1 Architecture Figure 1: The three-stage collaborative pipeline of ImageEdit-R1.

Experimental Breakthroughs

The team tested ImageEdit-R1 across three major benchmarks: PSR, RealEdit, and UltraEdit.

SOTA Dominance: While FLUX.1 (Original) scored 7.21, the ImageEdit-R1 version soared to 8.23.
Proprietary vs. Open: ImageEdit-R1 (using Qwen-Image-Edit) achieved a score of 8.85, surpassing the closed-source GPT-4o (8.47).
The RL Necessity: A massive takeaway was that the multi-agent framework without RL (ImageEdit-R1 w/o RL) actually performed worse than the base model in some cases. RL is not an "extra"—it is the engine that makes the decomposition reliable.

Performance Comparison Table 1: Quantitative results showing consistent improvements across different editing backbones.

Deep Insights: Single-Turn vs. Multi-Turn

An interesting finding in the ablation studies was the failure of the "Multi-turn" strategy. One might assume applying edits one-by-one in multiple steps would be better. However, the authors found that Single-turn delivery—where all sub-requests are fed into the model simultaneously—yielded better results.

Why? Multi-turn editing suffers from compounding errors. Small artifacts in step one are amplified in step two, whereas single-turn execution allows the diffusion model to globalize the context and maintain spatial consistency.

RL Training Progress Figure 3: Rapid convergence of the decomposition agent's reward signals during RL training.

Conclusion & Future Outlook

ImageEdit-R1 proves that we don't necessarily need "bigger" diffusion models to get better edits; we need smarter controllers. By framing image editing as a policy-driven coordination task, the authors have provided a roadmap for building modular, interpretable, and highly capable AI creative tools.

Limitations: The system still relies on the base editing model's inherent ability to handle the final sub-requests. If the diffusion model itself cannot render a specific "Goal" (e.g., a highly specific texture), no amount of decomposition will fix the final pixels. Future work likely lies in "Joint RL," where the manager and the artist are trained in a single feedback loop.

Find Similar Papers

Try Our Examples

Search for recent papers that apply Group Relative Policy Optimization (GRPO) to multi-modal tasks beyond text-based reasoning.
Which paper first introduced the concept of multi-agent collaboration for image synthesis, and how does ImageEdit-R1's reinforcement learning approach differ from their governance mechanism?
Explore research investigating the "compounding errors" problem in multi-turn image editing versus single-turn holistic execution in diffusion models.

Contents

[CVPR 2026] ImageEdit-R1: Orchestrating Multi-Agent Collaboration via RL for Complex Image Editing

1. TL;DR

2. The "Monolithic Model" Limitation

3. Methodology: The R1 Pipeline

3.1. The RL Edge: GRPO Training

4. Experimental Breakthroughs

5. Deep Insights: Single-Turn vs. Multi-Turn

6. Conclusion & Future Outlook