Group Editing : Edit Multiple Images in One Go

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Group Editing : Edit Multiple Images in One Go

[CVPR 2026] GroupEditing: Harmonizing Multi-Image Edits via Pseudo-Video Priors and Geometric RoPE

总结

问题

方法

结果

要点

摘要

GroupEditing is a novel training-based framework designed for consistent and unified modifications across multiple related images using a pseudo-video reformulation. By leveraging a pre-trained video diffusion model (WAN-2.1) and integrating explicit geometric correspondences from VGGT, it achieves state-of-the-art performance in local/global editing and identity preservation.

TL;DR

Editing a single image is easy; editing a set of images—like photos of the same product from different angles—while keeping the changes perfectly consistent is incredibly hard. GroupEditing solves this by treating an image group as a "pseudo-video" and injecting explicit geometric cues into a video diffusion transformer. By using two new flavors of Rotary Positional Embeddings (Ge-RoPE and Identity-RoPE), the model ensures that an edit applied to one view propagates correctly to every other view, regardless of rotation or pose.

Problem & Motivation: The Consistency Gap

Current SOTA editing tools (like InstructPix2Pix or MasaCtrl) typically struggle with "Group-Image Editing." If you ask them to put a "robotic armor" on a fox across five different photos, the armor's design will shift slightly in every frame.

The root of the problem is twofold:

Lack of Dense Correspondence: Standard attention mechanisms understand that a "face is a face," but they lose track of specific pixels when an object rotates or deforms.
Data Scarcity: Most datasets focus on single-image editing or video sequences, not diverse static views of the same object.

The authors' insight? Video models already know how things move. By feeding an image group into a video model, we can exploit pre-learned "temporal coherence" to bridge the gap between static views.

Methodology: Fusing Implicit Priors with Explicit Geometry

The GroupEditing framework operates on a WAN-2.1 video diffusion backbone. To turn this video model into a group editor, the authors introduce a dual-path correspondence system.

1. Geometry-enhanced RoPE (Ge-RoPE)

While video models have "implicit" consistency, they need "explicit" help for complex geometry. The authors use VGGT (Visual Geometry Grounded Transformer) to extract dense matching features. They then warp the spatial grid based on these features and inject them into the transformer using Ge-RoPE. This aligns the latent tokens with the actual geometric structure of the scene.

2. Identity-RoPE

To keep the "identity" of an object (like a specific character's face) stable, the authors propose Identity-RoPE. It calculates a bounding rectangle for the target object and normalizes the positional encodings within that box. This forces the model to treat the object the same way, regardless of where it appears in the frame.

Model Architecture Figure: The GroupEditing pipeline showing the fusion of VGGT tokens and the two RoPE variants.

GroupEditData: Scaling Up

A significant contribution of this paper is GroupEditData, a dataset of 7,500+ high-quality image groups. The authors used Gemini 2.5 to generate related image sets and a rigorous filtering pipeline (segmentation + aesthetic scoring) to ensure the data was clean enough for training a robust group-editing model.

Experiments: Superior Quality and Consistency

GroupEditing was tested against heavyweights like Edicho, Anydoor, and OminiControl.

Qualitative Mastery: Whether it's changing the style of a jeep across multiple landscapes or adding armor to a character, GroupEditing maintains a level of structural and textural fidelity that previous methods lack.
Downstream Power: Because the edits are so consistent, the output can be used directly for 3D Reconstruction (using Must3R). If the edits were inconsistent, the 3D model would fail to align the points.

Experimental Results Figure: Comparison of GroupEditing against SOTA methods in local and global editing tasks.

Critical Analysis & Conclusion

Takeaway

The core value of GroupEditing lies in its representation choice. By moving from "independent images" to "pseudo-video," the authors successfully unlocked the latent geometric reasoning capabilities of modern video transformers.

Limitations & Future Work

The method heavily relies on the quality of masks and initial geometric correspondences (VGGT). In cases of extreme occlusion or completely unrelated backgrounds, the consistency might still degrade. Future research could look into iterative refinement where the model "self-corrects" its geometric understanding during the denoising process.

GroupEditing sets a new bar for digital commerce and virtual content creation, where maintaining a "single source of truth" across multiple images is paramount.

发现相似论文

试试这些示例

Search for recent papers that use video diffusion models as a prior for multi-view consistent image generation or 3D-aware editing.
Which paper first introduced the VGGT (Visual Geometry Grounded Transformer) architecture, and how does GroupEditing's fusion mechanism differ from the original implementation?
Find research exploring the application of Rotary Positional Embeddings (RoPE) for spatial alignment in multi-modal or cross-view attention mechanisms beyond text-to-video.

[CVPR 2026] GroupEditing: Harmonizing Multi-Image Edits via Pseudo-Video Priors and Geometric RoPE

1. TL;DR

2. Problem & Motivation: The Consistency Gap

3. Methodology: Fusing Implicit Priors with Explicit Geometry

3.1. 1. Geometry-enhanced RoPE (Ge-RoPE)

3.2. 2. Identity-RoPE

4. GroupEditData: Scaling Up

5. Experiments: Superior Quality and Consistency

6. Critical Analysis & Conclusion

6.1. Takeaway

6.2. Limitations & Future Work