Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

[CVPR 2024] Kiwi-Edit: Bridging the Gap Between Text and Vision in Video Editing

总结

问题

方法

结果

要点

摘要

The paper introduces Kiwi-Edit, a versatile video editing framework that supports both natural language instructions and visual reference images. It leverages a newly curated large-scale dataset, RefVIE (477K quadruplets), and achieves State-of-the-Art (SOTA) performance on benchmarks like OpenVE-Bench and the newly proposed RefVIE-Bench.

TL;DR

Video editing is evolving from "tell me what to do" (text-only) to "show me what you want" (reference-guided). Kiwi-Edit introduces a unified framework and the first large-scale open-source dataset, RefVIE (477K high-quality samples), to enable precise video modifications using both text instructions and reference images. By combining a frozen Multimodal Large Language Model (MLLM) with a Diffusion Transformer (DiT), it sets a new SOTA for controllable video editing.

Problem & Motivation: The Ambiguity of Text

While models like Llama or Sora have shown that text can drive generation, language is inherently limited. Can you describe the exact texture of a specific vintage sports car or a unique painting style in just a few words? Probably not.

Prior works in video editing (like InsViE or OpenVE) suffer from two main issues:

Semantic Drift: Text prompts are too vague for high-fidelity identity transfer.
Data Scarcity: There were no large-scale, open-source datasets that provided the (Source Video, Instruction, Reference Image, Target Video) quadruplets needed to train reference-guided models.

Methodology: Data Synthesis and Unified Architecture

1. The RefVIE Pipeline: Making Data Out of Thin Air

The authors didn't just collect data; they synthesized it. Starting from 3.7M raw video pairs, they used a multi-stage pipeline:

Grounding: Qwen3-VL identifies the region to be edited.
Synthesis: Qwen-Image-Edit creates a "reference scaffold"—a clean image of the target object or background based on the grounding.
Filtering: A strict quality control loop results in 477K high-fidelity quadruplets.

2. Kiwi-Edit Architecture

The model uses a frozen MLLM (Qwen2.5-VL) to act as the "brain," encoding instructions and references into semantic tokens. These are fed into a Diffusion Transformer (DiT) via two specific mechanisms:

Dual-Connector: A Query Connector distills text intent, while a Latent Connector extracts dense visual features from the reference image.
Hybrid Latent Injection: To keep the video's original structure (like motion and layout), the source video features are added element-wise to the noisy latents.
Timestep Scaling: They found that a learnable scalar $γ (t)$ is crucial. It tells the model when to focus on the original structure (early timesteps) and when to allow the new reference features to take over.

Model Architecture

Experiments & Results: SOTA Performance

Kiwi-Edit was tested against both open-source and heavyweight proprietary models (Runway, Kling).

Instruction Following: On OpenVE-Bench, Kiwi-Edit's 3.02 score significantly outperformed OpenVE-Edit's 2.50.
Reference Adherence: In tasks requiring a specific subject reference, Kiwi-Edit achieved high Identity Consistency (3.98), proving it can "copy-paste" visual concepts into a video while maintaining temporal stability.

Experimental Results

Qualitative Brilliance

As shown in the examples, whether it's replacing a "bread with a hamburger made of modeling clay" or changing a "green valley to a golden wheat field," Kiwi-Edit maintains the foreground's interaction with the new background seamlessly.

Qualitative Result

Critical Analysis & Conclusion

Takeaway: Kiwi-Edit proves that the bottleneck for reference-guided editing wasn't necessarily the architecture, but the training data. By building RefVIE, they've provided a blueprint for future multimodal generative research.

Limitations: The model still occasionally struggles with "Background Change" vs "Local Change" bias—training on more local changes slightly degraded background performance. This suggests a need for even more balanced task distributions in future datasets.

Future Outlook: The integration of MLLMs as "pluggable brains" for DiTs represents a shift toward more modular, multi-modal AI where one model understands the world and the other draws it.

发现相似论文

试试这些示例

Search for recent papers published after 2024 that utilize Diffusion Transformers (DiT) for instruction-based or reference-guided video editing tasks.
Which paper first proposed the concept of using a 'frozen MLLM' as a semantic encoder for video generation, and how does Kiwi-Edit's dual-connector mechanism differentiate from that origin?
Examine how the 'hybrid latent injection' and 'timestep-dependent scaling' techniques used in Kiwi-Edit could be applied to 3D scene editing or temporally consistent image-to-video synthesis.

[CVPR 2024] Kiwi-Edit: Bridging the Gap Between Text and Vision in Video Editing

1. TL;DR

2. Problem & Motivation: The Ambiguity of Text

3. Methodology: Data Synthesis and Unified Architecture

3.1. 1. The RefVIE Pipeline: Making Data Out of Thin Air

3.2. 2. Kiwi-Edit Architecture

4. Experiments & Results: SOTA Performance

4.1. Qualitative Brilliance

5. Critical Analysis & Conclusion