WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[CVPR 2024] Kiwi-Edit: Bridging the Gap Between Text and Vision in Video Editing
总结
问题
方法
结果
要点
摘要

The paper introduces Kiwi-Edit, a versatile video editing framework that supports both natural language instructions and visual reference images. It leverages a newly curated large-scale dataset, RefVIE (477K quadruplets), and achieves State-of-the-Art (SOTA) performance on benchmarks like OpenVE-Bench and the newly proposed RefVIE-Bench.

TL;DR

Video editing is evolving from "tell me what to do" (text-only) to "show me what you want" (reference-guided). Kiwi-Edit introduces a unified framework and the first large-scale open-source dataset, RefVIE (477K high-quality samples), to enable precise video modifications using both text instructions and reference images. By combining a frozen Multimodal Large Language Model (MLLM) with a Diffusion Transformer (DiT), it sets a new SOTA for controllable video editing.

Problem & Motivation: The Ambiguity of Text

While models like Llama or Sora have shown that text can drive generation, language is inherently limited. Can you describe the exact texture of a specific vintage sports car or a unique painting style in just a few words? Probably not.

Prior works in video editing (like InsViE or OpenVE) suffer from two main issues:

  1. Semantic Drift: Text prompts are too vague for high-fidelity identity transfer.
  2. Data Scarcity: There were no large-scale, open-source datasets that provided the (Source Video, Instruction, Reference Image, Target Video) quadruplets needed to train reference-guided models.

Methodology: Data Synthesis and Unified Architecture

1. The RefVIE Pipeline: Making Data Out of Thin Air

The authors didn't just collect data; they synthesized it. Starting from 3.7M raw video pairs, they used a multi-stage pipeline:

  • Grounding: Qwen3-VL identifies the region to be edited.
  • Synthesis: Qwen-Image-Edit creates a "reference scaffold"—a clean image of the target object or background based on the grounding.
  • Filtering: A strict quality control loop results in 477K high-fidelity quadruplets.

2. Kiwi-Edit Architecture

The model uses a frozen MLLM (Qwen2.5-VL) to act as the "brain," encoding instructions and references into semantic tokens. These are fed into a Diffusion Transformer (DiT) via two specific mechanisms:

  • Dual-Connector: A Query Connector distills text intent, while a Latent Connector extracts dense visual features from the reference image.
  • Hybrid Latent Injection: To keep the video's original structure (like motion and layout), the source video features are added element-wise to the noisy latents.
  • Timestep Scaling: They found that a learnable scalar is crucial. It tells the model when to focus on the original structure (early timesteps) and when to allow the new reference features to take over.

Model Architecture

Experiments & Results: SOTA Performance

Kiwi-Edit was tested against both open-source and heavyweight proprietary models (Runway, Kling).

  • Instruction Following: On OpenVE-Bench, Kiwi-Edit's 3.02 score significantly outperformed OpenVE-Edit's 2.50.
  • Reference Adherence: In tasks requiring a specific subject reference, Kiwi-Edit achieved high Identity Consistency (3.98), proving it can "copy-paste" visual concepts into a video while maintaining temporal stability.

Experimental Results

Qualitative Brilliance

As shown in the examples, whether it's replacing a "bread with a hamburger made of modeling clay" or changing a "green valley to a golden wheat field," Kiwi-Edit maintains the foreground's interaction with the new background seamlessly.

Qualitative Result

Critical Analysis & Conclusion

Takeaway: Kiwi-Edit proves that the bottleneck for reference-guided editing wasn't necessarily the architecture, but the training data. By building RefVIE, they've provided a blueprint for future multimodal generative research.

Limitations: The model still occasionally struggles with "Background Change" vs "Local Change" bias—training on more local changes slightly degraded background performance. This suggests a need for even more balanced task distributions in future datasets.

Future Outlook: The integration of MLLMs as "pluggable brains" for DiTs represents a shift toward more modular, multi-modal AI where one model understands the world and the other draws it.

发现相似论文

试试这些示例

  • Search for recent papers published after 2024 that utilize Diffusion Transformers (DiT) for instruction-based or reference-guided video editing tasks.
  • Which paper first proposed the concept of using a 'frozen MLLM' as a semantic encoder for video generation, and how does Kiwi-Edit's dual-connector mechanism differentiate from that origin?
  • Examine how the 'hybrid latent injection' and 'timestep-dependent scaling' techniques used in Kiwi-Edit could be applied to 3D scene editing or temporally consistent image-to-video synthesis.
目录
[CVPR 2024] Kiwi-Edit: Bridging the Gap Between Text and Vision in Video Editing
1. TL;DR
2. Problem & Motivation: The Ambiguity of Text
3. Methodology: Data Synthesis and Unified Architecture
3.1. 1. The RefVIE Pipeline: Making Data Out of Thin Air
3.2. 2. Kiwi-Edit Architecture
4. Experiments & Results: SOTA Performance
4.1. Qualitative Brilliance
5. Critical Analysis & Conclusion