The paper introduces Kiwi-Edit, a versatile video editing framework that supports both natural language instructions and visual reference images. It leverages a newly curated large-scale dataset, RefVIE (477K quadruplets), and achieves State-of-the-Art (SOTA) performance on benchmarks like OpenVE-Bench and the newly proposed RefVIE-Bench.
TL;DR
Video editing is evolving from "tell me what to do" (text-only) to "show me what you want" (reference-guided). Kiwi-Edit introduces a unified framework and the first large-scale open-source dataset, RefVIE (477K high-quality samples), to enable precise video modifications using both text instructions and reference images. By combining a frozen Multimodal Large Language Model (MLLM) with a Diffusion Transformer (DiT), it sets a new SOTA for controllable video editing.
Problem & Motivation: The Ambiguity of Text
While models like Llama or Sora have shown that text can drive generation, language is inherently limited. Can you describe the exact texture of a specific vintage sports car or a unique painting style in just a few words? Probably not.
Prior works in video editing (like InsViE or OpenVE) suffer from two main issues:
- Semantic Drift: Text prompts are too vague for high-fidelity identity transfer.
- Data Scarcity: There were no large-scale, open-source datasets that provided the (Source Video, Instruction, Reference Image, Target Video) quadruplets needed to train reference-guided models.
Methodology: Data Synthesis and Unified Architecture
1. The RefVIE Pipeline: Making Data Out of Thin Air
The authors didn't just collect data; they synthesized it. Starting from 3.7M raw video pairs, they used a multi-stage pipeline:
- Grounding: Qwen3-VL identifies the region to be edited.
- Synthesis: Qwen-Image-Edit creates a "reference scaffold"—a clean image of the target object or background based on the grounding.
- Filtering: A strict quality control loop results in 477K high-fidelity quadruplets.
2. Kiwi-Edit Architecture
The model uses a frozen MLLM (Qwen2.5-VL) to act as the "brain," encoding instructions and references into semantic tokens. These are fed into a Diffusion Transformer (DiT) via two specific mechanisms:
- Dual-Connector: A Query Connector distills text intent, while a Latent Connector extracts dense visual features from the reference image.
- Hybrid Latent Injection: To keep the video's original structure (like motion and layout), the source video features are added element-wise to the noisy latents.
- Timestep Scaling: They found that a learnable scalar is crucial. It tells the model when to focus on the original structure (early timesteps) and when to allow the new reference features to take over.

Experiments & Results: SOTA Performance
Kiwi-Edit was tested against both open-source and heavyweight proprietary models (Runway, Kling).
- Instruction Following: On OpenVE-Bench, Kiwi-Edit's 3.02 score significantly outperformed OpenVE-Edit's 2.50.
- Reference Adherence: In tasks requiring a specific subject reference, Kiwi-Edit achieved high Identity Consistency (3.98), proving it can "copy-paste" visual concepts into a video while maintaining temporal stability.

Qualitative Brilliance
As shown in the examples, whether it's replacing a "bread with a hamburger made of modeling clay" or changing a "green valley to a golden wheat field," Kiwi-Edit maintains the foreground's interaction with the new background seamlessly.

Critical Analysis & Conclusion
Takeaway: Kiwi-Edit proves that the bottleneck for reference-guided editing wasn't necessarily the architecture, but the training data. By building RefVIE, they've provided a blueprint for future multimodal generative research.
Limitations: The model still occasionally struggles with "Background Change" vs "Local Change" bias—training on more local changes slightly degraded background performance. This suggests a need for even more balanced task distributions in future datasets.
Future Outlook: The integration of MLLMs as "pluggable brains" for DiTs represents a shift toward more modular, multi-modal AI where one model understands the world and the other draws it.
