NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

[CVPR 2025] NOVA: Revolutionizing Video Editing with Sparse Control and Dense Synthesis

总结

问题

方法

结果

要点

摘要

NOVA is a novel pair-free video editing framework that introduces the "Sparse Control, Dense Synthesis" paradigm. By decoupling semantic guidance (via sparse keyframes) from structural fidelity (via dense original video features), it achieves SOTA performance in local video editing without requiring large-scale paired datasets or per-video fine-tuning.

TL;DR

Video editing is moving beyond the "one-frame-fits-all" approach. NOVA (Sparse Control, Dense Synthesis) allows for precise local video editing (adding/removing objects) without any paired training data. By using a few edited keyframes as "sparse" guides and the original video as "dense" structural support, it achieves superior coherence and fidelity compared to existing SOTA methods like VACE or AnyV2V.

Problem & Motivation: The Local Editing Crisis

While global style transfer (e.g., turning a video into an oil painting) has seen massive success, local editing—such as adding a window to a specific wall or removing a moving car—remains a nightmare for AI.

The root cause is two-fold:

Data Scarcity: We don't have millions of "before and after" videos of someone adding a specific window to a specific house.
Structural Drift: Most models use the first edited frame as a guide. As the camera moves, the model "forgets" what the background looked like, leading to hallucinations and flickering (structural drift).

Existing methods try to solve this by fine-tuning a small model (LoRA) for every single video, which is slow and computationally expensive. NOVA asks: Can we learn to edit by simulating how videos degrade?

Methodology: High-Level Decoupling

The core philosophy of NOVA is the decoupling of control and synthesis. Instead of forcing one neural network to balance "changing the object" and "keeping the background," NOVA uses two specialized branches.

The Dual-Branch Architecture

Sparse Branch: Takes user-edited keyframes (spaced out, e.g., every 10 frames). These act as semantic anchors that tell the model what the new content should look like.
Dense Branch: Feeds in the original unedited video. This preserves the high-frequency textures and motion dynamics of the background.

Model Architecture

Learning Without Pairs: The Degradation Simulation

Since NOVA has no "edited" ground truth, it trains via self-supervision.

Anchored Control Pipe: It takes a video, corrupts random frames (blurring, warping), and asks the model to reconstruct the original.
Source Fidelity Pipe: It uses a "cut-and-paste" method to create synthetic edits, forcing the Dense Branch to learn how to recover the original background despite the new "pasted" distractors.

Experiments & Results

In quantitative testing, NOVA dominates. On the Background SSIM (BG-SSIM) metric—which measures how well the background is preserved—NOVA scores 0.917, outperforming VACE and AnyV2V.

SOTA Comparison

As seen in the qualitative results below, NOVA maintains a sharp, consistent environment even when objects are removed or added, whereas baselines often generate blurry or flickering textures.

Qualitative Comparison

Key Insights from Ablation

The Dense Branch is Non-Negotiable: Without it, the model hallucinates non-existent details in the background. With it, even a blurred source video can be used to reconstruct a sharp, consistent output.
Interval Robustness: While trained on 10-frame intervals, Nova remains robust even if users provide keyframes every 20 frames, demonstrating impressive generalization.

Experiment Results

Conclusion & Future Outlook

NOVA represents a paradigm shift in video editing. By moving away from paired-data dependency and per-video fine-tuning, it opens the door for real-time, high-fidelity video manipulation tools.

Takeaway: The key to stable video generation isn't just "more data"—it's an architectural design that understands which signals are sparse (user intent) and which are dense (physical reality).

Limitations: The model is still dependent on the quality of the initial image-based keyframe edits. If the first edited frame is poor, the video coherence will suffer. Future work may involve integrated keyframe creation within the unified framework.

发现相似论文

试试这些示例

Search for recent video editing papers published in 2024-2025 that also utilize the WAN 2.1 or DiT-based architectures for local object manipulation.
Which paper first introduced the concept of using cross-attention to inject dense motion features from a source video into a denoising branch, and how does NOVA's implementation differ?
Explore research that applies sparse keyframe-based temporal anchors to other generative tasks like long-form video synthesis or multi-modal video-to-video translation.

[CVPR 2025] NOVA: Revolutionizing Video Editing with Sparse Control and Dense Synthesis

1. TL;DR

2. Problem & Motivation: The Local Editing Crisis

3. Methodology: High-Level Decoupling

3.1. The Dual-Branch Architecture

3.2. Learning Without Pairs: The Degradation Simulation

4. Experiments & Results

4.1. SOTA Comparison

4.2. Key Insights from Ablation

5. Conclusion & Future Outlook