WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
VEFX-Bench: Decoupling Semantic Faithfulness from Visual Quality in Video Editing
Summary
Problem
Method
Results
Takeaways
Abstract

This paper introduces VEFX-Bench, a holistic framework for video editing evaluation consisting of a human-annotated dataset (VEFX-Dataset), a multi-dimensional reward model (VEFX-Reward), and a standardized benchmark. It achieves state-of-the-art alignment with human judgment in video editing quality assessment, outperforming generic VLM judges and prior unified reward models.

Executive Summary

TL;DR: Researchers have introduced VEFX-Bench, the first comprehensive ecosystem for evaluating video editing. It includes a massive dataset of 5,049 human-annotated examples and VEFX-Reward, a specialized reward model that outperforms GPT-4 and other VLM judges in detecting whether an AI-edited video actually follows instructions while keeping the rest of the scene intact.

Background Positioning: This work moves beyond "visual plausibility" (does it look real?) to "functional accuracy" (did it do what I asked?). It addresses a critical gap in professional AI-assisted filmmaking by providing a standardized yardstick to compare commercial giants like Kling, Runway, and Luma.

Problem & Motivation: The "Black Box" of Video Quality

Why is evaluating video editing so hard? Unlike text-to-video generation, editing involves a "triplet" of constraints:

  1. Instruction Following (IF): Did the apple turn into a banana?
  2. Rendering Quality (RQ): Does the banana look realistic and stable?
  3. Edit Exclusivity (EE): Did the background stay the same, or did the whole world melt?

Previous benchmarks like VE-Bench or OpenVE-3M either lacked human labels, lacked the actual edited outputs for training, or collapsed all these factors into a single, uninformative "quality score." The authors identify that a model can be a "creative failure" (looks great but ignored the prompt) or a "VFX disaster" (followed the prompt but ruined the resolution).

Methodology: The Core of VEFX-Reward

The authors built VEFX-Reward on top of the Qwen3-VL architecture. Its secret sauce lies in its input structure and loss function.

1. Joint Triplet Reasoning

Instead of just looking at the final clip, the model ingests the Source Video, the Text Instruction, and the Edited Video simultaneously. This allows the transformer to perform "differential reasoning"—calculating what changed and what should have remained static.

2. Ordinal Regression

Human scorers use 1-4 rubrics. Standard regression (L2 loss) often fails to capture the "threshold" nature of these scores. VEFX-Reward uses Ordinal Regression, modeling the score as a sequence of ordered binary decisions (e.g., Is this better than a 2? Is it better than a 3?), which aligns much more closely with human psychology.

Overall Framework Figure 1: The VEFX framework, transitioning from dataset annotation to reward model training and final system benchmarking.

Experiments: Ranking the Giants

The researchers put 10 major models to the test. The results revealed a significant "Locality Gap."

SOTA Comparison

Kling o3 omni and Kling o1 emerged as the leaders, showing the best balance across all three dimensions. However, a fascinating trend emerged: most models are much better at Rendering Quality (RQ) than Instruction Following (IF).

The Locality Crisis

The most difficult dimension for current AI is Edit Exclusivity (EE). Open-source models like VACE and older commercial versions often "over-edit," changing the lighting or texture of the entire video just to modify one small object.

Performance Comparison Figure 2: Dataset statistics showing the distribution of IF, RQ, and EE scores. Note the high polarization in the IF (Instruction Following) dimension.

Critical Analysis & Conclusion

Takeaway

The release of VEFX-Bench provides the community with a "North Star" for video editing. By decoupling IF, RQ, and EE, researchers can now specifically target their model's weaknesses—for instance, focusing on 3D consistency to improve camera-angle edits, which were found to be the most difficult task.

Limitations

Despite the breakthrough, the model still operates at 4 FPS for evaluation due to compute constraints. This means high-frequency artifacts (like micro-flickering) might be missed by the reward model even if a human spots them.

Future Work

The authors suggest that the next step is using VEFX-Reward directly in the training loop (e.g., via DPO or PPO) to optimize models not just to produce "cool videos," but to become precise, reliable tools for directors and editors.

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize Reinforcement Learning from Human Feedback (RLHF) specifically for localized video editing rather than general video generation.
  • Which paper first established the concept of "Edit Exclusivity" or "Edit Locality" in the context of diffusion-based video manipulation, and how does this benchmark's metric compare?
  • Find studies that apply the VEFX-Reward model or similar multi-dimensional reward models to evaluate real-world professional VFX workflows or agentic video editing pipelines.
Contents
VEFX-Bench: Decoupling Semantic Faithfulness from Visual Quality in Video Editing
1. Executive Summary
2. Problem & Motivation: The "Black Box" of Video Quality
3. Methodology: The Core of VEFX-Reward
3.1. 1. Joint Triplet Reasoning
3.2. 2. Ordinal Regression
4. Experiments: Ranking the Giants
4.1. SOTA Comparison
4.2. The Locality Crisis
5. Critical Analysis & Conclusion
5.1. Takeaway
5.2. Limitations
5.3. Future Work