This paper introduces VEFX-Bench, a holistic framework for video editing evaluation consisting of a human-annotated dataset (VEFX-Dataset), a multi-dimensional reward model (VEFX-Reward), and a standardized benchmark. It achieves state-of-the-art alignment with human judgment in video editing quality assessment, outperforming generic VLM judges and prior unified reward models.
Executive Summary
TL;DR: Researchers have introduced VEFX-Bench, the first comprehensive ecosystem for evaluating video editing. It includes a massive dataset of 5,049 human-annotated examples and VEFX-Reward, a specialized reward model that outperforms GPT-4 and other VLM judges in detecting whether an AI-edited video actually follows instructions while keeping the rest of the scene intact.
Background Positioning: This work moves beyond "visual plausibility" (does it look real?) to "functional accuracy" (did it do what I asked?). It addresses a critical gap in professional AI-assisted filmmaking by providing a standardized yardstick to compare commercial giants like Kling, Runway, and Luma.
Problem & Motivation: The "Black Box" of Video Quality
Why is evaluating video editing so hard? Unlike text-to-video generation, editing involves a "triplet" of constraints:
- Instruction Following (IF): Did the apple turn into a banana?
- Rendering Quality (RQ): Does the banana look realistic and stable?
- Edit Exclusivity (EE): Did the background stay the same, or did the whole world melt?
Previous benchmarks like VE-Bench or OpenVE-3M either lacked human labels, lacked the actual edited outputs for training, or collapsed all these factors into a single, uninformative "quality score." The authors identify that a model can be a "creative failure" (looks great but ignored the prompt) or a "VFX disaster" (followed the prompt but ruined the resolution).
Methodology: The Core of VEFX-Reward
The authors built VEFX-Reward on top of the Qwen3-VL architecture. Its secret sauce lies in its input structure and loss function.
1. Joint Triplet Reasoning
Instead of just looking at the final clip, the model ingests the Source Video, the Text Instruction, and the Edited Video simultaneously. This allows the transformer to perform "differential reasoning"—calculating what changed and what should have remained static.
2. Ordinal Regression
Human scorers use 1-4 rubrics. Standard regression (L2 loss) often fails to capture the "threshold" nature of these scores. VEFX-Reward uses Ordinal Regression, modeling the score as a sequence of ordered binary decisions (e.g., Is this better than a 2? Is it better than a 3?), which aligns much more closely with human psychology.
Figure 1: The VEFX framework, transitioning from dataset annotation to reward model training and final system benchmarking.
Experiments: Ranking the Giants
The researchers put 10 major models to the test. The results revealed a significant "Locality Gap."
SOTA Comparison
Kling o3 omni and Kling o1 emerged as the leaders, showing the best balance across all three dimensions. However, a fascinating trend emerged: most models are much better at Rendering Quality (RQ) than Instruction Following (IF).
The Locality Crisis
The most difficult dimension for current AI is Edit Exclusivity (EE). Open-source models like VACE and older commercial versions often "over-edit," changing the lighting or texture of the entire video just to modify one small object.
Figure 2: Dataset statistics showing the distribution of IF, RQ, and EE scores. Note the high polarization in the IF (Instruction Following) dimension.
Critical Analysis & Conclusion
Takeaway
The release of VEFX-Bench provides the community with a "North Star" for video editing. By decoupling IF, RQ, and EE, researchers can now specifically target their model's weaknesses—for instance, focusing on 3D consistency to improve camera-angle edits, which were found to be the most difficult task.
Limitations
Despite the breakthrough, the model still operates at 4 FPS for evaluation due to compute constraints. This means high-frequency artifacts (like micro-flickering) might be missed by the reward model even if a human spots them.
Future Work
The authors suggest that the next step is using VEFX-Reward directly in the training loop (e.g., via DPO or PPO) to optimize models not just to produce "cool videos," but to become precise, reliable tools for directors and editors.
