WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CVPR 2026] SlotVTG: Breaking the Shortcut Trap in Video Grounding with Object-Centric Adapters
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces SlotVTG, an object-centric adapter framework for Generalizable Video Temporal Grounding (VTG) using Multimodal Large Language Models (MLLMs). It utilizes a lightweight Slot Adapter and a Slot Alignment Loss to decompose video tokens into entity-level representations, achieving state-of-the-art Out-of-Domain (OOD) generalization.

TL;DR

Current Multimodal Large Language Models (MLLMs) are surprisingly bad at generalizing in Video Temporal Grounding (VTG). They tend to "cheat" by memorizing dataset biases rather than looking at the video. SlotVTG fixes this by introducing a lightweight Slot Adapter that forces the model to perceive videos as a collection of semantic objects (slots), rather than an undifferentiated blob of pixels. This simple change leads to massive improvements in zero-shot transfer across different video domains.

The "Shortcut" Problem: Why MLLMs Fail at Generalization

In the world of Video Temporal Grounding, the goal is simple: find the start and end times of an event described in text. However, the authors discovered a disturbing trend: fine-tuned MLLMs often ignore the actual visual content on Out-of-Domain (OOD) data.

Through a "Noise Perturbation" experiment, they found that if you blur out the ground-truth segment in a video, the model's accuracy on OOD data barely changes compared to blurring a random part of the video. This proves the model isn't truly "grounding" the query—it's guessing based on statistical shortcuts learned during training.

Methodology: Perception through Decomposition

To force the MLLM to be "honest," the authors introduce SlotVTG. The architecture relies on two key components:

1. The Slot Adapter

Instead of passing visual tokens directly into the LLM, they pass them through a bottleneck. Inside this bottleneck, a set of learnable Slots compete via Slot Attention to explain the visual features. This forces the model to disentangle the scene into discrete entities (e.g., "person," "cup," "background").

Model Architecture

2. Slot Alignment Loss (SA Loss)

How do we ensure these slots actually correspond to objects? The authors use DINOv2—a self-supervised model known for its emergent object-aware features. By aligning the slot attention maps with DINOv2's feature similarity, the model is guided to group semantically coherent pixels into the same slot without needing explicit object bounding boxes.

Experimental Results: Generalization is Real

The results across three major benchmarks (Charades-STA, QVHighlights, ActivityNet) show that SlotVTG consistently outperforms standard fine-tuning (Chrono-Qwen) in OOD settings.

| Source $\rightarrow$ Target | Method | R1@0.5 (OOD) | Improvement | | :--- | :--- | :--- | :--- | | Charades $\rightarrow$ QVH | Baseline | 43.6 | - | | Charades $\rightarrow$ QVH | SlotVTG | 47.9 | +4.3% | | Charades $\rightarrow$ ANet | Baseline | 26.3 | - | | Charades $\rightarrow$ ANet | SlotVTG | 28.7 | +2.4% |

These gains are achieved using only ~7.6M to 23.3M trainable parameters, representing a tiny fraction of the base 3B or 7B models.

Slot Attention Visualization Visualizing slots: The model successfully segments "people" and "objects" even in unseen video environments, providing a domain-invariant "vocabulary" for the LLM.

Critical Insight & Conclusion

The core takeaway is that Object-Centricity is a robust prior for generalization. By compressing high-dimensional visual tokens into a few semantic slots, we create a bottleneck that filters out domain-specific "noise" (like lighting or resolution) and keeps the "entities" that are essential for reasoning.

While SlotVTG shows impressive OOD gains, the challenge remains for very complex scenes where more than 4-8 slots might be needed. Future work will likely explore dynamic slot allocation to handle increasingly cluttered video environments.

For practitioners looking to improve MLLM robustness, SlotVTG provides a clear blueprint: Don't just fine-tune—decompose.

Find Similar Papers

Try Our Examples

  • Search for recent papers that use object-centric learning or slot attention to improve the out-of-distribution (OOD) generalization of Multimodal Large Language Models.
  • What are the primary sources of dataset bias in Video Temporal Grounding, and how do "blind baselines" (language-only or video-only) expose these shortcuts?
  • Explore studies that evaluate the effectiveness of DINOv2 features as self-supervised "objectness" priors for downstream localization tasks like grounding or detection.
Contents
[CVPR 2026] SlotVTG: Breaking the Shortcut Trap in Video Grounding with Object-Centric Adapters
1. TL;DR
2. The "Shortcut" Problem: Why MLLMs Fail at Generalization
3. Methodology: Perception through Decomposition
3.1. 1. The Slot Adapter
3.2. 2. Slot Alignment Loss (SA Loss)
4. Experimental Results: Generalization is Real
5. Critical Insight & Conclusion