Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

[CVPR 2026] ESTF-SSM: Mastering Temporal Action Detection with Asymmetric State Space Models

总结

问题

方法

结果

要点

摘要

The paper introduces ESTF-SSM, an end-to-end Temporal Action Detection (TAD) framework that integrates an Efficient Spatial-Temporal Focal (ESTF) Adapter into a frozen video backbone. By leveraging a novel Temporal Boundary-aware State Space Model (TB-SSM), the method achieves SOTA performance on THUMOS14 (75.3% Avg. mAP) and ActivityNet-1.3 with linear computational complexity.

TL;DR

Temporal Action Detection (TAD) scales poorly with video length due to the quadratic complexity of Transformers and the limited reach of CNNs. ESTF-SSM tackles this by introducing a "Temporal Boundary-aware SSM" (TB-SSM) within a lightweight adapter. By modeling the forward and backward temporal flows with independent state transitions, it achieves superior boundary precision on THUMOS14 and ActivityNet while maintaining linear efficiency.

The Scaling Wall: Why Attention Isn't All You Need for Video

In the realm of long-form video understanding, the industry faces a dilemma:

CNNs are efficient but "blind" to long-range dependencies, often missing the broader context of an action.
Transformers provide global context but suffer from Quadratic Complexity $O (N^{2})$ , making them memory-hungry for high-resolution, untrimmed videos.
Existing SSMs (like Mamba) offer $O (N)$ efficiency but struggle with Oversmoothing. In TAD, the exact "flicker" of a movement starting or ending is vital; standard SSMs tend to blend these fine-grained details into the global feature representation.

Methodology: Decoupling and Specializing

The authors propose the Efficient Spatial-Temporal Focal (ESTF) Adapter. Instead of full-parameter fine-tuning, they insert these adapters into a frozen backbone (like VideoMAEv2).

1. Spatial-Temporal Decoupling

The adapter splits the feature flow into two branches:

Spatial Branch: Uses Depthwise Convolutions to maintain local structural cues.
Temporal Branch: Uses the new TB-SSM to reason across time steps.

This decoupling ensures that the model doesn't waste computational cycles trying to attend to every pixel-timestamp pair simultaneously.

Model Architecture

2. TB-SSM: The Power of Asymmetry

The "secret sauce" is the Temporal Boundary-aware SSM. Unlike standard bidirectional Mamba that might share parameters, TB-SSM uses independent transition matrices ( $A^{f}$ and $A^{b}$ ) for forward and backward scans.

Why? The visual cues for an action starting (onset) are often physically and visually different from an action ending (offset). By parameterizing these directions separately, the model captures the unique "energy signatures" of action boundaries.

Experimental Results: Precision and Efficiency

The model was tested against heavyweights like ActionFormer and MambaTAD.

THUMOS14 Performance: Reached 75.3% Average mAP, outperforming the previous best (MambaTAD) by 1.1%.
Boundary Precision: The gain is most significant at high tIoU (0.7), proving that TB-SSM is significantly better at "snapping" to the correct start/end frames.
Efficiency: Despite the performance boost, ESTF-SSM consumes less memory (28.6G) than its end-to-end competitors (typically >30G), thanks to its adapter-based design and linear SSM core.

Performance Comparison

Visualizing the Difference

Qualitative results show that while previous SOTA models might detect the general vicinity of an action, ESTF-SSM aligns its predictions almost perfectly with the Ground Truth, especially in complex indoor scenes (Charades dataset) where background clutter is high.

Qualitative Results

Conclusion & Insights

The ESTF-SSM framework proves that we don't need to choose between efficiency and precision. The key takeaway for the research community is that structural inductive biases (like asymmetric temporal scanning) still matter, even in the age of massive foundation models.

Limitations: While the model is highly efficient for offline detection, real-time "online" action detection (where future frames aren't available for backward scans) remains a direction for future work.

发现相似论文

试试这些示例

Search for recent papers using State Space Models (SSMs) specifically for dense prediction tasks in video like temporal action localization or video object segmentation.
Identify the origin of "asymmetric state dynamics" in sequence modeling and how it compares to the bi-directional SSM approach used in VisionMamba.
Explore how parameter-efficient adapters like ESTF are being applied to large-scale video backbones (e.g., VideoMAEv2 or InternVideo) for downstream tasks.

[CVPR 2026] ESTF-SSM: Mastering Temporal Action Detection with Asymmetric State Space Models

1. TL;DR

2. The Scaling Wall: Why Attention Isn't All You Need for Video

3. Methodology: Decoupling and Specializing

3.1. 1. Spatial-Temporal Decoupling

3.2. 2. TB-SSM: The Power of Asymmetry

4. Experimental Results: Precision and Efficiency

5. Visualizing the Difference

6. Conclusion & Insights