UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

[CVPR 2025] UTPTrack: Simple and Unified Token Pruning for Visual Tracking

总结

问题

方法

结果

要点

摘要

UTPTrack is a simple and unified token pruning framework for one-stream Transformer-based visual trackers that jointly compresses the search region, dynamic template, and static template. It achieves state-of-the-art accuracy-efficiency trade-offs, pruning over 65% of tokens while maintaining up to 100.5% of the baseline performance across RGB and multimodal benchmarks.

TL;DR

UTPTrack is the first framework to implement joint token pruning across the three pillars of modern one-stream trackers: the Search Region (SR), the Dynamic Template (DT), and the Static Template (ST). By treating token redundancy as a holistic problem rather than pruning components in isolation, UTPTrack slashes computation (MACs -31%) and token counts (>65%) while remarkably maintaining—or even slightly improving—the original tracking accuracy.

Background: The Cost of Global Interaction

The shift from two-stream (Siamese) trackers to one-stream Transformer trackers (like OSTrack) brought a massive leap in accuracy by allowing early, dense interactions between the template and the search area. However, this comes at a steep price: the quadratic complexity of the self-attention mechanism makes real-time deployment on mobile or CPU-based devices nearly impossible.

While previous works attempted to "prune" tokens to save speed, they usually focused only on the Search Region. UTPTrack argues that everything is connected. Pruning the templates alongside the search region isn't just about speed; it's about removing noise and ensuring the model focuses only on the most discriminative features.

Methodology: Unified Redundancy Modeling

The core innovation of UTPTrack lies in its Candidate or Template Elimination Module (CTEM). Instead of using complex auxiliary networks to guess which tokens are important, it reuses the model's own Attention Maps.

1. Unified Pruning Logic

The framework calculates the relevance of all tokens (SR, DT, and ST) relative to the center token of the Static Template. This center token acts as the "gold standard" of what the target looks like.

2. Token Type-Aware (TTA) Strategy

Pruning the Static Template is risky—if you lose the target features there, the whole track fails. UTPTrack introduces a spatial prior: it uses the initial bounding box to create a "Soft Bonus" for tokens inside the target area, ensuring that foreground information is preserved even under aggressive pruning.

3. Text-Guided Pruning

For unified trackers that handle natural language, UTPTrack integrates language cues. The importance of a visual token is determined not just by visual similarity, but also by its alignment with the CLIP-encoded text description.

Overall Architecture Figure 1: The UTPTrack architecture showing the joint processing of SR, DT, and ST tokens through the CTEM modules.

Performance: More Speed, More Accuracy?

The results from 10 benchmarks (including LaSOT, TrackingNet, and multimodal sets like RGBT) are striking.

Efficiency: On OSTrack384, MACs are reduced by 31.3% and vision tokens by 65.4%.
Accuracy Paradox: In several cases (like SUTrack384), the pruned model actually outperformed the full baseline (100.5% relative accuracy).

This suggests that UTPTrack isn't just making the model smaller; it's making it smarter by filtering out background clutter that might otherwise distract the attention mechanism.

Performance Comparison Table 1: Controlled-budget comparison showing UTPTrack leading against other SOTA pruning methods like ToMe and EViT.

Visualization: Seeing the Pruning in Action

The progressive pruning schedule (Stages 1 through 6) shows how the model systematically discards the background. By the final stage, only a concentrated cluster of tokens on the target and its most relevant template features remain.

Figure 2: Visualization of the pruning process across different tracking stages.

Conclusion & Insights

UTPTrack proves that the "One-Stream" paradigm in tracking is still ripe for optimization. Key takeaways for researchers include:

Holistic Pruning: Don't treat inputs as islands. Pruning the template and search region jointly preserves systemic alignment.
Attention as a Zero-Cost Metric: The inherent attention weights of a pre-trained Transformer are often sufficient for identifying redundancy without needing extra MLP "predictor" layers.
Regularization via Pruning: Aggressive token removal can act as a noise filter, potentially solving some distraction issues in complex backgrounds.

UTPTrack sets a new foundation for high-performance, real-time visual tracking across all modalities.

发现相似论文

试试这些示例

Search for recent papers on joint token pruning or merging strategies in one-stream Transformer architectures for video-based tasks.
Which paper first introduced the "one-stream" Transformer tracking paradigm (e.g., OSTrack), and how does UTPTrack specifically modify its attention mechanism for efficiency?
Investigate the performance and implementation of multimodal unified trackers like SUTrack in real-time edge computing scenarios.

[CVPR 2025] UTPTrack: Simple and Unified Token Pruning for Visual Tracking

1. TL;DR

2. Background: The Cost of Global Interaction

3. Methodology: Unified Redundancy Modeling

3.1. 1. Unified Pruning Logic

3.2. 2. Token Type-Aware (TTA) Strategy

3.3. 3. Text-Guided Pruning

4. Performance: More Speed, More Accuracy?

5. Visualization: Seeing the Pruning in Action

6. Conclusion & Insights