WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
MegaFlow: Bridging Vision Foundation Models and Large Displacement Optical Flow
总结
问题
方法
结果
要点
摘要

MegaFlow is a unified architecture for zero-shot large displacement optical flow and point tracking that achieves SOTA performance by adapting pre-trained vision priors (DINOv2) to dynamic motion. It leverages a global matching formulation followed by lightweight iterative refinement, outperforming previous multi-frame methods on benchmarks like Sintel (Final EPE 1.83) and KITTI.

TL;DR

MegaFlow is a powerful new framework that tackles the "Achilles' heel" of optical flow: large displacement motion. By repurposing pre-trained vision foundation models (DINOv2) and replacing local search with global matching, it achieves state-of-the-art zero-shot performance across Sintel, KITTI, and Spring benchmarks. Remarkably, it also functions as a highly capable point tracker without any architectural changes.

Problem & Motivation: The Failure of Local Search

Standard flow estimators (like RAFT) rely on an iterative local search. While excellent for sub-pixel precision, they are "nearsighted." When an object moves 100+ pixels between frames, the local correlation window finds nothing but noise. Previous attempts to fix this involved complex image pyramids or task-specific fine-tuning, which often broke down when presented with "out-of-distribution" real-world data.

The authors' insight is simple: Static geometric priors are enough. If a model (like DINOv2) already understands the semantic and geometric layout of a scene, we should use that global understanding to "initialize" the motion before letting a local refiner polish the details.

Methodology: Global Matching Meets Temporal Attention

The MegaFlow pipeline consists of three distinct phases:

  1. Feature Extraction: A shared backbone uses a frozen DINOv2 encoder paired with a 24-layer Transformer. It fuses these semantic tokens with a lightweight CNN encoder to preserve high-resolution spatial details.
  2. Global Matching: Instead of looking in a small neighborhood, MegaFlow computes an all-pairs correlation between adjacent frames. This allows the model to "jump" across the entire image to find the best match, effectively handling "teleporting" objects.
  3. Local Recurrent Refinement: The initial global flow is "polished" using a hybrid module. It uses ConvNeXt blocks to handle spatial details and a Temporal Attention branch to ensure that the motion of a pixel is consistent across multiple frames (T=4 or more).

MegaFlow Architecture Figure 1: The MegaFlow architecture, highlighting the fusion of frozen ViT features with flexible multi-frame refinement.

Experiments: Flattening the Error Curve

The most striking result is found in the "Large Displacement" analysis. In the s40+ category (motions larger than 40 pixels), standard SOTA models like SEA-RAFT see their error explode. MegaFlow "flattens" this curve, maintaining accuracy where others fail.

Performance Comparison Table 1: MegaFlow significantly reduces EPE in the s40+ regime compared to previous methods.

Zero-Shot Point Tracking

Because MegaFlow computes continuous displacement fields, it can track any point across a video sequence. In zero-shot tests on TAP-Vid, MegaFlow outperformed dedicated trackers that were specifically trained for long-term point tracking. This suggests that a good optical flow model is a good tracker if it is robust enough.

Point Tracking Qualitative Figure 2: MegaFlow maintains stable, coherent tracks over 90+ frames in the DAVIS dataset.

Critical Analysis & Conclusion

MegaFlow effectively bridges the gap between static vision foundation models and dynamic motion estimation. Its core value lies in generalization: it works expertly on synthetic Sintel data and real-world KITTI/Spring data without being tuned for either.

Limitations:

  • Computational Cost: Processing 24 Transformer layers for every frame pair is significantly heavier than lightweight CNN-based models.
  • Backbone Sensitivity: It is somewhat sensitive to input aspect ratios due to the fixed-patch nature of ViTs.

Future Work: The success of MegaFlow points toward a future where "Motion Foundation Models" are trained once on massive video datasets, providing a universal backbone for everything from autonomous driving to video editing.

Summary of Key Contributions:

  • Foundation Priors: Successfully adapted DINOv2 for dense motion.
  • Global-Local Hybrid: Combined all-pairs matching with iterative refinement.
  • Unified Task: Proved that one model can dominate both Optical Flow and Point Tracking.

发现相似论文

试试这些示例

  • Search for recent papers that utilize DINOv2 or other Vision Transformer foundation models for dense correspondence tasks beyond optical flow, such as depth estimation or stereo matching.
  • Which paper first introduced the "Global Matching" formulation in optical flow (e.g., GMFlow), and how does MegaFlow specifically improve upon its handling of multi-frame temporal consistency?
  • Investigate recent studies that compare "tracking-by-detection" versus "tracking-by-displacement" (like MegaFlow) for long-range point tracking in complex, occluded video sequences.
目录
MegaFlow: Bridging Vision Foundation Models and Large Displacement Optical Flow
1. TL;DR
2. Problem & Motivation: The Failure of Local Search
3. Methodology: Global Matching Meets Temporal Attention
4. Experiments: Flattening the Error Curve
4.1. Zero-Shot Point Tracking
5. Critical Analysis & Conclusion
5.1. Summary of Key Contributions: