WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[arXiv 2026] FOSSA & ZEDD: Modernizing Depth from Defocus for Zero-Shot Generalization
总结
问题
方法
结果
要点
摘要

This paper introduces FOSSA, a Transformer-based architecture for Zero-Shot Depth from Defocus (DfD) that estimates metric depth from a focus stack without scene-specific fine-tuning. Alongside the model, the authors release ZEDD, a large-scale real-world benchmark featuring 100 diverse scenes with high-quality LiDAR ground truth and 4K DSLR imagery.

TL;DR

Estimating "Metric Depth" (actual physical distance) has long been a challenge for AI. While monocular models struggle with scale ambiguity, Depth from Defocus (DfD) offers a physical solution by analyzing how blur changes across a focus stack. This paper introduces FOSSA, a Transformer-based model that achieves state-of-the-art zero-shot performance, and ZEDD, a high-fidelity real-world benchmark that finally provides the data density needed to advance this field.

The "Scale Ambiguity" Wall and the DfD Insight

In monocular depth estimation, a model might correctly identify that a chair is in front of a table, but it often struggles to tell if the chair is 2 meters or 2.5 meters away—this is scale ambiguity.

Depth from Defocus (DfD) circumvents this by using the laws of optics. When a camera's focal plane moves, objects at different depths go in and out of focus. By analyzing a "focus stack" (a series of images at different focus distances), we can mathematically derive the exact metric depth. However, previous DfD models were "brittle"—they worked on specific lab datasets but failed in the real world.

Methodology: The FOSSA Architecture

The core innovation of FOSSA (FOcuS Stack Attention) is how it treats the focus stack. Instead of processing images in isolation, it treats the stack as a temporal-like sequence.

  1. Shared ViT Encoder: Each image in the stack passes through a Vision Transformer (ViT) to extract features.
  2. Stack Attention Layer: This is the "secret sauce." Between ViT blocks, the model performs attention across the images in the stack at each pixel location.
  3. Focus Distance Embedding: Much like positional encoding in NLP, FOSSA injects the physical focus distance ($d$) into the features, allowing the model to correlate visual blur with specific metric distances.

FOSSA Architecture The FOSSA pipeline: Feature extraction with cross-stack attention followed by global feature refinement.

Bridging the Gap: Synthetic Training & ZEDD Benchmark

Since there are no massive datasets of real-world focus stacks, the authors built a sophisticated Data Pipeline. They took existing RGBD datasets and "simulated" focus stacks by applying a Generalized Point Spread Function (PSF). By randomizing the lens aperture ($N$) and the blur shape, they forced the model to learn robust optical features rather than overfitting to one specific lens.

To prove this works, they released ZEDD (ZEro-shot Depth from Defocus):

  • 100 real-world scenes (8.3x more than previous benchmarks).
  • 4K Resolution DSLR imagery.
  • High-end Lidar Ground Truth with sub-centimeter accuracy.

Experimental Performance

FOSSA's zero-shot capability is its most impressive feat. Without seeing a single image from the ZEDD dataset during training, it outperformed models specifically designed for those environments.

  • ZEDD Benchmark: FOSSA reduced the Absolute Relative error by 55.7% compared to DepthPro (a top-tier monocular baseline).
  • Traditional DfD: On the classic DDFF dataset, FOSSA (after fine-tuning) crushed previous SOTA models, reducing MSE by 40.4%.

Experimental Results Qualitative comparison: FOSSA (far right) recovers sharp, metrically accurate boundaries compared to competing methods.

Deep Insight & Conclusion

The success of FOSSA reveals a major trend in computer vision: Foundational models benefit from physical constraints. By wrapping a powerful, pre-trained ViT (DepthAnything v2) in an architecture that understands optical blur, the authors created a tool that is both "smart" (semantic understanding) and "precise" (optical measurement).

Future Outlook: While FOSSA currently targets static scenes, the logical next step is Dynamic DfD—using these same focus cues to estimate depth from handheld video or moving subjects, potentially replacing expensive LiDAR sensors in consumer devices.

发现相似论文

试试这些示例

  • Search for recent papers published after 2024 that utilize Transformer architectures for multi-image depth estimation tasks such as Depth from Focus or Depth from Defocus.
  • What are the key differences between Gaussian, Disk, and the Generalized Point Spread Function (PSF) models used in synthetic defocus generation for computer vision?
  • Investigate how modern monocular depth foundation models like Depth Anything v2 handle metric scale and whether they can be integrated with multi-view or focus-stack cues.
目录
[arXiv 2026] FOSSA & ZEDD: Modernizing Depth from Defocus for Zero-Shot Generalization
1. TL;DR
2. The "Scale Ambiguity" Wall and the DfD Insight
3. Methodology: The FOSSA Architecture
4. Bridging the Gap: Synthetic Training & ZEDD Benchmark
5. Experimental Performance
6. Deep Insight & Conclusion