WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[CVPR 2026] MinerU-Diffusion: Rethinking OCR as Inverse Rendering via Parallel Diffusion
总结
问题
方法
结果
要点
摘要

MinerU-Diffusion is a 2.5B-parameter unified document OCR framework that redefines OCR as an inverse rendering task. By replacing traditional autoregressive decoding with a block-wise diffusion mechanism, it accomplishes state-of-the-art structured document parsing with up to 3.2x faster inference.

TL;DR

MinerU-Diffusion breaks the "sequential bottleneck" of modern document OCR. By treating OCR as an inverse rendering problem solved via block-wise diffusion, it achieves competitive SOTA accuracy with a staggering 3.2x increase in decoding speed. It effectively eliminates the cumulative error propagation and semantic hallucination common in standard autoregressive models.

Motivation: The Flaw in Sequential Reading

Current state-of-the-art OCR systems (like Qwen2-VL or MinerU2.5) operate like a human reading a book: left-to-right, word-by-word. Technically, this is Autoregressive (AR) decoding. While successful, it has two fatal flaws:

  1. Efficiency: For a 2,000-token document, the model must run 2,000 sequential forward passes.
  2. Linguistic Bias: AR models are trained to predict the "most likely next word." When the visual signal is blurry, the model often "guesses" based on grammar rather than sight, leading to hallucinations.

The authors argue that OCR isn't a language modeling task; it's inverse rendering. The text is already there on the page; we just need to reconstruct the underlying structure from the pixels.

Concept of Inverse Rendering vs AR

Methodology: Block-Wise Diffusion

MinerU-Diffusion employs Masked Diffusion Language Models (MDLM). Instead of generating $y_1, y_2, ... y_n$, it starts with a sequence of [MASK] tokens and refines them all at once (or in large batches) based on visual conditioning.

1. Unified Diffusion Architecture

A key innovation is the Block-Attention Mask. While full-sequence diffusion is computationally expensive ($O(L^2)$), MinerU-Diffusion partitions the sequence into contiguous blocks.

  • Within a block: Tokens attend bidirectionally (Parallel Diffusion).
  • Across blocks: Tokens attend causally to previous blocks (Autoregressive structure).

This "Hybrid" approach maintains structural anchors (preventing long-range drift) while allowing massive parallelism.

MinerU-Diffusion Architecture

2. Two-Stage Curriculum Learning

Diffusion models are notoriously hard to train due to any-order dependencies. The authors use a two-stage strategy:

  • Stage I (Foundational): Training on massive, diverse data (7.5M samples) to learn general visual-semantic alignment.
  • Stage II (Refinement): Using "Hard Case Mining." The model identifies samples where multiple inference passes yield inconsistent results (high uncertainty) and refines its boundary precision on those specific cases.

Experimental Battleground

MinerU-Diffusion was tested against the heavyweights: Qwen2.5-VL, GPT-4o, and the original MinerU2.5.

Speed vs. Accuracy

The "Confidence Threshold" acts as a throttle. By lowering the threshold to 0.6, the model achieves a 3.2x speedup (164.8 Tokens Per Second) while maintaining over 90% of its peak accuracy. At a toggle of 0.95, it matches AR accuracy with a 2.1x speedup.

Speed Performance Comparison

Robustness: The Semantic Shuffle Test

To prove that MinerU-Diffusion relies on vision rather than language guessing, the authors created "Semantic Shuffle"—a benchmark where words are randomly shuffled on the page.

  • AR Models: Accuracy plummeted as the text became nonsensical.
  • MinerU-Diffusion: Performance remained flat, proving it actually "reads" the pixels regardless of whether the sentence makes sense.

Semantic Shuffle Results

Critical Insight & Conclusion

MinerU-Diffusion demonstrates that the future of document AI may not be purely autoregressive. For tasks where the sequence is "fixed" by an image (like OCR, table parsing, or chart-to-code), Diffusion offers a more mathematically sound posterior approximation.

Takeaway: By treating the page as a spatially coupled random field rather than a causal string, we gain both the speed of parallel processing and the reliability of visual-first grounding.

Limitations

Despite the breakthrough, the model still shows a performance gap in Layout Analysis compared to the most massive 70B+ parameter models. While it excels at reading what is in a box, finding the boxes on complex, overlapping layouts remains the next frontier for diffusion-based document models.

发现相似论文

试试这些示例

  • Search for recent papers that utilize non-autoregressive or diffusion-based decoding for high-resolution document parsing and structural analysis.
  • Which paper first introduced the SDAR (Synergistic Diffusion-Autoregression) paradigm, and how does MinerU-Diffusion adapt this architecture for multimodal OCR?
  • Explore research applying discrete masked diffusion to other deterministic vision-to-sequence tasks like protein sequence prediction or scene graph generation.
目录
[CVPR 2026] MinerU-Diffusion: Rethinking OCR as Inverse Rendering via Parallel Diffusion
1. TL;DR
2. Motivation: The Flaw in Sequential Reading
3. Methodology: Block-Wise Diffusion
3.1. 1. Unified Diffusion Architecture
3.2. 2. Two-Stage Curriculum Learning
4. Experimental Battleground
4.1. Speed vs. Accuracy
4.2. Robustness: The Semantic Shuffle Test
5. Critical Insight & Conclusion
5.1. Limitations