MinerU-Diffusion is a 2.5B-parameter unified document OCR framework that redefines OCR as an inverse rendering task. By replacing traditional autoregressive decoding with a block-wise diffusion mechanism, it accomplishes state-of-the-art structured document parsing with up to 3.2x faster inference.
TL;DR
MinerU-Diffusion breaks the "sequential bottleneck" of modern document OCR. By treating OCR as an inverse rendering problem solved via block-wise diffusion, it achieves competitive SOTA accuracy with a staggering 3.2x increase in decoding speed. It effectively eliminates the cumulative error propagation and semantic hallucination common in standard autoregressive models.
Motivation: The Flaw in Sequential Reading
Current state-of-the-art OCR systems (like Qwen2-VL or MinerU2.5) operate like a human reading a book: left-to-right, word-by-word. Technically, this is Autoregressive (AR) decoding. While successful, it has two fatal flaws:
- Efficiency: For a 2,000-token document, the model must run 2,000 sequential forward passes.
- Linguistic Bias: AR models are trained to predict the "most likely next word." When the visual signal is blurry, the model often "guesses" based on grammar rather than sight, leading to hallucinations.
The authors argue that OCR isn't a language modeling task; it's inverse rendering. The text is already there on the page; we just need to reconstruct the underlying structure from the pixels.

Methodology: Block-Wise Diffusion
MinerU-Diffusion employs Masked Diffusion Language Models (MDLM). Instead of generating $y_1, y_2, ... y_n$, it starts with a sequence of [MASK] tokens and refines them all at once (or in large batches) based on visual conditioning.
1. Unified Diffusion Architecture
A key innovation is the Block-Attention Mask. While full-sequence diffusion is computationally expensive ($O(L^2)$), MinerU-Diffusion partitions the sequence into contiguous blocks.
- Within a block: Tokens attend bidirectionally (Parallel Diffusion).
- Across blocks: Tokens attend causally to previous blocks (Autoregressive structure).
This "Hybrid" approach maintains structural anchors (preventing long-range drift) while allowing massive parallelism.

2. Two-Stage Curriculum Learning
Diffusion models are notoriously hard to train due to any-order dependencies. The authors use a two-stage strategy:
- Stage I (Foundational): Training on massive, diverse data (7.5M samples) to learn general visual-semantic alignment.
- Stage II (Refinement): Using "Hard Case Mining." The model identifies samples where multiple inference passes yield inconsistent results (high uncertainty) and refines its boundary precision on those specific cases.
Experimental Battleground
MinerU-Diffusion was tested against the heavyweights: Qwen2.5-VL, GPT-4o, and the original MinerU2.5.
Speed vs. Accuracy
The "Confidence Threshold" acts as a throttle. By lowering the threshold to 0.6, the model achieves a 3.2x speedup (164.8 Tokens Per Second) while maintaining over 90% of its peak accuracy. At a toggle of 0.95, it matches AR accuracy with a 2.1x speedup.

Robustness: The Semantic Shuffle Test
To prove that MinerU-Diffusion relies on vision rather than language guessing, the authors created "Semantic Shuffle"—a benchmark where words are randomly shuffled on the page.
- AR Models: Accuracy plummeted as the text became nonsensical.
- MinerU-Diffusion: Performance remained flat, proving it actually "reads" the pixels regardless of whether the sentence makes sense.

Critical Insight & Conclusion
MinerU-Diffusion demonstrates that the future of document AI may not be purely autoregressive. For tasks where the sequence is "fixed" by an image (like OCR, table parsing, or chart-to-code), Diffusion offers a more mathematically sound posterior approximation.
Takeaway: By treating the page as a spatially coupled random field rather than a causal string, we gain both the speed of parallel processing and the reliability of visual-first grounding.
Limitations
Despite the breakthrough, the model still shows a performance gap in Layout Analysis compared to the most massive 70B+ parameter models. While it excels at reading what is in a box, finding the boxes on complex, overlapping layouts remains the next frontier for diffusion-based document models.
