This paper introduces NAP (Non-Autoregressive Parallel DLMs), a data-centric framework designed to enable genuine parallel token generation in Diffusion Language Models (DLMs). By restructuring training data into multiple independent reasoning trajectories and employing a parallel-forced decoding strategy, NAP breaks the "autoregressive collapse" common in existing DLMs, achieving significant performance gains (e.g., +14.4% accuracy on GSM8K) in high-parallelism regimes.
TL;DR
Diffusion Language Models (DLMs) are supposed to be the "parallel" alternative to sequential Autoregressive (AR) models. However, new research reveals a startling truth: most DLMs actually "cheat" by mimicking AR behavior because they are trained on sequential data. This paper introduces NAP (Non-Autoregressive Parallel DLMs), which uses a data-decoding co-design to force models to think in parallel, resulting in massive accuracy boosts (+14.4%) in high-speed, parallel decoding settings.
The Illusion of Parallelism: Why DLMs Struggle
The mathematical promise of DLMs is "random-order" decoding—the ability to generate any token at any time. Yet, in practice, models like LLaDA and Dream-7B exhibit high ARness. When forced to generate tokens in a truly parallel (random) order, their reasoning capabilities fall off a cliff.
The authors identify the root cause: Data Mismatch. Standard pre-training data and Chain-of-Thought (CoT) rationales are strictly sequential. The model learns that token must wait for token . Even with a bidirectional architecture, the model internalizes a "privileged order," creating a sequential critical path that destroys the efficiency gains of distributed hardware.
Figure 1: Comparison of decoding order. Standard DLMs (left/middle) naturally drift toward a diagonal/AR-like pattern, while NAP (right) shows horizontal bands indicative of true parallel generation.
Methodology: The NAP Framework
To fix this, NAP (Non-Autoregressive Parallel) intervenes at two levels:
1. Parallel Data Curation
Instead of a single long reasoning chain, NAP uses a teacher model to generate multiple independent reasoning trajectories for the same prompt. This teaches the model that there is no "correct" sequence to follow. These paths are then aggregated into a final summary block, forcing the model to evaluate multiple streams of evidence simultaneously.
2. Parallel-Forced Decoding
At inference time, NAP uses a structured canvas. It divides the output into independent blocks.
- Macro-Parallel: The model is forced to distribute its token budget across all blocks at every step. It cannot finish block 1 before starting block 3.
- Micro-Confidence: Within each block, it commits tokens based on confidence, not position.
Figure 2: The NAP system concurrently generates independent paths, then synthesizes them in a final summary block.
Experimental Results: Speed Without Sacrifice
The real test for NAP is the "low-step" regime—where we force the model to generate many tokens per step (high parallelism).
| Metric | Decoding Steps | Long-CoT (Baseline) | NAP (Ours) | Improvement | | :--- | :--- | :--- | :--- | :--- | | GSM8K Acc | 256 (Fast) | 46.5% | 60.9% | +14.4% | | MATH-500 Acc | 256 (Fast) | 16.2% | 23.8% | +7.6% |
In high-parallelism scenarios (256 steps), the standard Long-CoT model's reasoning collapses because it hasn't seen enough sequential context to "stabilize" its early tokens. NAP, however, thrives by utilizing the internal ensemble effect of multiple parallel paths.
Table 1: Accuracy comparison across benchmarks highlights the widening gap as parallelism increases.
Deep Insights: The "Flexibility Trap"
A critical takeaway is the analysis of current "Fast DLMs" (like Fast-dLLM). The paper proves that these methods gain speed not by being parallel, but by accelerating the AR critical path. They identify the tokens the model is already likely to predict in a left-to-right fashion and optimize that specific sequential execution.
NAP suggests a different future: Intrinsic Parallelism. By changing the supervision, we can reduce the global dependence of tokens on their immediate predecessors.
Conclusion & Limitations
NAP is a proof-of-concept showing that the bottleneck in DLMs is often the data, not the architecture.
- The Good: It achieves superior reasoning performance under high parallelism and decouples capability from order.
- The Caveat: This study was conducted on a post-training scale (100K samples). To truly kill the "AR-ghost" in DLMs, we likely need to rethink the pre-training datasets (like FineWeb) to include more non-sequential, structured data formats.
For researchers building the next generation of efficient LLMs, the message is clear: If you want parallel hardware efficiency, you must provide parallel supervision.
