X-Cache is a training-free acceleration framework designed for few-step autoregressive (AR) video diffusion world models. By exploiting temporal redundancy across consecutive generation chunks rather than denoising steps, it achieves a 71% block skip rate and a 2.6x wall-clock speedup on the X-World autonomous driving simulator.
TL;DR
X-Cache is a novel, training-free acceleration method specifically designed for few-step autoregressive video diffusion used in autonomous driving simulators. While traditional caching methods try to skip redundant denoising steps, X-Cache looks "sideways" to reuse computation across consecutive time-steps (chunks). It achieves a 2.6x speedup on production-grade models like X-World with virtually no loss in visual quality.
The Bottleneck: Why "Cross-Step" Caching Fails
In the quest for real-time autonomous driving simulation, researchers have moved toward few-step distillation (e.g., using only 4 denoising steps instead of 50). Existing acceleration techniques like DeepCache or FlowCache rely on the similarity between adjacent denoising steps.
However, in a 4-step regime, every step is packed with crucial information; skipping one leads to immediate "hallucinations" or structural collapses. Furthermore, interactive simulators receive new action inputs (steering, braking) every chunk. These discrete control signals break the smoothness assumptions required by prior extrapolation-based methods.
The Insight: Physical Continuity as a Redundancy Axis
The authors observe that while denoising steps in a few-step model are not redundant, the physical world is. Between two consecutive 0.5-second video chunks, the environment changes smoothly. This creates Cross-Chunk Redundancy.
By caching the "residual" (the output of a DiT block) at a specific block and denoising step index, and reusing it for the same position in the next temporal chunk, the model can bypass the heavy lifting of Transformer blocks when the scene evolution is predictable.
Methodology: The X-Cache Architecture
1. Structure and Action-Aware Fingerprinting
To decide whether to skip a block, X-Cache compares the current input to the cached version using a compact "fingerprint."
- 3D Grid Sampling: Instead of flat 1D sampling, it samples across the (Frames, Height, Width) grid to ensure geometrically balanced coverage.
- Action Channel: Crucially, it attaches the ego-vehicle's action vector to the fingerprint. If the driver suddenly steers, the fingerprint changes drastically, forcing the model to recompute rather than reuse a "smooth" straight-driving cache.

2. Dual-Metric Gating & Adaptive Thresholds
The gate uses two tests:
- Cosine Similarity: Measures the global direction of the latent features.
- Maximum Token Deviation: Detects local outliers (e.g., a pedestrian suddenly appearing). Instead of a fixed threshold, it uses an Exponential Moving Average (EMA) to learn the "normal" similarity for each block, allowing the system to be aggressive in static scenes and conservative in dynamic ones.
3. The Safety Valve: KV Update Protection
In autoregressive models, errors can compound indefinitely. X-Cache identifies the specific forward pass that updates the persistent Key-Value (KV) cache. It unconditionally forces full computation during this pass to ensure "clean" data is written to memory, effectively resetting any accumulated approximation errors.
Experiments and Results
The researchers validated X-Cache on X-World, a 7-camera simulator.
- Efficiency: The block skip rate reached 71%, translating to a massive reduction in wall-clock time from ~3.7s to ~1.4s per chunk.
- Fidelity: In benchmarks across Urban, Highway, and U-Turn scenarios, the PSNR remained above 51dB. Visually, the difference between the full-compute baseline and X-Cache is nearly invisible even at 20x amplification.

Ablation Insights
The most critical finding in the ablation study was the role of KV-Update Protection. Without it, the PSNR plummeted to 21.46dB (unusable noise), proving that while caching residuals is safe for the "current" view, the long-term memory must remain pristine.
Summary and Future Outlook
X-Cache successfully shifts the focus of diffusion acceleration from the denoising axis to the temporal axis. For the autonomous driving industry, this means faster-than-real-time closed-loop testing is finally becoming a reality.
Limitations: The current thresholds were tuned on internal datasets. While adaptive, very extreme edge cases (like transitioning from a bright tunnel to dark night) might require more conservative "warmup" periods to allow the EMA thresholds to reset.
Takeaway: If your model is autoregressive and few-step, stop looking at step-redundancy and start looking at chunk-redundancy.
