WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
WorldCache: Breaking the Speed-Consistency Trade-off in Video World Models
Summary
Problem
Method
Results
Takeaways
Abstract

WorldCache is a training-free caching framework designed to accelerate Diffusion Transformer (DiT) based video world models. By replacing the traditional static "Zero-Order Hold" feature reuse with perception-constrained dynamical approximation, it achieves a 2.3× inference speedup on Cosmos-Predict2.5 while preserving 99.4% of baseline quality.

TL;DR

Generating high-fidelity, physically consistent videos with Diffusion Transformers (DiTs) is notoriously slow. WorldCache is a new training-free framework that accelerates these "World Models" by up to 2.3× with virtually no loss in quality (99.4% retention). It moves beyond simple "skip-and-copy" caching by using motion-aware logic and saliency-weighted probes to ensure that speed never comes at the cost of physical reality.

The "Zero-Order Hold" Trap

In the world of video diffusion, consecutive denoising steps are highly redundant. Prior works like DiCache and FasterCache exploit this by "caching" deep transformer layers and skipping them if the change (drift) between steps is small.

However, these methods suffer from a Zero-Order Hold (ZOH) problem: they treat cached features as static snapshots. In a video of a car driving past a forest, the massive amount of "static" trees masks the "dynamic" movement of the car. If the cache triggers based on the average global drift, the car becomes a "ghost"—a blurred, incoherent mess. For World Models used in robotics or autonomous driving, these artifacts aren't just ugly; they are functional failures.

Methodology: Perception-Constrained Dynamical Caching

WorldCache replaces the brittle ZOH assumption with a pipeline designed to understand what is moving and how it's moving.

1. Causal Feature Caching (CFC)

Instead of a fixed threshold, CFC uses a "velocity" proxy derived from the raw latent input. If the scene has fast dynamics, the threshold tightens, forcing the model to recompute rather than skip.

2. Saliency-Weighted Drift (SWD)

Not all pixels are created equal. SWD calculates a saliency map based on the channel-wise variance of probe features. High variance usually highlights edges and objects (foreground). By weighting the drift signal toward these regions, WorldCache ensures that even a tiny movement in a salient object triggers a recomputation, while background noise is ignored.

3. Optimal Feature Approximation (OFA)

When the model does skip, it doesn't just copy. OFA uses Optimal State Interpolation (OSI)—a least-squares vector projection—to align the cached history with the current trajectory. It optionally uses Motion-Compensated Warping to spatially align features, preventing the "drift" that usually kills long-horizon video generation.

Model Architecture Figure 1: The WorldCache Pipeline. Note the interplay between the Probe, CFC/SWD decision logic, and the OFA approximation.

4. Adaptive Threshold Scheduling (ATS)

Denoising typically follows a "structure-then-detail" path. ATS exploits this by keeping the cache threshold strict during early steps (where the physical layout is decided) and relaxing it aggressively during the final refinement steps.

Experimental Results: SOTA Efficiency

The authors tested WorldCache on the Cosmos-Predict2.5 (2B & 14B) and WAN2.1 models using the PAI-Bench (Physical AI Benchmark).

  • Speedup: Reaches 2.3× on Cosmos-2.5-2B.
  • Fidelity: Maintained an overall score of 0.745 (vs. 0.748 baseline) on Text-to-World tasks.
  • Robotics: On the EgoDex-Eval benchmark, WorldCache maintained higher PSNR and SSIM than previous SOTA caching methods, proving it handles hand-object interactions far better.

Performance Table Table 1: Comparison on PAI-Bench. WorldCache (WC) consistently occupies the "sweet spot" of high speed and high domain/quality scores.

Visual Evidence: No More Ghosts

The qualitative results highlight the stark difference in "cleanliness." While previous methods like DiCache produce ghosting on moving vehicles or pedestrians, WorldCache maintains object persistence.

Qualitative Comparison Figure 2: WorldCache (c) vs. DiCache (b). Notice the red boxes in (b) showing ghosting and deformation, which are corrected in (c).

Conclusion & Insight

WorldCache represents a shift from "heuristic-based skipping" to "principled approximation." By treating DiT caching as a dynamical system identification problem, it solves the primary bottleneck of video world models: the high cost of autoregressive sampling.

The key takeaway for the industry is the "Invest & Spend" strategy: spend compute on high-quality saliency-aware decisions and structural denoising steps, then harvest massive speedups by relaxing the constraints once the physical backbone of the video is secure.

Find Similar Papers

Try Our Examples

  • Search for recent papers published in 2024-2025 that address temporal redundancy and training-free acceleration in video diffusion transformers.
  • Which paper first introduced the "probe-then-cache" paradigm for diffusion models, and how does WorldCache's Optimal Feature Approximation (OFA) mathematically differ from the original interpolation methods?
  • Examine research that applies motion-compensated feature warping or optical flow-based latent propagation to improve the efficiency of autoregressive video generation models.
Contents
WorldCache: Breaking the Speed-Consistency Trade-off in Video World Models
1. TL;DR
2. The "Zero-Order Hold" Trap
3. Methodology: Perception-Constrained Dynamical Caching
3.1. 1. Causal Feature Caching (CFC)
3.2. 2. Saliency-Weighted Drift (SWD)
3.3. 3. Optimal Feature Approximation (OFA)
3.4. 4. Adaptive Threshold Scheduling (ATS)
4. Experimental Results: SOTA Efficiency
5. Visual Evidence: No More Ghosts
6. Conclusion & Insight