MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

[CVPR 2026] MosaicMem: Redefining Spatial Memory for Controllable Video World Models

总结

问题

方法

结果

要点

摘要

MosaicMem is a hybrid spatial memory framework for video diffusion models that combines explicit 3D geometric lifting with implicit attention-based conditioning. It achieves SOTA consistency in long-horizon video generation and precise camera control by treating video patches as the fundamental unit for 3D localization and retrieval.

TL;DR

MosaicMem introduces a hybrid "Patch-and-Compose" spatial memory that bridges the gap between explicit 3D geometry and implicit latent representations. By lifting video patches into a persistent 3D space and re-aligning them using Warped RoPE, it enables minute-level consistent video navigation, complex scene editing (like "Inception-style" world stitching), and real-time interaction at 16 FPS.

Background: The Persistence Paradox

In the quest to create "World Simulators" (like OpenAI's Sora or Google's Genie), researchers face a fundamental trade-off:

Explicit Memory: Use 3D structures (point clouds/Gaussians). Pros: Perfect geometry. Cons: Static; can't handle a cat running across the room.
Implicit Memory: Store past frames as context. Pros: Handles dynamics and "vibes." Cons: The world "drifts" and melts as you move the camera.

MosaicMem breaks this paradox by treating the Patch as the atomic unit of memory—lifting it into 3D for localization, but retrieving it via the model's native attention to stay flexible.

Methodology: The Patch-and-Compose Interface

MosaicMem's pipeline follows an "Explicit-Lifting, Implicit-Conditioning" logic.

1. Geometric Lifting & Alignment

When a frame is generated, its patches are associated with depth and camera poses, effectively "pinning" them in 3D space. When the camera revisits this area, the model retrieves these patches. However, simple retrieval isn't enough due to VAE compression. MosaicMem introduces:

Warped RoPE: Instead of standard 2D positions, it re-projects coordinates (u, v) through the camera transform to provide the transformer with pixel-accurate correspondence.
Warped Latents: Differentiable bilinear sampling of features to physically align the "memory" with the "query" view.

2. PRoPE Camera Control

The authors incorporate Projective Positional Encoding (PRoPE). Unlike simple MLP-based camera injections, PRoPE injects relative frustum geometry directly into the self-attention blocks, allowing the model to "understand" the 3D relationship between views.

MosaicMem Architecture Fig 1: The MosaicMem hybrid pipeline—lifting patches for localization and aligning them for generation.

Experiments & SOTA Results

The model was stress-tested on a new benchmark, MosaicMem-World, featuring long trajectories and frequent revisits in complex environments like Cyberpunk 2077 and Unreal Engine 5.

Geometric Fidelity: MosaicMem achieved a Rotation Error of 0.51°, compared to ~4.6° for purely implicit models.
Dynamic Modeling: Unlike explicit models (GEN3C) which produce static "cutouts," MosaicMem successfully generates new dynamics (e.g., a knight riding a horse into a scene) while keeping the background anchored.
Real-time Interaction: The "Mosaic Forcing" variant enables interactive world exploration at 16 FPS.

Performance Comparison Table 1: MosaicMem vs. SOTA. Note the massive jump in Consistency Scores (SSIM/PSNR) over implicit methods.

Revolutionary Capabilities: Scene Editing & Navigation

Because the memory is "deletable and manipulable," users can:

Delete/Duplicate Objects: Remove a building or triple the number of trees by modifying the patch-space.
Scene Stitching: Connect a "Medieval World" to a "Cyberpunk World" horizontally. By walking through the transition, the model generates a geometrically continuous bridge between disparate styles.
Inception Effects: Register a street memory "in the sky" to create surreal, vertically connected worlds.

Scene Editing Example Fig 2: Vertical and horizontal scene stitching via MosaicMem manipulation.

Critical Insight & Conclusion

MosaicMem proves that for a video model to act as a true simulator, it must "understand" that the world is 3D even if it lives in a 2D latent space. The transition from Frame-level memory (redundant and vague) to Patch-level 3D-aligned memory is a significant step toward solving the "drifting world" problem in generative AI.

Future Outlook: While MosaicMem currently relies on off-the-shelf depth estimators, the next logical step is an end-to-end framework where the 3D structure is learned jointly with the generation, further reducing reliance on external geometric scaffolds.

发现相似论文

试试这些示例

Search for recent papers that utilize hybrid 3D representations (like 3D Gaussians or Surfels) as memory modules for video diffusion transformers to solve spatial drift.
Which study first introduced Projective Positional Encoding (PRoPE), and how does MosaicMem adapt it for temporally compressed video latents?
Explore how MosaicMem's patch-and-compose interface could be extended to multi-agent robotics or reinforcement learning for consistent world simulation during exploration.

[CVPR 2026] MosaicMem: Redefining Spatial Memory for Controllable Video World Models

1. TL;DR

2. Background: The Persistence Paradox

3. Methodology: The Patch-and-Compose Interface

3.1. 1. Geometric Lifting & Alignment

3.2. 2. PRoPE Camera Control

4. Experiments & SOTA Results

5. Revolutionary Capabilities: Scene Editing & Navigation

6. Critical Insight & Conclusion