ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

[CVPR 2026] ScrollScape: Transforming Video Priors into 32K Panoramic Powerhouses

总结

问题

方法

结果

要点

摘要

ScrollScape is a novel framework that reformulates Ultra-High-Resolution (UHR) image generation at Extreme Aspect Ratios (EAR) as a sequential video panning process. By adapting the Wan2.1 video diffusion model, it achieves structurally coherent 32K resolution imagery, effectively eliminating common artifacts like object repetition.

TL;DR

Generating ultra-wide panoramas (like 8:1 aspect ratios) at massive scales has long been a "boss level" challenge for AI. While models like Stable Diffusion excel at squares, they lose their minds on long canvases, repeating trees and people like a glitched video game. ScrollScape solves this by flipping the script: it treats generating a long image as filming a video with a panning camera. By harnessing the inherent "temporal consistency" of video models, it unlocks seamless, non-repetitive 32K imagery.

The "Object Repetition" Curse

Why can't DALL-E or Midjourney just "keep painting" to the right? The problem lies in the Stationary Bias. Traditional Image Diffusion models are trained on standard crops. When you stretch them, they lose global context.

Tiling methods (SyncDiffusion, MultiDiffusion) try to stitch patches together, but without a global "map," they end up repeating content because each patch thinks it's the center of the world.
Position Interpolation (DyPE) helps but often results in "hallucinated" textures when pushed 4x or 8x beyond training limits.

Methodology: The Architecture of a Digital Scroll

ScrollScape introduces three core pillars to bridge the gap between video and static panoramas:

1. Spatial-Temporal Mapping

The central insight is that a long scroll is just a video where time equals horizontal distance ($x = v \cdot t$). By partitioning a massive canvas into overlapping chunks and treating them as video frames, the model uses its "temporal" knowledge to ensure that Frame 2 flows logically from Frame 1.

2. Scanning Positional Encoding (ScanPE)

Standard video models assume the camera is static. ScrollScape re-engineers the Rotary Positional Embeddings (RoPE). Instead of fixed coordinates, it uses ScanPE to distribute global coordinates across frames.

Model Architecture Fig 1: ScrollScape reformulates image synthesis as a sequential video panning task, acting like a moving camera.

3. ScrollSR & TAP: The Path to 32K

Directly generating 32,000 pixels in one go would incinerate most GPUs. ScrollScape uses ScrollSR (Scrolling Super-Resolution), which applies a video super-resolution prior frame-by-frame. To prevent "flickering" or seams between these high-res chunks, they use Trajectory Anchored Partitioning (TAP). This ensures the 3D VAE decoder stays synchronized across the entire length of the 32K canvas.

Experiments: Killing the "Hall of Mirrors" Effect

The authors compared ScrollScape against heavyweights like FLUX and specialized tiling methods.

Qualitative: While FLUX-Krea and DyPE showed "catastrophic structural failure" beyond 4:1 ratios, ScrollScape stayed coherent even at 8:1.
Quantitative: The team introduced a new metric, Global Structural Diversity (GSD), using DINOv2 and LPIPS to check if distant parts of the image were just clones of each other. ScrollScape scored significantly lower on redundancy.

Experimental Results Fig 2: Ablation study showing how Wan2.1 (top) repeats content, while ScrollScape (bottom) creates a continuous, varied narrative.

Deep Insight: Why This Matters

The genius of ScrollScape isn't just "making big images." It’s the realization that Video Diffusion Models are better world-simulators than Image Models.

A video model understands that if a mountain starts on the left, it must realistically conclude on the right. By "tricking" a video model into thinking a spatial layout is a temporal sequence, we gain access to a much higher level of structural logic. This moves us away from "stitching" and toward "authoring" continuous digital worlds.

Limitations & Future Work

While ScrollScape is a giant leap for panoramas:

It still requires a "lightweight alignment" (fine-tuning) on roughly 3,000 images to bridge the domain gap.
The inference speed, while optimized via ScrollSR, is still bound by the sequential nature of video generation.

Future iterations might explore multi-directional scanning (Snake Fusion Mode) to generate massive 2D grids (gigapixel square photos) or immersive 360-degree VR environments using the same video-panning logic.

Final Takeaway: To solve 2D spatial problems, sometimes you need to add a 1D temporal dimension.

发现相似论文

试试这些示例

Search for recent papers that utilize video diffusion priors for high-resolution static image editing or panoramic synthesis.
How does Scanning Positional Encoding (ScanPE) compare to other dynamic or extrapolated positional embedding techniques like RoPE or ALiBi in the context of Diffusion Transformers?
Explore research that applies video super-resolution (VSR) diffusion models to non-video tasks such as gigapixel image generation.

[CVPR 2026] ScrollScape: Transforming Video Priors into 32K Panoramic Powerhouses

1. TL;DR

2. The "Object Repetition" Curse

3. Methodology: The Architecture of a Digital Scroll

3.1. 1. Spatial-Temporal Mapping

3.2. 2. Scanning Positional Encoding (ScanPE)

3.3. 3. ScrollSR & TAP: The Path to 32K

4. Experiments: Killing the "Hall of Mirrors" Effect

5. Deep Insight: Why This Matters

6. Limitations & Future Work