ScrollScape is a novel framework that reformulates Ultra-High-Resolution (UHR) image generation at Extreme Aspect Ratios (EAR) as a sequential video panning process. By adapting the Wan2.1 video diffusion model, it achieves structurally coherent 32K resolution imagery, effectively eliminating common artifacts like object repetition.
TL;DR
Generating ultra-wide panoramas (like 8:1 aspect ratios) at massive scales has long been a "boss level" challenge for AI. While models like Stable Diffusion excel at squares, they lose their minds on long canvases, repeating trees and people like a glitched video game. ScrollScape solves this by flipping the script: it treats generating a long image as filming a video with a panning camera. By harnessing the inherent "temporal consistency" of video models, it unlocks seamless, non-repetitive 32K imagery.
The "Object Repetition" Curse
Why can't DALL-E or Midjourney just "keep painting" to the right? The problem lies in the Stationary Bias. Traditional Image Diffusion models are trained on standard crops. When you stretch them, they lose global context.
- Tiling methods (SyncDiffusion, MultiDiffusion) try to stitch patches together, but without a global "map," they end up repeating content because each patch thinks it's the center of the world.
- Position Interpolation (DyPE) helps but often results in "hallucinated" textures when pushed 4x or 8x beyond training limits.
Methodology: The Architecture of a Digital Scroll
ScrollScape introduces three core pillars to bridge the gap between video and static panoramas:
1. Spatial-Temporal Mapping
The central insight is that a long scroll is just a video where time equals horizontal distance ($x = v \cdot t$). By partitioning a massive canvas into overlapping chunks and treating them as video frames, the model uses its "temporal" knowledge to ensure that Frame 2 flows logically from Frame 1.
2. Scanning Positional Encoding (ScanPE)
Standard video models assume the camera is static. ScrollScape re-engineers the Rotary Positional Embeddings (RoPE). Instead of fixed coordinates, it uses ScanPE to distribute global coordinates across frames.
Fig 1: ScrollScape reformulates image synthesis as a sequential video panning task, acting like a moving camera.
3. ScrollSR & TAP: The Path to 32K
Directly generating 32,000 pixels in one go would incinerate most GPUs. ScrollScape uses ScrollSR (Scrolling Super-Resolution), which applies a video super-resolution prior frame-by-frame. To prevent "flickering" or seams between these high-res chunks, they use Trajectory Anchored Partitioning (TAP). This ensures the 3D VAE decoder stays synchronized across the entire length of the 32K canvas.
Experiments: Killing the "Hall of Mirrors" Effect
The authors compared ScrollScape against heavyweights like FLUX and specialized tiling methods.
- Qualitative: While FLUX-Krea and DyPE showed "catastrophic structural failure" beyond 4:1 ratios, ScrollScape stayed coherent even at 8:1.
- Quantitative: The team introduced a new metric, Global Structural Diversity (GSD), using DINOv2 and LPIPS to check if distant parts of the image were just clones of each other. ScrollScape scored significantly lower on redundancy.
Fig 2: Ablation study showing how Wan2.1 (top) repeats content, while ScrollScape (bottom) creates a continuous, varied narrative.
Deep Insight: Why This Matters
The genius of ScrollScape isn't just "making big images." It’s the realization that Video Diffusion Models are better world-simulators than Image Models.
A video model understands that if a mountain starts on the left, it must realistically conclude on the right. By "tricking" a video model into thinking a spatial layout is a temporal sequence, we gain access to a much higher level of structural logic. This moves us away from "stitching" and toward "authoring" continuous digital worlds.
Limitations & Future Work
While ScrollScape is a giant leap for panoramas:
- It still requires a "lightweight alignment" (fine-tuning) on roughly 3,000 images to bridge the domain gap.
- The inference speed, while optimized via ScrollSR, is still bound by the sequential nature of video generation.
Future iterations might explore multi-directional scanning (Snake Fusion Mode) to generate massive 2D grids (gigapixel square photos) or immersive 360-degree VR environments using the same video-panning logic.
Final Takeaway: To solve 2D spatial problems, sometimes you need to add a 1D temporal dimension.
