InfinityStory is a novel framework for generating long-form storytelling videos with high visual consistency and smooth cinematographic transitions. It utilizes a multi-agent hierarchical planning system and a specialized First-Last-Frame-to-Video (FLF2V) model to achieve SOTA rankings on VBench, specifically leading in Background Consistency (88.94) and Subject Consistency (82.11).
TL;DR
InfinityStory is a breakthrough framework designed to move generative AI from short video clips to full-scale cinematic storytelling. By solving the dual challenges of background drift and abrupt character transitions, it enables the synthesis of hour-long narratives that look and feel like coherent movies.
The Core Conflict: Why Current AI Can't "Tell" a Story
While models like Sora or HunyuanVideo generate stunning short clips, they fail at long-form narrative for two technical reasons:
- Background Drift: Between two shots in the same room, the walls, lighting, or furniture often shift because the model "re-imagines" the scene from scratch based on a prompt.
- Transition Discontinuity: When a new character enters a scene in a movie, they walk into the frame. In AI videos, they usually just "pop" into existence at the start of a new clip, breaking the viewer's immersion.
Methodology: The "Director, Architect, and Editor" Approach
InfinityStory solves this using a sophisticated hierarchical multi-agent system and a novel training strategy.
1. Location Injection & Background Stability
Instead of generating shots in isolation, the framework first builds a Location Library.
- The Insight: By generating a "canonical" background image for a location first, the model can inject this fixed environment into every shot.
- The Tech: Using Image-to-Image (I2I) and Image-to-Video (I2V) pipelines, the system fuses persistent backgrounds with character reference images, ensuring that even in a 100-shot sequence, the "living room" always looks like the same living room.
Figure 1: The InfinityStory pipeline uses I2V for narrative content and FLF2V for smooth transitions.
2. CMTS: Smooth Multi-Character Transitions
To stop characters from "teleporting" into frames, the authors introduce Cinematic Multi-Subject Transition Synthesis (CMTS).
- The Dataset: They generated 10,000 synthetic videos specifically for "boring" but necessary transitions (Entering, Exiting, Swapping).
- The FLF2V Model: They fine-tuned a model that takes the last frame of Shot A and the first frame of Shot B to generate a bridge clip. This ensures that if Character X is leaving, the model actually renders them walking out of the frame.
Figure 2: The novel framework for generating and filtering the multi-subject transition dataset.
Experiments: Proving the Narrative Logic
The model was tested using the VBench suite and human evaluations. The results were clear: InfinityStory dominates in consistency.
Key Results:
- Background Consistency: 88.94 (Rank 1)
- Subject Consistency: 82.11 (Rank 1)
- Human Preference: In blind tests, users preferred InfinityStory's transitions and scene coherence by a significant margin over previous SOTA models like MovieAgent.
Table 1: Quantitative comparison showing InfinityStory's superiority in consistency metrics.
Critical Analysis & Conclusion
Takeaway: The real breakthrough here isn't a bigger neural network, but a better workflow. By mimicking the actual film production process (reusing sets, planning character exits), InfinityStory brings us closer to a "Studio in a Box."
Limitations: While consistency is solved, the authors admit that image quality took a slight hit (480p vs. 720p). The model also needs more work to generalize to complex multi-character crowds or highly abstract storylines.
The Future: InfinityStory sets the stage for AI-generated series and films where the AI doesn't just create "cool clips" but maintains a world with "spatial memory."
