WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CVPR 2025] InfinityStory: Conquering the "Consistency Curse" in Hour-Long Video Generation
Summary
Problem
Method
Results
Takeaways
Abstract

InfinityStory is a novel framework for generating long-form storytelling videos with high visual consistency and smooth cinematographic transitions. It utilizes a multi-agent hierarchical planning system and a specialized First-Last-Frame-to-Video (FLF2V) model to achieve SOTA rankings on VBench, specifically leading in Background Consistency (88.94) and Subject Consistency (82.11).

TL;DR

InfinityStory is a breakthrough framework designed to move generative AI from short video clips to full-scale cinematic storytelling. By solving the dual challenges of background drift and abrupt character transitions, it enables the synthesis of hour-long narratives that look and feel like coherent movies.

The Core Conflict: Why Current AI Can't "Tell" a Story

While models like Sora or HunyuanVideo generate stunning short clips, they fail at long-form narrative for two technical reasons:

  1. Background Drift: Between two shots in the same room, the walls, lighting, or furniture often shift because the model "re-imagines" the scene from scratch based on a prompt.
  2. Transition Discontinuity: When a new character enters a scene in a movie, they walk into the frame. In AI videos, they usually just "pop" into existence at the start of a new clip, breaking the viewer's immersion.

Methodology: The "Director, Architect, and Editor" Approach

InfinityStory solves this using a sophisticated hierarchical multi-agent system and a novel training strategy.

1. Location Injection & Background Stability

Instead of generating shots in isolation, the framework first builds a Location Library.

  • The Insight: By generating a "canonical" background image for a location first, the model can inject this fixed environment into every shot.
  • The Tech: Using Image-to-Image (I2I) and Image-to-Video (I2V) pipelines, the system fuses persistent backgrounds with character reference images, ensuring that even in a 100-shot sequence, the "living room" always looks like the same living room.

Overall Architecture Figure 1: The InfinityStory pipeline uses I2V for narrative content and FLF2V for smooth transitions.

2. CMTS: Smooth Multi-Character Transitions

To stop characters from "teleporting" into frames, the authors introduce Cinematic Multi-Subject Transition Synthesis (CMTS).

  • The Dataset: They generated 10,000 synthetic videos specifically for "boring" but necessary transitions (Entering, Exiting, Swapping).
  • The FLF2V Model: They fine-tuned a model that takes the last frame of Shot A and the first frame of Shot B to generate a bridge clip. This ensures that if Character X is leaving, the model actually renders them walking out of the frame.

Transition Framework Figure 2: The novel framework for generating and filtering the multi-subject transition dataset.

Experiments: Proving the Narrative Logic

The model was tested using the VBench suite and human evaluations. The results were clear: InfinityStory dominates in consistency.

Key Results:

  • Background Consistency: 88.94 (Rank 1)
  • Subject Consistency: 82.11 (Rank 1)
  • Human Preference: In blind tests, users preferred InfinityStory's transitions and scene coherence by a significant margin over previous SOTA models like MovieAgent.

Experimental Results Table 1: Quantitative comparison showing InfinityStory's superiority in consistency metrics.

Critical Analysis & Conclusion

Takeaway: The real breakthrough here isn't a bigger neural network, but a better workflow. By mimicking the actual film production process (reusing sets, planning character exits), InfinityStory brings us closer to a "Studio in a Box."

Limitations: While consistency is solved, the authors admit that image quality took a slight hit (480p vs. 720p). The model also needs more work to generalize to complex multi-character crowds or highly abstract storylines.

The Future: InfinityStory sets the stage for AI-generated series and films where the AI doesn't just create "cool clips" but maintains a world with "spatial memory."

Find Similar Papers

Try Our Examples

  • Find recent papers addressing background consistency in diffusion-based video generation beyond simple prompt engineering.
  • Who originally proposed the First-Last-Frame-to-Video (FLF2V) interpolation task, and how does this paper's LoRA-based approach for multi-subject transitions differ?
  • Search for research applying agentic hierarchical planning to multimodal story generation, specifically comparing MCTS-based vs. Chain-of-Thought approaches.
Contents
[CVPR 2025] InfinityStory: Conquering the "Consistency Curse" in Hour-Long Video Generation
1. TL;DR
2. The Core Conflict: Why Current AI Can't "Tell" a Story
3. Methodology: The "Director, Architect, and Editor" Approach
3.1. 1. Location Injection & Background Stability
3.2. 2. CMTS: Smooth Multi-Character Transitions
4. Experiments: Proving the Narrative Logic
4.1. Key Results:
5. Critical Analysis & Conclusion