Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Rays as Pixels: Unifying Video Generation and Pose Estimation via Joint Distribution Learning

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces Rays as Pixels (Raxels), a unified Video Diffusion Model (VDM) that learns a joint distribution over video frames and camera trajectories. By representing camera parameters as dense 3-channel "raxel" images, the model achieves SOTA performance in camera-controlled video generation and competitive results in camera pose estimation within a single 20B parameter framework.

Executive Summary

TL;DR: Researchers have traditionally treated "where the camera is" and "what the camera sees" as two separate problems. Rays as Pixels (Raxels) breaks this wall by training a massive 20B parameter Video Diffusion Model to learn the joint distribution of both. By turning abstract camera matrices into 3-channel "raxel" images, the model can predict motion from video, generate video from motion, or do both simultaneously.

Context: This isn't just another video generator. It sits at the intersection of Generative AI and 3D Vision (SfM/SLAM), proving that a single generative backbone can handle both forward synthesis and inverse geometric inference.

The Core Insight: Why Decoupling Fails

In a standard pipeline (e.g., NeRF or Gaussian Splatting), you first run a tool like COLMAP to get camera poses, then train your model. If the input images are sparse, COLMAP fails, and the whole house of cards collapses.

The authors argue that Forward (Generation) and Inverse (Pose Estimation) processes are two sides of the same coin. A model that understands the underlying 3D geometry of a scene should be able to hallucinate the camera path just as easily as it hallucinates the pixels.

Methodology: Cameras as Image Tensors

The technical hurdle was making camera parameters (usually 4x4 matrices) "digestible" for a Video Diffusion Transformer (Wan 2.1).

1. The Raxel Representation

Instead of feeding matrices into an MLP, the authors create Raxels (Ray-Pixels). For every pixel $(u, v)$ , they calculate the ray's origin $o$ and direction $d$ in a canonical coordinate system. The raxel value is simply the sum $d + o$ .

Compatibility: Because it's a 3-channel map, it can be compressed by the exact same VAE used for video frames.
Alignment: It preserves spatial correspondence. A "pixel" in the camera map corresponds to the same "pixel" in the video frame.

2. Decoupled Self-Cross Attention

To prevent the high-frequency texture of video from "polluting" the smooth geometric signals of camera rays, the model uses a dual-branch attention mechanism:

Self-Attention: Stabilizes the video sequence and camera trajectory independently.
Cross-Attention: Allows the video to follow the camera rays and the rays to refine based on visual landmarks.

Model Architecture

Experiments: The Self-Consistency Test

The most impressive part of the paper is the Closed-Loop Self-Consistency Test. Use the model to:

Predict trajectories $r^{'}$ from a video $z$ .
Re-generate video $z^{'}$ using those predicted trajectories $r^{'}$ .

If the model truly learns the joint distribution $p (z, r)$ , then $z^{'}$ should look identical to $z$ . Raxels pass this test with flying colors, whereas older methods like Plücker embeddings fall apart.

Qualitative Comparison

Key Results

Visual Quality: Achieved an FID of 9.73 on the DL3DV-140 benchmark, a significant leap over previous SOTA like Kaleido (18.04).
Efficiency: Trajectory prediction (pose estimation) requires only 2 denoising steps, making it exponentially faster than traditional optimization-based SfM.

Critical Analysis & Future Outlook

Rays as Pixels is a sophisticated "repurposing" of large-scale video models. It suggests that the next generation of "World Simulators" won't just output pixels; they will output a coherent 4D understanding of space and movement.

Limitations:

Static Bias: The model is trained on Re10K and DL3DV (mostly static real estate/scenes). It might struggle with "dynamic" movements like a person running through a shot.
Scale: At 20B parameters, it is a heavyweight model.

Takeaway: The "Raxel" concept is a blueprint for integrating any meta-data (depth, semantics, poses) into diffusion models. By treating everything as a spatially-aligned pixel latent, we unlock the full power of pretrained visual backbones for geometric reasoning.

Technical interpretation by Senior Academic Tech Editor.

Find Similar Papers

Try Our Examples

Which recent papers explore "Everything as Pixels" or similar unified tokenization strategies for non-visual modalities in Diffusion Transformers?
How does the Raxel representation's handling of SE(3) transformations compare to the theoretical foundations of Plücker coordinates and Epipolar Geometry in neural rendering?
Are there emerging studies applying joint Video-Trajectory distribution learning to autonomous driving or robotics for end-to-end world modeling?

Contents

Rays as Pixels: Unifying Video Generation and Pose Estimation via Joint Distribution Learning

1. Executive Summary

2. The Core Insight: Why Decoupling Fails

3. Methodology: Cameras as Image Tensors

3.1. 1. The Raxel Representation

3.2. 2. Decoupled Self-Cross Attention

4. Experiments: The Self-Consistency Test

4.1. Key Results

5. Critical Analysis & Future Outlook