X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering

Zhitong Huang\(^1\), Mohan Zhang\(^2\), Renhan Wang\(^3\), Rui Tang\(^3\), Hao Zhu\(^3\), Jing Liao\(^{1*}\)

\(^1\): City University of Hong Kong, Hong Kong SAR, China.   \(^2\): WeChat, Tencent Inc., Shenzhen, China.   \(^3\): Manycore Tech Inc., Hangzhou, China.
\(^*\) : Corresponding author

Paper | Supplementary (87 demo videos)

Abstract

We present X2Video, the first diffusion model for rendering photorealistic videos guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance, while supporting intuitive multi-modal controls with reference images and text prompts for both global and local regions. The intrinsic guidance allows accurate manipulation of color, material, geometry, and lighting, while reference images and text prompts provide intuitive adjustments in the absence of intrinsic information. To enable these functionalities, we extend the intrinsic-guided image generation model XRGB to video generation by employing a novel and efficient Hybrid Self-Attention, which ensures temporal consistency across video frames and also enhances fidelity to reference images. We further develop a Masked Cross-Attention to disentangle global and local text prompts, applying them effectively onto respective local and global regions. For generating long videos, our novel Recursive Sampling method incorporates progressive frame sampling, combining keyframe prediction and frame interpolation to maintain long-range temporal consistency while preventing error accumulation. To support the training of X2Video, we assembled a video dataset named InteriorVideo, featuring 1,154 rooms from 295 interior scenes, complete with reliable ground-truth intrinsic channel sequences and smooth camera trajectories.

Methodology

Model Architecture

Overall structure of our framework. Given a sequence of intrinsic channels (with optional multi-modal conditions including a reference frame, global text, and local masked text), our model can generate temporally consistent video. Hybrid Self-Attention is proposed to enhance temporal consistency and fidelity to the reference. Masked Cross-Attention is proposed to effectively apply global and local text prompts on corresponding global and local regions.
Sequential Sampling

Recursive Sampling scheme. We sample long videos with a keyframe prediction stage followed by successive stages of frame interpolation.

Comparisons

Qualitative comparisons with XRGB and SVD+CNet, where cyan frames highlight the temporal inconsistencies observed in XRGB. Purple frames indicate wrong colors or materials produced by SVD+CNet due to the lack of intrinsic knowledge. Red frames illustrate our model's ability to infer reflective surfaces with mirrored objects. Zoom-in is recommended for better visualization.

Qualitative comparison 1

Qualitative comparison 2

Qualitative comparison 3

Qualitative comparison 4

Multi-modal Controls

Multimodal controls with intrinsic channels, reference images, and text prompts on bota global and local regions.

Example 1: Edit intrinsics

Example 2: Edit intrinsics

Example 3: Reference image control

Example 4: Reference image control

Example 5: Text control on global region

Example 6: Text control on global region

Example 7: Text control on local regions

Example 8: Text control on local regions

Generalization to Other Scenes

Generalize to dynamic scene

Generalize to outdoor scene

Generalize to outdoor scene

Generalize to other PBR-rendered scene

Generalize to other PBR-rendered scene

Generalize to other PBR-rendered scene

Generalize to other PBR-rendered scene

Generalize to other PBR-rendered scene