VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Nerfies Image

A diagram of our unified video diffusion transformer (VDT) via spatial-temporal mask modeling. VDT represents a versatile framework constructed upon pure transformer architectures.

Gif Image 1 Gif Image 2 Gif Image 3 Gif Image 4 Gif Image 5 Gif Image 6 Gif Image 7 Gif Image 8 Gif Image 9 Gif Image 10 Gif Image 11 Gif Image 12 Gif Image 14 Gif Image 15 Gif Image 16

Qualitative results (64x256x256) on Sky Time-Lapse.

-->
Gif Image 1 Gif Image 2 Gif Image 3 Gif Image 4 Gif Image 5 Gif Image 7

Qualitative results (64x256x256) on TaiChi.

Gif Image 1 Gif Image 2 Gif Image 3 Gif Image 4 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5 Gif Image 5

Qualitative results (32x128x128) on Physion.

Abstract

This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers. Additionally, we propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.

VDT offers several appealing benefits. (1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time. (2) It facilitates flexible conditioning information, \eg, simple concatenation in the token space, effectively unifying different token lengths and modalities. (3) Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc. % Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT. % Moreover, we provide a comprehensive study on the capabilities of VDT in capturing accurate temporal dependencies, handling conditioning information, and the spatial-temporal mask modeling mechanism. Additionally, we present comprehensive studies on how VDT handles conditioning information with the mask modeling mechanism, which we believe will benefit future research and advance the field.

Nerfies Image

Illustration of our video diffusion transformer (VDT). Left: VDT block with temporal and spatial attention. Mid: The diffusion pipeline of our VDT. Right: We uniformly sample frames and then project them into the latent space using a pre-trained VAE tokenizer.

Nerfies Image

Illustration of our unified spatial-temporal mask modeling mechanism. Under this unified framework, we can modulate the the spatial-temporal mask M to incorporate additional video generation tasks into the VDT training process. This ensures that a well-trained VDT can be effortlessly applied to various video generation tasks.

Nerfies Image

Unconditional Generation, Sky Time-Lapse.

Nerfies Image

Spatial-Temporal Video Completion, Sky Time-Lapse.

Nerfies Image

Bi-directional Video Prediction, Sky Time-Lapse.

Nerfies Image

Arbitrary Video Interpolation, Sky Time-Lapse.

Nerfies Image

Image-to-video Generation, Sky Time-Lapse.

Nerfies Image

Qualitative results (16x256x256) on TaiChi-HD.

Nerfies Image

Qualitative results (30x128x128) on Cityscapes.