REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

¹ Adobe Research ² Northeastern University

Reconstruction Comparison
4X Reconstruction Comparison
8X Reconstruction Comparison
16X
Reconstruction Comparison
32X Text-to-Video Generation
32X Latent

Reconstruction Comparison (4× Temporal Compression, z_dim=8)

We provide visual examples for reconstruction at 4x temporal compression (8x8x4) with 8 latent channels and we compare with WF-VAE, VidTok and MAGVIT-v2 at the resolution of 512x512. Our implemented MAGVIT-v2 exhibits competitive performance compared to WF-VAE and VidTok, while they all lag behind our method on challenging scenarios, e.g., texts, faces, temporal consistency.

Reference

WF-VAE

VidTok

MAGVIT-v2

REGEN

REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

Yitian Zhang^1,2*

Long Mai¹

Aniruddha Mahapatra¹

David Bourgin¹

Yicong Hong¹

Jonah Casebeer¹

Feng Liu¹

Yun Fu²

¹ Adobe Research ² Northeastern University

Reconstruction Comparison (4× Temporal Compression, z_dim=8)

REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

Yitian Zhang1,2*

Long Mai1

Aniruddha Mahapatra1

David Bourgin1

Yicong Hong1

Jonah Casebeer1

Feng Liu1

Yun Fu2

1 Adobe Research 2 Northeastern University

Reconstruction Comparison (4× Temporal Compression, zdim=8)

Yitian Zhang^1,2*

Long Mai¹

Aniruddha Mahapatra¹

David Bourgin¹

Yicong Hong¹

Jonah Casebeer¹

Feng Liu¹

Yun Fu²

¹ Adobe Research ² Northeastern University

Reconstruction Comparison (4× Temporal Compression, z_dim=8)