We provide visual examples for reconstruction at 4x temporal compression (8x8x4) with 8 latent channels and we compare with WF-VAE, VidTok and MAGVIT-v2 at the resolution of 512x512. Our implemented MAGVIT-v2 exhibits competitive performance compared to WF-VAE and VidTok, while they all lag behind our method on challenging scenarios, e.g., texts, faces, temporal consistency.
Reference
WF-VAE
VidTok
MAGVIT-v2
REGEN