We provide visual examples for reconstruction at 16x temporal compression (8x8x16) with 8 latent channels and we compare with MAGVIT-v2 at the resolution of 512x512. With the increasing temporal compression rate, MAGVIT-v2 exhibits much more severe visual artifacts and our method maintains much better spatiotemporal structure than MAGVIT-v2.
Reference
MAGVIT-v2
REGEN