We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32× (8× higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.
We provide visual examples for reconstruction at 32x temporal compression (8x8x32) with 8 latent channels and we compare with MAGVIT-v2 at the resolution of 256x256 as MAGVIT-v2 faces out of memory issue at 512x512.
We increase the temporal compression rate to the extreme case of 32x where our method maintains the reconstruction ability to a great extent compared to MAGVIT-v2.
For viewing more 32× temporal compression comparison results please refer to our Gallery > Reconstruction Comparison 32× page.
Reference
MAGVIT-v2
REGEN
We provide additional visual examples for text-to-video generation on our ultra-compact latent space with 32x temporal compression and the videos are generated at the resolution of 192x320.
The early stage results suggest that we could generate videos with plausible content with 5x reduction in the number of latent frames compared to current state-of-the-art MAGVIT-v2.
For viewing more 32× temporal latent space video generation results please refer to our Gallery > Text-to-Video Generation 32× Latent page.
Note: Hover over the text to see the full prompt.
A young woman with vibrant red hair, adorned with a whimsical leafy crown...
A cinematic documentary hand held close up of a woman standing...
Close up shot of a woman, police lights flashing in background, cinematic...
A dramatic close-up shows an elderly man in harsh red light...
A cat wearing sunglasses and working as a lifeguard at a pool.
A side view of an owl sitting in a
field.
A slow cinematic push in on an ostrich standing in a 1980s kitchen.
A sheep behind a fence looking at the camera.
A stunning aerial drone footage time lapse of El Capitan in Yosemite...
Time lapse at the snow land with aurora in the sky, 4k, high resolution.
Milk dripping into a cup of coffee, high definition, 4k.
Beer pouring into glass, low angle video shot.
Although REGEN inherits a costly inference process like most diffusion models, it is tasked with an easier generation problem compared to the pure generation or weakly-conditional generation problem. We observe that it is possible to reduce the number of sampling steps significantly (even down to 1-step sampling) without noticeable visual quality degradation.
Reference
1 Step
10 Step
50 Step
100 Step
Conventional DiT architecture often struggles to generalize to unseen input sizes but REGEN is flexible with respect to aspect ratios and resolutions thanks to the content-aware positional encoding scheme. In-context conditioning results in gridding artifacts at larger resolution, while REGEN exhibits strong generalization due to the proposed content-aware PE.
Reference
In-context
Ours
The overall framework of REGEN. Our spatiotemporal video encoder \(E\left ( \cdot \right )\) encodes the input video \(x_{input}\) into two latent frames \(\left ( z_{c}, z_{m} \right )\), which will be processed by our latent expansion module \(C_{e}\) and serve as a conditioning for the generative decoder.
Latent conditioning module \(C_{e}\). The SIREN network \(M_{t}\) maps the time coordinate \(t_{f}\) to a feature vector, modulated by the motion latent \(z_{m}\). The resulting feature is concatenated with the feature value of \(z_c\) at the corresponding spatial coordinate \((x,y)\). The concatenated feature is mapped into the DiT hidden dimension by the projector \(M_{s}\). We utilize the first frame prediction from SIREN to replace the first frame of expanded \(z_{c}\) to ensure consistent representation for both image and video inputs.
@misc{zhang2025regenlearningcompactvideo,
title={REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder},
author={Yitian Zhang and Long Mai and Aniruddha Mahapatra and David Bourgin and Yicong Hong and Jonah Casebeer and Feng Liu and Yun Fu},
year={2025},
eprint={2503.08665},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.08665},
}
The website template is taken from ProMAG (which was built on
DreamFusion's project page).
* This work was done while Yitian Zhang was an intern at Adobe Research.