PhD Student @NYU_Courant; Representation Learning; Generative Models; Multimodal Learning; BS @sjtu1896 , ACM class

New York, NY
Introducing Representation Autoencoders (RAE)! We revisit the latent space of Diffusion Transformers, replacing VAE with RAE: pretrained representation encoders (DINOv2, SigLIP2) paired with trained ViT decoders. (1/n)
6
53
474
49,694
We challenge the assumption that pretrained encoders like DINOv2 or SigLIP2 can’t reconstruct because they “only capture semantics.” With a simple ViT decoder, frozen semantic encoders can achieve good reconstruction—matching or surpassing SD-VAE. (2/n)
2
16
2,427
Across all scales, DiT^DH converges faster and achieves lower FID than standard DiT (left). Using DiT^DH-XL on DINOv2-B yields much faster convergence than VAE-based diffusion models (right).
1
7
1,285
Now we proceed to examine generation. RAE latents are high-dimensional—DINOv2-B is 48× larger than SD-VAE—but this adds no extra diffusion compute: both yield 256 tokens on a 256x256 image, only the token dimension increases.
1
7
2,063
We further introduce DiT^DH — DiT with a wide Diffusion Head. Inspired by DDT, it adds a shallow, wide head that expands effective width without quadratic FLOPs. By reusing the noisy latent as the head input, DiT no longer needs backbone width > token dimension.
1
7
1,453
We find three key components make RAE diffusion stable and effective. 1: match DiT width to token dimension. When the model width is smaller than the token dimension, DiT cannot even overfit a single sample. Width ≥ token dim is essential for high-dimension space.
1
6
2,091
3: train decoders with noisy latents. RAE decoders see only clean features during training, but diffusion outputs are inherently noisy. Adding Gaussian noise during decoder training smooths the latent distribution and improves robustness at sampling time.
1
6
1,475
2: apply a dimension-dependent schedule shift. Commonly used noise schedules, designed for VAE latents, corrupt RAE latents too slowly. We adopt the SNR schedule shift in SD3, and shift according to latent dimension rather than resolution -- yielding significant improvement.
1
6
1,720
Together, these enable fast-converging and performant high-dimensional diffusion. DiT-XL on DINOv2-B converges 47x and 16x faster than SiT-XL and REPA-XL.
1
5
3,187
❗️We found an inconsistency in ImageNet FID-50K evaluation: some use 50 images per class (balanced), others sample uniformly at random. Balanced sampling gives consistently ~0.1 lower FID, so we re-evaluate prior models under this protocol — see our paper for updated results.
1
5
1,178
RAE(DINOv2-B) + DiTDH-XL sets a new state-of-the-art on ImageNet: It reaches 1.15 (no guidance) on ImageNet 256 and 1.13 (with AutoGuidance) on both ImageNet 256 and 512.
1
5
1,163
Replying to @JiaweiYang118
Thanks for your interest! I think it's provable (using similar techniques in the paper) that AR model also have a loss lower bound. But instead of an integral of last (n-d) eigenvalues of x_t, it's just eigenvalues of x itself.
1
346
Replying to @WhyNotBerker
Thanks for the catch🙌! We'll fix it in the next version.
1
301