Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

Pinned Tweet

Boyang Zheng

@boyangzheng_

14 Oct 2025

Introducing Representation Autoencoders (RAE)! We revisit the latent space of Diffusion Transformers, replacing VAE with RAE: pretrained representation encoders (DINOv2, SigLIP2) paired with trained ViT decoders. (1/n)

474

49,694

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

We challenge the assumption that pretrained encoders like DINOv2 or SigLIP2 can’t reconstruct because they “only capture semantics.” With a simple ViT decoder, frozen semantic encoders can achieve good reconstruction—matching or surpassing SD-VAE. (2/n)

2,427

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

For more technical details, please checkout our paper and website! If you have any question related to our work, feel free to contact! Paper: arxiv.org/abs/2510.11690 Page: rae-dit.github.io/ Thanks to my collaborators @ma_nanye @TongPetersb and adviser @sainingxie !!!(n/n)

Diffusion Transformers with Representation Autoencoders

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however,...

arxiv.org

2,550

Boyang Zheng · Oct 14, 2025 · 3:28 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

Huggingface paper: huggingface.co/papers/2510.1…

Paper page - Diffusion Transformers with Representation Autoencoders

Join the discussion on this paper page

huggingface.co

1,738

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

Across all scales, DiT^DH converges faster and achieves lower FID than standard DiT (left). Using DiT^DH-XL on DINOv2-B yields much faster convergence than VAE-based diffusion models (right).

1,285

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

Now we proceed to examine generation. RAE latents are high-dimensional—DINOv2-B is 48× larger than SD-VAE—but this adds no extra diffusion compute: both yield 256 tokens on a 256x256 image, only the token dimension increases.

2,063

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

We further introduce DiT^DH — DiT with a wide Diffusion Head. Inspired by DDT, it adds a shallow, wide head that expands effective width without quadratic FLOPs. By reusing the noisy latent as the head input, DiT no longer needs backbone width > token dimension.

1,453

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

We find three key components make RAE diffusion stable and effective. 1: match DiT width to token dimension. When the model width is smaller than the token dimension, DiT cannot even overfit a single sample. Width ≥ token dim is essential for high-dimension space.

2,091

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

3: train decoders with noisy latents. RAE decoders see only clean features during training, but diffusion outputs are inherently noisy. Adding Gaussian noise during decoder training smooths the latent distribution and improves robustness at sampling time.

1,475

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

2: apply a dimension-dependent schedule shift. Commonly used noise schedules, designed for VAE latents, corrupt RAE latents too slowly. We adopt the SNR schedule shift in SD3, and shift according to latent dimension rather than resolution -- yielding significant improvement.

1,720

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

Together, these enable fast-converging and performant high-dimensional diffusion. DiT-XL on DINOv2-B converges 47x and 16x faster than SiT-XL and REPA-XL.

3,187

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

❗️We found an inconsistency in ImageNet FID-50K evaluation: some use 50 images per class (balanced), others sample uniformly at random. Balanced sampling gives consistently ~0.1 lower FID, so we re-evaluate prior models under this protocol — see our paper for updated results.

1,178

Boyang Zheng · Oct 14, 2025 · 2:55 AM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

RAE(DINOv2-B) + DiTDH-XL sets a new state-of-the-art on ImageNet: It reaches 1.15 (no guidance) on ImageNet 256 and 1.13 (with AutoGuidance) on both ImageNet 256 and 512.

1,163

Boyang Zheng · Oct 14, 2025 · 12:36 PM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

Replying to @JiaweiYang118

Thanks for your interest! I think it's provable (using similar techniques in the paper) that AR model also have a loss lower bound. But instead of an integral of last (n-d) eigenvalues of x_t, it's just eigenvalues of x itself.

346

Boyang Zheng · Oct 14, 2025 · 12:37 PM UTC

Boyang Zheng

@boyangzheng_

14 Oct 2025

Replying to @WhyNotBerker

Thanks for the catch🙌! We'll fix it in the next version.

301