using this DiT variant with a shallow/wide DDT head, we achieve strong image generation results on imagenet.
some highlights:
> 1.51 FID at 256×256 (without any guidance)
> 1.13 FID at both 256×256 and 512×512 (with auto-guidance)
personally, I don’t think these absolute sota FIDs tell the whole story anymore.
what could matter more: how quickly a diffusion model can be trained, because that reflects the quality of its underlying representation.
in this sense, the RAE-based DiT also stands out: it converges extremely fast, reaching 3.71 FID after only 20 epochs of training.
and the samples are very fun to look at: the generated images are remarkably *diverse* and high quality! (6/n)
Oct 14, 2025 · 3:17 AM UTC
2
3
76

