using this DiT variant with a shallow/wide DDT head, we achieve strong image generation results on imagenet. some highlights: > 1.51 FID at 256×256 (without any guidance) > 1.13 FID at both 256×256 and 512×512 (with auto-guidance) personally, I don’t think these absolute sota FIDs tell the whole story anymore. what could matter more: how quickly a diffusion model can be trained, because that reflects the quality of its underlying representation. in this sense, the RAE-based DiT also stands out: it converges extremely fast, reaching 3.71 FID after only 20 epochs of training. and the samples are very fun to look at: the generated images are remarkably *diverse* and high quality! (6/n)

Oct 14, 2025 · 3:17 AM UTC

2
3
76