I know op is click-baiting, but let me bite...
fwiw every researcher’s DREAM is to find out their architecture is wrong. If it’s never wrong, that’s a bigger problem. we try to break DiT every day w/ SiT, REPA, REPA-E etc. but you gotta form hypotheses, run experiments, test, not by LARPing science in your head...otherwise, your conclusion is not just “wrong,” it’s not even wrong.
okay - more technical take on "what's wrong with DiT" (as of today):
- tread is more like stochastic depth, i think the convergence comes from the regularization effect that makes the representation stronger (note inference is all standard - all blocks process all tokens); very interesting work, but has nothing to do with whatever OP is saying...
- lightning DiT is a proven, robust upgrade (w\ swiglu, rmsnorm, rope, ps=1), always use that when possible
- no evidence that post-norm is hurting anything
- the biggest fix from the past year is on internal rep learning: repa was first, but now tons of ways to do it (tokenizer-level fix like va-vae/repa-e, concat semantic tokens to noise latents, decoupled arch like ddt, regularizers like dispersive loss or self-representation alignment, etc.)
- always go with stochastic interpolants/flow matching (SiT should be the baseline here)
- use adaln-zero for time embedding, but use cross attn for more complex distributions like text embedding
- but do it right -- use pixart-style shared adaln, otherwise you waste 30% of params for nothing
- sd-vae is the real wrong thing in DiT, it's the elephant in the room, bloated (445.87 GFlops for 256^2 images??), not end to end, again approaches like va-vae and repa-e are partial fixes, but more progress is coming.
bros, DiT is wrong.
it's mathematically wrong.
it's formally wrong. there is something wrong with it