Sander Dieleman · May 6, 2026 · 9:29 AM UTC

Sander Dieleman

Pinned Tweet

Sander Dieleman

@sedielem

May 6

My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks. It's the longest one yet👀 Let me know what you think! sander.ai/2026/05/06/flow-ma…

Learning the integral of a diffusion model

A deep dive on flow maps.

sander.ai

177

762

99,077

Sander Dieleman · Jan 24, 2022 · 2:49 PM UTC

Sander Dieleman

@sedielem

24 Jan 2022

A very common trick in neural net training, often omitted in papers: add a tiny number ε (e.g. 1e-10) to any quantity in a denominator or square root, so you don't divide by 0. My advice: always add ε! If it doesn't help, it won't hurt, and you might avoid a few NaN encounters👀

166

1,636

Sander Dieleman · Aug 18, 2025 · 10:40 PM UTC

Sander Dieleman

@sedielem

18 Aug 2025

Measuring information content in bits is very useful. Information theory made digital communication, cryptography and machine learning possible. But information is not just a quantity: it also has a shape. (1/6)

117

1,247

139,916

Sander Dieleman · Dec 15, 2022 · 3:31 PM UTC

Sander Dieleman

@sedielem

15 Dec 2022

Me: "NOOO, you can't just treat spectrograms as images, the frequency and time axes have completely different semantics, there is no locality in frequency and ..." These guys: "Stable diffusion go brrr"

@_akhaliq

15 Dec 2022

Riffusion, real-time music generation with stable diffusion @huggingface model: huggingface.co/riffusion/rif… project page: riffusion.com/about

131

1,167

Sander Dieleman · Sep 2, 2024 · 10:21 AM UTC

Sander Dieleman

@sedielem

2 Sep 2024

Diffusion is the rising tide that eventually submerges all frequencies, high and low 🌊 Diffusion is the gradual decomposition into feature scales, fine and coarse 🗼 Diffusion is just spectral autoregression 🤷🌈

151

1,082

277,987

Sander Dieleman · Apr 15, 2025 · 9:39 AM UTC

Sander Dieleman

@sedielem

15 Apr 2025

New blog post: let's talk about latents! sander.ai/2025/04/15/latents…

Generative modelling in latent space

Latent representations for generative models.

sander.ai

193

1,044

175,503

Sander Dieleman · Jul 20, 2023 · 7:24 PM UTC

Sander Dieleman

@sedielem

20 Jul 2023

New blog post: perspectives on diffusion, or how diffusion models are autoencoders, deep latent variable models, score function predictors, reverse SDE solvers, flow-based models, RNNs, and autoregressive models, all at once! sander.ai/2023/07/20/perspec…

Perspectives on diffusion

Perspectives on diffusion, or how diffusion models are autoencoders, deep latent variable models, score function predictors, reverse SDE solvers, flow-based models, RNNs, and autoregressive models,...

sander.ai

186

828

171,520

Sander Dieleman · Jan 9, 2023 · 2:40 PM UTC

Sander Dieleman

@sedielem

9 Jan 2023

New blog post about diffusion language models: benanne.github.io/2023/01/09… Diffusion models have completely taken over generative modelling of perceptual signals -- why is autoregression still the name of the game for language modelling? And can we do anything about that?

Diffusion language models

Diffusion models have completely taken over generative modelling of perceptual signals -- why is autoregression still the name of the game for language modelling? Can we do anything about that?

sander.ai

167

831

474,412

Sander Dieleman · Jun 28, 2018 · 12:11 AM UTC

Sander Dieleman

@sedielem

28 Jun 2018

Stacking WaveNet autoencoders on top of each other leads to raw audio models that can capture long-range structure in music. Check out our new paper: arxiv.org/abs/1806.10474 Listen to some minute-long piano music samples: goo.gl/A9nTZa

231

758

Sander Dieleman · May 20, 2025 · 6:07 PM UTC

Sander Dieleman

@sedielem

20 May 2025

As I was saying: it's happening

Google DeepMind

@GoogleDeepMind

20 May 2025

We’ve developed Gemini Diffusion: our state-of-the-art text diffusion model. Instead of predicting text directly, it learns to generate outputs by refining noise, step-by-step. This helps it excel at coding and math, where it can iterate over solutions quickly. #GoogleIO

695

50,654

Sander Dieleman · Jul 5, 2025 · 4:01 PM UTC

Sander Dieleman

@sedielem

5 Jul 2025

Diffusion models have analytical solutions, but they involve sums over the entire training set, and they don't generalise at all. They are mainly useful to help us understand how practical diffusion models generalise. Nice blog + code by Raymond Fan: rfangit.github.io/blog/2025/…

658

41,629

Sander Dieleman · Dec 16, 2022 · 9:42 PM UTC

Sander Dieleman

@sedielem

16 Dec 2022

First Riffusion, now this. Perhaps pixels are all you need🤔

@_akhaliq

16 Dec 2022

Image-and-Language Understanding from Pixels Only abs: arxiv.org/abs/2212.08045

571

151,440

Sander Dieleman · Jun 2, 2022 · 8:56 PM UTC

Sander Dieleman

@sedielem

2 Jun 2022

This paper is a goldmine for anyone training diffusion models, carefully picking apart theory and practice and showing which choices really matter. I was quite excited to see the authors of the StyleGAN series of papers tackle this topic, and boy do they deliver!

@_akhaliq

2 Jun 2022

Elucidating the Design Space of Diffusion-Based Generative Models abs: arxiv.org/abs/2206.00364 improve efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55

105

582

Sander Dieleman · Aug 19, 2025 · 8:44 PM UTC

Sander Dieleman

@sedielem

19 Aug 2025

New survey on diffusion language models: arxiv.org/abs/2508.10875 (via @NicolasPerezNi1). Covers pre/post-training, inference and multimodality, with very nice illustrations. I can't help but feel a bit wistful about the apparent extinction of the continuous approach after 2023🥲

585

105,865

Sander Dieleman · Nov 20, 2024 · 10:55 PM UTC

Sander Dieleman

@sedielem

20 Nov 2024

Story time! Some of the most productive time I spent during my PhD was, ironically, when @kaggle competitions distracted me from my research😁 I learnt to assess and reimplement ideas from papers, and how to properly train neural nets. (1/6)

540

67,246

Sander Dieleman · Feb 11, 2024 · 3:28 PM UTC

Sander Dieleman

@sedielem

11 Feb 2024

This one's easy! That honour goes to "the diffusion bible", as I like to call it. It's been well over a year and I still refer to it several times a week. Very few papers I've read come close, in terms of signal-to-noise ratio. arxiv.org/abs/2206.00364

This tweet is unavailable

538

71,620

Sander Dieleman · Mar 6, 2018 · 12:02 PM UTC

Sander Dieleman

@sedielem

6 Mar 2018

"We conclude that the common association between sequence modeling and recurrent nets should be reconsidered, and convolutional nets should be regarded as a natural starting point for sequence modeling tasks." Great to see more work in this direction! arxiv.org/abs/1803.01271

188

510

Sander Dieleman · Oct 5, 2020 · 5:19 PM UTC

Sander Dieleman

@sedielem

5 Oct 2020

Very excited about the renewed focus on iterative refinement as a powerful tool for generative modelling! Here are a few relevant ICLR 2021 submissions: (image credit: github.com/ermongroup/ncsn) (1/3)

502

Sander Dieleman · Aug 4, 2024 · 11:02 PM UTC

Sander Dieleman

@sedielem

4 Aug 2024

The interpretation of diffusion as autoregression in the frequency domain seems to be stirring up a lot of thought! (I may or may not have a new blog post in the works 🧐)

493

47,994

Sander Dieleman · Nov 16, 2023 · 9:02 AM UTC

Sander Dieleman

@sedielem

16 Nov 2023

5-6 years ago I was working on music generation at DeepMind, but let me tell you, this is... something else. Incredibly excited to be able to finally share what our team has been working on!

Demis Hassabis

@demishassabis

16 Nov 2023

Thrilled to share #Lyria, the world's most sophisticated AI music generation system. From just a text prompt Lyria produces compelling music & vocals. Also: building new Music AI tools for artists to amplify creativity in partnership w/YT & music industry deepmind.google/discover/blo…

469

106,230

Sander Dieleman · Jan 31, 2022 · 4:06 PM UTC

Sander Dieleman

@sedielem

31 Jan 2022

New blog post: diffusion models are autoencoders! benanne.github.io/2022/01/31… I take a closer look at this connection, inspired by the work of @YSongStanford, @StefanoErmon, @kjgeras, @RandomlyWalking, @gyomalin_ML and @jaschasd, among others!

Diffusion models are autoencoders

Diffusion models have become very popular over the last two years. There is an underappreciated link between diffusion models and autoencoders.

sander.ai

106

450

Sander Dieleman · Dec 2, 2024 · 7:52 PM UTC

Sander Dieleman

@sedielem

2 Dec 2024

Better VQ-VAEs with this one weird rotation trick! I love papers like this: a simple change to an already powerful technique, that significantly improves results without introducing complexity or hyperparameters. arxiv.org/abs/2410.06424 (h/t lucidrains)

469

28,256

Sander Dieleman · May 26, 2022 · 12:51 PM UTC

Sander Dieleman

@sedielem

26 May 2022

New blog post about the magic of diffusion guidance! benanne.github.io/2022/05/26… Guidance powers the recent spectacular results in text-conditioned image generation (DALL·E 2, Imagen), so the time is right for a closer look at this simple, yet extremely effective technique.

Guidance: a cheat code for diffusion models

A quick post with some thoughts on diffusion guidance

sander.ai

442

Sander Dieleman · Feb 28, 2024 · 11:03 PM UTC

Sander Dieleman

@sedielem

28 Feb 2024

New blog post! Some thoughts about diffusion distillation. Actually, quite a lot of thoughts 🤭 Please share your thoughts as well! sander.ai/2024/02/28/paradox…

The paradox of diffusion distillation

Thoughts on the tension between iterative refinement as the thing that makes diffusion models work, and our continual attempts to make it _less_ iterative.

sander.ai

423

72,583

Sander Dieleman · Oct 10, 2025 · 7:59 PM UTC

Sander Dieleman

@sedielem

10 Oct 2025

In diffusion LMs, discrete methods have all but displaced continuous ones (🥲). Interesting new trend: why not both? Use continuous methods to make discrete diffusion better. Diffusion duality: arxiv.org/abs/2506.10892 CADD: arxiv.org/abs/2510.01329 CCDD: arxiv.org/abs/2510.03206

The Diffusion Duality

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models...

arxiv.org

Sander Dieleman

@sedielem

19 Aug 2025

420

97,369

Sander Dieleman · Nov 30, 2024 · 12:56 PM UTC

Sander Dieleman

@sedielem

30 Nov 2024

The link between diffusion models and optimal transport is still a bit of an enigma to me. One thing that's clear: different diffusion models trained on similar datasets tend to recover similar mappings. If these are generally not OT, in what sense are they optimal instead?

Gabriel Peyré

@gabrielpeyre

30 Nov 2024

I wrote a summary of the main ingredients of the neat proof by Hugo Lavenant that diffusion models do not generally define optimal transport. github.com/mathematical-tour…

381

40,289

Sander Dieleman · Dec 20, 2016 · 9:07 PM UTC

Sander Dieleman

@sedielem

20 Dec 2016

Harmonic networks (@deworrall92 et al.) are fully rotation equivariant convnets. Very cool! arxiv.org/abs/1612.04642 piped.video/watch?v=qoWAFBYO…

162

375

Sander Dieleman · Dec 18, 2020 · 1:46 PM UTC

Sander Dieleman

@sedielem

18 Dec 2020

To synthesise realistic megapixel images, learn a high-level discrete representation with a conditional GAN, then train a transformer on top. Beautiful synergy between adversarial and likelihood-based learning! 🧵 (1/8)

@_akhaliq

18 Dec 2020

Taming Transformers for High-Resolution Image Synthesis pdf: arxiv.org/pdf/2012.09841.pdf abs: arxiv.org/abs/2012.09841 project page: compvis.github.io/taming-tra…

368

Sander Dieleman · Jul 26, 2025 · 9:53 PM UTC

Sander Dieleman

@sedielem

26 Jul 2025

I blog and give talks to help build people's intuition for diffusion models. YouTubers like @3blue1brown and @welchlabs have been a huge inspiration: their ability to make complex ideas in maths and physics approachable is unmatched. Really great to see them tackle this topic!

Grant Sanderson

@3blue1brown

25 Jul 2025

New video on the details of diffusion models: piped.video/iv-5mZ_9CPY Produced by @welchlabs, this is the first in a small series of 3b1b this summer. I enjoyed providing editorial feedback throughout the last several months, and couldn't be happier with the result.

373

26,267

Sander Dieleman · Aug 2, 2024 · 4:54 PM UTC

Sander Dieleman

@sedielem

2 Aug 2024

I gave a 1-hour talk about generative modelling at the EEML 2024 summer school last month. It's mostly an intuitive look at how and why diffusion models actually work -- not unlike the content of my recent blog posts. All summer school talks will be freely available online!🙏

Petar Veličković

@PetarV_93

2 Aug 2024

EEML'24 Day 1 videos are out! 🇷🇸 * Intro to DL (@alfcnz): piped.video/1bBOneUMu3Y?si=KiSk… * Generative modelling + iterative refinement (@sedielem): piped.video/9BHQvQlsVdE?si=x9ir… * AI for Good (@weballergy): piped.video/tJSicw7DPVU?si=RPcp… * Reasoning (@backprop2seed & I): piped.video/CyIuM5eQZ5A?si=90pl…

361

93,815

Sander Dieleman · Aug 28, 2023 · 4:01 PM UTC

Sander Dieleman

@sedielem

28 Aug 2023

New blog post about the geometry of diffusion guidance: sander.ai/2023/08/28/geometr… This complements my previous blog post on the topic of guidance, but it has a lot of diagrams which I was too lazy to draw back then! Guest-starring Bundle, the cutest bunny in ML 🐇

ALT Diagram showing the operations involved in a single step of diffusion sampling (DDPM) with classifier-free guidance.

343

51,235

Sander Dieleman · Nov 28, 2024 · 12:09 AM UTC

Sander Dieleman

@sedielem

28 Nov 2024

IMO VQGAN is why GANs deserve the NeurIPS test of time award. Suddenly our image representations were an order of magnitude more compact. Absolute game changer for generative modelling at scale, and the basis for latent diffusion models. arxiv.org/abs/2012.09841

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias...

arxiv.org

354

19,058

Sander Dieleman · Oct 14, 2025 · 4:59 PM UTC

Sander Dieleman

@sedielem

14 Oct 2025

In my blog post on latents for generative modelling, I pointed out that representation learning and reconstruction are two separate tasks (§6.3), which autoencoders try to solve simultaneously. Separating them makes sense. It opens up a lot of possibilities, as this work shows!

Saining Xie

@sainingxie

14 Oct 2025

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

338

33,823

Sander Dieleman · Nov 29, 2022 · 1:26 AM UTC

Sander Dieleman

@sedielem

29 Nov 2022

New paper: continuous diffusion for categorical data We train diffusion language models with cross-entropy, using score interpolation instead of score matching. The training distribution of noise levels is adapted on the fly with time warping. arxiv.org/abs/2211.15089 (1/3)

329

Sander Dieleman · Mar 13, 2019 · 1:00 AM UTC

Sander Dieleman

@sedielem

13 Mar 2019

Likelihood is a great loss fn, it's all about the space you measure it in! Our latest work on hierarchical AR image models (w/ @JeffreyDeFauw, Karen Simonyan): arxiv.org/abs/1903.04933 We generated 128x128 & 256x256 samples for all ImageNet classes: bit.ly/2FJkvhJ (1/2)

324

Sander Dieleman · Aug 22, 2024 · 9:08 PM UTC

Sander Dieleman

@sedielem

22 Aug 2024

It's so much easier to tweet low-effort memes which assert that diffusion is just autoregression in frequency space, than it is to write a blog post about it 🤷 (but I'm doing both!)

310

23,335

Sander Dieleman · Apr 25, 2025 · 1:03 PM UTC

Sander Dieleman

@sedielem

25 Apr 2025

One weird trick for better diffusion models: concatenate some DINOv2 features to your latent channels! Combining latents with PCA components extracted from DINOv2 features yields faster training and better samples. Also enables a new guidance strategy. Simple and effective!

Thodoris Kouzelis @ThKouz

25 Apr 2025

1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture – Low-level image details (via VAE latents) – High-level semantic features (via DINOv2)🧵

295

26,244

Sander Dieleman · Dec 1, 2018 · 4:14 PM UTC

Sander Dieleman

@sedielem

1 Dec 2018

I will be at #NeurIPS2018 to present our work on music generation in the raw audio domain, using a stack of WaveNet autoencoders. Poster #87 on Tuesday Dec 4th, 5PM-7PM! Paper: papers.nips.cc/paper/8023-th… Samples: goo.gl/A9nTZa

287

Sander Dieleman · Oct 29, 2025 · 9:20 AM UTC

Sander Dieleman

@sedielem

29 Oct 2025

Some prefer math and rigour, personally I like intuitive explanations. This monograph has plenty of both! I love how much time is spent linking different perspectives (variational, score-based, flow-based) together. Chapter 6 in particular is really great. Amazing effort! 👏

Chieh-Hsin (Jesse) Lai

@JCJesseLai

29 Oct 2025

Tired to go back to the original papers again and again? Our monograph: a systematic and fundamental recipe you can rely on! 📘 We’re excited to release 《The Principles of Diffusion Models》— with @DrYangSong, @gimdong58085414, @mittu1204, and @StefanoErmon. It traces the core ideas that shaped diffusion modeling and explains how today’s models work, why they work, and where they’re heading. 🧵You’ll find the link and a few highlights in the thread. We’d love to hear your thoughts and join some discussions! ⚡ Stay tuned for our markdown version, where you can drop your comments!

294

22,109

Sander Dieleman · Mar 24, 2020 · 1:24 PM UTC

Sander Dieleman

@sedielem

24 Mar 2020

New blog post: 'Generating music in the waveform domain' benanne.github.io/2020/03/24… A comprehensive overview of the field and some personal thoughts, based on a tutorial I gave at @ismir2019 with @jordiponsdotme and Jongpil Lee back in November. Comments / feedback welcome!

109

279

Sander Dieleman · Mar 13, 2025 · 11:58 PM UTC

Sander Dieleman

@sedielem

13 Mar 2025

"signal processing meets neural nets" is probably my favourite genre of paper, two great examples: Making Convolutional Networks Shift-Invariant Again by @rzhang88 arxiv.org/abs/1904.11486 Alias-Free Generative Adversarial Networks by Karras et al. arxiv.org/abs/2106.12423

Making Convolutional Networks Shift-Invariant Again

Modern convolutional networks are not shift-invariant, as small input shifts or translations can cause drastic changes in the output. Commonly used downsampling methods, such as max-pooling,...

arxiv.org

Jack Morris

@jxmnop

12 Mar 2025

i wish i knew more about signal processing, because it seems like neural networks often choose to learn things in frequency space: - the grokking algorithm for modular addition (the one discovered by Neel Nanda) did all its math in frequency space - lots of evidence (DeepDream/Chris Olah's work) that CNNs recognize images by composing patterns of high and low frequencies - spectral analysis might be the key to "decoding" what's going on in the residual stream inside transformers (just guessing)

288

25,402

Sander Dieleman · Apr 6, 2017 · 4:48 PM UTC

Sander Dieleman

@sedielem

6 Apr 2017

I've been working on WaveNet autoencoders with @GoogleBrain Magenta. blog post: magenta.tensorflow.org/nsynt… paper: arxiv.org/abs/1704.01279

273

Sander Dieleman · Jan 1, 2025 · 12:24 PM UTC

Sander Dieleman

@sedielem

1 Jan 2025

Why do diffusion models generalise at all? It's not obvious that they would. It turns out underfitting plays an important role, as well as the architectural inductive biases of locality and translation equivariance. What other kinds of symmetry and structure could we hardcode? 🤔

Mason Kamb @MasonKamb

31 Dec 2024

Excited to finally share this work w/ @SuryaGanguli. Tl;dr: we find the first closed-form analytical theory that replicates the outputs of the very simplest diffusion models, with median pixel wise r^2 values of 90%+. arxiv.org/abs/2412.20292

266

24,203

Sander Dieleman · Aug 22, 2024 · 6:07 PM UTC

Sander Dieleman

@sedielem

22 Aug 2024

Think you understand classifier-free diffusion guidance? Think again! These two papers beg to differ😁 arxiv.org/abs/2406.02507 arxiv.org/abs/2408.09000 Both full of really great insights that question prevailing assumptions. cc @jaakkolehtinen @ArwenBradley @PreetumNakkiran

267

17,269

Sander Dieleman · May 14, 2024 · 10:10 PM UTC

Sander Dieleman

@sedielem

14 May 2024

When I started working on generative modelling at @GoogleDeepMind in 2016, it was not a very popular research topic. I remember how we had to emphasise in our papers that these models would learn useful representations, in order to convince reviewers of their utility 😁 (1/6)

263

48,462

Sander Dieleman · Nov 15, 2024 · 11:20 PM UTC

Sander Dieleman

@sedielem

15 Nov 2024

Big fan of straightforward ideas that help to free us from the tyranny of the grid! Off-the-grid ideas often end up being too complicated to catch on, but at a glance, this looks simple enough.

Rohan Choudhury

@rchoudhury997

15 Nov 2024

Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!

260

30,111

Sander Dieleman · Oct 31, 2025 · 4:35 PM UTC

Sander Dieleman

@sedielem

31 Oct 2025

Generative modelling used to be about capturing the training data distribution. Interestingly, this stopped being the case when we started actually using them🤔 We tweak temps, use classifier-free guidance and post-train to get a distribution better than the training data.

267

44,317

Sander Dieleman · Jun 14, 2024 · 12:07 PM UTC

Sander Dieleman

@sedielem

14 Jun 2024

The noise schedule seems like a pretty important design choice for any diffusion model, but I have sometimes found this concept to be a greater source of confusion than insight😵‍💫 In this blog post, I try to explain why. sander.ai/2024/06/14/noise-s…

Noise schedules considered harmful

The noise schedule is a key design parameter for diffusion models. Unfortunately it is a superfluous abstraction that entangles several different model aspects. Do we really need it?

sander.ai

255

47,043

Sander Dieleman · Feb 21, 2025 · 11:51 AM UTC

Sander Dieleman

@sedielem

21 Feb 2025

If you want to diffuse stuff, its frequency behaviour is important🌊 (sander.ai/2024/09/02/spectra…). For latents, you can shape the spectrum! Like EQ-VAE, they find: equivariance ⇒ better latents. Loving all the recent work on tweaking latents, might be time for another blog post✍️

Diffusion is spectral autoregression

A deep dive into spectral analysis of diffusion models of images, revealing how they implicitly perform a form of autoregression in the frequency domain.

sander.ai

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

21 Feb 2025

Improving the Diffusability of Autoencoders "In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 2562 and FVD by at least 44% for video generation on Kinetics-700 17 × 2562."

261

25,476

Sander Dieleman · Apr 30, 2025 · 7:24 PM UTC

Sander Dieleman

@sedielem

30 Apr 2025

I'm always quite skeptical of work that addresses a long-standing problem with a relatively simple tweak, but this looks promising: wrap the softmax numerator in ReLU(x - 1), and the denom terms in abs(x - 1) to get rid of attention sinks. Would be nice if it holds up at scale!

zed @zmkzmkz

30 Apr 2025

EARLY PREPRINT: Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Why do we use softmax in attention, even though we don’t really need non-zero probabilities that sum to one, causing attention sink and large hidden state activations? Let that sink in.

257

25,103

Sander Dieleman · Sep 25, 2024 · 5:07 PM UTC

Sander Dieleman

@sedielem

25 Sep 2024

We are hiring on the Generative Media team in London: boards.greenhouse.io/deepmin… We work on Imagen, Veo, Lyria and all that good stuff. Come work with us! If you're interested, don't delay -- apply before 5PM tomorrow (UK time).

DeepMind

job-boards.greenhouse.io

258

52,489

Sander Dieleman · Jun 16, 2025 · 9:50 PM UTC

Sander Dieleman

@sedielem

16 Jun 2025

This work uncovers a profound connection between continuous and discrete (non-absorbing) diffusion models, allowing transfer of advanced techniques such as consistency distillation to the discrete setting! Also: amazing title, no notes! 🧑‍🍳😙🤌

Subham Sahoo

@ssahoo_

13 Jun 2025

🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)

250

24,041

Sander Dieleman · Dec 31, 2022 · 2:12 PM UTC

Sander Dieleman

@sedielem

31 Dec 2022

End of year shower thought: Before AlexNet, we used layer-wise pre-training to train neural nets with >2 layers -- backprop just couldn't hack it. Diffusion and autoregression are the new layer-wise pre-training: decompose generation into many steps, train one step at a time!

232

70,920

Sander Dieleman · Jan 25, 2023 · 3:49 PM UTC

Sander Dieleman

@sedielem

25 Jan 2023

Batch normalisation appears to be falling out of favour (probably for the best IMO, so many bugs end up being batchnorm bugs😬). One area where it persists is GAN discriminators (e.g. in StyleGAN-T and VQGAN). Are there any other settings where batchnorm is still hard to avoid?

229

177,072

Sander Dieleman · Dec 16, 2024 · 5:26 PM UTC

Sander Dieleman

@sedielem

16 Dec 2024

Here's Veo 2, the latest version of our video generation model, as well as a substantial upgrade for Imagen 3 🧑‍🍳🚢 (Did I mention we are hiring on the Generative Media team, btw 👀)

Google DeepMind

@GoogleDeepMind

16 Dec 2024

Today, we’re announcing Veo 2: our state-of-the-art video generation model which produces realistic, high-quality clips from text or image prompts. 🎥 We’re also releasing an improved version of our text-to-image model, Imagen 3 - available to use in ImageFX through @LabsDotGoogle. → goo.gle/veo-2-imagen-3

Prompt: An extreme close-up of a craftsperson's hands shaping a glowing piece of pottery on a wheel. Threads of golden, luminous energy connect the potter’s hands to the clay, swirling dynamically with their movements.

ALT Prompt: An extreme close-up of a craftsperson's hands shaping a glowing piece of pottery on a wheel. Threads of golden, luminous energy connect the potter’s hands to the clay, swirling dynamically with their movements.

ALT Prompt: A portrait of an Asian woman with neon green lights in the background, shallow depth of field.

236

35,328

Sander Dieleman · Mar 11, 2021 · 1:25 AM UTC

Sander Dieleman

@sedielem

11 Mar 2021

🆕Variable-rate discrete representation learning🆕 We learn slowly-varying discrete representations of speech signals, compress them with run-length encoding, and train transformers to model language in the speech domain 🗣️ 📜arxiv.org/abs/2103.06089 🔊vdrl.github.io/

222

Sander Dieleman · Dec 5, 2023 · 3:57 PM UTC

Sander Dieleman

@sedielem

5 Dec 2023

With all the recent work on distilling diffusion models into single-pass models, I've been thinking a lot about diffusion model training as solving a kind of optimal transport problem🚐 (1/6)

226

66,951

Sander Dieleman · May 21, 2025 · 11:16 AM UTC

Sander Dieleman

@sedielem

21 May 2025

In 2022, I worked on text diffusion for a bit and wrote a blog post. Since then, people have regularly asked me about scaling diffusion LLMs. All the while, I was on the first row watching Brendan assemble a cracked team and make it a reality. Now I can stop being coy about it😁

Brendan O'Donoghue

@bodonoghue85

20 May 2025

Excited to share what my team has been working on lately - Gemini diffusion! We bring diffusion to language modeling, yielding more power and blazing speeds! 🚀🚀🚀 Gemini diffusion is especially strong at coding. In this example the model generates at 2000 tokens/sec, including overheads like tokenization, prefill, safety filters etc.

227

17,860

Sander Dieleman · Jan 22, 2025 · 7:15 PM UTC

Sander Dieleman

@sedielem

22 Jan 2025

📢PSA: #NeurIPS2024 recordings are now publicly available! The workshops always have tons of interesting things on at once, so the FOMO is real😵‍💫 Luckily it's all recorded, so I've been catching up on what I missed. Thread below with some personal highlights🧵

226

29,226

Sander Dieleman · Sep 11, 2025 · 12:34 PM UTC

Sander Dieleman

@sedielem

11 Sep 2025

The effective context length of Transformers with local (sliding window) attention layers is usually much shorter than the theoretical maximum. This blog post explains why. Back in 2017 the visualisations in arxiv.org/abs/1701.04128 really changed my perspective on this for CNNs!

Guangxuan Xiao @Guangxuan_Xiao

25 Aug 2025

It's a common belief that L SWA layers (size W) yield an L×W receptive field. My post shows why the effective range is limited to O(W), regardless of depth. The reasons are information dilution and the exponential barrier from residual connections: guangxuanx.com/blog/stacking…

221

30,974

Sander Dieleman · Dec 3, 2023 · 4:00 PM UTC

Sander Dieleman

@sedielem

3 Dec 2023

Parameterising neural nets to predict logits and training them using the cross-entropy loss function is an extremely effective combination. This setup works for diffusion models as well, by using score interpolation instead of score matching! See arxiv.org/abs/2211.15089 (§3.1)

Continuous diffusion for categorical data

Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact...

arxiv.org

Fern @hi_tysam

2 Dec 2023

The more I work in ML the more I feel like nearly any loss objective can, and should, be rephrased as its cross-entropy-based analog.

211

75,480

Sander Dieleman · Dec 5, 2024 · 12:05 AM UTC

Sander Dieleman

@sedielem

5 Dec 2024

Diffusion models learn useful internal representations of images, but it's somewhat impractical to use them for feature extraction, because they expect noisy input images. This work suggests a cheap and straightforward way to distill them into clean-input models. Neat!

Nick Stracke

@rmsnorm

4 Dec 2024

🤔 Why do we extract diffusion features from noisy images? Isn’t that destroying information? Yes, it is - but we found a way to do better. 🚀 Here’s how we unlock better features, no noise, no hassle 🧵👇

213

18,621

Sander Dieleman · Jul 28, 2025 · 2:37 PM UTC

Sander Dieleman

@sedielem

28 Jul 2025

Great blog post by @jerryx314 on rotary position embeddings (RoPE) in more than one dimension, with interactive visualisations, a bunch of experimental results, and code! jerryxio.ng/posts/nd-rope/

Simo Ryu

@cloneofsimo

27 Jul 2025

Very nice blogpost on RoPE variants by @jerryx314

213

14,969

Sander Dieleman · Sep 2, 2024 · 10:21 AM UTC

Sander Dieleman

@sedielem

2 Sep 2024

New blog post: sander.ai/2024/09/02/spectra…

Diffusion is spectral autoregression

A deep dive into spectral analysis of diffusion models of images, revealing how they implicitly perform a form of autoregression in the frequency domain.

sander.ai

203

15,759

Sander Dieleman · Apr 3, 2024 · 2:37 PM UTC

Sander Dieleman

@sedielem

3 Apr 2024

10 years ago to the day, I published my first ML-related blog post: sander.ai/posts/ My blogging has been very sporadic over the years, but sharing what I've learnt has been very rewarding, and probably a pretty good career move as well😁 I highly recommend it!

All Posts

An archive of posts.

sander.ai

200

46,565

Sander Dieleman · May 29, 2022 · 12:50 PM UTC

Sander Dieleman

@sedielem

29 May 2022

Replying to @A_K_Nain

I think this is exacerbated by the fact that there are multiple formalisms (e.g. VAE-style, score-based, SDE, ...) and everything has 2-3 different names, depending on who you ask! I strongly recommend @YSongStanford's compendium (with Python notebooks!): yang-song.github.io/blog/202…

210

Sander Dieleman · Apr 14, 2025 · 4:37 PM UTC

Sander Dieleman

@sedielem

14 Apr 2025

Amazing interview with @DrYangSong, one of the key researchers we have to thank for diffusion models. The most important lesson IMO: be fearless! The community's view on score matching was quite pessimistic at the time -- he went against the grain and got it to work at scale!

Slater Stich

@slaterstich

14 Apr 2025

Very excited to share our interview with @DrYangSong. This is Part 2 of our history of diffusion series — score matching, the SDE/ODE interpretation, consistency models, and more. Enjoy!

209

24,297

Sander Dieleman · Jan 27, 2023 · 11:50 AM UTC

Sander Dieleman

@sedielem

27 Jan 2023

Two neat papers about diffusion for high-res images without cascading. Similar observations: - tuning the noise schedule is really important - the bulk of computation can be done on a significantly more compact representation arxiv.org/abs/2301.11093 arxiv.org/abs/2301.10972

209

41,475

Sander Dieleman · Sep 3, 2020 · 1:16 AM UTC

Sander Dieleman

@sedielem

3 Sep 2020

WaveGrad generates waveforms from spectrograms by iteratively following the log-likelihood gradient. The surprising thing is that it needs as little as 6 steps to produce good quality audio! arxiv.org/abs/2009.00713 Seems like the resurgence of score matching is in full swing :)

Heiga Zen (全炳河)

@heiga_zen

3 Sep 2020

Yet another neural vocoder from my team mates in Google Brain is out! The new model, "WaveGrad", is not autoregressive/Flow/GAN. It is based on score matching / diffusion probabilistic models. Check it please!!

205

Sander Dieleman · Jun 7, 2025 · 9:06 AM UTC

Sander Dieleman

@sedielem

7 Jun 2025

If you've read my latest blog post on generative modelling in latent space, this one is a great follow-up about putting things into practice. openworldlabs.ai/blog/traini…

Overworld

@overworld_ai

7 Jun 2025

In this blog post we will summarize some of our findings with training autoencoders for diffusion! We also share some null results we had with a slightly unconventional approach we tried. 1/2

206

21,228

Sander Dieleman · May 1, 2025 · 1:03 AM UTC

Sander Dieleman

@sedielem

1 May 2025

Cool result: the entropy decrease resulting from a diffusion model prediction is equal to a scaled version of the loss🤯 In CDCD (arxiv.org/abs/2211.15089), we linearised the categorical cross-entropy to warp time. This finding makes that possible for Gaussian diffusion models!

Continuous diffusion for categorical data

Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact...

arxiv.org

Luca Ambrogioni

@LucaAmb

29 Apr 2025

1/4) I am very happy to share our latest work on the information theory of generative diffusion: "Entropic Time Schedulers for Generative Diffusion Models" We find that the conditional entropy offers a natural data-dependent notion of time.

205

19,009

Sander Dieleman · Dec 28, 2016 · 1:00 PM UTC

Sander Dieleman

@sedielem

28 Dec 2016

Lots of interesting work on "fixing" GANs right now: arxiv.org/abs/1612.02780 arxiv.org/abs/1611.04273 arxiv.org/abs/1611.02163 [1/3]

204

Sander Dieleman · May 26, 2025 · 11:22 AM UTC

Sander Dieleman

@sedielem

26 May 2025

I once discovered I'd been training networks without any biases for 3 months, because I forgot y += b in my conv layer implementation 🙃 Turns out it didn't really matter 🤷 although that wasn't quite as well-established at the time, so it was a bit of a shock to find out!

Andrej Karpathy

@karpathy

26 May 2025

Replying to @kalomaze

Deep Learning horror genre 🫣 That fear of a kwarg that isn’t set right, not erroring, only silently making your results slightly worse.

195

19,567

Sander Dieleman · Jan 25, 2025 · 2:54 PM UTC

Sander Dieleman

@sedielem

25 Jan 2025

Nice paper on the trade-off between decoding quality and modelability in 2-stage generative models. I disagree with this framing though: the trade-off is quite clear from an information-theoretic perspective. Do most people really believe this? Maybe it's time for a blog post🤔

Vivek Ramanujan @RamanujanVivek

24 Jan 2025

Happy to (belatedly) share our recent work introducing Causally Regularized Tokenization 📺, matching LlamaGen-3B generation performance with 0.5x the number of tokens/image (256 vs 576) and 0.25x the number of params (770M vs 3B) on ImageNet. arxiv.org/pdf/2412.16326 1/n

190

25,518

Sander Dieleman · Sep 28, 2023 · 1:07 PM UTC

Sander Dieleman

@sedielem

28 Sep 2023

This work shows scalar quantisation is competitive with VQ across a range of tasks, but simplifies things a lot: no codebook collapse, no EMA updates, ... because no codebook! I've been a fan of scalar quantisation for a while, see arxiv.org/abs/1810.05246 arxiv.org/abs/2103.06089

Piano Genie

We present Piano Genie, an intelligent controller which allows non-musicians to improvise on the piano. With Piano Genie, a user performs on a simple interface with eight buttons, and their...

arxiv.org

@_akhaliq

28 Sep 2023

Finite Scalar Quantization: VQ-VAE Made Simple paper page: huggingface.co/papers/2309.1… propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

187

57,568

Sander Dieleman · May 20, 2025 · 6:40 PM UTC

Sander Dieleman

@sedielem

20 May 2025

Video isn't really video without audio!

Google DeepMind

@GoogleDeepMind

20 May 2025

Video, meet audio. 🎥🤝🔊 With Veo 3, our new state-of-the-art generative video model, you can add soundtracks to clips you make. Create talking characters, include sound effects, and more while developing videos in a range of cinematic styles. 🧵

185

8,608

Sander Dieleman · Oct 2, 2020 · 12:09 PM UTC

Sander Dieleman

@sedielem

2 Oct 2020

Good advice! For classification models, a scatter plot of the cross-entropy loss vs. prediction entropy (~confidence) for individual examples can be very revealing. More generally: study model behaviour for individual data points, don't look at aggregate statistics exclusively.

Andrej Karpathy

@karpathy

2 Oct 2020

When you sort your dataset descending by loss you are guaranteed to find something unexpected, strange and helpful.

182

Sander Dieleman · Aug 15, 2018 · 10:43 PM UTC

Sander Dieleman

@sedielem

15 Aug 2018

Invertible neural networks are really cool! Check out this excellent blog post about a new paper where they are used to analyse inverse problems: hci.iwr.uni-heidelberg.de/vi… paper: arxiv.org/abs/1808.04730 (1/4)

178

Sander Dieleman · Jan 30, 2024 · 11:45 PM UTC

Sander Dieleman

@sedielem

30 Jan 2024

I've got a blog post brewing... maybe even two blog posts! They are about diffusion models🙃

177

27,448

Sander Dieleman · Nov 17, 2024 · 3:31 PM UTC

Sander Dieleman

@sedielem

17 Nov 2024

I was working on autoregressive models around that time, but instead of RNNs and language modelling, we were trying to make convolutional nets generate music by producing 16000+ audio waveform amplitude values per second. No regrets😎 (Never got tempted by RL🤭)

Andrej Karpathy

@karpathy

17 Nov 2024

Replying to @BldrInvstTech

I don’t know why I didn’t work on this at early OpenAI, despite going around everywhere giving talks about the magic of autoregressive language models around that time. I went deep into RL like everyone else that time. Biggest, most confusing research career mistake ever

179

26,641

Sander Dieleman · Apr 9, 2016 · 7:05 PM UTC

Sander Dieleman

@sedielem

9 Apr 2016

I've uploaded my PhD thesis "Learning feature hierarchies for musical audio signals", which I defended in January: dropbox.com/s/22bqmco45179t7…

thesis-FINAL.pdf

Shared with Dropbox

dropbox.com

171

Sander Dieleman · Nov 13, 2023 · 4:54 PM UTC

Sander Dieleman

@sedielem

13 Nov 2023

At the end of the summer, I gave an invited talk at the @M2lSchool in Thessaloniki about training neural networks. It's a bit of a jumble of ideas, suggestions and best practices I've amassed over the years, interspersed with concrete examples. piped.video/watch?v=wO4quYlQ…

2023 5.3 How to train Neural Networks effectively - Sander Dieleman

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

youtube.com

173

39,221

Sander Dieleman · Aug 26, 2016 · 10:47 PM UTC

Sander Dieleman

@sedielem

26 Aug 2016

tl;dr: connect every CNN layer to every other layer. Simple but effective idea, well-written paper. Worth a read!

Brandon Amos @brandondamos

26 Aug 2016

Densely Connected Convolutional Networks arxiv.org/abs/1608.06993

172

Sander Dieleman · Mar 3, 2023 · 2:23 AM UTC

Sander Dieleman

@sedielem

3 Mar 2023

If diffusion model sampling tries your patience, check out consistency models: single-step sampling! No adversarial loss! In addition to being a very cool idea, this paper significantly leans on the formalism from Karras et al. 2022 AKA my favourite diffusion paper😁 Neat!

@_akhaliq

3 Mar 2023

Consistency Models achieve the new state-of-the-art FID of 3.55 on CIFAR10 and 6.20 on ImageNet 64 ˆ 64 for one-step generation abs: arxiv.org/abs/2303.01469

169

44,556

Sander Dieleman · Dec 2, 2024 · 6:34 PM UTC

Sander Dieleman

@sedielem

2 Dec 2024

In arxiv.org/abs/2303.00848, @dpkingma and @RuiqiGao had suggested that noise augmentation could be used to make other likelihood-based models optimise perceptually weighted losses, like diffusion models do. So cool to see this working well in practice!

Understanding Diffusion Objectives as the ELBO with Simple Data...

To achieve the highest perceptual quality, state-of-the-art diffusion models are optimized with objectives that typically look very different from the maximum likelihood and the Evidence Lower...

arxiv.org

Michael Tschannen @mtschannen

2 Dec 2024

Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 arxiv.org/abs/2411.19722 A thread 👇 1/

176

19,613

Sander Dieleman · Oct 24, 2020 · 4:13 PM UTC

Sander Dieleman

@sedielem

24 Oct 2020

Neat idea: if you fit augmentation params with gradient descent (jointly with model params) using a prior that gently encourages more augmentation, they will naturally drift towards the maximal sensible values, which correspond to the degree of invariance exhibited by the data.

Andrew Gordon Wilson

@andrewgwils

23 Oct 2020

Translation equivariance has imbued CNNs with powerful generalization abilities. Our #NeurIPS2020 paper shows how to *learn* symmetries -- rotations, translations, scalings, shears -- from training data alone! arxiv.org/abs/2010.11882 w/ @g_benton_, @Pavel_Izmailov, @m_finzi. 1/9

171

Sander Dieleman · Feb 17, 2025 · 1:02 PM UTC

Sander Dieleman

@sedielem

17 Feb 2025

There's some interesting ongoing discussion about how much of a "diffusion model" this really is, but anything that even slightly challenges the autoregressive hegemony is a refreshing change!

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

17 Feb 2025

Large Language Diffusion Models Introduces LLaDA-8B, a large language diffusion model that pretrained on 2.3 trillion tokens using 0.13 million H800 GPU hours, followed by SFT on 4.5 million pairs. LLaDA 8B surpasses Llama-2 7B on nearly all 15 standard zero/few-shot learning tasks while performing on par with Llama-3 8B.

165

14,224

Sander Dieleman · May 21, 2025 · 12:58 AM UTC

Sander Dieleman

@sedielem

21 May 2025

Instead of modelling the velocity at each time step t in the process, model the mean velocity across any time interval (r, t). Not the first work to try this, but using the gradient of the mean velocity to define the target is an interesting approach that I haven't seen before.

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

20 May 2025

Mean Flows for One-step Generative Modeling "We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models."

161

22,192

Sander Dieleman · Sep 1, 2020 · 4:11 PM UTC

Sander Dieleman

@sedielem

1 Sep 2020

New blog post, in which I wax lyrical about typicality and the curse of dimensionality: benanne.github.io/2020/09/01… I tweeted about this concept a while back, but it turns out I have more to say on the topic. It's a bit more speculative than what I usually write, hope you like it!

Musings on typicality

A summary of my current thoughts on typicality, and its relevance to likelihood-based generative models.

sander.ai

159

Sander Dieleman · Apr 14, 2025 · 2:12 PM UTC

Sander Dieleman

@sedielem

14 Apr 2025

New blog post tomorrow... probably 👀

159

10,054

Sander Dieleman · Aug 18, 2025 · 10:40 PM UTC

Sander Dieleman

@sedielem

18 Aug 2025

The original promise of deep learning was to make feature engineering obsolete, and this was relatively successful. But when the inductive biases of neural networks work against us, it can still be useful to shape information, e.g. Fourier features arxiv.org/abs/2006.10739 (5/6)

Fourier Features Let Networks Learn High Frequency Functions in...

We show that passing input points through a simple Fourier feature mapping enables a multilayer perceptron (MLP) to learn high-frequency functions in low-dimensional problem domains. These results...

arxiv.org

160

9,277

Sander Dieleman · May 14, 2025 · 4:08 PM UTC

Sander Dieleman

@sedielem

14 May 2025

Here's the third and final part of @slaterstich's "History of diffusion" interview series! The other two interviewees' research played a pivotal role in the rise of diffusion models, whereas I just like to yap about them 😬 this was a wonderful opportunity to do exactly that!

Slater Stich

@slaterstich

14 May 2025

Excited to share our interview with @sedielem! This is Part 3 in our History of Diffusion series. We talk about diffusion as spectral autoregression, diffusion language models, flow matching, and much more. Enjoy!

162

13,266

Sander Dieleman · Jun 1, 2024 · 1:42 PM UTC

Sander Dieleman

@sedielem

1 Jun 2024

This paper does a great job explaining why 2-rectified flow could be a serious contender among the evergrowing abundance of diffusion distillation methods. The resulting model produces good samples in few steps, but retains the full flexibility of an undistilled diffusion model.

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

31 May 2024

Improving the Training of Rectified Flows abs: arxiv.org/abs/2405.20320 code: github.com/sangyun884/rfpp Three training tricks: 1. u-shaped timestep distribution 2. LPIPS-Huber loss 3. initializing with pretrained diffusion model (EDM) Now only requires reflow (distillation) once while comparable to consistency training/distillation at 1-2 inference steps.

161

72,229

Sander Dieleman · Dec 9, 2023 · 4:22 PM UTC

Sander Dieleman

@sedielem

9 Dec 2023

10 years ago today: @avdnoord and I presenting our audio-based music recommendation demo at @NeurIPSConf 2013! We went on to intern at Spotify & Google Play Music the next summer (blog post: sander.ai/2014/08/05/spotify…), and by summer 2015, we had both joined @GoogleDeepMind.

159

22,315

Sander Dieleman · Apr 7, 2017 · 1:11 PM UTC

Sander Dieleman

@sedielem

7 Apr 2017

The TF wrapper we use internally at DeepMind has been open sourced. Lasagne users might like this one, it shares a lot of design principles.

Google DeepMind

@GoogleDeepMind

7 Apr 2017

Excited to release #Sonnet - a library for constructing complex Neural Network models in TensorFlow. Get started: github.com/deepmind/sonnet

155

Sander Dieleman · Jul 25, 2025 · 8:06 PM UTC

Sander Dieleman

@sedielem

25 Jul 2025

Transformers haven't changed much since 2017, but there have been some innovations over the years. This is an excellent summary of architectural differences in recent LLMs. Nice diagrams too! 👏 It would be great to see something like this for diffusion Transformers as well 🤔

Sebastian Raschka

@rasbt

19 Jul 2025

From GPT to MoE: I reviewed & compared the main LLMs of 2025 in terms of their architectural design from DeepSeek-V3 to Kimi 2. Multi-head Latent Attention, sliding window attention, new Post- & Pre-Norm placements, NoPE, shared-expert MoEs, and more... magazine.sebastianraschka.co…

158

12,838

Sander Dieleman · Jul 5, 2025 · 11:16 AM UTC

Sander Dieleman

@sedielem

5 Jul 2025

This looks like a great deep dive on neural network architectures for diffusion models. tl;dr use a Transformer, but there's quite a bit more to it, and as always in this field, the devil is in the details!

Sayak Paul

@RisingSayak

4 Jul 2025

Had the honor to present diffusion transformers at CS25, Stanford. The place is truly magical. Slides: bit.ly/dit-cs25 Recording: piped.video/vXtapCFctTI?si=dlcE… Thanks to @stevenyfeng for making it happen!

158

18,992

Sander Dieleman · Oct 4, 2017 · 6:50 PM UTC

Sander Dieleman

@sedielem

4 Oct 2017

Google Assistant is now powered by WaveNet! deepmind.com/blog/wavenet-la…

149

Sander Dieleman · Sep 28, 2024 · 11:24 PM UTC

Sander Dieleman

@sedielem

28 Sep 2024

Replying to @gowthami_s

Sure! Another way to think of it is texture vs. structure, or sometimes people call this "stuff vs. things". In an image of a dog in a field, the grass texture (stuff) is high-entropy, but we do not perceive individual realisations of this texture, we just perceive it as "grass". If the realisation of this texture is subtly different, we often cannot tell, unless the images are layered directly on top of each other. This is a fun experiment to try with an adversarial autoencoder: when comparing an original image and its reconstruction side by side, they often look identical. But layering them on top of each other and flipping back and forth often reveals just how different the images are, especially in areas with a lot of texture. For objects (things) on the other hand, like the dog's eyes for example, differences of a similar magnitude would be immediately obvious. A good adversarial autoencoder will make abstraction of texture, but try to preserve structure. That way, the realisation of the grassy texture in the reconstruction can be different than the original, without it noticeably affecting the fidelity of the reconstruction. This enables the autoencoder to drop a lot of modes (i.e. other realisations of the same texture) and represent the presence of this texture more compactly in its latent space. This in turn should make generative modelling in the latent space easier as well, because it can now model the absence/presence of a texture, rather than having to capture all the entropy associated with that texture. This is a bit of a caricature, and what happens in reality is probably a bit more complicated, but this is roughly my intuition for why two-stage training is actually preferable over end-to-end, at least in the visual domain.

156

31,660

Sander Dieleman · May 20, 2025 · 6:34 PM UTC

Sander Dieleman

@sedielem

20 May 2025

We can diffuse text now, but we can still diffuse pixels as well!

Google DeepMind

@GoogleDeepMind

20 May 2025

Get ready for Imagen 4 🎨 capable of creating richer images, with more nuanced colors, intricate details and superior typography. Tap each photo below to see more. 👀

153

10,154

Sander Dieleman · Jun 8, 2020 · 12:11 PM UTC

Sander Dieleman

@sedielem

8 Jun 2020

Our latest work on GANs for text-to-speech, from characters/phonemes to waveforms with a single model. Learning varying alignment without teacher forcing is tricky, but we found dynamic time warping (DTW) to be very effective.

Google DeepMind

@GoogleDeepMind

8 Jun 2020

In our new paper [arxiv.org/abs/2006.03575] we propose EATS: End-to-End Adversarial Text-to-Speech, which allows for speech synthesis directly from text or phonemes without the need for multi-stage training pipelines or additional supervision. Audio: bit.ly/2Ya9rRK

144