Research Scientist at Google DeepMind (WaveNet, Imagen, Veo). I tweet about deep learning (research + software), music, generative models (personal account).

London, England
My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks. It's the longest one yet👀 Let me know what you think! sander.ai/2026/05/06/flow-ma…
8
177
762
99,077
A very common trick in neural net training, often omitted in papers: add a tiny number ε (e.g. 1e-10) to any quantity in a denominator or square root, so you don't divide by 0. My advice: always add ε! If it doesn't help, it won't hurt, and you might avoid a few NaN encounters👀
30
166
1,636
Measuring information content in bits is very useful. Information theory made digital communication, cryptography and machine learning possible. But information is not just a quantity: it also has a shape. (1/6)
20
117
1,247
139,916
Me: "NOOO, you can't just treat spectrograms as images, the frequency and time axes have completely different semantics, there is no locality in frequency and ..." These guys: "Stable diffusion go brrr"
Riffusion, real-time music generation with stable diffusion @huggingface model: huggingface.co/riffusion/rif… project page: riffusion.com/about
16
131
1,167
Diffusion is the rising tide that eventually submerges all frequencies, high and low 🌊 Diffusion is the gradual decomposition into feature scales, fine and coarse 🗼 Diffusion is just spectral autoregression 🤷🌈
32
151
1,082
277,987
New blog post: perspectives on diffusion, or how diffusion models are autoencoders, deep latent variable models, score function predictors, reverse SDE solvers, flow-based models, RNNs, and autoregressive models, all at once! sander.ai/2023/07/20/perspec…
15
186
828
171,520
New blog post about diffusion language models: benanne.github.io/2023/01/09… Diffusion models have completely taken over generative modelling of perceptual signals -- why is autoregression still the name of the game for language modelling? And can we do anything about that?
21
167
831
474,412
Stacking WaveNet autoencoders on top of each other leads to raw audio models that can capture long-range structure in music. Check out our new paper: arxiv.org/abs/1806.10474 Listen to some minute-long piano music samples: goo.gl/A9nTZa
6
231
758
As I was saying: it's happening
We’ve developed Gemini Diffusion: our state-of-the-art text diffusion model. Instead of predicting text directly, it learns to generate outputs by refining noise, step-by-step. This helps it excel at coding and math, where it can iterate over solutions quickly. #GoogleIO
7
45
695
50,654
Diffusion models have analytical solutions, but they involve sums over the entire training set, and they don't generalise at all. They are mainly useful to help us understand how practical diffusion models generalise. Nice blog + code by Raymond Fan: rfangit.github.io/blog/2025/…
8
84
658
41,629
First Riffusion, now this. Perhaps pixels are all you need🤔
Image-and-Language Understanding from Pixels Only abs: arxiv.org/abs/2212.08045
13
79
571
151,440
This paper is a goldmine for anyone training diffusion models, carefully picking apart theory and practice and showing which choices really matter. I was quite excited to see the authors of the StyleGAN series of papers tackle this topic, and boy do they deliver!
Elucidating the Design Space of Diffusion-Based Generative Models abs: arxiv.org/abs/2206.00364 improve efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55
1
105
582
New survey on diffusion language models: arxiv.org/abs/2508.10875 (via @NicolasPerezNi1). Covers pre/post-training, inference and multimodality, with very nice illustrations. I can't help but feel a bit wistful about the apparent extinction of the continuous approach after 2023🥲
7
91
585
105,865
Story time! Some of the most productive time I spent during my PhD was, ironically, when @kaggle competitions distracted me from my research😁 I learnt to assess and reimplement ideas from papers, and how to properly train neural nets. (1/6)
6
31
540
67,246
This one's easy! That honour goes to "the diffusion bible", as I like to call it. It's been well over a year and I still refer to it several times a week. Very few papers I've read come close, in terms of signal-to-noise ratio. arxiv.org/abs/2206.00364
6
67
538
71,620
"We conclude that the common association between sequence modeling and recurrent nets should be reconsidered, and convolutional nets should be regarded as a natural starting point for sequence modeling tasks." Great to see more work in this direction! arxiv.org/abs/1803.01271
7
188
510
Very excited about the renewed focus on iterative refinement as a powerful tool for generative modelling! Here are a few relevant ICLR 2021 submissions: (image credit: github.com/ermongroup/ncsn) (1/3)
5
99
502
The interpretation of diffusion as autoregression in the frequency domain seems to be stirring up a lot of thought! (I may or may not have a new blog post in the works 🧐)
9
29
493
47,994
5-6 years ago I was working on music generation at DeepMind, but let me tell you, this is... something else. Incredibly excited to be able to finally share what our team has been working on!
Thrilled to share #Lyria, the world's most sophisticated AI music generation system. From just a text prompt Lyria produces compelling music & vocals. Also: building new Music AI tools for artists to amplify creativity in partnership w/YT & music industry deepmind.google/discover/blo…
16
35
469
106,230
Better VQ-VAEs with this one weird rotation trick! I love papers like this: a simple change to an already powerful technique, that significantly improves results without introducing complexity or hyperparameters. arxiv.org/abs/2410.06424 (h/t lucidrains)
3
51
469
28,256
New blog post about the magic of diffusion guidance! benanne.github.io/2022/05/26… Guidance powers the recent spectacular results in text-conditioned image generation (DALL·E 2, Imagen), so the time is right for a closer look at this simple, yet extremely effective technique.
7
96
442
In diffusion LMs, discrete methods have all but displaced continuous ones (🥲). Interesting new trend: why not both? Use continuous methods to make discrete diffusion better. Diffusion duality: arxiv.org/abs/2506.10892 CADD: arxiv.org/abs/2510.01329 CCDD: arxiv.org/abs/2510.03206
New survey on diffusion language models: arxiv.org/abs/2508.10875 (via @NicolasPerezNi1). Covers pre/post-training, inference and multimodality, with very nice illustrations. I can't help but feel a bit wistful about the apparent extinction of the continuous approach after 2023🥲
9
69
420
97,369
The link between diffusion models and optimal transport is still a bit of an enigma to me. One thing that's clear: different diffusion models trained on similar datasets tend to recover similar mappings. If these are generally not OT, in what sense are they optimal instead?
I wrote a summary of the main ingredients of the neat proof by Hugo Lavenant that diffusion models do not generally define optimal transport. github.com/mathematical-tour…
5
39
381
40,289
Harmonic networks (@deworrall92 et al.) are fully rotation equivariant convnets. Very cool! arxiv.org/abs/1612.04642 piped.video/watch?v=qoWAFBYO…
4
162
375
To synthesise realistic megapixel images, learn a high-level discrete representation with a conditional GAN, then train a transformer on top. Beautiful synergy between adversarial and likelihood-based learning! 🧵 (1/8)
Taming Transformers for High-Resolution Image Synthesis pdf: arxiv.org/pdf/2012.09841.pdf abs: arxiv.org/abs/2012.09841 project page: compvis.github.io/taming-tra…
4
80
368
I blog and give talks to help build people's intuition for diffusion models. YouTubers like @3blue1brown and @welchlabs have been a huge inspiration: their ability to make complex ideas in maths and physics approachable is unmatched. Really great to see them tackle this topic!
New video on the details of diffusion models: piped.video/iv-5mZ_9CPY Produced by @welchlabs, this is the first in a small series of 3b1b this summer. I enjoyed providing editorial feedback throughout the last several months, and couldn't be happier with the result.
5
16
373
26,267
I gave a 1-hour talk about generative modelling at the EEML 2024 summer school last month. It's mostly an intuitive look at how and why diffusion models actually work -- not unlike the content of my recent blog posts. All summer school talks will be freely available online!🙏
EEML'24 Day 1 videos are out! 🇷🇸 * Intro to DL (@alfcnz): piped.video/1bBOneUMu3Y?si=KiSk… * Generative modelling + iterative refinement (@sedielem): piped.video/9BHQvQlsVdE?si=x9ir… * AI for Good (@weballergy): piped.video/tJSicw7DPVU?si=RPcp… * Reasoning (@backprop2seed & I): piped.video/CyIuM5eQZ5A?si=90pl…
5
37
361
93,815
New blog post about the geometry of diffusion guidance: sander.ai/2023/08/28/geometr… This complements my previous blog post on the topic of guidance, but it has a lot of diagrams which I was too lazy to draw back then! Guest-starring Bundle, the cutest bunny in ML 🐇

ALT Diagram showing the operations involved in a single step of diffusion sampling (DDPM) with classifier-free guidance.

9
72
343
51,235
IMO VQGAN is why GANs deserve the NeurIPS test of time award. Suddenly our image representations were an order of magnitude more compact. Absolute game changer for generative modelling at scale, and the basis for latent diffusion models. arxiv.org/abs/2012.09841
13
26
354
19,058
In my blog post on latents for generative modelling, I pointed out that representation learning and reconstruction are two separate tasks (§6.3), which autoencoders try to solve simultaneously. Separating them makes sense. It opens up a lot of possibilities, as this work shows!
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
9
20
338
33,823
New paper: continuous diffusion for categorical data We train diffusion language models with cross-entropy, using score interpolation instead of score matching. The training distribution of noise levels is adapted on the fly with time warping. arxiv.org/abs/2211.15089 (1/3)
5
71
329
Likelihood is a great loss fn, it's all about the space you measure it in! Our latest work on hierarchical AR image models (w/ @JeffreyDeFauw, Karen Simonyan): arxiv.org/abs/1903.04933 We generated 128x128 & 256x256 samples for all ImageNet classes: bit.ly/2FJkvhJ (1/2)
1
95
324
It's so much easier to tweet low-effort memes which assert that diffusion is just autoregression in frequency space, than it is to write a blog post about it 🤷 (but I'm doing both!)
9
25
310
23,335
One weird trick for better diffusion models: concatenate some DINOv2 features to your latent channels! Combining latents with PCA components extracted from DINOv2 features yields faster training and better samples. Also enables a new guidance strategy. Simple and effective!
1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture – Low-level image details (via VAE latents) – High-level semantic features (via DINOv2)🧵
3
34
295
26,244
I will be at #NeurIPS2018 to present our work on music generation in the raw audio domain, using a stack of WaveNet autoencoders. Poster #87 on Tuesday Dec 4th, 5PM-7PM! Paper: papers.nips.cc/paper/8023-th… Samples: goo.gl/A9nTZa
2
71
287
Some prefer math and rigour, personally I like intuitive explanations. This monograph has plenty of both! I love how much time is spent linking different perspectives (variational, score-based, flow-based) together. Chapter 6 in particular is really great. Amazing effort! 👏
Tired to go back to the original papers again and again? Our monograph: a systematic and fundamental recipe you can rely on! 📘 We’re excited to release 《The Principles of Diffusion Models》— with @DrYangSong, @gimdong58085414, @mittu1204, and @StefanoErmon. It traces the core ideas that shaped diffusion modeling and explains how today’s models work, why they work, and where they’re heading. 🧵You’ll find the link and a few highlights in the thread. We’d love to hear your thoughts and join some discussions! ⚡ Stay tuned for our markdown version, where you can drop your comments!
5
17
294
22,109
New blog post: 'Generating music in the waveform domain' benanne.github.io/2020/03/24… A comprehensive overview of the field and some personal thoughts, based on a tutorial I gave at @ismir2019 with @jordiponsdotme and Jongpil Lee back in November. Comments / feedback welcome!
8
109
279
"signal processing meets neural nets" is probably my favourite genre of paper, two great examples: Making Convolutional Networks Shift-Invariant Again by @rzhang88 arxiv.org/abs/1904.11486 Alias-Free Generative Adversarial Networks by Karras et al. arxiv.org/abs/2106.12423
i wish i knew more about signal processing, because it seems like neural networks often choose to learn things in frequency space: - the grokking algorithm for modular addition (the one discovered by Neel Nanda) did all its math in frequency space - lots of evidence (DeepDream/Chris Olah's work) that CNNs recognize images by composing patterns of high and low frequencies - spectral analysis might be the key to "decoding" what's going on in the residual stream inside transformers (just guessing)
9
49
288
25,402
I've been working on WaveNet autoencoders with @GoogleBrain Magenta. blog post: magenta.tensorflow.org/nsynt… paper: arxiv.org/abs/1704.01279
5
95
273
Why do diffusion models generalise at all? It's not obvious that they would. It turns out underfitting plays an important role, as well as the architectural inductive biases of locality and translation equivariance. What other kinds of symmetry and structure could we hardcode? 🤔
Excited to finally share this work w/ @SuryaGanguli. Tl;dr: we find the first closed-form analytical theory that replicates the outputs of the very simplest diffusion models, with median pixel wise r^2 values of 90%+. arxiv.org/abs/2412.20292
5
32
266
24,203
Think you understand classifier-free diffusion guidance? Think again! These two papers beg to differ😁 arxiv.org/abs/2406.02507 arxiv.org/abs/2408.09000 Both full of really great insights that question prevailing assumptions. cc @jaakkolehtinen @ArwenBradley @PreetumNakkiran
2
47
267
17,269
When I started working on generative modelling at @GoogleDeepMind in 2016, it was not a very popular research topic. I remember how we had to emphasise in our papers that these models would learn useful representations, in order to convince reviewers of their utility 😁 (1/6)
1
21
263
48,462
Big fan of straightforward ideas that help to free us from the tyranny of the grid! Off-the-grid ideas often end up being too complicated to catch on, but at a glance, this looks simple enough.
Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!
3
25
260
30,111
Generative modelling used to be about capturing the training data distribution. Interestingly, this stopped being the case when we started actually using them🤔 We tweak temps, use classifier-free guidance and post-train to get a distribution better than the training data.
17
13
267
44,317
The noise schedule seems like a pretty important design choice for any diffusion model, but I have sometimes found this concept to be a greater source of confusion than insight😵‍💫 In this blog post, I try to explain why. sander.ai/2024/06/14/noise-s…
6
37
255
47,043
If you want to diffuse stuff, its frequency behaviour is important🌊 (sander.ai/2024/09/02/spectra…). For latents, you can shape the spectrum! Like EQ-VAE, they find: equivariance ⇒ better latents. Loving all the recent work on tweaking latents, might be time for another blog post✍️
Improving the Diffusability of Autoencoders "In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 2562 and FVD by at least 44% for video generation on Kinetics-700 17 × 2562."
3
35
261
25,476
I'm always quite skeptical of work that addresses a long-standing problem with a relatively simple tweak, but this looks promising: wrap the softmax numerator in ReLU(x - 1), and the denom terms in abs(x - 1) to get rid of attention sinks. Would be nice if it holds up at scale!
EARLY PREPRINT: Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Why do we use softmax in attention, even though we don’t really need non-zero probabilities that sum to one, causing attention sink and large hidden state activations? Let that sink in.
3
24
257
25,103
We are hiring on the Generative Media team in London: boards.greenhouse.io/deepmin… We work on Imagen, Veo, Lyria and all that good stuff. Come work with us! If you're interested, don't delay -- apply before 5PM tomorrow (UK time).
4
40
258
52,489
This work uncovers a profound connection between continuous and discrete (non-absorbing) diffusion models, allowing transfer of advanced techniques such as consistency distillation to the discrete setting! Also: amazing title, no notes! 🧑‍🍳😙🤌
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
5
29
250
24,041
End of year shower thought: Before AlexNet, we used layer-wise pre-training to train neural nets with >2 layers -- backprop just couldn't hack it. Diffusion and autoregression are the new layer-wise pre-training: decompose generation into many steps, train one step at a time!
8
19
232
70,920
Batch normalisation appears to be falling out of favour (probably for the best IMO, so many bugs end up being batchnorm bugs😬). One area where it persists is GAN discriminators (e.g. in StyleGAN-T and VQGAN). Are there any other settings where batchnorm is still hard to avoid?
15
18
229
177,072
Here's Veo 2, the latest version of our video generation model, as well as a substantial upgrade for Imagen 3 🧑‍🍳🚢 (Did I mention we are hiring on the Generative Media team, btw 👀)
Today, we’re announcing Veo 2: our state-of-the-art video generation model which produces realistic, high-quality clips from text or image prompts. 🎥 We’re also releasing an improved version of our text-to-image model, Imagen 3 - available to use in ImageFX through @LabsDotGoogle. → goo.gle/veo-2-imagen-3
4
34
236
35,328
🆕Variable-rate discrete representation learning🆕 We learn slowly-varying discrete representations of speech signals, compress them with run-length encoding, and train transformers to model language in the speech domain 🗣️ 📜arxiv.org/abs/2103.06089 🔊vdrl.github.io/
1
50
222
With all the recent work on distilling diffusion models into single-pass models, I've been thinking a lot about diffusion model training as solving a kind of optimal transport problem🚐 (1/6)
2
26
226
66,951
In 2022, I worked on text diffusion for a bit and wrote a blog post. Since then, people have regularly asked me about scaling diffusion LLMs. All the while, I was on the first row watching Brendan assemble a cracked team and make it a reality. Now I can stop being coy about it😁
Excited to share what my team has been working on lately - Gemini diffusion! We bring diffusion to language modeling, yielding more power and blazing speeds! 🚀🚀🚀 Gemini diffusion is especially strong at coding. In this example the model generates at 2000 tokens/sec, including overheads like tokenization, prefill, safety filters etc.
8
23
227
17,860
📢PSA: #NeurIPS2024 recordings are now publicly available! The workshops always have tons of interesting things on at once, so the FOMO is real😵‍💫 Luckily it's all recorded, so I've been catching up on what I missed. Thread below with some personal highlights🧵
2
40
226
29,226
The effective context length of Transformers with local (sliding window) attention layers is usually much shorter than the theoretical maximum. This blog post explains why. Back in 2017 the visualisations in arxiv.org/abs/1701.04128 really changed my perspective on this for CNNs!
It's a common belief that L SWA layers (size W) yield an L×W receptive field. My post shows why the effective range is limited to O(W), regardless of depth. The reasons are information dilution and the exponential barrier from residual connections: guangxuanx.com/blog/stacking…
2
34
221
30,974
Parameterising neural nets to predict logits and training them using the cross-entropy loss function is an extremely effective combination. This setup works for diffusion models as well, by using score interpolation instead of score matching! See arxiv.org/abs/2211.15089 (§3.1)
The more I work in ML the more I feel like nearly any loss objective can, and should, be rephrased as its cross-entropy-based analog.
2
14
211
75,480
Diffusion models learn useful internal representations of images, but it's somewhat impractical to use them for feature extraction, because they expect noisy input images. This work suggests a cheap and straightforward way to distill them into clean-input models. Neat!
🤔 Why do we extract diffusion features from noisy images? Isn’t that destroying information? Yes, it is - but we found a way to do better. 🚀 Here’s how we unlock better features, no noise, no hassle 🧵👇
3
31
213
18,621
Great blog post by @jerryx314 on rotary position embeddings (RoPE) in more than one dimension, with interactive visualisations, a bunch of experimental results, and code! jerryxio.ng/posts/nd-rope/
Very nice blogpost on RoPE variants by @jerryx314
2
28
213
14,969
10 years ago to the day, I published my first ML-related blog post: sander.ai/posts/ My blogging has been very sporadic over the years, but sharing what I've learnt has been very rewarding, and probably a pretty good career move as well😁 I highly recommend it!
2
15
200
46,565
Replying to @A_K_Nain
I think this is exacerbated by the fact that there are multiple formalisms (e.g. VAE-style, score-based, SDE, ...) and everything has 2-3 different names, depending on who you ask! I strongly recommend @YSongStanford's compendium (with Python notebooks!): yang-song.github.io/blog/202…
2
28
210
Amazing interview with @DrYangSong, one of the key researchers we have to thank for diffusion models. The most important lesson IMO: be fearless! The community's view on score matching was quite pessimistic at the time -- he went against the grain and got it to work at scale!
Very excited to share our interview with @DrYangSong. This is Part 2 of our history of diffusion series — score matching, the SDE/ODE interpretation, consistency models, and more. Enjoy!
1
38
209
24,297
Two neat papers about diffusion for high-res images without cascading. Similar observations: - tuning the noise schedule is really important - the bulk of computation can be done on a significantly more compact representation arxiv.org/abs/2301.11093 arxiv.org/abs/2301.10972
2
24
209
41,475
WaveGrad generates waveforms from spectrograms by iteratively following the log-likelihood gradient. The surprising thing is that it needs as little as 6 steps to produce good quality audio! arxiv.org/abs/2009.00713 Seems like the resurgence of score matching is in full swing :)
Yet another neural vocoder from my team mates in Google Brain is out! The new model, "WaveGrad", is not autoregressive/Flow/GAN. It is based on score matching / diffusion probabilistic models. Check it please!!
40
205
If you've read my latest blog post on generative modelling in latent space, this one is a great follow-up about putting things into practice. openworldlabs.ai/blog/traini…
In this blog post we will summarize some of our findings with training autoencoders for diffusion! We also share some null results we had with a slightly unconventional approach we tried. 1/2
1
27
206
21,228
Cool result: the entropy decrease resulting from a diffusion model prediction is equal to a scaled version of the loss🤯 In CDCD (arxiv.org/abs/2211.15089), we linearised the categorical cross-entropy to warp time. This finding makes that possible for Gaussian diffusion models!
1/4) I am very happy to share our latest work on the information theory of generative diffusion: "Entropic Time Schedulers for Generative Diffusion Models" We find that the conditional entropy offers a natural data-dependent notion of time.
5
26
205
19,009
Lots of interesting work on "fixing" GANs right now: arxiv.org/abs/1612.02780 arxiv.org/abs/1611.04273 arxiv.org/abs/1611.02163 [1/3]
3
81
204
I once discovered I'd been training networks without any biases for 3 months, because I forgot y += b in my conv layer implementation 🙃 Turns out it didn't really matter 🤷 although that wasn't quite as well-established at the time, so it was a bit of a shock to find out!
Replying to @kalomaze
Deep Learning horror genre 🫣 That fear of a kwarg that isn’t set right, not erroring, only silently making your results slightly worse.
10
10
195
19,567
Nice paper on the trade-off between decoding quality and modelability in 2-stage generative models. I disagree with this framing though: the trade-off is quite clear from an information-theoretic perspective. Do most people really believe this? Maybe it's time for a blog post🤔
Happy to (belatedly) share our recent work introducing Causally Regularized Tokenization 📺, matching LlamaGen-3B generation performance with 0.5x the number of tokens/image (256 vs 576) and 0.25x the number of params (770M vs 3B) on ImageNet. arxiv.org/pdf/2412.16326 1/n
16
15
190
25,518
This work shows scalar quantisation is competitive with VQ across a range of tasks, but simplifies things a lot: no codebook collapse, no EMA updates, ... because no codebook! I've been a fan of scalar quantisation for a while, see arxiv.org/abs/1810.05246 arxiv.org/abs/2103.06089
Finite Scalar Quantization: VQ-VAE Made Simple paper page: huggingface.co/papers/2309.1… propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
6
29
187
57,568
Video isn't really video without audio!
Video, meet audio. 🎥🤝🔊 With Veo 3, our new state-of-the-art generative video model, you can add soundtracks to clips you make. Create talking characters, include sound effects, and more while developing videos in a range of cinematic styles. 🧵
6
14
185
8,608
Good advice! For classification models, a scatter plot of the cross-entropy loss vs. prediction entropy (~confidence) for individual examples can be very revealing. More generally: study model behaviour for individual data points, don't look at aggregate statistics exclusively.
When you sort your dataset descending by loss you are guaranteed to find something unexpected, strange and helpful.
1
36
182
Invertible neural networks are really cool! Check out this excellent blog post about a new paper where they are used to analyse inverse problems: hci.iwr.uni-heidelberg.de/vi… paper: arxiv.org/abs/1808.04730 (1/4)
6
58
178
I've got a blog post brewing... maybe even two blog posts! They are about diffusion models🙃
5
5
177
27,448
I was working on autoregressive models around that time, but instead of RNNs and language modelling, we were trying to make convolutional nets generate music by producing 16000+ audio waveform amplitude values per second. No regrets😎 (Never got tempted by RL🤭)
Replying to @BldrInvstTech
I don’t know why I didn’t work on this at early OpenAI, despite going around everywhere giving talks about the magic of autoregressive language models around that time. I went deep into RL like everyone else that time. Biggest, most confusing research career mistake ever
4
8
179
26,641
I've uploaded my PhD thesis "Learning feature hierarchies for musical audio signals", which I defended in January: dropbox.com/s/22bqmco45179t7…
5
48
171
At the end of the summer, I gave an invited talk at the @M2lSchool in Thessaloniki about training neural networks. It's a bit of a jumble of ideas, suggestions and best practices I've amassed over the years, interspersed with concrete examples. piped.video/watch?v=wO4quYlQ…
4
27
173
39,221
tl;dr: connect every CNN layer to every other layer. Simple but effective idea, well-written paper. Worth a read!
Densely Connected Convolutional Networks arxiv.org/abs/1608.06993
3
60
172
If diffusion model sampling tries your patience, check out consistency models: single-step sampling! No adversarial loss! In addition to being a very cool idea, this paper significantly leans on the formalism from Karras et al. 2022 AKA my favourite diffusion paper😁 Neat!
Consistency Models achieve the new state-of-the-art FID of 3.55 on CIFAR10 and 6.20 on ImageNet 64 ˆ 64 for one-step generation abs: arxiv.org/abs/2303.01469
31
169
44,556
In arxiv.org/abs/2303.00848, @dpkingma and @RuiqiGao had suggested that noise augmentation could be used to make other likelihood-based models optimise perceptually weighted losses, like diffusion models do. So cool to see this working well in practice!
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 arxiv.org/abs/2411.19722 A thread 👇 1/
2
25
176
19,613
Neat idea: if you fit augmentation params with gradient descent (jointly with model params) using a prior that gently encourages more augmentation, they will naturally drift towards the maximal sensible values, which correspond to the degree of invariance exhibited by the data.
Translation equivariance has imbued CNNs with powerful generalization abilities. Our #NeurIPS2020 paper shows how to *learn* symmetries -- rotations, translations, scalings, shears -- from training data alone! arxiv.org/abs/2010.11882 w/ @g_benton_, @Pavel_Izmailov, @m_finzi. 1/9
31
171
There's some interesting ongoing discussion about how much of a "diffusion model" this really is, but anything that even slightly challenges the autoregressive hegemony is a refreshing change!
Large Language Diffusion Models Introduces LLaDA-8B, a large language diffusion model that pretrained on 2.3 trillion tokens using 0.13 million H800 GPU hours, followed by SFT on 4.5 million pairs. LLaDA 8B surpasses Llama-2 7B on nearly all 15 standard zero/few-shot learning tasks while performing on par with Llama-3 8B.
6
14
165
14,224
Instead of modelling the velocity at each time step t in the process, model the mean velocity across any time interval (r, t). Not the first work to try this, but using the gradient of the mean velocity to define the target is an interesting approach that I haven't seen before.
Mean Flows for One-step Generative Modeling "We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models."
2
18
161
22,192
New blog post, in which I wax lyrical about typicality and the curse of dimensionality: benanne.github.io/2020/09/01… I tweeted about this concept a while back, but it turns out I have more to say on the topic. It's a bit more speculative than what I usually write, hope you like it!
3
45
159
New blog post tomorrow... probably 👀
10
3
159
10,054
The original promise of deep learning was to make feature engineering obsolete, and this was relatively successful. But when the inductive biases of neural networks work against us, it can still be useful to shape information, e.g. Fourier features arxiv.org/abs/2006.10739 (5/6)
2
14
160
9,277
Here's the third and final part of @slaterstich's "History of diffusion" interview series! The other two interviewees' research played a pivotal role in the rise of diffusion models, whereas I just like to yap about them 😬 this was a wonderful opportunity to do exactly that!
Excited to share our interview with @sedielem! This is Part 3 in our History of Diffusion series. We talk about diffusion as spectral autoregression, diffusion language models, flow matching, and much more. Enjoy!
2
30
162
13,266
This paper does a great job explaining why 2-rectified flow could be a serious contender among the evergrowing abundance of diffusion distillation methods. The resulting model produces good samples in few steps, but retains the full flexibility of an undistilled diffusion model.
Improving the Training of Rectified Flows abs: arxiv.org/abs/2405.20320 code: github.com/sangyun884/rfpp Three training tricks: 1. u-shaped timestep distribution 2. LPIPS-Huber loss 3. initializing with pretrained diffusion model (EDM) Now only requires reflow (distillation) once while comparable to consistency training/distillation at 1-2 inference steps.
2
22
161
72,229
10 years ago today: @avdnoord and I presenting our audio-based music recommendation demo at @NeurIPSConf 2013! We went on to intern at Spotify & Google Play Music the next summer (blog post: sander.ai/2014/08/05/spotify…), and by summer 2015, we had both joined @GoogleDeepMind.
5
3
159
22,315
The TF wrapper we use internally at DeepMind has been open sourced. Lasagne users might like this one, it shares a lot of design principles.
Excited to release #Sonnet - a library for constructing complex Neural Network models in TensorFlow. Get started: github.com/deepmind/sonnet
1
76
155
Transformers haven't changed much since 2017, but there have been some innovations over the years. This is an excellent summary of architectural differences in recent LLMs. Nice diagrams too! 👏 It would be great to see something like this for diffusion Transformers as well 🤔
From GPT to MoE: I reviewed & compared the main LLMs of 2025 in terms of their architectural design from DeepSeek-V3 to Kimi 2. Multi-head Latent Attention, sliding window attention, new Post- & Pre-Norm placements, NoPE, shared-expert MoEs, and more... magazine.sebastianraschka.co…
2
17
158
12,838
This looks like a great deep dive on neural network architectures for diffusion models. tl;dr use a Transformer, but there's quite a bit more to it, and as always in this field, the devil is in the details!
Had the honor to present diffusion transformers at CS25, Stanford. The place is truly magical. Slides: bit.ly/dit-cs25 Recording: piped.video/vXtapCFctTI?si=dlcE… Thanks to @stevenyfeng for making it happen!
1
10
158
18,992
Google Assistant is now powered by WaveNet! deepmind.com/blog/wavenet-la…
3
39
149
Replying to @gowthami_s
Sure! Another way to think of it is texture vs. structure, or sometimes people call this "stuff vs. things". In an image of a dog in a field, the grass texture (stuff) is high-entropy, but we do not perceive individual realisations of this texture, we just perceive it as "grass". If the realisation of this texture is subtly different, we often cannot tell, unless the images are layered directly on top of each other. This is a fun experiment to try with an adversarial autoencoder: when comparing an original image and its reconstruction side by side, they often look identical. But layering them on top of each other and flipping back and forth often reveals just how different the images are, especially in areas with a lot of texture. For objects (things) on the other hand, like the dog's eyes for example, differences of a similar magnitude would be immediately obvious. A good adversarial autoencoder will make abstraction of texture, but try to preserve structure. That way, the realisation of the grassy texture in the reconstruction can be different than the original, without it noticeably affecting the fidelity of the reconstruction. This enables the autoencoder to drop a lot of modes (i.e. other realisations of the same texture) and represent the presence of this texture more compactly in its latent space. This in turn should make generative modelling in the latent space easier as well, because it can now model the absence/presence of a texture, rather than having to capture all the entropy associated with that texture. This is a bit of a caricature, and what happens in reality is probably a bit more complicated, but this is roughly my intuition for why two-stage training is actually preferable over end-to-end, at least in the visual domain.
5
21
156
31,660
We can diffuse text now, but we can still diffuse pixels as well!
Get ready for Imagen 4 🎨 capable of creating richer images, with more nuanced colors, intricate details and superior typography. Tap each photo below to see more. 👀
8
6
153
10,154
Our latest work on GANs for text-to-speech, from characters/phonemes to waveforms with a single model. Learning varying alignment without teacher forcing is tricky, but we found dynamic time warping (DTW) to be very effective.
In our new paper [arxiv.org/abs/2006.03575] we propose EATS: End-to-End Adversarial Text-to-Speech, which allows for speech synthesis directly from text or phonemes without the need for multi-stage training pipelines or additional supervision. Audio: bit.ly/2Ya9rRK
2
35
144