Pioneering Diffusion LLMs | Team Lead @mbzuai - IFM | PhD @cornell

San Francisco, CA
🎓 Officially a doctor now 😊!!! As a first-gen college kid, this moment means the world to me. Grateful beyond words to all my mentors who’ve guided me along the way — from @GMartius who first introduced me to research back in 2017, to @volokuleshov who sparked my love for generative modeling, and finally to @jwthickstun and @Jimantha for their incredible mentorship through the final stretch of my PhD. ❤️
81
48
1,652
102,614
For a PhD, you need to be a romantic at some level. Your papers will get rejected. Your ideas will get scooped. All while you peers flourish. And yes--It will sting. 2023 was one such year for me. Yet I call it my golden year, because that’s when I truly fell in love with my research. I found my escape in it.
Replying to @ssahoo_
What do you think would make a good PhD candidate? What specific traits do you see in a smart/talented PhD? Would love to hear some feedbacks on your side
13
36
811
91,369
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
16
100
551
144,093
As I wrap up my thesis, I can’t help but look back on the past year of working on Diffusion LLMs. People often ask me: why and how I got into this strange little world of discrete diffusion. I usually give the textbook answer: the kind you’d find in any random paper and make myself sound like some visionary who saw this whole field explode the way it did. Lie. A blatant lie. The truth is simpler: the day I first stumbled upon this topic, I felt like a kid again. Remember that toy you loved so much as a child that you couldn’t stop obsessing over? That’s what discrete diffusion felt like to me. My head exploded with possibilities. ⚡️2023 was my golden year. No. of papers published: 0. BUT I was working on things that made me genuinely happy-- no expectations, no thought of return. That was the year I learned almost everything I know about diffusion. The seeds of MDLM and Duo were planted in late Nov / Dec. I didn’t start formally working on discrete diffusion until early February'24, when SEDD came out and showed a lot of promise in this area. P.S. I’ll miss my desk, which had a gorgeous view of Manhattan.
24
14
408
37,853
Overwhelmed by the number of Diffusion LLM papers? 🌊 Same here 😭 So I’m starting a Discrete Diffusion Reading Group (@diffusion_llms) with my favorite disciples @jdeschena and @zhihanyang_ ✨ We’ll cover everything—from theory to empirics, from language to molecules. Join us 👉 Google Group:  groups.google.com/g/diffusio… webpage: d-llms.io Follow us @diffusion_llms
20
39
314
30,478
✨New beginnings: I’ve joined the Institute of Foundation Models @llm360, where I’ll be leading research on diffusion-LLMs. 🚨Goals > Design frontier diffusion-LLMs > Advance these algorithms through fundamental research ✌️About to go on a hiring frenzy, so stay tuned.
27
10
274
20,897
🚨 [New paper alert] Esoteric Language Models (Eso-LMs) First Diffusion LM to support KV caching w/o compromising parallel generation. 🔥 Sets new SOTA on the sampling speed–quality Pareto frontier 🔥 🚀 65× faster than MDLM ⚡ 4× faster than Block Diffusion 📜 Paper: arxiv.org/abs/2506.01928 📘 Blog: s-sahoo.com/Eso-LMs/ 💻 Code: github.com/s-sahoo/Eso-LMs 🔬 Colab: colab.research.google.com/dr… 🤗 Hugging Face: huggingface.co/collections/s… Project co-led with @zhihanyang_ (1/9)
10
35
256
92,527
🔥 Rethinking Reasoning (with Diffusion LLMs) This work changes how you think about reasoning in LLMs. 🤯 Turns out: you don’t need the full chain-of-thought — only a small subset of CoT tokens actually matter for the final answer. ❌ Autoregressive LLMs can’t exploit this since they must generate the entire CoT. ✨But MDLMs enable early-exit reasoning — predicting the answer without materializing the whole CoT. Huge congrats to the team behind this breakthrough: @zachary_horvitz @_rk_singhal @cdomingoenrich @Zhou_Yu_AI
10
34
228
15,485
Pre-training for Diffusion LLMs will be solved in the next 6 months. ^That’s underestimating both myself and the community.
6
10
201
31,455
Attending ICML ✈️Tues-Fri to present "The Diffusion Duality" 🗓️Wed, July 16 @ 4:30pm 📍East Exhibition Hall A-B (E-3003) DM if you want to chat about diffusion LMs, or my current work on Duality or Esoteric LMs!
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
1
16
156
10,302
We’re building a space that connects researchers, students, and practitioners working on discrete diffusion. Join the Discord — collaborate, learn, and share! Whether you’re 💼hiring or showcasing your work, this is the place 👇 Discord: discord.gg/JxSCwpNb
The Discrete Diffusion Reading Group is growing — 400+ members strong! We’ve launched a Discord for discussions, research ideas, help, and job opportunities. Join the conversation 👇 💬 discord.gg/JxSCwpNb 📧 groups.google.com/g/diffusio…
7
107
15,721
We’re dropping “The Diffusion Duality, Chapter 2” soon! So, stay tuned 🤗
In diffusion LMs, discrete methods have all but displaced continuous ones (🥲). Interesting new trend: why not both? Use continuous methods to make discrete diffusion better. Diffusion duality: arxiv.org/abs/2506.10892 CADD: arxiv.org/abs/2510.01329 CCDD: arxiv.org/abs/2510.03206
7
82
10,917
🚨“We have only one internet” (@ilyasut) — and that’s exactly why diffusion is the future of LLMs. 🔥Come for the hot takes, stay for @mihirp98’s deep dive at Monday’s @diffusion_llms reading group. ⏲️10 am ET (4pm CET)
5
5
64
5,225
Impressive work by @jdeschena ! They propose to replace the Encoder only denoising transformer with an Encoder-Decoder architecture which leads to faster training and inference of MDLM.
📢 « Partition Generative Modeling (PGM): Masked Modeling without Masks » is out! 🚯 Masked diffusion models waste FLOPs processing countless mask tokens that carry no real information. ⚡We show how partitioning can replace masking, boosting throughput by >5.3x on text and up to 7.5x on VQ-ImageNet! 📄 paper: arxiv.org/abs/2505.18883 💻 Code: github.com/jdeschena/pgm 🤗 Models: huggingface.co/jdeschena/pgm 1/9 🧵
1
4
52
7,532
We've finalized the schedule for our weekly reading group starting this Monday, Nov 17th. Do join us and sign up if you haven't already.
📅 Weekly meetings on Mondays starting November 17, 10–11 AM ET (4–5 PM CET). Details about our first session are coming soon! 🚀
3
4
48
5,382
Now @elonmusk has joined the chat!
Diffusion will obviously work on any bitstream. With text, since humans read from first word to last, there is just the question of whether the delay to first sentence for diffusion is worth it. That said, the vast majority of AI workload will be video understanding and generation, so good chance diffusion is the biggest winner overall. Also means that the ratio of compute to memory bandwidth will increase.
1
1
39
7,903
📢 @BytedanceTalk just dropped their diffusion LLM!!! And boy it's fast 💨 From their technical report, it seems like they are using MDLM (my research) 😊 lf3-static.bytednsdoc.com/ob…
30
1,882
Honored to have been invited to the Research/PostTraining Round Table at @NVIDIAGTC! Thrilled that Diffusion-LMs are going mainstream. Here’s hoping the next generation of GPUs will supercharge both training and inference of these models.
6
3
26
16,815
🔥Diffusion Models 𝐚𝐥𝐦𝐨𝐬𝐭 beat AR models on text generation. Presenting MDLM, a Masked discrete Diffusion Language Model featuring a Rao-Blackwelized ELBO which is a mixture of classical Masked Language Modeling losses and achieves SOTA results among all DMs. (1/10)
1
3
27
4,303
Duo has 56.3K 🔥downloads on HF already, Jesus! Please find the colab notebook below to play around with the HF model. 🤗HugginFace model: huggingface.co/s-sahoo/duo-d… 🤗HuggingFace paper: huggingface.co/papers/2506.1… 🖥️Colab: colab.research.google.com/dr…
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
1
25
1,583
📢 Excited to defend my PhD thesis: "Foundations of Diffusion Language Models" 🎓✨ 📅 October 3 | 11:30 am PT / 2:30 pm ET 🔗Zoom: cornell.zoom.us/j/9586300292… Topics covered: 1⃣ MDLM 2⃣The Diffusion Duality 3⃣Esoteric Language Models
1
2
27
5,421
WHY DID I NOT KNOW ABOUT THIS!!!
Just learned that tacking a * to the \operatorname latex tag causes it to underset its subscript, very handy.
1
22
2,320
🎥 Sampling Viz: Duo vs MDM vs AR 🔥Notice how Duo self-corrects, unlike masked diffusion or AR. (7/8)
2
3
21
5,425
Happening tomorrow at 2:30pm ET / 11:30 am PT
📢 Excited to defend my PhD thesis: "Foundations of Diffusion Language Models" 🎓✨ 📅 October 3 | 11:30 am PT / 2:30 pm ET 🔗Zoom: cornell.zoom.us/j/9586300292… Topics covered: 1⃣ MDLM 2⃣The Diffusion Duality 3⃣Esoteric Language Models
2
20
3,657
📢 Duo and Eso-LMs at 2B scale on Slim Pajama These models will finish training in a few days. While HF release may take time due to corporate red tape, we'll try providing early access case-by-case. Email susahoo@nvidia.com with the subject “Early access”. Duo: s-sahoo.com/duo/ Eso-LMs: s-sahoo.com/Eso-LMs/
1
21
1,222
😱 Discrete diffusion emerges from Gaussian diffusion. 🧮 The argmax operator maps a Gaussian latent to a discrete one. 🔄 The resulting transition dynamics match Uniform-state discrete diffusion. 📉 We prove the discrete ELBO is tighter — making discrete space preferable. Details in the paper 😊 (3/8)
1
2
20
4,135
Honored to see MDLM featured in the tutorial 😊
Diffusion LLMs are promising ways to overcome the limitations of autoregressive LLMs. Less error propagation, easier to control, and faster to sample! But how do Diffusion LLMs actually work? 🤔 Let's explore some ideas on this fascinating topic! piped.video/8BTOoc0yDVA
2
18
1,968
Funny enough, after we released MDLM last year, @srush_nlp came up with the exact same idea!
(1/5) Beyond Next-Token Prediction, introducing Next Semantic Scale Prediction! Our @NeurIPSConf NeurIPS 2025 paper HDLM is out! Check out the new language modeling paradigm: Next Semantic Scale Prediction via Hierarchical Diffusion Language Models. It largely generalizes Masked Diffusion Models (MDM), and provides the progressively denoising capability for each token in the semantic level. Minimal computation overheads, much better results! arxiv: arxiv.org/abs/2510.08632 code: github.com/zhouc20/HDLM
1
18
2,511
coming soon.
2
1
16
806
🧩 Background Uniform-state discrete diffusion models can self-correct—unlike AR or masked diffusion (MDMs). ❌But they trail MDMs and AR in terms of perplexity. (2/8)
1
17
2,506
And that’s where diffusion shines!!
-2016 (classic era): focus on data efficiency 2017-2025 (pretraining era): focus on compute efficiency 2026-: focus on data efficiency (again) The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design decisions, which will be exciting!
16
2,037
The term AGI gives me the same ick that “AI” did back in 2015. If it takes hundreds of billions of tokens just to get a respectable score on grade school math (GSM8K), that says everything about where we actually are.
1
16
991
Please fill out your availability for the reading group
As we get started with our discrete diffusion reading group, we’d like to schedule a recurring one-hour meeting time that works for everyone. Form: forms.gle/Xtogq4T7xuKBfFjr7 > Please fill out your availability in the Google form , and be sure to select your local timezone when setting your availability. > This will help us find a time that accommodates everyone across time zones. Once the responses are in, I’ll follow up with the finalized meeting time and our first reading.
1
15
3,321
🎯 100× reduction in sampling steps via Discrete Consistency Distillation. ✨1024 → 16 steps: no quality/diversity loss ✨1024 → 8 steps: same quality, slight drop in diversity (6/8)
2
1
16
1,256
Masked Diffusion Models (MDMs) are a strong alternative to autoregressive (AR) LMs—but they have two fatal flaws: 🐌 Slow: No KV caching = much slower than AR in practice 📉 Quality gap: Struggle on complex tasks, lower likelihood than AR (2/9)
1
15
1,679
🤝 With amazing collaborators: @jdeschena , @SkyLi0n , @Guanghan__Wang , @justintchiu , @volokuleshov (8/8)
14
1,132
🚀 Application 1: Faster Training + Better Likelihood We use this duality to create a curriculum: we start off with a Gaussian diffusion model and anneal it to a discrete diffusion model! 🤯 Our method: Duo ➡️ 2× faster training ➡️ Outperforms AR on 3/7 zero-shot likelihood benchmarks (4/8)
1
15
1,593
⚡ Application 2: Few-Step Generation Few-step sampling in Gaussian Diffusion relies on PF-ODEs + consistency distillation. 🦾Discrete models lack such tools—until now. 🕺Our duality allows us to port consistency distillation to the discrete domain. cc:@DrYangSong (5/8)
1
1
15
1,406
How do you even compute such probabilities?
My estimate of the probability of Grok 5 achieving AGI is now at 10% and rising
1
12
1,815
I tagged the wrong ICML FML 🤦🏽
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
1
12
1,201
And in 2025 we unify discrete-space and Gaussian-space diffusion 😊
How Diffusion unification went: > score based model > then DDPM came along > we have two formalism, DDPM & SBM > SDE came to unify them > now we have Score, DDPM & SDE > Then came flow matching to unify them > now we have Score, DDPM, SDE & Flow Models > Then consistency models came > now we have Score, DDPM, SDE, Flow & Consistency Models
12
960
BD3-LMs improve speed via blockwise semi-autoregressive diffusion. They partially support KV caching and outperform MDMs in speed. But… ⚠️ Mode collapse at low steps = bad samples 🧠 Partial caching only = intra-block KV still missing Speed & quality still compromised. (3/9)
1
11
1,284
Excited to present our #NeurIPS2024 paper: "Simple and Effective Masked Diffusion Language Models" on Thurs at 11:30 a.m. in Hall A-C (#2505) 🔥Our method almost surpasses AR models in text generation 📜arxiv.org/abs/2406.07524 🔖s-sahoo.com/mdlm/ 💻github.com/kuleshov-group/md…
1
2
11
1,192
Happy Diwali — from mine to yours ✨
11
882
is it just me or the OpenReview is down for everyone?
10
1,645
Training Innovations in Eso-LMs: 🔀 Half the batch → AR-style: Mask tokens see clean context + prior clean tokens → AR loss 📥 Other half → MDLM-style: Shuffled inputs, left = clean, right = masked + causal attention → MDLM loss 🏁 Outcome: ✅ Unified Denoising Model for AR and diffusion ✅ KV caching during diffusion (yes, really!) (5/9)
1
9
1,238
We introduce a new LM paradigm that fuses AR and MDMs. 💡 Trained with a hybrid loss (AR + MDM), our model interpolates smoothly between both styles—balancing: ✅ Perplexity ✅ Sample quality ✅ Inference speed (4/9)
1
9
987
🔥 Not sure if 2025 is the year of AGI, but it definitely belongs to Diffusion LMs. Dropping another banger next week — stay tuned. 👀💥 #DiffusionLMs #NLP #GenerativeAI #LLMs
1
9
654
Inference time Innovations: ⏩ Reformulated ancestral sampling: Only do forward pass on scheduled MASK + clean tokens—not the whole sequence = Massive FLOP savings 💾 🧠 Thanks to any-order autoregressive training, KV of clean tokens is cacheable! 🏁 Outcome: 🚀 65× faster than MDLM ⚡ 4× faster than Block Diffusion (6/9)
1
9
777
Replying to @jaschasd
Just reached out! Would love to chat about diffusion-LLMs with you 😊
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
8
1,530
Replying to @iScienceLuvr
And theirs will be Esoteric
🚨 [New paper alert] Esoteric Language Models (Eso-LMs) First Diffusion LM to support KV caching w/o compromising parallel generation. 🔥 Sets new SOTA on the sampling speed–quality Pareto frontier 🔥 🚀 65× faster than MDLM ⚡ 4× faster than Block Diffusion 📜 Paper: arxiv.org/abs/2506.01928 📘 Blog: s-sahoo.com/Eso-LMs/ 💻 Code: github.com/s-sahoo/Eso-LMs 🔬 Colab: colab.research.google.com/dr… 🤗 Hugging Face: huggingface.co/collections/s… Project co-led with @zhihanyang_ (1/9)
7
484
Replying to @RickyTQChen
Looking forward to it! I have an oral presentation there where I’ll present our work “The diffusion duality” where we unlock few-step generation in diffusion language models. Hopefully you’ll like it 😊 s-sahoo.com/duo/
1
7
435
Replying to @geoffreyhinton
We published a paper at #ICML where we used periodic functions along with 1 / x as activations to perform symbolic regression. Doing this helped the NN to generalize to unseen domains (and it outperformed Eureqa). @GMartius arxiv.org/abs/1806.07259
7
🥇 New SOTA on the Speed–Quality Pareto Frontier Eso-LMs redefine what’s possible: 🔁 MDLM-level perplexity at high speed ✍️ AR-level perplexity when needed ❌ No mode collapse at low steps — unlike Block Diffusion One model. Full control. P.S. Low Gen PPL = High sample quality (8/9)
1
7
755
In collaboration with: @zhihanyang_ @yashakha Johnna Deepansha @ChengZhoujun Hector Liu @ericxing @jwthickstun @ArashVahdat (9/9)
1
7
676
Our work uses gradient based methods to scale SMT solvers (#Microsoft z3) to analyze deep networks, such as inception, for model explanation. (1 / 3) @rishabhs #NeurIPS2020 #Google #GoogleAI
Scaling Symbolic Methods using Gradients for Neural Model Explanation. arxiv.org/abs/2006.16322
1
7
Excited to present our #NeurIPS2024 🌟spotlight🌟paper: "MuLAN: Diffusion Models with Learned Adaptive Noise" on Fri at 4:30 p.m. in Hall A-C (#2604) 📜arxiv.org/abs/2312.13236 💻github.com/s-sahoo/MuLAN 🔖s-sahoo.com/MuLAN/ w/ @SkyLi0n Chris @volokuleshov
6
762
@sedielem Very kind of you to share our work; it's such an honor 😊 Not sure if you recall, but during my NeurIPS poster session, I briefly mentioned an idea about why adding Gaussian noise to one-hot vectors might be better than adding it to embeddings. It was because of this connection to Uniform-state diffusion.
6
741
.@keenanisalive From quantum mechanics, where the quantized energy states of electrons arise as solutions to continuous wave equations to the binary logic of digital circuits, fundamentally driven by smooth analog currents, discreteness has repeatedly and naturally emerged from an underlying continuum. In the following work, we show that a discrete diffusion process is, in fact, an emergent phenomenon of an underlying continuous Gaussian diffusion process. nitter.app/ssahoo_/status/1933675…
We often use discretization to approximate continuous laws of physics, but it also goes the other way: You can use continuous equations to approximate the behavior of discrete systems! Here we'll see how electrical circuits can be modeled using the Laplace equation Δφ=0. [1/n]
5
825
Replying to @_akhaliq
Thanks for the shoutout 😊For details see this thread:
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
1
4
464
Replying to @kfountou
Indeed. Very fortunate to be in this place and I couldn't have asked for more.
5
2,799
Replying to @huybery
Diffusion LLMs. 🔥Few-step generation in LLMs (ICML 25): s-sahoo.com/duo/ ✨Used by Byte Dance's Seed Diffusion (Neurips 24): s-sahoo.com/mdlm/
1
5
937
ESO-LMs seamlessly interpolate between MDLM and AR perplexities on OWT and LM1B (7/9)
1
4
595
Replying to @jarridrb
We’re working on a benchmarking paper that evaluates existing diffusion LMs, highlighting where they excel and where they fall short. I’ll share it with you as soon as it’s ready :)
1
4
93
Replying to @kohjingyu
Yeah, they are adorable
3
2,060
Replying to @tw_killian
Congrats to the team!
1
4
274
no rest for the wicked 😊
3
467
In this work we aim to find a minimal subset of input features relevant for the model’s prediction. Earlier approaches, which used these solvers for interpretability, were limited to small networks featuring a few thousand neurons. (2 / 3)
1
3
It’s remote for now but we’ll consider in-person session depending on the number of participants
3
575
Replying to @LucaAmb @Cornell
Thank you so much for having me 😊
3
784
Replying to @yingheng_wang
Omg, yesss!! Feels like it was yesterday. This was an unplanned PhD, ngl. I wanted to dropout of the program for the longest time haha. And Thank you so much for your kind words 😊😊
1
3
565
The unique problem formulation leverages gradient information to partially encode the network which helps these solvers scale to networks with millions of parameters. (3/ 3)
3
Ouch, my ego took a hit. Chemistry is a subject that can be gamed with rote learning, yet surprisingly, Gemini performs worse in it than in physics and math.
AI now beats every single human in the hardest college entrance exam in India, the IIT JEE. Bytedance silently published this result this week. The top scorer was Rajit Gupta with 332/360, but Google's Gemini 2.5 Pro was at rank 1 with 336/360.
3
757
Replying to @NiJinjie
And I've learned from yours! Both you and @mihirp98 have done an amazing job with the scaling laws for MDMs in the data constrained regime.
1
3
74
Replying to @vincesitzmann
You might find our work interesting (published at #ICML) where we used periodic functions along with 1 / x as activations to perform symbolic regression. Doing this helped the NN to generalize to unseen domains (and it outperformed Eureqa). arxiv.org/abs/1806.07259
3
This work was done in collaboration with @mariannearr @SchiffYair  @SkyLi0n  Edgar @justintchiu  @srush_nlp @volokuleshov. (10/10)
3
281
Lmao 😂😂
1
114
Thank you for your kind words! We look forward to having you 😊
2
236
Would love to have you! Please join the mailing list: groups.google.com/g/diffusio…
1
2
334
We are in the process of figuring it out. Meanwhile, subscribe to the mailing list and we’ll keep you posted:)
1
2
372
Replying to @sovon_haidar
Thank you so much!! I'll release a tutorial on that paper soon. Stay tuned 😊😊
1
2
263
Replying to @arankomatsuzaki
Thank you @arankomatsuzaki for sharing our work. Here's a more detailed thread if you're interested😄
🔥Diffusion Models 𝐚𝐥𝐦𝐨𝐬𝐭 beat AR models on text generation. Presenting MDLM, a Masked discrete Diffusion Language Model featuring a Rao-Blackwelized ELBO which is a mixture of classical Masked Language Modeling losses and achieves SOTA results among all DMs. (1/10)
2
44
Replying to @FoldMani
That’s the unfortunate reality
2
300
Replying to @jdeschena
Looking forward to it :)
1
101
Replying to @jbhuang0604
Thank you so much for covering MDLM 😊 In our ICML paper, we show that uniform state diffusion, a type of discrete diffusion, emerges from Gaussian diffusion—enabling few-step generation in diffusion language models. paper: s-sahoo.com/duo/ tweet:
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
2
264
Thanks for your interest😊More details here:
🚨 “The Diffusion Duality” is out! @ICML2025 ⚡️ Few-step generation in discrete diffusion language models by exploiting the underlying Gaussian diffusion. 🦾Beats AR on 3/7 zero-shot likelihood benchmarks. 📄 Paper: arxiv.org/abs/2506.10892 💻 Code: github.com/s-sahoo/duo 🧠 Blog: s-sahoo.com/duo/ (1/8)
2
63
Replying to @pranamanam
Many congratulations! You just couldn’t resist talking about work, could you?
2
517