VP, Kernels @togethercompute Assistant Professor @ucsd_cse Looking for talented kernel engineers and performance engineers!

Excited to share that I will be joining UCSD CSE as an assistant professor in January 2026! I'll be recruiting PhD students from the 2024 application pool - if you're interested in anything ML Sys/efficiency/etc please reach out & put my name on your application! Until then I'll be finishing up some requirements at Stanford (long story...) and hanging out at @togethercompute. Stay tuned for more!
47
40
578
116,061
Attention is all you need... but how much of it do you need? Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao 📜 arxiv.org/abs/2212.14052 1/n
25
268
1,666
373,416
We spent a couple days this week speeding up Stable Diffusion in @huggingface Diffusers using FlashAttention. 3-4x faster than the original version, 33% faster than the super optimized v0.4.1 - and >1 image/s throughput on A100. w/ @tri_dao A short thread on how we did it👇
6
62
533
Announcing FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores! We speed up exact FFT convolutions by up to 7.93x over PyTorch, reduce memory footprint, and get 4.4x speedup end-to-end. Read on for more details: Thanks @arankomatsuzaki and @_akhaliq for tweeting it out yesterday :) The key idea: map the FFT onto tensor cores using a Monarch decomposition -- which allows kernel fusion for long sequences, and uses fast matmul units to compute the FFT (pic 2). The FFT convolution allows us to compute the convolution in O(N log N) time, instead of O(N^2) from computing it directly (as in PyTorch nn.Conv1d). With advances in gated convolutions & gated SSMs, this means that we have a sub-quadratic alternative to attention that scales well! But the FFT convolution has a critical flaw for ML - it has low hardware utilization on GPUs. We're talking several times slower than FlashAttention/FlashAttention-v2 until you get to very long sequences. There's two critical bottlenecks: the FFT convolution incurs a lot of expensive I/O to store intermediate outputs, and it doesn't use tensor cores (16x faster than general arithmetic on A100/H100)! Kernel fusion can partially address the I/O problems (32K & shorter), but you run into SRAM limits at long sequences. FlashFFTConv addresses these drawbacks, and achieves speedup over FlashAttention-v2 at sequences as short as 2K. The core idea is that a Monarch decomposition allows us to break up the FFT into smaller parts that can be computed using matrix-matrix multiply (pic 3). This decomposition brings another benefit: Since the FFT is being split into smaller parts, you can avoid SRAM limitations even for long sequences -- we support sequences up to length 4 million. Now, there's a good old-fashioned systems tradeoff with this Monarch decomposition. You can recurse on the decomposition to break the FFT down into more parts -- compute the FFT in 2 parts, vs 3 parts or 4 parts. Higher-order decompositions reduce your FLOPs, but require more I/O between intermediate steps. This introduces a natural tradeoff, which we model using a cost model that takes both compute time and I/O time into account (pic 4). (There's another tradeoff for higher-order -- if your sequence is too short, you actually get to matmuls that are too small to fill up your tensor cores). Our cost model gives us a natural guide -- we can use a order-2 decomposition for sequences up to 4K, then order-3 for sequences up to 32K, and then order-4 for sequences up to 4M. Even longer sequences may require even higher-order breakdowns! The upshot of all of this is end-to-end speedup and higher utilization. We see up to 4.4x speedup for long sequence models (where a bunch of the gain also comes from memory reduction). With these gains, long convolutional models are now competitive with FlashAttention-v2 at sequences as short as 2K -- with more speedup for longer sequences. FlashFFTConv is already being used to support several internal research projects, and we'll be releasing new long-sequence models trained with FlashFFTConv this week -- stay tuned for more! With great collaborators @KumbongHermann, @exnx, and @HazyResearch. Hermann is applying to PhD programs this year -- he's a beast, you should hire him! Shoutouts to @tri_dao, @BeidiChen for discussions on early versions of this work and always pushing the boundaries of ML & systems :) Supported by resources from @StanfordAILab, @StanfordCRFM, @StanfordHAI. Developed in collaboration with the great folks at @togethercompute, and soon available for select models in their new inference API. Check out our paper, blog, and code for more details: Paper: arxiv.org/abs/2311.05908 Blog: hazyresearch.stanford.edu/bl… Code: github.com/HazyResearch/flas…
3
71
369
69,877
Excited about models that are sub-quadratic in sequence length and model dimension? Our Monarch Mixer paper is now on arXiv -- and super excited to present it as an oral at #NeurIPS2023! Let's dive in to what's new with the paper and the new goodies from this release: Monarch matrices are an expressive and hardware-efficient set of matrices that generalize the FFT -- and can be used to represent all sorts of fun linear transforms, from Hadamard transforms to Toeplitz matrices and more. Monarch mixer (M2) uses Monarch matrices to mix information both along the sequence (replacing attention) and along the model dimension. M2 replaces attention in Transformers with gated convolutions, and replace the linear layers in MLPs with sparse block-diagonal matrices. The result are architectures that scale sub-quadratically in both sequence length and model dimension! Back in July, we released a short blog post (hazyresearch.stanford.edu/bl…) with @togethercompute about using Monarch matrices to train some more efficient BERT models -- matching BERT-base in quality with 27% fewer parameters, and with long-context inference throughput. With this release, we're excited to announce two new M2-BERT-large models -- the 260M version matches BERT-large in downstream GLUE score with 24% fewer parameters (and also has much faster long-context throughput). Our paper also has a whole set of theoretical goodness that we didn't get to in our blog post. For causal language modeling -- e.g. GPT-style or decoder-only language modeling -- we need to parameterize the Monarch matrices to make sure that the sequence mixing is causal. This ensures that you can train with next token prediction, GPT-style. We use a mix of polynomial theory to interpret Monarch matrices as bivariate polynomial evaluation, and then causality is just a matter of keeping the degrees in check. (If you're familiar with the FFT convolution theorem, this is equivalent to the padding trick to turn the circular convolution into a causal convolution). Using this theory, we can train M2-GPT models -- fully sub-quadratic in the sequence length. In a weird twist, we found that we can get rid of the MLP layers entirely, and still match GPT performance... wild! Check out our paper, code, and blog post for more details: Paper: arxiv.org/abs/2310.12109 Code: github.com/HazyResearch/m2 Blog: hazyresearch.stanford.edu/bl… With @simran_s_arora, @Jessica_Grogan_, Isys Johnson, @EyubogluSabri, @ai_with_brains, @bfspector, @MichaelPoli6, Atri Rudra, and @HazyResearch Building on a lot of great work from great folks, including @tri_dao @_albertgu @davidwromero @srush_nlp @BeidiChen @exnx @BlinkDL_AI @MaxMa1987 @ramin_m_h and many many more! And of course, couldn't have done this work without support from @StanfordHAI @StanfordAILab @StanfordCRFM. In collaboration with @togethercompute. Check out our paper for more, and please reach out if you have ideas about usage or questions! arxiv.org/abs/2310.12109 And look forward to more soon ;)
5
59
278
81,037
This sentiment is exactly right - and why we've been working to increase sequence length in our lab for the past two years! From FlashAttention, to S4, H3, Hyena, and more - check out our blog post putting this line of work into context: hazyresearch.stanford.edu/bl… More below: 1/n
we though we wanted flying cars and not 140/280 characters, but really we wanted 32000 tokens
3
40
234
92,156
New year, new model drop! w/ @JonSaadFalcon, @simran_s_arora, excited to release new long-context retrieval models with Monarch Mixer, up to 32K sequence length! First step 2 long-context retrieval, outperforming Mistral, BGE, OpenAI on long-context document retrieval. 1/
4
40
226
53,786
S4 is an amazing sequence model - but has seemed mysterious. It doesn't have to be! In this blog (originally an internal explainer for our group), @HazyResearch looks at S4 from first principles that are familiar to most sophomore engineering students. hazyresearch.stanford.edu/bl…
2
40
187
You've heard of models that are sub-quadratic in sequence length, but what if they were sub-quadratic in model *dimension* too? Announcing a preview of Monarch Mixer - a fully sub-quadratic & hardware-efficient architecture that matches BERT in quality! w @simran_s_arora 1/
4
41
155
61,452
Super excited to announce ThunderMLA: fast MLA decode in ThunderKittens ⚡️🐱! Up to 35% faster than FlashMLA. Where does that speedup come from? It's all in the scheduling! 1/
(1/7) Inspired by DeepSeek's FlashMLA, we're releasing ThunderMLA—a fused megakernel optimized for variable-prompt decoding! ⚡️🐱ThunderMLA is up to 35% faster than FlashMLA and just 400 LoC. Blog: bit.ly/4kubAAK With @AaryanSinghal4, @realDanFu, and @hazyresearch!
2
22
133
25,921
One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone! We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters! github.com/HazyResearch/H3 2/n
3
10
126
14,769
The Stanford MLSys Seminar is now available in podcast form on Apple Podcasts, Spotify, Google, and more! We release new podcasts every Monday and Friday (new episodes on Fridays, old episodes from the backlog on Mondays). Check us out on your favorite platform below! (1/n)
3
20
126
Blog alert! 📣 How does contrastive learning work? How can we apply it effectively? New *3-part series* covering *2 new papers* on getting better transfer & robustness, and how to apply contrastive w types to improve entity retrieval. Part 1: hazyresearch.stanford.edu/bl… 👇 (1/n)
1
37
112
I’ll be at #ICLR2025! 🛫🇸🇬 - ThunderKittens (spotlight) w @bfspector Thu 3pm - I’ll be at the @togethercompute booth Fri afternoon - we’re hiring aggressively for kernels! Please reach out if you’d like to chat kernels🌽, TK⚡️🐱, Chipmunk🐿️, or anything performance! DMs open!
2
7
114
5,793
And we're not done - excited to announce ThunderGQA ⚡️🐱! Fast fused decode, applied to GQA for Llama & QWEN family models, and 20+% faster than FA3! We'll be shipping more updates to ThunderMLA in the coming days, watch this space! w/ @bfspector @AaryanSinghal4 @HazyResearch
Super excited to announce ThunderMLA: fast MLA decode in ThunderKittens ⚡️🐱! Up to 35% faster than FlashMLA. Where does that speedup come from? It's all in the scheduling! 1/
2
27
105
10,294
Thrilled that FlashAttention won the best paper award at the Hardware Aware Efficient Training workshop at ICML - really excited to meet so many like-minded folks at the workshop. Thanks to the organizers (and NVIDIA) for the GPU!
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
5
5
102
New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 arxiv.org/abs/2203.13270 More below 👇 (1/n)
2
27
95
Super excited to release the RedPajama dataset - a new, fully open *1.2 trillion token* dataset following the LLaMA recipe. A first step towards creating leading, fully open-source large language models. together.xyz/blog/redpajama
Announcing RedPajama — a project to create leading, fully open-source large language models, beginning with the release of a 1.2 trillion token dataset that follows the LLaMA recipe, available today! together.xyz/blog/redpajama More in 🧵 …
2
16
89
32,706
A little pre-GTC present for everyone... new Blackwell kernels, all written in ThunderKittens! ⚡️🐱 BF16 & FP8 GEMMs, attention forwards & backwards - fast (competitive with cuDNN and cuBLAS) and open-source! w/ @bfspector @AaryanSinghal4 @HazyResearch @togethercompute 1/
2
13
89
6,754
An entire model... in a single kernel! The H100 number is crazy - at 1000 toks/s on 1xH100, the Llama-1B is running at 72% memory bandwidth util for the entire model. ⚡️
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint with @jordanjuravsky, @stuart_sul, @OwenDugan, @dylan__lim, @realDanFu, @simran_s_arora, and @HazyResearch)
1
7
89
7,571
See everyone at MLSys 2025 in Santa Clara next week! Super excited to organize the Young Professional Symposium program on day one (May 12) and our invited speakers @soumithchintala (keynote), @Tim_Dettmers, @infwinston, @simran_s_arora, @BeidiChen!
3
16
82
9,857
Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!
I'm flying out to #NeurIPS2023 @NeurIPSConf! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: arxiv.org/abs/2310.12109 FlashFFTConv: arxiv.org/abs/2311.05908
2
13
77
23,301
A little taste of what we've been working on... super excited to launch support for FLUX models on @togethercompute! Some highlights: * With @tri_dao and the TKC team, we built the fastest FLUX engine anywhere - 315ms inference time for FLUX.1 [schnell] on our turbo engine * We're running a promotion with three months free support for FLUX.1 [schnell] on our free endpoint - have fun prototyping! * We're one of the exclusive launch partners for @bfl_ml's new FLUX1.1 [pro] model - the new state-of-the-art diffusion model by ELO score If you're excited about faster and better diffusion models (or you have a diffusion model you want to speed up, video anyone? 👀), please reach out - let's make diffusion faster for everyone!
FLUX has arrived on Together AI, and it's faster and more powerful than ever. We’re one of the exclusive launch partners for FLUX1.1 [pro], @bfl_ml latest premium high-performance model. Plus we’re giving developers 3 months free access to FLUX.1 [schnell] via our FLUX-schnell-Free endpoint. Start building with state-of-the-art image generation today. Read more: together.ai/blog/flux-api-is…
3
15
70
15,403
This is one of the coolest papers I've read this year. Efficient attention-free LLMs (SSMs, Mamba, etc) are cool - but you lose something from going to a fixed state. E.g., much harder to use long-context docs in QA. The problem is that you don't know what to put in the state.
Excited to share Just read twice: going beyond causal language modeling to close quality gaps between efficient recurrent models and attention-based models!! There’s so much recent progress on recurrent architectures, which are dramatically more memory efficient and asymptotically faster than attention 💨 But there’s no free lunch 🥪 these models can’t fit all the information from long contexts into the limited memory, degrading in-context learning quality. Is all lost?
1
10
67
10,184
Super excited for this model to see the light of day! 7B model, hybrid gated conv/SSM + attention architecture, trained for long context and running FlashFFTConv everywhere. You can chat with it now on the Together API!
Announcing StripedHyena 7B — an open source model using an architecture that goes beyond Transformers achieving faster performance and longer context. It builds on the lessons learned in past year designing efficient sequence modeling architectures. together.ai/blog/stripedhyen…
4
9
67
18,733
After a short hiatus, the Stanford MLSys Seminar is coming back this quarter with a special series of episodes on foundation models! Our first talk (ep 67!!) will be @tri_dao, who'll be talking about FlashAttention. Catch us *TOMORROW* at 3:30 PT: piped.video/watch?v=gMOAud7h…
1
20
61
10,814
ChatGPT's 1700-token system prompt got you down? Led by @jordanjuravsky, @brad19brown, introducing Hydragen, a simple technique for Transformer LLM inference with shared prefixes! Up to 30x improvement in throughput with no custom CUDA! A few things I love in this project: 1/
Excited to share my first PhD project! TLDR: Hydragen is an exact, simple (no custom CUDA) implementation of attention for large batches with shared prefixes. We can improve LLM throughput by over 30x for CodeLlama-13b. Also, adding lots more shared context becomes cheap: growing a prefix from 1k to 16k tokens causes less than a 15% drop in throughput. Details: Large-batch inference with shared prefixes is a common use case for LLMs. Chatbots can have long system instructions that are shared across users, few-shot examples can be reused across multiple problems, or many candidate solutions can be sampled from a single prompt (e.g. self-consistency, AlphaCode). Shared prefixes enable special optimizations because the attention keys and values for the prefix tokens are identical across sequences. Libraries like vLLM are great at avoiding redundant storage of the prefix, enabling a much larger batch size. We show that in addition to saving memory, shared prefixes can also be used to significantly improve the speed of computing attention. Existing attention implementations operate independently on every sequence in the batch without considering sharing. When sequences do in fact share a prefix, this means that the same prefix keys and values are read from GPU memory many times, regardless of whether they are redundantly stored or not. Moreover, during decoding these approaches involve computing many matrix-vector products, preventing the use of fast tensor cores. Overall, this leads to attention having a low hardware utilization that can bottleneck end-to-end decoding with big batches or long sequences. With Hydragen, we can improve the utilization and speed of attention by taking advantage of shared prefixes. Hydragen is a combination of two techniques: 1. Attention Decomposition: We split attention over the full sequence (which has partial KV overlap across the batch) into prefix attention (which has full overlap) and suffix attention (which has no overlap). As long as we store the softmax denominators from each sub-computation, we can cheaply combine them to obtain the full attention result. 2. Inter-Sequence Batching: Now that the shared prefix has been split into its own attention op, we can compute it efficiently by batching attention queries together across sequences. This converts many matrix-vector products into fewer matrix-matrix products, reducing redundant reads and leveraging tensor cores. Both of these techniques can be easily implemented in PyTorch, as long as you have access to a fast attention primitive that returns softmax denominators. Hydragen can dramatically improve end-to-end LLM throughput over baselines that only avoid redundant prefix storage. The speedups are biggest when attention is expensive relative to the rest of the model (e.g. large batch sizes, long sequence lengths, smaller models, no MQA/GQA), and when the ratio of prefix length to suffix length is high. A key takeaway for LLM users is that prefix attention is so fast that adding more shared tokens is cheap. With a large batch size, expanding the prefix length from 1k to 16k tokens for Hydragen only results in a 15% drop in throughput, while for vLLM throughput drops by over 90%.
1
8
56
15,541
Thrilled to win the Best Poster award at the ENLSP workshop!
Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!
1
2
56
31,990
Super excited to share Chipmunk 🐿️- training-free acceleration of diffusion transformers (video, image generation) with dynamic attention & MLP sparsity! Led by @austinsilveria, @SohamGovande - 3.7x faster video gen, 1.6x faster image gen. Kernels written in TK ⚡️🐱 1/
Training-free acceleration of Diffusion Transformers with dynamic sparsity and cross-step attention/MLP deltas--collaboration with @SohamGovande and @realDanFu! ⚡️ 3.7x faster video and 1.6x faster image generation while preserving quality! 🧵 Open-source code & CUDA kernels!
4
16
55
8,350
Super excited to share some thoughts with @laurel_orr1 on lessons learned from the past four years with @HazyResearch and @SnorkelML, and what's next for the ways that machine learning is changing how we build software: hazyresearch.stanford.edu/so…
1
20
45
This Thursday, @srush_nlp from @cornell_tech will be talking to us about going beyond softmax in NLP. As always, 30 minute talk + 30 minute podcast with live audience questions, be sure to tune in! Livestream link: piped.video/watch?v=8nx4KfK3… #Stanford #MachineLearning
6
43
Our paper got accepted to #ICML2022 - excited to talk about this work in Baltimore!
New preprint alert! 📣 How do we produce transferable and robust representations with supervised contrastive learning? We need *geometric spread* and an inductive bias towards *latent subclass clustering* in representation space. 📜 arxiv.org/abs/2204.07596 👇 (1/n)
3
42
In H3, we replace attention with a new layer based on state space models (SSMs) - with the right modifications, we find that it can outperform Transformers. Two key ideas: * Adapting SSMs to be able to do *comparison* * Making SSMs as hardware-efficient as attention 3/n
1
2
40
8,230
We built off the super-optimized version of Diffusers that And 33% faster than the super optimized version that @Nouamanetazi / @huggingface released last week - the diff is pretty small, 68 LOC: github.com/HazyResearch/diff…
1
5
39
We’ve been hard at work training RedPajama 7B! GPUs go brrr :)
Training our first RedPajama 7B model is going well! Less than half way through training (after 440 billion tokens) the model achieves better results on HELM benchmarks than the well-regarded Pythia-7B trained on the Pile. Details at together.xyz/blog/redpajama-…
2
2
36
5,889
A bit late, but super honored to receive the best student paper runner up at @UncertaintyInAI #UAI2022! This project has been 2+ years in the making (we started *before COVID*), so super grateful to see it recognized! w @MayeeChen, @dyhadila, @fredsala, @kayvonf, @HazyResearch!
New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 arxiv.org/abs/2203.13270 More below 👇 (1/n)
1
8
38
Overall, really excited about new models/architectures like this. What happens if we don't need attention to get the magic we've been seeing, and we can get the same quality with a linear operator? No more fixed context windows, long context for everyone! 16/n

ALT Stillesque GIF

1
1
37
3,631
Super excited by this work. Making attention IO-aware makes it run way faster - and enables much longer sequences, since memory footprint becomes linear in sequence length. Really excited to see how this gets used, and where it goes next - IO-aware transformers?
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
3
3
33
FlashAttention speeds up attention and reduces its memory footprint - without any approximation. Our key insight is that attention is bottlenecked by GPU memory *reads/writes*. FlashAttention speeds up attention by reducing the R/W. Same FLOPs, 3-4x faster!
1
6
35
Friends don't let friends run XGBoost on tabular data without trying foundation models first Great work by some awesome labmates!
Can Foundation Models (FMs) clean and integrate your data? We explore the efficacy of FMs on these hard classical data tasks (1/7)
14
35
Build your own ChatGPT! Super excited by this open-source release - even more exciting that it was trained 100% carbon-negative. Happy to play a (minuscule) part in putting it together and helping serve it faster. Looking forward to seeing what folks build on top of this!
Introducing OpenChatKit. A powerful, open-source base to create chatbots for various applications. Details in 🧵 together.xyz/blog/openchatki…
7
35
10,183
Why Train What You Can Code? Excited to share Rekall - using programmatic composition to find new events in video! Paper on arXiv, and code available on Github! Blog: dawn.cs.stanford.edu/2019/10…
1
20
32
I’m building up the kernels team @togethercompute! If you’d like to contribute to kernels like ThunderMLA for production workloads, please reach out!
Replying to @togethercompute
At Together AI, we are thrilled to be building a world-class kernels team. If you’d like to come build with us, please reach out!
5
34
5,060
Meta uses FlashAttention to speed up inference in AITemplate - really cool work, super excited to see folks pick it up!
Get faster, more flexible inference on GPUs using our newly open-sourced AITemplate, a revolutionary new inference engine that delivers up to 12X performance improvements on NVIDIA GPUs & 4X on AMD GPUs compared to eager-mode within Pytorch. Learn more: bit.ly/3rl8F3b nitter.app/MetaAI/status/15769745…
1
31
Absolutely thrilled to receive the best paper award w @MayeeChen for our work on supervised contrastive learning at the AI with Biased/Scarce Data Workshop at @RealAAAI today! Check out the paper on the workshop website: drive.google.com/file/d/1LX7… Short 🧵👇 - more soon! (1/n)
1
4
31
Happy Sunday! ThunderMLA -> ThunderGQA -> ThunderMHA! What’s next? 👀
Wrapping up our trio of decode kernels, we’re excited to announce ThunderMHA! Our fused decode kernels pack now supports Multi-Head Attention (MHA), powering even faster inference for day-1 architectures like Transformers, GPT, and BERT. 10%+ faster than FA3 on H100s, we’re excited to keep on pushing perf 🚀. Play with ThunderMHA here: github.com/HazyResearch/Thun… w/ @bfspector @realDanFu @HazyResearch
1
1
32
2,480
Join us today for our workshop on efficient systems for foundation models - we’ve got a killer lineup of speakers and posters!
Attending #ICML2023? Join us Saturday at our workshop on Efficient Systems for Foundation Models! 🔥 Large-Scale Distributed Training 🚀 Efficient Inference ⚙️ Deep Optimization 📈 Over 50 posters and 4 orals spanning from RL to efficient finetuning! gpusgobrrr.com
2
7
30
38,105
I'm heading to #NeurIPS next week Wed-Fri! I'll be at a couple things: - Wed 1-2pm: talking Transformer killers with @picocreator at @swyx @latentspacepod live! - Wed 11am: RedPajama poster (spotlight) with @mauriceweberq I'm also recruiting for my lab at UCSD this cycle and for @togethercompute! Please reach out if you're interested in: - CUDA kernels/ThunderKittens - Faster diffusion models - SSMs/architectures DM me if you'd like to meet up! 👋
1
5
28
5,020
Then we'll have an invited talk from @Tim_Dettmers (10:45-11:05) on "Lessons Learned from Successful PhD Students" - where Tim will tell us a bit about his PhD journey and how to have a satisfying and successful PhD. I'm sure it will be great advice for all of us!
1
5
29
15,107
Ce Zhang (@DS3Lab and @togethercompute) has done some crazy stuff in distributed training. In this talk, he goes over the magic behind distribute training and inference on a GLOBAL scale over slow networks! Tune in tomorrow at 3:30 pm Pacific! piped.video/watch?v=e7o2C0lP…
2
10
29
5,327
The deadline for our #ICML2023 workshop Efficient Systems for Foundation Models is tomorrow, May 31 AOE! Submit your best papers on training, inference or anything FM systems and efficiency - then join us for a great day of speakers & panel in Hawaii! es-fomo.com
3
10
28
13,320
Check out our fork of @huggingface Diffusers on GitHub and our blog post to try it out yourselves and read more! Code: github.com/HazyResearch/diff… Blog: hazyresearch.stanford.edu/bl…
2
4
27
If you're at ICLR, Catch my talk on our paper Hungry Hungry Hippos: Towards Language Modeling with State Space Models today at 10 AM in room AD12! Featuring photos of actual Rwandan hippos :) (+poster from 11:30-1:30 at board 80!)
🛫 to Rwanda for #ICLR2023! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday. Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!
1
3
26
4,539
The upshot: we can scale H3 up to *2.7B* parameter models. And because of the state passing, we can run inference blazing fast -- up to *2.4x* faster than highly-optimized Transformers. Up to 1,980 tokens/second! 12/n
2
1
28
4,204
Replying to @typedfemale
Thanks for bringing this to our attention. We've updated the blog in light of this new and important information: 🙏🙏🙏 hazyresearch.stanford.edu/bl…
2
3
28
2,467
This is cool - a generalization of attention, SSMs, RNNs through the view of associative recall and what is solvable by each class. Nice work @heyyalexwang!
did you know you've been doing test-time learning this whole time? transformers, SSMs, RNNs, are all test-time regressors but with different design choices we present a unifying framework that derives sequence layers (and higher-order attention👀) from a *single* equation 🧵
1
4
25
3,507
The H3 layer closes the gap on our synthetics, and the gains translate to strong downstream performance on language modeling. We replaced almost all the attention blocks in a Transformer with H3 layers, and trained on the PILE. Our model *outperforms* GPT-Neo in PPL! 7/n
1
1
27
5,339
What a throwback to weak supervision! Great work @JonSaadFalcon @ekellbuch @MayeeChen!
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct! 🧵(1 / N)
1
7
24
5,374
And to close out a trio of diffusion papers… Super excited to announce Grafting - a method for distilling pretrained diffusion transformers into *new architectures*, led by @keshigeyan! Swap attention for new primitives for 2% pretraining cost, exciting for modeling research!
1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu/ Co-led with @MichaelPoli6
2
5
23
6,283
One final plug: Oral 2A Efficient Learning tomorrow is absolutely **packed** with great work from @Tim_Dettmers and @srush_nlp - super excited to hear what they have to say!
1
5
23
3,924
Mambas go brr with tensor cores!
With @_albertgu, we’ve built a rich theoretical framework of state-space duality, showing that many linear attn variants and SSMs are equivalent! The resulting model, Mamba-2 is better & faster than Mamba-1, and still matching strong Transformer arch on language modeling. 1/
1
22
3,607
With FlashConv, we can make SSMs outperform attention for almost all sequence lengths -- up to 35x faster than FlashAttention for long sequences! 11/n
1
1
22
4,386
(1/n) This week we have @fredsala on the Stanford MLSys Seminar, live on Thursday at 1:30 PM! Fred was a postdoc at @StanfordAILab, and is now a professor at @WisconsinCS and a research scientist at @SnorkelAI -- so he knows a thing or two about MLSys. piped.video/watch?v=XbnAYeSJ…
1
6
22
RedPajama-v2 - 30 trillion tokens, 84 CC dumps, 5 languages! Excited to see what people do with it :)
We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training. together.ai/blog/redpajama-d…
1
1
23
3,159
I'm flying out to #NeurIPS2023 @NeurIPSConf! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: arxiv.org/abs/2310.12109 FlashFFTConv: arxiv.org/abs/2311.05908
1
1
22
14,960
I’m off to #ICML2025 in Vancouver! (After an unusually eventful first flight - our plane had a wing problem, so we had to take an emergency landing back to SFO & switch planes) Reach out if you’d like to chat about (mega)kernels, @togethercompute, or anything MLSys! 1/
2
22
1,000
The MLSys Seminar is back this week with our very own @BeidiChen! Tune in Thursday, 1:30 PM on YouTube to hear about her great work on sparsity in deep learning. Livestream link: piped.video/watch?v=aGPzuwox… #Stanford #MachineLearning
2
6
21
These synthetic languages (inspired by great work like transformer-circuits.pub/202…) test how well SSMs can do in-context learning compared to attention. We find a critical missing capability -- SSMs have trouble *comparing tokens* across the sequence. 5/n
2
1
22
6,173
🛫 to Rwanda for #ICLR2023! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday. Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!
Attention is all you need... but how much of it do you need? Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao 📜 arxiv.org/abs/2212.14052 1/n
3
22
7,122
We sped up stable diffusion by replacing the self-attention/cross-attention blocks in the unet with FlashAttention. FlashAttention doesn't do any approximation, so you get the *exact same image* at the end.
1
1
22
I'm at #ICML2022 this week! Let's chat if you're also in person! I'm presenting two papers: - Improving Transfer, Robustness of Supervised Contrastive Learning arxiv.org/abs/2204.07596 - FlashAttention: Fast & Memory-Efficient Exact Attention arxiv.org/abs/2205.14135 ⏱below!
1
1
20
The power of data - RedPajama-2.8B matches Pythia-7B in HELM score after being trained on 2x the tokens! Excited to see these models continue to improve as they see more tokens :)
In addition to RedPajama 7B, we’ve also been training a 2.8B model. After 600B tokens it is exciting to see the model has higher HELM scores than the excellent Pythia-2.8B & GPT-Neo 2.7B. In fact, trained with twice the tokens, RedPajama-2.8B has comparable quality to Pythia-7B!
2
20
2,698
In response, we designed the H3 layer (Hungry Hungry Hippos) to plug this gap. The H3 layer stacks two SSMs, and uses some simple multiplicative interactions between them (gating) to do comparisons. 6/n
1
1
20
6,248
Wow, excited to see FlashAttention seeing adoption by folks in industry - excited to see where else it can make training faster!
We have exciting news! In our latest and greatest LLM blog, we show how MosaicML Cloud can help you train LLMs from 1B - 70B parameters, and for the first time, publish transparent times + costs for doing so. It's a lot cheaper than you think! (1/9) mosaicml.com/blog/gpt-3-qual…
2
19
Super excited for our new seminar series on ML and systems -- how does ML change the modern programming stack, and what does it mean for how people will build and deploy applications in the future? Live on YouTube every Thursday, 3-4 PM PT. Check out links below for more!
Announcing the new live-streamed Stanford MLSys Seminar Series, in which we will explore the frontier of machine learning and systems. Read the full announcement: hazyresearch.stanford.edu/ml… Schedule: mlsys.stanford.edu Intro video: piped.video/OEiNnfdxBRE
1
5
19
(1/n) This week @dorisjlee from @ucbrise and @BerkeleyISchool will be joining us on the Stanford MLSys Seminar to talk about her fantastic work on @lux_api. You can catch us live on YouTube this Thursday at 1:30 PT! Deets in 🧵👇: piped.video/watch?v=yrmSoU8j…
1
9
19
Announcing HMAR - Efficient Hierarchical Masked Auto-Regressive Image Generation, led by @KumbongHermann! HMAR is hardware-efficient, reformulates autoregressive image generation in a way that can take advantage of tensor cores. Hermann is presenting it at CVPR this week!
Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from low-res to high-res) as opposed to next-token prediction. HMAR builds on VAR to make it even better. We make changes that: ✅Improve quality; FID, Inception Score (up to 30 pts) & qualitatively ✅ Speed up training by up to 2.5x, inference by up to 1.75x, and reduce inference memory footprint by up to 3x. ✅Enable adjustable sampling schedules to trade-off quality/speed without retraining from scratch.
5
21
5,005
One thing I've been wondering about as the next generation of GPUs come online is how much further we can take the quantization/scaling bits down regime... this paper takes some of the first steps towards answering this question!
[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR; - Models become harder to post-train quantize as they are overtrained on lots of data, so that eventually more pretraining data can be actively harmful if quantizing post-training! - The effects of putting weights, activations, or attention in varying precisions during pretraining are consistent and predictable, and fitting a scaling law suggests that pretraining at high (BF16) and next-generation (FP4) precisions may both be suboptimal design choices! Joint work with @ZackAnkner @bfspector @blake__bordelon @Muennighoff @mansiege @CPehlevan @HazyResearch @AdtRaghunathan.
2
2
19
3,123
A few fun bits I couldn't fit into the original tweet: 1. We also have the fastest implementation of a short depthwise 1D convolution, which doesn't use the FFT but is up to 7x faster than PyTorch Conv1D, check out our repo to try it out: github.com/HazyResearch/flas… 2. During development of this project, we found a bug in the backward pass of the FFT in PyTorch... that was a fun one to debug :) github.com/pytorch/pytorch/i… 3. End-to-end speedup numbers and comparison against FA-v2 (Twitter only allows 4 images per post now?) 62% MFU end-to-end!
4
19
1,889
What's the problem? Long convolutions require multiple FFT calls, which introduce expensive GPU memory reads/writes. We develop FlashConv to address this problem. FlashConv uses a block FFT algorithm to increase FLOP util, and uses state passing to scale to long sequences. 10/n
2
1
19
4,772
Announcing FlyingSquid - fast weak supervision with triplet methods. We speed up weak supervision by orders of magnitude, allowing weakly-supervised video analysis and online learning! Blog: hazyresearch.stanford.edu/fl… w/ @MayeeChen, @fredsala, Sarah Hooper, @kayvonf, @HazyResearch
1
12
18
This is really cool! There’s a ton of places where a dynamic differentiable hierarchy makes a ton of sense. Awesome to see progress here!
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
19
1,137
Part 1: the quality gap SSM's have achieved impressive results on sequence modeling (30+ points over Transformers on Long Range Arena), but have underperformed attention in language modeling. In our paper, we use *synthetic languages* to probe this gap 4/n
1
1
19
6,847
This week we're excited to have @kexinrong (@Stanford, @VMware, and @gtcomputing) on the MLSys Seminar. Kexin will talk about improving query performance on big-data analytics. Be there or be square! Watch us live on YouTube this Thursday at 1:30 PT: piped.video/watch?v=sHmpMoao…
1
3
18
Thanks Tri! And yes, I'm on the academic job market this year :)
As much as I like attention, I'm also fond of attention-free architectures for long context. @realDanFu and others have been pushing in this direction, with deep theory and compelling empirical results! And @realDanFu is on the academic job market this year!
18
5,192
Thanks for having me on! It was really fun, really great event and really well-run!
and then there’s @realDanFu presenting all the frontier architecture work, starting with the adorably named ThunderKittens!!
1
2
17
5,335
Check out the blog for more details on the technical bits, and check out our GitHub for instructions on how to play with the model! Blog: hazyresearch.stanford.edu/bl… Github: github.com/HazyResearch/m2/b… 7/
1
1
18
1,243
With @tri_dao (co-first), @KhaledSaab11, @ai_with_brains, Atri Rudra, and @HazyResearch! Thanks to @StanfordAILab, @StanfordHAI, @StanfordCRFM, and @togethercompute for helping provide us with the compute necessary to train these models! 17/17
1
18
3,567
Model weights available on HuggingFace, and AutoModel compatible. Download them with just two lines of code! 32k model: huggingface.co/togethercompu… 8k model: huggingface.co/togethercompu… 2k model: huggingface.co/togethercompu… 4/
1
2
17
1,431
ES-FoMo is back at #ICML2025 this year! Submissions open until May 26, come join us for a great day of talks and posters in Vancouver!
ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇
4
2
17
1,481
This project has been a great collaboration with @togethercompute. Thanks to them, these models are already integrated into @MongoDB Atlas, @langchain, and @llama_index. Check out their tweet thread for more details! 9/
We are thrilled to announce the Together Embeddings endpoint! 🚀 Higher quality than OpenAI or Cohere in the MTEB benchmark. ✅ State of the art M2-Retrieval models with up to 32k context length. ✅ Up to 4x lower price. ✅ together.ai/blog/embeddings-… Details👇
1
3
14
4,448
(1/n) This week we're delighted to have @faitpoms (@Stanford, @SnorkelAI) on the MLSys Seminar Series! Fait will be talking about a vision for interactive model development, so you won't want to miss it. Catch us live on YouTube Thursday at 1:30 PM! piped.video/watch?v=-9LbJBzK… 🧵👇
1
5
14
On LoCo, M2-BERT-32k outperforms the state-of-the-art embedding models! Even outperforms Mistral-7B, even though M2-BERT models only have 80M parameters (85x more parameter efficient)! 3/
1
2
15
1,912