(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide. We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes it easy to express key technical ideas for building AI kernels. TK lets us write clean, easy-to-understand code that maximizes GPU utilization -- on all kinds of kernels! Code: github.com/HazyResearch/Thun… Writeups: (short) hazyresearch.stanford.edu/bl…, (long) hazyresearch.stanford.edu/bl…. Joint with @AaryanSinghal4, @simran_s_arora, @HazyResearch and team!
19
151
878
301,953
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint with @jordanjuravsky, @stuart_sul, @OwenDugan, @dylan__lim, @realDanFu, @simran_s_arora, and @HazyResearch)
33
142
877
384,018
(1/7) Inspired by DeepSeek's FlashMLA, we're releasing ThunderMLA—a fused megakernel optimized for variable-prompt decoding! ⚡️🐱ThunderMLA is up to 35% faster than FlashMLA and just 400 LoC. Blog: bit.ly/4kubAAK With @AaryanSinghal4, @realDanFu, and @hazyresearch!
7
69
370
60,492
(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.
7
48
321
84,808
(5/5) We’re open-sourcing all of the code so that you too can stop torturing your models with kernel launches (may Roko grant you a quick death) and have written up a blog with a bit more detail on how it all works. Code: bit.ly/451G881, Blog: bit.ly/3HcImHG
3
16
192
10,724
(1/6) Joyously announcing ThunderKittens with real support on NVIDIA Blackwell! We've released BF16/FP8 GEMM and attention fwd+bwd kernels, up to 2x faster than cuBLAS GEMMs on H100. Blog: bit.ly/41tuT4Q With @realDanFu, @AaryanSinghal4, and @hazyresearch!
4
28
190
20,918
We got early access to some of the very first Nvidia B200’s. We share initial benchmark results and wrote the fastest (public) attention kernel with 925+ BF16 TFLOPs: Since the PTX instruction set released yesterday, @aaryan04 and I have been hard at work at @HazyResearch learning the new instructions and putting them to the silicon.
3
25
168
18,451
(1/7) In celebration of National Cat Day, we’re excited to release our first major batch of updates to ThunderKittens! ThunderKittens is now easier, better, faster, and cuter than ever before! In addition to massive speed boosts, we’re releasing a broad swath of kernels, new features, and tons of usability improvements. (Blog post: hazyresearch.stanford.edu/bl…, Paper: arxiv.org/abs/2410.20399) As always, joint with @AaryanSinghal4, @simran_s_arora, @HazyResearch, and team!
8
41
123
36,468
(2/5) Our Llama megakernel is built around an on-GPU interpreter. Each SM fetches and executes huge, custom instructions from a special instruction tensor, so the GPU can be doing many different things. Without kernel boundaries, each SM can go from one instruction to the next.
1
3
81
10,834
(4/5) A big problem is synchronization. Normally, kernel boundaries synchronize for you. But we got rid of them all, so we have to do it ourselves. Fortunately, we found fine-grained synchronization enabled other optimizations, too -- like starting some attention heads early!
1
2
77
10,602
(3/5) To run Llama-1B fast, we need to hide latencies like loading weights. So, we divide each SM’s shared memory into 16KiB pages, and specialize threads by role. So, loader threads can start loading future weights while worker threads work on the current ones.
1
2
74
9,891
Hello friends! I'm going live now on YT to teach CUDA + ThunderKittens, in particular to @qamcintyre but perhaps to you too if you're interested. Come hang out! piped.video/watch?v=xcpEl0cG…
3
52
4,039
(6/7) We're sharing TK as an art project to make key ideas clear and accessible. To help with that, we've integrated TK with @karpathy's awesome NanoGPT project, forked as github.com/HazyResearch/nano…. Also as a heads up: we are not going to be responding to Github issues. Check out our blog post for more details and happy coding! 🌩️🐱
1
8
45
5,702
(7/8) Code is at bit.ly/tplcode; it is (emphasis) research code. You can also play with our custom profiler at bit.ly/4mDJ0wG! We’ve written up both a brief, introductory post at bit.ly/tplintro and a longer, more technical one: bit.ly/tplmain
1
4
42
4,348
(2/7) TK's core primitive is a 16x16 tile that fits into the tensor core, which makes up 94% of the compute on an H100. By focusing on these tiles, we ensure that our kernels keep the tensor cores busy and achieve high performance. 💪 Basically, we think you should be both thinking and writing in big instructions on small tiles.
3
2
36
8,861
(2/8) Our megakernel is built on the same on-GPU interpreter used in our low-latency megakernel, and extended with TK’s new PGL primitives to scale across GPUs. We’ve written a larger, richer instruction set to run these more complex workloads without coarsening synchronization.
1
34
2,646
(5/7) We've been using TK in our lab to write lots of different kinds of kernels for projects like Based and Hedgehog. Some of these kernels substantially outperform what we could achieve with Triton. Plus, AI folks who attended a two-hour CUDA session have been able to write code using TK. 📈
1
2
33
5,485
We've integrated lightweight support for the new Blackwell (tcgen05) instructions into ThunderKittens if you want to play! We don't think that this is near the limit of what 's possible -- we’re still way off from CuBLAS (in progress), but already way above Hopper. More kernels soon to come: attention backwards, linear attention variants, long convolutions, and more! Give us another day or two... 🙂
1
30
5,649
We're releasing something soon. In anticipation, and in belated celebration of @katyperry 's birthday, here's Kitty Perry.
2
2
32
5,403
(4/7) The results speak for themselves: On 4090s and A100s, TK matches Flash Attention 2 (FA2) performance in just a few lines of code. On H100s, TK is substantially faster than FA2 for both forward and backward passes. Clean code and high performance? Yes, please! 🎉
3
2
30
5,739
(3/8) There are three key kernel optimizations that help us achieve this performance. The first is inter-instruction pipelining. We specialize our threads to follow a dataflow pattern, so that we can overlap instructions and keep the matrix multiplies rolling!
1
30
2,448
(8/8) This is joint work with my amazing collaborators @jordanjuravsky, @stuart_sul, @dylan__lim, @OwenDugan, @simran_s_arora, and @HazyResearch. And special thanks to @togethercompute for providing the GPUs to make this work possible!
1
31
3,046
(6/7) Profiling these megakernels isn't straightforward, so we built custom tools that generate detailed Gantt charts for visualizing kernel behavior. This allowed us to rapidly identify and eliminate hidden latencies and bottlenecks.
2
3
30
2,553
(7/7) ThunderMLA is just the beginning—we see fused megakernels and dataflow paradigms as a key part of the future of efficient GPU programming for AI. We think these techniques can enable efficient variable-length training (no more frankensteining documents together), higher utilizations, and new architectures!
2
2
30
2,236
(3/7) We designed TK to be familiar to AI folks, with a PyTorch-like API -- you can do things like take a row sum of a matrix. The inner compute loop of our H100 forward attention kernel is just 21 LOC. At the same time, TK doesn't hide how accelerators work, allowing users to take full advantage of the latest hardware. If you know CUDA, you can probably "compile" TK in your head! 🧠
2
1
28
7,787
(4/8) Second, we dynamically schedule work across each GPU's SM's at runtime: SM’s just pull new work from a global work queue as needed. This simplifies scheduling on the CPU, and prevents jitter on the GPU from leading to stalls.
1
26
2,045
(7/7) And, many additional thanks are in order. We’ve been supported by resources from @StanfordAILab, @StanfordCRFM, @StanfordHAI, among others. NVIDIA’s Cutlass team gave us invaluable advice on wgmma and TMA descriptors, and the @colfaxintl kernels were an important reference. Finally, extra special thanks to @togethercompute for compute, advice, and friendship, and also to @HertzFoundation for funding my PhD.
2
2
24
6,229
Special thanks to @TogetherCompute for their support and early access to Blackwell, to @__tensorcore__ for making the PTX docs a real pleasure this time around (excited for the new Cutlass!), and to @nvidia more broadly for... unusually coherent architectural decisions!
2
23
1,956
(3/7) How does ThunderMLA achieve this boost? By applying classical computing concepts (job scheduling, interpreters) to modern AI inference. Our megakernel is essentially an interpreter running a tiny virtual instruction set directly on the GPU. 🚀
1
1
23
2,324
(5/8) Third, we interleave communication-intensive instructions with compute-intensive instructions, which lets us reduce peak networking bandwidth. In large batches, this can make a substantial difference.
1
24
1,847
(6/8) We pipeline scheduling work on the CPU while the GPU runs the last batch, so that new instructions are ready by the time they’re needed. In practice, this usually leaves the CPU idle for ~90% of runtime.
1
23
1,785
(2/7) Variable-prompt decoding is notoriously tricky—especially with small, imbalanced batches, where kernel launches and tail effects become big headaches. ThunderMLA tackles this head-on, reducing overhead from multiple kernel launches into a single, streamlined megakernel.
1
23
1,659
(5/6) We also discovered the B200 tensor cores behave... similarly to 128×128 systolics, meaning you want M and N >= 128 for full FLOP utilization. Smaller values run at corresponding fractions—a bit different from Hopper where much smaller shapes could max out the GPU.
1
1
17
1,880
(2/7) Regarding kernels: we’re releasing a whole bunch -- a fused Mamba-2, H100-optimized FlashFFTConv kernels, RoPE, Layernorm, Hedgehog, and more! These kernels range from “a bit faster” to “much, much faster”.
1
15
983
(4/7) On a realistic production workload (see the blog post), ThunderMLA achieves 183 TFLOPS & 1520 GB/s on an H100 GPU, vs. FlashMLA's 144 TFLOPS & 1199 GB/s, a 20% boost. More performance, fewer kernel launches, and dramatically reduced tail effects.
1
17
1,277
(5/7) Scheduling matters—a lot! We've explored simple static scheduling and more advanced makespan-based scheduling techniques to shave crucial microseconds off execution times. These tweaks alone can yield an additional 10% speedup!
1
16
1,138
(2/6) Writing performant kernels on B200 GPUs feels a lot more like programming a dataflow-machine than traditional CUDA coding. It's all about loading enough data at high throughput to keep the tensor cores hot—and the B200's tensor cores have 2-2.5x the power of H100!
1
15
649
(3/7) We’ve also optimized our attention kernels to broadly match FA-3 on performance -- a bit slower on the forwards, and correspondingly faster on the backwards, while being simpler (no ping-ponging) and more customizable. We’ve trained models, and provided more robust integrations to help you do so, too.
1
14
816
Replying to @EitanTurok
More models are in the works -- and we do think that we can come up with a set of megakernel instructions to run a wide range of models. I think it's unlikely DL compilers could generate the instructions themselves, but perhaps the scheduler could be the new compiler!
16
2,141
(5/7) In terms of library improvements, new TK has: - No more explicit shared layouts (selected internally), just specify type and size. - Global layouts, to automatically manage TMA descriptors (H100) and strides (otherwise) - More & complex-valued type support (we swear we’ll get around to FP8 one of these days, it’s not even hard) Tens of thousands of robust unit tests for just about everything. - Templates to manage pipelining and latency-hiding. Get FA-3-approaching performance in just 42 lines of device code! - Lots and lots of performance optimizations: reducing bank conflicts, managing address spaces, issuing better instructions, removing synchronizations, and more.
1
13
682
(6/7) We’ll also be livestreaming this Thursday and teaching TK and some of the tricks we’ve learned about writing fast kernels -- come hang out with us!
2
13
818
(7/7) (In no particular order) Huge thanks to @togethercompute for plying us with H100s, and to @StanfordAILab, @StanfordCRFM, @StanfordHAI, among others, for their support of our work. Cutlass has remained a fount of good ideas for squeezing more out of GPUs. Special thanks to @realDanFu for his good sense of direction. And last but not least: happy National Cat Day!
1
12
1,444
(3/6) Happily, the new hardware features fit perfectly into TK's tile-based abstractions. We're taking full advantage of 5th-generation tensor cores, Tensor Memory, and CTA pairs to write simple kernels that match or exceed NVIDIA’s handcrafted libraries.
1
13
519
(6/6) Check out our kernels and learn more on our blog. And a huge thanks to @togethercompute for GPUs and their broader, continuing support.
1
13
1,569
(4/7) We’re also releasing demo integrations with Llama, Qwen, LoLCATs, and Based models. The fastest, easiest way to get your kittens talking to you!
1
12
716
(4/6) The secret? We pipeline everything—from HBM to shared memory, from cluster shared memory to tensor cores, from tensor memory into registers, from registers into shared memory, and then finally out into HBM. No bubbles!
1
2
13
916
Replying to @ryu0000000001
The latency benefits of megakernels are definitely more pronounced on smaller models, although we think it probably still does make sense even with 10B+ models, especially once one introduces multi-GPU inference. We've also focused on the latency benefits, but it turns out there are throughput benefits, too!
13
2,072
adding a bit more flavor A100 -- 312 TFLOPs, 2TB/s, ~$.001/sec (depending on provider.) Llama-70B (int8): decoding takes ~140GFLOP per token and ~70GB memory bandwidth per batch.
1
11
609
Replying to @qamcintyre
figure it out lmao
1
11
839
Replying to @typedfemale
i feel this is why perplexity was invented, just put it in the exponent
12
1,039
Replying to @iScienceLuvr
GRPO3 (they skipped GRPO2)
9
2,269
Replying to @zackangelo
This megakernel is just running one forward pass at a time in a batch size of one -- though it would be pretty easy to stick 16:1 speculation on top and get another 2-3x in latency. We do think there are throughput reasons to do megakernels too -- coming soon, stay tuned!
10
2,758
Replying to @charles_irl
Took a look through the PTX, it's disallowed for kernels that use either dynamic shared memory (e.g. all kernels that use >48kB) and setmaxnreg (e.g. most critical-path post-Hopper). Also curious if it decreases occupancy. Cool idea but I doubt it's much use in practice. :/
1
10
600
Replying to @SohamGovande
There is indeed a bit of competition for registers, it's a good point. Right now the virtualization doesn't use many registers, and we could probably reduce it further. A related concern is that it also means all of your instructions need to run with the same number of threads. Fortunately most ops run pretty well with 8-16 consumer warps. Of course, if it's a huge problem, you can also always just reintroduce a couple of kernel boundaries -- to us, Megakernels are a tool in a toolbox! But so far we've never needed to.
9
2,285
FWIW we could probably take advantage of this better than we do via cluster dsmem. But we've broadly found it too much of a hassle for the gains -- excepting very specific cases. There's probably juice there, too!
1
9
676
Replying to @PandaAshwinee
We actually do both of these things. Regarding the attention head schedule, the basic idea is that the different Q, K, V heads are each associated with separate counters in global memory. So, you don't have to wait for consistency on all of them in order to start attention. Rather, each attention instruction just waits for its specific inputs to be ready, whenever that happens. In this low-latency application it's probably a pretty minor optimization relative to the overall pipelining, but we think it'll also help in future, throughput-oriented scenarios, too.
9
3,145
Replying to @appliedcompute
congrats!!
1
7
1,684
Replying to @Tgale96
As time went on we did notice our abstractions looked more and more TPU-like! There are still some differences, of course, but I think it goes to show -- Google did get their infra right!
2
5
360
Replying to @tri_dao
Thanks Tri! It's a delight to be able to build on your work :)
5
528
In short, Llama 7B on a 4090 with int4, speculative decoding, and a persistent KV cache should end up at least 20x cheaper than ChatGPT even in single batch inference. Replacing speculative decoding with large batch should bring the factor to 200+ at the cost of more latency.
1
5
315
4. Not exactly a per-token thing but self-hosting and keeping the KV cache around saves massive costs for long dialogues relative to APIs. Was just chatting with Together on this. (I want.)
1
5
390
1. Switch to a 4090 -- cheaper per memory bandwidth. (3 year average cost incl electricity@$0.2/kWh is $0.00004/sec)
1
4
304
Replying to @saranormous
now :)
(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.
5
243
What the theory then says is to achieve cost parity, you need to run a batch size of at least 18. There are inefficiencies so in practice this is more like 50. But, there are some tricks you can use to get this a lot lower.
1
4
344
FA3 actually does not port trivially to Blackwell because the wgmma instructions from Hopper are no longer supported, in favor of the newer tcgen05 instructions. Unusual of Nvidia to break backwards compatibility, but they have!
4
369
So you can get ~28 batches per second. In practice a bit less since KV cache requires some reshuffling too. But on the flipside you can get 2,200 tokens worth of compute. In practice much less since attention has substantial non-matmul ops.
1
4
450
3. Speculative decoding can save memory bandwidth when in low-utilization situations, so that can get you another factor of 2-3. Also has the benefit of improving latency to end user.
1
3
308
Yep, link to code is in the post!
1
3
419
Replying to @hsu_byron
@simran_s_arora is the expert, but I'm happy to send you the materials I wrote if you like.
1
3
1,084
Replying to @paradigmai
The logo I didn't know I needed!
3
307
2. Use a smaller model -- 7B gets you a factor of 10 over 70B (duh.) Also quantize down to int4 and take the extra factor of 2.
1
3
286
Will have to take a look in the morning -- was nice for the 20 minutes it lasted :)
3
101
My two cents is that CUDA is absolutely a relevant moat, but it is not nearly as deep of a moat as, say, x86 was for Intel. With x86 you at least had billions of lines of code that had been written / optimized with x86 in mind. But the workload for an LLM inference cluster looks *more* like a few thousand lines of kernels run on loop on ~$1E9 of GPUs for 6 months on end. At that scale you can actually afford to pay people to rewrite a few kernels for new hardware, if the cost savings + performance are really there.
2
3
310
I didn't have a twitter but Roger mentioned this thread and now I do!
1
3
Just to follow up on this -- decided to rerun some micro benchmarks en route to paper, and have some additional clarity. 1. FP32 and BF16 both run at 60 TFLOPs, even if BF16 is packed. This is what I had been noticing before. 2. packed FP16 *does* indeed run at 120 TFLOPs. I think this actually gives a modest case to run FP8 forward pass with FP16 accumulate instead of FP32. The main issue that remains is that h2exp2 appears to be equally slow in half2, and that's the single biggest bottleneck in the kernel. But the rest of the normalizations would benefit from FP16, and I think it would be a noticeable perf improvement.
343
We haven't written a specialized kernel for it but I think it would be pretty easy to adapt our existing attention kernels in TK. The only complexity I can think of is starting the causal mask in the middle of a chunk. I don't think there's anything actually hard there, just would need a custom masking function to do it. (Probably like 30 LoC outside of TK.)
2
86
plt.xkcd and xkcd script font!
1
210
Replying to @terafoenix
This is a good question. NVIDIA indeed claims double rate there (with a *), but on our internal micros , we've actually found it running at same rate. (Maybe we missed something!) Blog uses 60 TFLOPs since non-matmul FLOPs usually should be done in FP32 for numerics, anyways.
1
1
991
Roughly speaking, those results are real, but they're trained at very low data/parameter ratios. Per the paper, what we found is that as you raise the amount if "overtraining" of the model, the gap really starts to open up between a 1.58-bit model and an 8 or 16-bit model.
1
111
Replying to @main_horse
I think we tried both <whichever version came with CUDA 12.6> as well as within Pytorch, they were similar. TBH, the point is actually less that we really think we can beat cublas in ways that are intrinsically significant, they have a ton of internal heuristics for blocking and padding weird matrix sizes that we're not super focused on. But we do think it's cool that you can get comparable performance on a pretty wide range of matrix sizes with just a few dozen lines of code. Together's flux release a few weeks ago used some custom TK kernels to fuse operations, and they were a good deal faster than cublas + unfused op -- the real utility is not in beating the pure matmul, but rather in fusing it with other operations, too!
262
Replying to @Yuchenj_UW
We haven't (and don't really intend) to write a complete set of inference kernels -- TK is more meant as a resource and a statement of philosophy than a productionizable LLM inference system. But if TK turns out to be useful for decoding kernels and whatnot, great!
1
1
1,160
Replying to @nc_znc
Glad you liked the talk! I'll DM; let's find a time after ICLR :)
1
1
46
Replying to @lukeigel
Happy birthday!!!
1
1
922
Replying to @YouJiacheng
Approved and merged, thanks very much!
1
70
Replying to @amanrsanger
I don't really disagree with this, but I want to put the con case quickly, which is that if you really think that models are going to get a lot bigger and sparser, the balance may shift back towards CPUs with low arithmetic intensities.
1
1
124
Replying to @divchenko
FP16 is sort of already there, we'd need to add a couple of typedefs but there's no real barriers to it. FP8 looks like more of a hassle to me, unfortunately, as a bunch of transpose intrinsics we rely on don't seem to be defined, and doing everything with shuffles will be quite painful. So... Q@K.T should be fine but Q@K may suffer in FP8. TBH I am not at all sure why NVIDIA did not define transpose intrinsics for other types. Seems like a weird thing to leave out?
254
We're mainly looking at CDNA2 and CDNA3 in the short term -- RDNA3 is not immediately in focus. Unfortunately, the gap between RDNA3 and CDNA3 is larger than the gap between Hopper/Ampere on one important attribute: wavefront size. It's not really *hard* to do RDNA3, but it requires parallel code for all of the register primitives.
2
1
48