Benjamin F Spector · May 12, 2024 · 8:06 PM UTC

Benjamin F Spector

Benjamin F Spector

@bfspector

12 May 2024

(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide. We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes it easy to express key technical ideas for building AI kernels. TK lets us write clean, easy-to-understand code that maximizes GPU utilization -- on all kinds of kernels! Code: github.com/HazyResearch/Thun… Writeups: (short) hazyresearch.stanford.edu/bl…, (long) hazyresearch.stanford.edu/bl…. Joint with @AaryanSinghal4, @simran_s_arora, @HazyResearch and team!

151

878

301,953

Benjamin F Spector · May 27, 2025 · 6:43 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint with @jordanjuravsky, @stuart_sul, @OwenDugan, @dylan__lim, @realDanFu, @simran_s_arora, and @HazyResearch)

142

877

384,018

Benjamin F Spector · Mar 5, 2025 · 9:01 PM UTC

Benjamin F Spector

@bfspector

5 Mar 2025

(1/7) Inspired by DeepSeek's FlashMLA, we're releasing ThunderMLA—a fused megakernel optimized for variable-prompt decoding! ⚡️🐱ThunderMLA is up to 35% faster than FlashMLA and just 400 LoC. Blog: bit.ly/4kubAAK With @AaryanSinghal4, @realDanFu, and @hazyresearch!

370

60,492

Benjamin F Spector · Sep 28, 2025 · 7:34 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.

321

84,808

Benjamin F Spector · May 27, 2025 · 6:44 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

(5/5) We’re open-sourcing all of the code so that you too can stop torturing your models with kernel launches (may Roko grant you a quick death) and have written up a blog with a bit more detail on how it all works. Code: bit.ly/451G881, Blog: bit.ly/3HcImHG

192

10,724

Benjamin F Spector · Mar 15, 2025 · 3:42 PM UTC

Benjamin F Spector

@bfspector

15 Mar 2025

(1/6) Joyously announcing ThunderKittens with real support on NVIDIA Blackwell! We've released BF16/FP8 GEMM and attention fwd+bwd kernels, up to 2x faster than cuBLAS GEMMs on H100. Blog: bit.ly/41tuT4Q With @realDanFu, @AaryanSinghal4, and @hazyresearch!

ThunderKittens Now on Blackwells!

hazyresearch.stanford.edu

190

20,918

Benjamin F Spector · Jan 25, 2025 · 7:17 AM UTC

Benjamin F Spector

@bfspector

25 Jan 2025

We got early access to some of the very first Nvidia B200’s. We share initial benchmark results and wrote the fastest (public) attention kernel with 925+ BF16 TFLOPs: Since the PTX instruction set released yesterday, @aaryan04 and I have been hard at work at @HazyResearch learning the new instructions and putting them to the silicon.

168

18,451

Benjamin F Spector · Oct 29, 2024 · 5:50 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

(1/7) In celebration of National Cat Day, we’re excited to release our first major batch of updates to ThunderKittens! ThunderKittens is now easier, better, faster, and cuter than ever before! In addition to massive speed boosts, we’re releasing a broad swath of kernels, new features, and tons of usability improvements. (Blog post: hazyresearch.stanford.edu/bl…, Paper: arxiv.org/abs/2410.20399) As always, joint with @AaryanSinghal4, @simran_s_arora, @HazyResearch, and team!

123

36,468

Benjamin F Spector · May 27, 2025 · 6:43 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

(2/5) Our Llama megakernel is built around an on-GPU interpreter. Each SM fetches and executes huge, custom instructions from a special instruction tensor, so the GPU can be doing many different things. Without kernel boundaries, each SM can go from one instruction to the next.

10,834

Benjamin F Spector · May 27, 2025 · 6:44 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

(4/5) A big problem is synchronization. Normally, kernel boundaries synchronize for you. But we got rid of them all, so we have to do it ourselves. Fortunately, we found fine-grained synchronization enabled other optimizations, too -- like starting some attention heads early!

10,602

Benjamin F Spector · May 27, 2025 · 6:43 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

(3/5) To run Llama-1B fast, we need to hide latencies like loading weights. So, we divide each SM’s shared memory into 16KiB pages, and specialize threads by role. So, loader threads can start loading future weights while worker threads work on the current ones.

9,891

Benjamin F Spector · Oct 27, 2024 · 6:06 PM UTC

Benjamin F Spector

@bfspector

27 Oct 2024

Hello friends! I'm going live now on YT to teach CUDA + ThunderKittens, in particular to @qamcintyre but perhaps to you too if you're interested. Come hang out! piped.video/watch?v=xcpEl0cG…

CUDA + ThunderKittens, but increasingly drunk.

My friend Quinn (x.com/qamcintyre) asked me to teach him CUDA and T...

youtube.com

4,039

Benjamin F Spector · May 12, 2024 · 8:10 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

(6/7) We're sharing TK as an art project to make key ideas clear and accessible. To help with that, we've integrated TK with @karpathy's awesome NanoGPT project, forked as github.com/HazyResearch/nano…. Also as a heads up: we are not going to be responding to Github issues. Check out our blog post for more details and happy coding! 🌩️🐱

GitHub - HazyResearch/train-tk: train with kittens!

train with kittens! Contribute to HazyResearch/train-tk development by creating an account on GitHub.

github.com

5,702

Benjamin F Spector · Sep 28, 2025 · 7:38 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(7/8) Code is at bit.ly/tplcode; it is (emphasis) research code. You can also play with our custom profiler at bit.ly/4mDJ0wG! We’ve written up both a brief, introductory post at bit.ly/tplintro and a longer, more technical one: bit.ly/tplmain

4,348

Benjamin F Spector · May 12, 2024 · 8:07 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

(2/7) TK's core primitive is a 16x16 tile that fits into the tensor core, which makes up 94% of the compute on an H100. By focusing on these tiles, we ensure that our kernels keep the tensor cores busy and achieve high performance. 💪 Basically, we think you should be both thinking and writing in big instructions on small tiles.

8,861

Benjamin F Spector · Sep 28, 2025 · 7:35 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(2/8) Our megakernel is built on the same on-GPU interpreter used in our low-latency megakernel, and extended with TK’s new PGL primitives to scale across GPUs. We’ve written a larger, richer instruction set to run these more complex workloads without coarsening synchronization.

2,646

Benjamin F Spector · May 12, 2024 · 8:09 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

(5/7) We've been using TK in our lab to write lots of different kinds of kernels for projects like Based and Hedgehog. Some of these kernels substantially outperform what we could achieve with Triton. Plus, AI folks who attended a two-hour CUDA session have been able to write code using TK. 📈

5,485

Benjamin F Spector · Jan 25, 2025 · 7:19 AM UTC

Benjamin F Spector

@bfspector

25 Jan 2025

Replying to @bfspector @aaryan04 @HazyResearch

We've integrated lightweight support for the new Blackwell (tcgen05) instructions into ThunderKittens if you want to play! We don't think that this is near the limit of what 's possible -- we’re still way off from CuBLAS (in progress), but already way above Hopper. More kernels soon to come: attention backwards, linear attention variants, long convolutions, and more! Give us another day or two... 🙂

5,649

Benjamin F Spector · Oct 29, 2024 · 4:54 AM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

We're releasing something soon. In anticipation, and in belated celebration of @katyperry 's birthday, here's Kitty Perry.

5,403

Benjamin F Spector · May 12, 2024 · 8:08 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

(4/7) The results speak for themselves: On 4090s and A100s, TK matches Flash Attention 2 (FA2) performance in just a few lines of code. On H100s, TK is substantially faster than FA2 for both forward and backward passes. Clean code and high performance? Yes, please! 🎉

5,739

Benjamin F Spector · Sep 28, 2025 · 7:35 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(3/8) There are three key kernel optimizations that help us achieve this performance. The first is inter-instruction pipelining. We specialize our threads to follow a dataflow pattern, so that we can overlap instructions and keep the matrix multiplies rolling!

2,448

Benjamin F Spector · Sep 28, 2025 · 7:39 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(8/8) This is joint work with my amazing collaborators @jordanjuravsky, @stuart_sul, @dylan__lim, @OwenDugan, @simran_s_arora, and @HazyResearch. And special thanks to @togethercompute for providing the GPUs to make this work possible!

3,046

Benjamin F Spector · Mar 5, 2025 · 9:04 PM UTC

Benjamin F Spector

@bfspector

5 Mar 2025

Replying to @bfspector @AaryanSinghal4 @realDanFu @HazyResearch

(6/7) Profiling these megakernels isn't straightforward, so we built custom tools that generate detailed Gantt charts for visualizing kernel behavior. This allowed us to rapidly identify and eliminate hidden latencies and bottlenecks.

2,553

Benjamin F Spector · Mar 5, 2025 · 9:04 PM UTC

Benjamin F Spector

@bfspector

5 Mar 2025

Replying to @bfspector @AaryanSinghal4 @realDanFu @HazyResearch

(7/7) ThunderMLA is just the beginning—we see fused megakernels and dataflow paradigms as a key part of the future of efficient GPU programming for AI. We think these techniques can enable efficient variable-length training (no more frankensteining documents together), higher utilizations, and new architectures!

2,236

Benjamin F Spector · May 12, 2024 · 8:07 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

(3/7) We designed TK to be familiar to AI folks, with a PyTorch-like API -- you can do things like take a row sum of a matrix. The inner compute loop of our H100 forward attention kernel is just 21 LOC. At the same time, TK doesn't hide how accelerators work, allowing users to take full advantage of the latest hardware. If you know CUDA, you can probably "compile" TK in your head! 🧠

7,787

Benjamin F Spector · Sep 28, 2025 · 7:36 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(4/8) Second, we dynamically schedule work across each GPU's SM's at runtime: SM’s just pull new work from a global work queue as needed. This simplifies scheduling on the CPU, and prevents jitter on the GPU from leading to stalls.

2,045

Benjamin F Spector · May 12, 2024 · 8:12 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

(7/7) And, many additional thanks are in order. We’ve been supported by resources from @StanfordAILab, @StanfordCRFM, @StanfordHAI, among others. NVIDIA’s Cutlass team gave us invaluable advice on wgmma and TMA descriptors, and the @colfaxintl kernels were an important reference. Finally, extra special thanks to @togethercompute for compute, advice, and friendship, and also to @HertzFoundation for funding my PhD.

6,229

Benjamin F Spector · Jan 25, 2025 · 7:20 AM UTC

Benjamin F Spector

@bfspector

25 Jan 2025

Replying to @bfspector @aaryan04 @HazyResearch

Special thanks to @TogetherCompute for their support and early access to Blackwell, to @__tensorcore__ for making the PTX docs a real pleasure this time around (excited for the new Cutlass!), and to @nvidia more broadly for... unusually coherent architectural decisions!

1,956

Benjamin F Spector · Mar 5, 2025 · 9:02 PM UTC

Benjamin F Spector

@bfspector

5 Mar 2025

Replying to @bfspector @AaryanSinghal4 @realDanFu @HazyResearch

(3/7) How does ThunderMLA achieve this boost? By applying classical computing concepts (job scheduling, interpreters) to modern AI inference. Our megakernel is essentially an interpreter running a tiny virtual instruction set directly on the GPU. 🚀

2,324

Benjamin F Spector · Sep 28, 2025 · 7:36 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(5/8) Third, we interleave communication-intensive instructions with compute-intensive instructions, which lets us reduce peak networking bandwidth. In large batches, this can make a substantial difference.

1,847

Benjamin F Spector · Sep 28, 2025 · 7:37 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

(6/8) We pipeline scheduling work on the CPU while the GPU runs the last batch, so that new instructions are ready by the time they’re needed. In practice, this usually leaves the CPU idle for ~90% of runtime.

1,785

Benjamin F Spector · Mar 5, 2025 · 9:02 PM UTC

Benjamin F Spector

@bfspector

5 Mar 2025

Replying to @bfspector @AaryanSinghal4 @realDanFu @HazyResearch

(2/7) Variable-prompt decoding is notoriously tricky—especially with small, imbalanced batches, where kernel launches and tail effects become big headaches. ThunderMLA tackles this head-on, reducing overhead from multiple kernel launches into a single, streamlined megakernel.

1,659

Benjamin F Spector · May 12, 2024 · 8:13 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

Replying to @bfspector @StanfordAILab @StanfordCRFM @StanfordHAI @colfaxintl

Tagging a few people who may be interested @srush_nlp @JiaZhihao @SonglinYang4

5,260

Benjamin F Spector · Mar 15, 2025 · 3:44 PM UTC

Benjamin F Spector

@bfspector

15 Mar 2025

(5/6) We also discovered the B200 tensor cores behave... similarly to 128×128 systolics, meaning you want M and N >= 128 for full FLOP utilization. Smaller values run at corresponding fractions—a bit different from Hopper where much smaller shapes could max out the GPU.

1,880

Benjamin F Spector · Oct 29, 2024 · 5:52 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

(2/7) Regarding kernels: we’re releasing a whole bunch -- a fused Mamba-2, H100-optimized FlashFFTConv kernels, RoPE, Layernorm, Hedgehog, and more! These kernels range from “a bit faster” to “much, much faster”.

983

Benjamin F Spector · Mar 5, 2025 · 9:03 PM UTC

Benjamin F Spector

@bfspector

5 Mar 2025

Replying to @bfspector @AaryanSinghal4 @realDanFu @HazyResearch

(4/7) On a realistic production workload (see the blog post), ThunderMLA achieves 183 TFLOPS & 1520 GB/s on an H100 GPU, vs. FlashMLA's 144 TFLOPS & 1199 GB/s, a 20% boost. More performance, fewer kernel launches, and dramatically reduced tail effects.

1,277

Benjamin F Spector · Mar 5, 2025 · 9:03 PM UTC

Benjamin F Spector

@bfspector

5 Mar 2025

Replying to @bfspector @AaryanSinghal4 @realDanFu @HazyResearch

(5/7) Scheduling matters—a lot! We've explored simple static scheduling and more advanced makespan-based scheduling techniques to shave crucial microseconds off execution times. These tweaks alone can yield an additional 10% speedup!

1,138

Benjamin F Spector · Mar 15, 2025 · 3:42 PM UTC

Benjamin F Spector

@bfspector

15 Mar 2025

(2/6) Writing performant kernels on B200 GPUs feels a lot more like programming a dataflow-machine than traditional CUDA coding. It's all about loading enough data at high throughput to keep the tensor cores hot—and the B200's tensor cores have 2-2.5x the power of H100!

649

Benjamin F Spector · Oct 29, 2024 · 5:54 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

(3/7) We’ve also optimized our attention kernels to broadly match FA-3 on performance -- a bit slower on the forwards, and correspondingly faster on the backwards, while being simpler (no ping-ponging) and more customizable. We’ve trained models, and provided more robust integrations to help you do so, too.

816

Benjamin F Spector · May 28, 2025 · 3:54 AM UTC

Benjamin F Spector

@bfspector

28 May 2025

Replying to @EitanTurok

More models are in the works -- and we do think that we can come up with a set of megakernel instructions to run a wide range of models. I think it's unlikely DL compilers could generate the instructions themselves, but perhaps the scheduler could be the new compiler!

2,141

Benjamin F Spector · Oct 29, 2024 · 5:57 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

(5/7) In terms of library improvements, new TK has: - No more explicit shared layouts (selected internally), just specify type and size. - Global layouts, to automatically manage TMA descriptors (H100) and strides (otherwise) - More & complex-valued type support (we swear we’ll get around to FP8 one of these days, it’s not even hard) Tens of thousands of robust unit tests for just about everything. - Templates to manage pipelining and latency-hiding. Get FA-3-approaching performance in just 42 lines of device code! - Lots and lots of performance optimizations: reducing bank conflicts, managing address spaces, issuing better instructions, removing synchronizations, and more.

682

Benjamin F Spector · Oct 29, 2024 · 5:58 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

(6/7) We’ll also be livestreaming this Thursday and teaching TK and some of the tricks we’ve learned about writing fast kernels -- come hang out with us!

818

Benjamin F Spector · Oct 29, 2024 · 6:00 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

(7/7) (In no particular order) Huge thanks to @togethercompute for plying us with H100s, and to @StanfordAILab, @StanfordCRFM, @StanfordHAI, among others, for their support of our work. Cutlass has remained a fount of good ideas for squeezing more out of GPUs. Special thanks to @realDanFu for his good sense of direction. And last but not least: happy National Cat Day!

1,444

Benjamin F Spector · Mar 15, 2025 · 3:43 PM UTC

Benjamin F Spector

@bfspector

15 Mar 2025

(3/6) Happily, the new hardware features fit perfectly into TK's tile-based abstractions. We're taking full advantage of 5th-generation tensor cores, Tensor Memory, and CTA pairs to write simple kernels that match or exceed NVIDIA’s handcrafted libraries.

519

Benjamin F Spector · Mar 15, 2025 · 4:02 PM UTC

Benjamin F Spector

@bfspector

15 Mar 2025

(6/6) Check out our kernels and learn more on our blog. And a huge thanks to @togethercompute for GPUs and their broader, continuing support.

1,569

Benjamin F Spector · Oct 29, 2024 · 5:55 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2024

(4/7) We’re also releasing demo integrations with Llama, Qwen, LoLCATs, and Based models. The fastest, easiest way to get your kittens talking to you!

716

Benjamin F Spector · Mar 15, 2025 · 3:43 PM UTC

Benjamin F Spector

@bfspector

15 Mar 2025

(4/6) The secret? We pipeline everything—from HBM to shared memory, from cluster shared memory to tensor cores, from tensor memory into registers, from registers into shared memory, and then finally out into HBM. No bubbles!

916

Benjamin F Spector · May 27, 2025 · 8:21 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

Replying to @ryu0000000001

The latency benefits of megakernels are definitely more pronounced on smaller models, although we think it probably still does make sense even with 10B+ models, especially once one introduces multi-GPU inference. We've also focused on the latency benefits, but it turns out there are throughput benefits, too!

2,072

Benjamin F Spector · Jul 19, 2023 · 9:10 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @silasalberti @SilasAlberti @danielgross

adding a bit more flavor A100 -- 312 TFLOPs, 2TB/s, ~$.001/sec (depending on provider.) Llama-70B (int8): decoding takes ~140GFLOP per token and ~70GB memory bandwidth per batch.

609

Benjamin F Spector · Jan 11, 2025 · 7:20 AM UTC

Benjamin F Spector

@bfspector

11 Jan 2025

Replying to @qamcintyre

figure it out lmao

839

Benjamin F Spector · Sep 3, 2025 · 9:01 PM UTC

Benjamin F Spector

@bfspector

3 Sep 2025

Replying to @typedfemale

i feel this is why perplexity was invented, just put it in the exponent

1,039

Benjamin F Spector · Sep 28, 2025 · 3:11 AM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

Replying to @iScienceLuvr

GRPO3 (they skipped GRPO2)

2,269

Benjamin F Spector · May 28, 2025 · 3:56 AM UTC

Benjamin F Spector

@bfspector

28 May 2025

Replying to @zackangelo

This megakernel is just running one forward pass at a time in a batch size of one -- though it would be pretty easy to stick 16:1 speculation on top and get another 2-3x in latency. We do think there are throughput reasons to do megakernels too -- coming soon, stay tuned!

2,758

Benjamin F Spector · Sep 5, 2025 · 4:04 AM UTC

Benjamin F Spector

@bfspector

5 Sep 2025

Replying to @charles_irl

Took a look through the PTX, it's disallowed for kernels that use either dynamic shared memory (e.g. all kernels that use >48kB) and setmaxnreg (e.g. most critical-path post-Hopper). Also curious if it decreases occupancy. Cool idea but I doubt it's much use in practice. :/

600

Benjamin F Spector · May 27, 2025 · 8:20 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

Replying to @SohamGovande

There is indeed a bit of competition for registers, it's a good point. Right now the virtualization doesn't use many registers, and we could probably reduce it further. A related concern is that it also means all of your instructions need to run with the same number of threads. Fortunately most ops run pretty well with 8-16 consumer warps. Of course, if it's a huge problem, you can also always just reintroduce a couple of kernel boundaries -- to us, Megakernels are a tool in a toolbox! But so far we've never needed to.

2,285

Benjamin F Spector · Oct 9, 2025 · 4:22 AM UTC

Benjamin F Spector

@bfspector

9 Oct 2025

Replying to @__tinygrad__ @HazyResearch @simran_s_arora @stuart_sul @AaryanSinghal4 @tenstorrent

FWIW we could probably take advantage of this better than we do via cluster dsmem. But we've broadly found it too much of a hassle for the gains -- excepting very specific cases. There's probably juice there, too!

676

Benjamin F Spector · May 27, 2025 · 7:25 PM UTC

Benjamin F Spector

@bfspector

27 May 2025

Replying to @PandaAshwinee

We actually do both of these things. Regarding the attention head schedule, the basic idea is that the different Q, K, V heads are each associated with separate counters in global memory. So, you don't have to wait for consistency on all of them in order to start attention. Rather, each attention instruction just waits for its specific inputs to be ready, whenever that happens. In this low-latency application it's probably a pretty minor optimization relative to the overall pipelining, but we think it'll also help in future, throughput-oriented scenarios, too.

3,145

Benjamin F Spector · Oct 29, 2025 · 7:07 PM UTC

Benjamin F Spector

@bfspector

29 Oct 2025

Replying to @appliedcompute

congrats!!

1,684

Benjamin F Spector · May 13, 2024 · 11:58 PM UTC

Benjamin F Spector

@bfspector

13 May 2024

Replying to @Tgale96

As time went on we did notice our abstractions looked more and more TPU-like! There are still some differences, of course, but I think it goes to show -- Google did get their infra right!

360

Benjamin F Spector · May 12, 2024 · 8:55 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

Replying to @tri_dao

Thanks Tri! It's a delight to be able to build on your work :)

528

Benjamin F Spector · Jul 19, 2023 · 9:13 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @bfspector @SilasAlberti @danielgross

In short, Llama 7B on a 4090 with int4, speculative decoding, and a persistent KV cache should end up at least 20x cheaper than ChatGPT even in single batch inference. Replacing speculative decoding with large batch should bring the factor to 200+ at the cost of more latency.

315

Benjamin F Spector · Jul 19, 2023 · 9:13 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @bfspector @SilasAlberti @danielgross

4. Not exactly a per-token thing but self-hosting and keeping the KV cache around saves massive costs for long dialogues relative to APIs. Was just chatting with Together on this. (I want.)

390

Benjamin F Spector · Jul 19, 2023 · 9:12 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @bfspector @SilasAlberti @danielgross

1. Switch to a 4090 -- cheaper per memory bandwidth. (3 year average cost incl electricity@$0.2/kWh is $0.00004/sec)

304

Benjamin F Spector · Oct 31, 2024 · 8:13 PM UTC

Benjamin F Spector

@bfspector

31 Oct 2024

(8/7) Stream is here and now: piped.video/watch?v=IAwLzkld…

980

Benjamin F Spector · Sep 28, 2025 · 7:44 PM UTC

Benjamin F Spector

@bfspector

28 Sep 2025

Replying to @saranormous

now :)

Benjamin F Spector

@bfspector

28 Sep 2025

243

Benjamin F Spector · Jul 19, 2023 · 9:11 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @bfspector @SilasAlberti @danielgross

What the theory then says is to achieve cost parity, you need to run a batch size of at least 18. There are inefficiencies so in practice this is more like 50. But, there are some tricks you can use to get this a lot lower.

344

Benjamin F Spector · Jan 25, 2025 · 8:24 PM UTC

Benjamin F Spector

@bfspector

25 Jan 2025

Replying to @drisspg @aaryan04 @HazyResearch

FA3 actually does not port trivially to Blackwell because the wgmma instructions from Hopper are no longer supported, in favor of the newer tcgen05 instructions. Unusual of Nvidia to break backwards compatibility, but they have!

369

Benjamin F Spector · Jul 19, 2023 · 9:11 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @bfspector @SilasAlberti @danielgross

So you can get ~28 batches per second. In practice a bit less since KV cache requires some reshuffling too. But on the flipside you can get 2,200 tokens worth of compute. In practice much less since attention has substantial non-matmul ops.

450

Benjamin F Spector · Jul 19, 2023 · 9:12 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @bfspector @SilasAlberti @danielgross

3. Speculative decoding can save memory bandwidth when in low-utilization situations, so that can get you another factor of 2-3. Also has the benefit of improving latency to end user.

308

Benjamin F Spector · Mar 6, 2025 · 12:44 AM UTC

Benjamin F Spector

@bfspector

6 Mar 2025

Replying to @McQueenFu @AaryanSinghal4 @realDanFu @HazyResearch

Yep, link to code is in the post!

419

Benjamin F Spector · May 12, 2024 · 11:09 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

Replying to @hsu_byron

@simran_s_arora is the expert, but I'm happy to send you the materials I wrote if you like.

1,084

Benjamin F Spector · Aug 28, 2025 · 12:25 AM UTC

Benjamin F Spector

@bfspector

28 Aug 2025

Replying to @paradigmai

The logo I didn't know I needed!

307

Benjamin F Spector · Jul 19, 2023 · 9:12 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2023

Replying to @bfspector @SilasAlberti @danielgross

2. Use a smaller model -- 7B gets you a factor of 10 over 70B (duh.) Also quantize down to int4 and take the extra factor of 2.

286

Benjamin F Spector · Jan 25, 2025 · 10:15 AM UTC

Benjamin F Spector

@bfspector

25 Jan 2025

Replying to @__tensorcore__ @aaryan04 @HazyResearch @togethercompute @nvidia

Will have to take a look in the morning -- was nice for the 20 minutes it lasted :)

101

Benjamin F Spector · Jun 25, 2024 · 8:57 PM UTC

Benjamin F Spector

@bfspector

25 Jun 2024

Replying to @mark_woodworth @cHHillee

My two cents is that CUDA is absolutely a relevant moat, but it is not nearly as deep of a moat as, say, x86 was for Intel. With x86 you at least had billions of lines of code that had been written / optimized with x86 in mind. But the workload for an LLM inference cluster looks *more* like a few thousand lines of kernels run on loop on ~$1E9 of GPUs for 6 months on end. At that scale you can actually afford to pay people to rewrite a few kernels for new hardware, if the cost savings + performance are really there.

310

Benjamin F Spector · Sep 29, 2025 · 4:35 AM UTC

Benjamin F Spector

@bfspector

29 Sep 2025

Replying to @djsaunde

168

Benjamin F Spector · Oct 7, 2020 · 10:20 PM UTC

Benjamin F Spector

@bfspector

7 Oct 2020

Replying to @melissaekline @roger_p_levy

I didn't have a twitter but Roger mentioned this thread and now I do!

Benjamin F Spector · Oct 9, 2025 · 4:21 AM UTC

Benjamin F Spector

@bfspector

9 Oct 2025

Replying to @__tinygrad__ @HazyResearch @simran_s_arora @stuart_sul @AaryanSinghal4

240

Benjamin F Spector · Jan 23, 2025 · 10:57 PM UTC

Benjamin F Spector

@bfspector

23 Jan 2025

Replying to @__tensorcore__

ily

Benjamin F Spector · May 17, 2024 · 8:42 PM UTC

Benjamin F Spector

@bfspector

17 May 2024

Replying to @bfspector @terafoenix

Just to follow up on this -- decided to rerun some micro benchmarks en route to paper, and have some additional clarity. 1. FP32 and BF16 both run at 60 TFLOPs, even if BF16 is packed. This is what I had been noticing before. 2. packed FP16 *does* indeed run at 120 TFLOPs. I think this actually gives a modest case to run FP8 forward pass with FP16 accumulate instead of FP32. The main issue that remains is that h2exp2 appears to be equally slow in half2, and that's the single biggest bottleneck in the kernel. But the rest of the normalizations would benefit from FP16, and I think it would be a noticeable perf improvement.

343

Benjamin F Spector · Jun 12, 2024 · 3:19 PM UTC

Benjamin F Spector

@bfspector

12 Jun 2024

Replying to @andersonbcdefg @simran_s_arora

We haven't written a specialized kernel for it but I think it would be pretty easy to adapt our existing attention kernels in TK. The only complexity I can think of is starting the causal mask in the middle of a chunk. I don't think there's anything actually hard there, just would need a custom masking function to do it. (Probably like 30 LoC outside of TK.)

Benjamin F Spector · May 14, 2024 · 4:04 AM UTC

Benjamin F Spector

@bfspector

14 May 2024

plt.xkcd and xkcd script font!

210

Benjamin F Spector · May 13, 2024 · 12:38 AM UTC

Benjamin F Spector

@bfspector

13 May 2024

Replying to @terafoenix

This is a good question. NVIDIA indeed claims double rate there (with a *), but on our internal micros , we've actually found it running at same rate. (Maybe we missed something!) Blog uses 60 TFLOPs since non-matmul FLOPs usually should be done in FP32 for numerics, anyways.

991

Benjamin F Spector · Jul 19, 2024 · 8:39 PM UTC

Benjamin F Spector

@bfspector

19 Jul 2024

Replying to @MilksandMatcha @SarahChieng

DM please!

384

Benjamin F Spector · Nov 13, 2024 · 10:01 PM UTC

Benjamin F Spector

@bfspector

13 Nov 2024

Replying to @paul_cal @Tim_Dettmers

Roughly speaking, those results are real, but they're trained at very low data/parameter ratios. Per the paper, what we found is that as you raise the amount if "overtraining" of the model, the gap really starts to open up between a 1.58-bit model and an 8 or 16-bit model.

111

Benjamin F Spector · Oct 30, 2024 · 5:20 AM UTC

Benjamin F Spector

@bfspector

30 Oct 2024

Replying to @main_horse

I think we tried both <whichever version came with CUDA 12.6> as well as within Pytorch, they were similar. TBH, the point is actually less that we really think we can beat cublas in ways that are intrinsically significant, they have a ton of internal heuristics for blocking and padding weird matrix sizes that we're not super focused on. But we do think it's cool that you can get comparable performance on a pretty wide range of matrix sizes with just a few dozen lines of code. Together's flux release a few weeks ago used some custom TK kernels to fuse operations, and they were a good deal faster than cublas + unfused op -- the real utility is not in beating the pure matmul, but rather in fusing it with other operations, too!

262

Benjamin F Spector · May 12, 2024 · 11:13 PM UTC

Benjamin F Spector

@bfspector

12 May 2024

Replying to @Yuchenj_UW

We haven't (and don't really intend) to write a complete set of inference kernels -- TK is more meant as a resource and a statement of philosophy than a productionizable LLM inference system. But if TK turns out to be useful for decoding kernels and whatnot, great!

1,160

Benjamin F Spector · Sep 14, 2024 · 4:10 AM UTC

Benjamin F Spector

@bfspector

14 Sep 2024

Replying to @nc_znc

Glad you liked the talk! I'll DM; let's find a time after ICLR :)

Benjamin F Spector · Sep 3, 2025 · 12:43 AM UTC

Benjamin F Spector

@bfspector

3 Sep 2025

Replying to @lukeigel

Happy birthday!!!

922

Benjamin F Spector · Jan 2, 2025 · 1:30 AM UTC

Benjamin F Spector

@bfspector

2 Jan 2025

Replying to @YouJiacheng

Approved and merged, thanks very much!

Benjamin F Spector · Aug 12, 2023 · 7:49 PM UTC

Benjamin F Spector

@bfspector

12 Aug 2023

Replying to @amanrsanger

I don't really disagree with this, but I want to put the con case quickly, which is that if you really think that models are going to get a lot bigger and sparser, the balance may shift back towards CPUs with low arithmetic intensities.

124

Benjamin F Spector · May 14, 2024 · 12:01 AM UTC

Benjamin F Spector

@bfspector

14 May 2024

Replying to @divchenko

FP16 is sort of already there, we'd need to add a couple of typedefs but there's no real barriers to it. FP8 looks like more of a hassle to me, unfortunately, as a bunch of transpose intrinsics we rely on don't seem to be defined, and doing everything with shuffles will be quite painful. So... Q@K.T should be fine but Q@K may suffer in FP8. TBH I am not at all sure why NVIDIA did not define transpose intrinsics for other types. Seems like a weird thing to leave out?

254

Benjamin F Spector · Aug 12, 2023 · 7:47 PM UTC

Benjamin F Spector

@bfspector

12 Aug 2023

Replying to @amanrsanger

@bryanhpchiang

628

Benjamin F Spector · May 21, 2024 · 6:16 AM UTC

Benjamin F Spector

@bfspector

21 May 2024

Replying to @Nuliayuk @StanfordAILab @StanfordCRFM @StanfordHAI @colfaxintl

We're mainly looking at CDNA2 and CDNA3 in the short term -- RDNA3 is not immediately in focus. Unfortunately, the gap between RDNA3 and CDNA3 is larger than the gap between Hopper/Ampere on one important attribute: wavefront size. It's not really *hard* to do RDNA3, but it requires parallel code for all of the register primitives.

Benjamin F Spector · Jan 25, 2025 · 8:25 PM UTC

Benjamin F Spector

@bfspector

25 Jan 2025

Replying to @RaghuGanti @aaryan04 @HazyResearch

github.com/HazyResearch/Thun… -- Prelim Blackwell support is on the Blackwell channel, will probably merge to main within the week.

GitHub - HazyResearch/ThunderKittens: Tile primitives for speedy kernels

Tile primitives for speedy kernels. Contribute to HazyResearch/ThunderKittens development by creating an account on GitHub.

github.com

309