Systems and GPU Performance Mechanic - TBD Ex. CUTLASS 3.x / 4.x etc

As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚
16
18
370
27,555
🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/medi…
16
85
422
78,966
CUDA 12.8 just dropped with Blackwell support. TensorCore 5th Generation Family Instructions: docs.nvidia.com/cuda/paralle…
12
51
307
33,848
FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with @tri_dao, @colfaxintl, @Meta, and my @_prrama. We can get up to 760 TFLOP/s on head dim 128 forward pass!
5
51
285
33,461
pip install nvidia-cutlass-dsl 👀 CUTLASS 4.0 is on the horizon, and the future is all pythonic! Come talk to us at GTC to learn more and attend our two talks
Cutlass 4.0 Python DSL @__tensorcore__ Fully python! Performance parity! Join these two sessions at @NVIDIAGTC
3
18
148
15,156
🔥 NVLink Distributed GEMMs natively in CUTLASS for all your tensor parallelism needs 🔥 blog.shi-labs.com/distribute… Work done by @AliHassaniJr over his 2024 summer internship :)
3
17
137
6,666
🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚
6
28
121
12,512
CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs
🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/medi…
4
11
119
7,657
Every GPU kernel writer in shambles
4
8
119
12,268
This was an extremely rewarding collaboration CUTLASS (4.0) team had over the last few years. Happy to have been involved in a small capacity with the fundamental rearchitecture of CUDA programming model
Almost 2 years at Nvidia, and the Tile IR project has been a very large part of my time here! So happy to see it finally coming to light. The CUDA GPU driver will now include a #MLIR-based JIT compiler! :) More MLIR-based announcement at GTC tomorrow in the Cutlass 4.0 Session!
4
87
6,504
On March 16th, I will be attending two GPU events in the bay area! First is the SemiAnalysis Blackwell hackathon, and the second is the GPU MODE mixer event. At both, there will be a sneak peek of the future of CUTLASS and tensor core programming 👀 : github.com/NVIDIA/cutlass/di…
SemiAnalysis is hosting an Nvidia Blackwell GPU Hackathon on Sunday March 16th. It is the ultimate playground for Blackwell PTX tech enthusiasts, offering hands-on exploration of Blackwell & PTX infrastructure while collaborating on open-source projects.
1
8
69
8,199
It is so cool to see CUTLASS 3.x and CuTe being used for new breakthroughs in ML compute word! Thanks a lot for taking our stuff for a spin. Can't wait to see more :D
Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/
6
55
11,876
We gave our first in depth publicly available talk on CUTLASS 3.x and it’s up on YouTube now!
The CUTLASS/TensorCores/Hopper lecture covered quite advanced cuda programming. I guess we need further ramp-up lectures to make these topics more accessible. Recoding: piped.video/hQ9GPnV0-50?si=r10V… Slides: drive.google.com/file/d/18st…
6
52
7,344
Sorry to break it to you but PTX is a SASS wrapper. Sorry to break it to you but SASS is a uArch wrapper. Sorry to break it to you but uArch is a transistor wrapper. Sorry to break it to you but transistors are atom wrappers. Sorry to break it to you but atoms are quark wrappers
1
2
45
1,667
Microbenchmarking the store and load buffer sizes on my zen2 machine as a follow up to the thread over at nitter.app/trav_downs/status/1187…. Tests written by @trav_downs show that the SB size seems to be 48, confirming what is already on his blog. LB size is 116, ~60% increase over zen1.
How many load buffers in #AMD Zen2?
5
7
34
I've been meaning tow rite Manim animations myself for how CUTLASS works, but seems like I don't have to. Thanks @KuterDinel. Btw, if someone wants to collaborate more closely on making documentation animations from CUTLASS, please DM me. I have tons of ideas of what we can do.
Considering to make a @3blue1brown style video explaining CuTe/Cutlass layout algebra. Let me know what you think of the small demo clip I made.
2
2
39
5,024
“NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.” 🥵❤️👀
🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) github.com/karpathy/llm.c/bl… On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0). - It is a direct implementation of the training loop and backpropagation in C/CUDA. - It compiles and runs instantly. No more "hit run then wait for tens of seconds for unknown reasons", for mountains of inscrutable abstractions to build a Universe. - It deletes the need for the Python interpreter and a deep learning library. - It allocates all the memory a single time at the start. - It's pretty cool. How: Getting this to work required us to write a lot of custom CUDA kernels, and doing this manually (instead of using Tensor ops of aten/PyTorch and torch.compile etc.) is a bit like programming in assembly. And you spend quality time looking at more assembly (CUDA PTX/SASS). But this also means we get to hyperoptimize the code and possibly explore optimizations that torch.compile might find difficult to, which is awesome. Examples of optimizations that went in over the last few days: - we're being clever with our memory consumption in the backward pass, only using a few buffers we need to propagate the gradients, saving memory capacity. - one fused classifier kernel does the last layer forward pass, the loss, and kicks off the backward pass. - many improvements to all the kernels involved, including e.g. gains from carefully constraining execution within the autoregressive mask in attention - cuBLAS(Lt) calls for all heavy lifting matmuls, and fused bias accumulation Big credits to two CUDA experts who appeared from somewhere on the internet to help this open source project, ngc92 and ademeure. We're hanging out of Github and Discords of CUDAMODE and my NN Zero to Hero. Next steps: - more optimizing of our (fp32) kernels, and especially switch to flash attention. - mixed precision training (fp16 to start). - multi-gpu training (DDP to start). - data & evals to set up a proper GPT-2 training runs - 🚀 repro GPT-2 (1.6B) training run. - more modern architectures etc. (Llama 3?) - writing, videos, exercises on building all of this from scratch. Figure 1: eye candy: timing profile of the kernels (one layer). NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.
1
5
36
3,584
Replying to @_xjdr
sure! colfax just dropped a banger today: research.colfax-intl.com/cut… We have a series of really nice incremental tutorials: github.com/NVIDIA/cutlass/tr… The two GTC talks as I mentioned are the richest source out there on Blackwell MMAs. TK blog is nice too.
5
36
3,396
Woot! My patches have been merged in for Zen2 PMU counters. A twinkle in the eye of my undergrad self just a couple years ago :3 *insert pun about not holding vendor events at times likes these*
Add AMD Zen2 vendor events, see how to use one of them with perf stat, record, report, annotate: git.kernel.org/acme/c/5ed096… #perf_events #linux
4
2
31
I rarely talk about my employer, or embellish their products. I am going to break that rule for once, this is too big to ignore. Public CUDA compiler with 11.3 release last week has some incredibly huge performance improvements for anyone writing #CUDA kernels.
2
1
31
CUTLASS 3.0 GEMMs are actually GETTs disguise! github.com/NVIDIA/cutlass/pu… Hopper GEMMs that use TMA are capable of computing any tensor contraction :D Now that is what we mean when we say "next-gen linear algebra library". Read more in the example and let linear algebra go brrrrrrrr
1
6
32
4,850
🔋 Batteries included in the form of many new examples across multiple architectures 🔌 Of note are Ampere FlashAattention-2 and Blackwell FlashAttention-3 implementations that are at parity with C++ implementations in terms of performance: github.com/NVIDIA/cutlass/bl…
1
2
31
1,581
We believe low level access to hardware is extremely important. High level generators rob away the freedom of programmers to experiment with new ideas and kernel designs while C++ is too slow to compile, learn, and debug. CuTe DSL provides the best of both worlds ⚡
2
2
30
2,534
Folks like @bfspector complained in the past that Hopper documentation for WGMMA and TMA layouts was inaccurate. Well, we have completely overhauled how we write those docs, and CUTLASS team has worked closely with PTX ISA folks to use CuTe to generate swizzle layout diagrams
2
31
1,138
This initial release ships with CuTe DSL, a programming language that is fully consistent with CuTe C++ in its programming model, APIs, abstraction level, and performance. Kernels in CuTe DSL look and feel like CuTe C++ but compile 100x faster without compromise in performance.
1
1
30
2,755
How to separate the wheat from the chaff 101 🤦🏽‍♂️
4
26
3,459
Even the text descriptions are using CuTe now to describe these layouts :D
1
27
1,130
Today is my (second) first day at @Nvidia as an intern. For the next six months, I will be on the programming systems and applications research group to work on GPU compute acceleration and linear algebra library design. Pretty pog tbh. Was always my dream to work at a ...
3
24
Been cooking this PTX design for nearly 2 years, hope y'all cook something good with this like @tri_dao always does. Fun is only beginning ...
2
11
24
2,923
We even have a series of Jupyter notebooks to get you started. My favorite is the one that teaches how to print types and values at both compile time and runtime. One of the best and most frequently used ways to debug kernels 😊 github.com/NVIDIA/cutlass/bl…
2
1
24
1,281
"The 5th generation TensorCore has dedicated on-chip memory that is specialized for use by TensorCore operations. This Tensor Memory is organized as a two-dimensional matrix where the horizontal rows are called lanes and the vertical columns are called columns."
1
1
21
3,563
Lastly, I want to say a massive thank you to the work (and sacrifices) of everyone who has worked on project. This is the first time CUTLASS has done a compiler, and it required a lot of collaboration across our CUDA and driver, compiler, Python, frameworks, DevTech
2
22
919
When FA2 was released, I wrote this tweet: nitter.app/DROP_ALL_TABLES/status… Today, I feel compelled to say the same thing again. It is so so cool to see DL community embrace and use CUTLASS and CuTe to develop novel algorithms that make Tensor Cores sing <3
FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/
2
4
19
3,089
“This is our last Hopper focused release. More exciting releases will come in 2025.” 👀👀👀👀👀👀
CUTLASS 3.7 is tagged. The highlights are from our long term contributors and friends: @BigManniM9 's block scaling FP8 GEMM and @AliHassaniJr 's distributed GEMM. This is our last Hopper focused release. More exciting releases will come in 2025. github.com/NVIDIA/cutlass/di…
1
18
1,452
We have also heard the your feedback asking for better documentation. CuTe DSL documentation and all existing CUTLASS C++ documentation is now homed at docs.nvidia.com/cutlass with a fresh coat of paint!
1
20
1,152
Winter break just started and ALL the HotChips31 sessions are up on YouTube!!! Time to get some popcorn...
2
2
16
On average, the CUTLASS GitHub repo gets downloaded once per second.
On average, the Keras pip package gets downloaded about 7 times per second.
3
17
1,728
CUTLASS 3.0 started almost exactly 2 years ago to the day. CuTe is truly groundbreaking, and I am lucky to have been a part of 3.0 since its inception. Tons to talk about our research innovations for a unified micro-kernel abstraction for GPUs at conferences throughout 2023! ^_^
CUTLASS 3.0 has just been released, offering optimal performance on NVIDIA’s H100 and a new approach to template metaprogramming in CUDA C++. github.com/NVIDIA/cutlass
2
15
4,260
AgileX FPGA, I Love It
4
14
I need help! I just got done implementing PMC counter support in perf profiler for zen2. There are a lot of changes, and I wanted help from people who own a Zen1 or Zen2 based system to test if it actually works before I upstream it. Please clone github.com/thakkarv/linux
2
7
16
If you missed it, you can also watch our 40 min GTC talk that dives deep into CuTe DSL nvidia.com/en-us/on-demand/s…
1
1
16
879
CuTe DSL will be in beta for the next few months. We would love to hear your feedback and suggestions especially during this period. Check us out on GitHub and file issues or contribute examples: github.com/NVIDIA/cutlass
1
18
1,052
I really should write a blog about this when I get the time but maybe a thread about the CUDA kernels I implemented is in order. Many thanks to @hpcgarage and collaborators @OLCFGOV for inviting me to work on this project, I am very grateful :) This is a wild to a newbie like me
2020 Gordon Bell finalists, @DROP_ALL_TABLES & @richitobonito describe their work w/ @ORNL that uses #HPC to connect the dots across academic bodies of work. 📝:cc.gatech.edu/news/641385/go… The winning paper will be announced at @Supercomputing Nov. 19 @ 2 PM EST. Good luck, team!
1
15
New username, who dis?
2
15
1,732
HotChips 2019 Videos have started trickling in 🙃
1
15
💚
Among many great things about the Blackwell chip, having CUTLASS as its central programming model is my favorite. That framework is so good it makes programming low-level kernels enjoyable.
14
1,050
Built on CUTLASS! Congrats @creed_humphrey and thanks for using our stuff :)
Check out Faster Neighborhood Attention: great effort led by @AliHassaniJr to optimize multi-dimensional local, causal, sparse global attention at Threadblock level, paving the way for building next-generation efficient multimodal AI systems. code/paper: github.com/SHI-Labs/NATTEN
1
1
14
2,857
It was unheard of to approach cuBLAS level perf before this release, but as the cutlass benchmarks show in the chart, cuBLAS is only a stone's throw away. This is a huge win for open source and anyone who obsesses over Speed of Light kernels
1
13
OMG OMG OMG “Instead of pragma, C++ attribute syntax can be used to specify OpenMP directives, which simplifies their integration with templates. Full support for C11, C18, C++11, C++14, C++17 and C++20”

ALT Jonah Hill Yay GIF

.@OpenMP_ARB Releases #OpenMP 5.1 With Vital Usability Enhancements Bronis de Supinski: OpenMP 5.1 represents the culmination of the past two years of work within the OpenMP Language Committee openmp.org/press-release/ope… #HPC #AI #SC20
1
13
Plan for next week is to retire because I think my life has peaked 🤣
Not a bad week for our @DROP_ALL_TABLES (@GTCSE): he's a @gtcomputing Donald V. Jackson Fellowship winner *and* wrote code that ran on 24,000 or so GPUs (with @piyusch et al. @ORNL). Rich's logical follow-up question: what's the plan for *next* week? 🤔 cc.gatech.edu/content/2020-g…
3
12
Replying to @ebetica
If you want low level control, the freedom to design your own kernel algorithms, triton isn’t meeting your perf needs, or just love low level hacking the yes :)
13
483
I’ve been trying to compete at @Supercomputing’s @SCCompSC for 3 years now ever since I started in sophomore year of my undergrad. This is the first time I get to participate as a mentor for the Georgia Tech team. Follow our journey as we release daily coverage of the team! #SC20
Ever heard of a nonstop 72-hour #HPC focused hackathon that requires 6+ months of training? Beginning tomorrow @ 4:30 PM: @gtcomputing's Team Phoenix will discuss how they prepped & more in their daily @Supercomputing video check-ins from Nov 7 – Nov 11 sites.gatech.edu/gtsc20/team…
1
12
Replying to @_xjdr
the accurate stuff is just regurgitation of the couple of blogs that are out there. There is nothing of actual substance in here that someone could not find with a single google search. The real alpha is in watching the GTC talks that is not captured here. :yawn:
1
12
992
Replying to @PytorchToAtoms
No, Blackwell does not support WGMMA. It has its own shiny new tensor core (TCGen05 MMA)
2
1
11
1,971
Replying to @marksaroufim
I’m only a few hours away in Boston! Can I come hang out and possible talk about CuTe DSL stuff?
2
12
513
This is what the internet was made for 🥹
presenting: big jeff's trainium hell
11
1,134
Replying to @typedfemale
Skill issue
1
11
267
Hear me talk about speed of light CUDA SemiRing kernels, GraphBLAS, shortest path applications and future HPC hardware today at 3:35 Central time. A spicy serving of dank OC memes included because I just can’t help myself. #SIAM #CSE21
If you're at @TheSIAMNews CSE'21, check out the fun GraphBLAS minisymposia happening today, which includes @GTCSE's @DROP_ALL_TABLES on his new CUDA dense semiring-GEMM library, cuASR ("quasar", cuasr.io) — MS11/MS41 meetings.siam.org/sess/dsp_p… #siamCSE21
1
10
Check out some of our recent work 👀
ORNL and @GeorgiaTech scientists are tapping the immense power of the Summit #supercomputer to comb through millions of journal articles to find potential vaccines, drugs and effective measures that could stop the spread of #COVID19. ▶️ bit.ly/2WHKbCB #NatLabsInTheFight
10
Replying to @_xjdr
> In CUTLASS, you might call something like cute::copy_tma(output_desc, tmem_ptr_output, gmem_ptr_output, barrier) to lolwat
1
9
993
Replying to @apaszke
By that metric, latest FA3 forward kernel is only 143 lines too! github.com/Dao-AILab/flash-a…
1
1
10
1,365
At much higher target frequencies. IPC doesn’t live in a vacuum
1
10
Oh no what will @dcominottim do without fixed boosts?
1
8
My definition of celebrity status is being subtweeted by typedfemale
9
152
Replying to @apaszke
I was mildly trolling but my point is shall we include the lowering and compiler lines of code for mosaic as well? CUTLASS (and C++ metaporg) is a kernel programming language and a compiler written in-band. Lines of code is not apples to apples
1
8
451
The ArXiv paper talks about the totally novel hardware aware algorithms that were required to achieve this level of optimization: tridao.me/publications/flash…
1
1
9
882
Replying to @_xjdr
💚 but we gotta solve this cockiness in LLMs for us to really get anywhere and use them as true copilots. Being confidently wrong is orders of magnitude worse than not knowing something (for existing knowledge, not for frontiers)
1
8
729
Hopper uArch whitepaper is out: nvidia.com/en-us/technologie… To my HPC nerds friends, I would read more about - Threadblock Clusters - Tensor memory accelerator - Distributed shared memory - Transaction barriers - DPX TMA and clusters are particularly cool
1
2
8
I don't think I have ever seen such an incredible boost to performance from a totally user invisible compiler change before. Let that sink in, no change to the user code is needed! That is insane! From the PoV of Proebsting's law, this is years worth of improvement in one go.
1
1
8
Address sanitizer is a wonderful, wonderful tool. That is all.
1
6
Replying to @typedfemale
My name is mentioned in a tweet without me being tagged in it 😳 Have I finally made it 🥹
1
8
286
it is not just SMT. "We then disabled four cores on the AMD Ryzen 9 3900X to bring the core count down from 12 to 8" -- if productivity tests are being run, why disable cores?
4
7
You don't have to wait long.
🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚
1
7
355
Not quite. The hosts are still different processes. It’s sort of like an MPI program. @AliHassaniJr paging in
1
7
443
3rd PLACE AS A FIRST TIME TEAM LETS GOOOOOOOOOO!! Hey @danolds it might be time to revise that victory odds calculator ;)
❤️❤️❤️ these 💣🍑 humans, mostly sophomores, who went from "is that like the opposite of 'low-performance computing?'" to a top-3 VSCC crew by overall score in less than a year. — ❤️🤗bbs #sc20 #morethanhpc +1 to @PenguinHPC for tees and @greenhpc @nvidia for encouragement
8
The plot thickens. Here are the results for mov reg, $0 and mov reg, $1. Moving constant 0 into a register has the same graph as zeroing and LB size test. This is different from moving 1 into a reg. Loop latency jumps for const 1 move at around 130-138 instruction depth.
2
2
8
Skill issue
5
153
PS, may wanna benchmark against example 77 now ;) I hear there might be a new fastest public FA in town
1
6
197
Replying to @cHHillee
Fair point, but the draft for this started >2 years ago 🥲
7
545
there is a sizable delta between 450/500W SXM A100 and PCIe A100 too
6
1,219
PyTorch blog post summarizes the paper well and provides a lot more data on the performance of the forward pass on fp8 and fp16 data types: nitter.app/PyTorch/status/1811445… On FP8, we can hit 1.2 PFLOP/s
Introducing FlashAttention-3 🚀 Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to @colfaxintl, @AIatMeta, @NVIDIAAI and @togethercompute for the collaboration here 🙌 Read more in our blog: hubs.la/Q02Gf4hf0
1
1
6
1,274
VPRF size is measured similarly. Tests seem to indicate OoO window usable VPR count to be 144. Published number is 160, so this fits perfectly with the 16 that would be needed to maintain the YMM register state.
1
6
Here we go! @SCCompSC kicks off today, and here is a little about how the Georgia Tech Team Phoenix was put together, mostly through reddit :P
Team Phoenix discusses supercomputers, 6 months of prep & discovering the VIP class that started it all via @reddit. It's been an interesting journey to get to this point in time & now we're ready to start the @Supercomputing Student Cluster Competition! piped.video/iHjCJPsr8hI
1
6
Please don’t bring business decisions into technical ones. The fact that Nv prints money off of GEMM FF is a orthogonal to the discussion
1
6
Replying to @dcominottim
well it is in somewhat more of generic way, in that texture pipelines are shared with BVH traversal. paging @GPUsAreMagic . That said I have no clear idea about how NV engine is built with respect to other FF bits.
1
5
Replying to @MalekiSaeed
It should be very hackable. Code quality is really nice :) but I’ll let you be the judge. Really pushed to get this out day zero so folks like you can build on top of it rather than start from scratch. Contributions to add any and all new features are welcome
6
191
Any benchmark that’s SIMT heavy and not GEMM go brrr. MLPerf numbers become domain specialized to the point where it’s hard to extrapolate to other applications. Would you conclude that TPU is a solid graphics chip based on Google’s MLPerf results?
2
6
Could not be more proud of where @buhpc is and where it is going! It has been a great time building up from scratch with @iCountFromZero and @irlNnContainer.
New Semester means new eboard! @DROP_ALL_TABLES and @irlNnContainer have graduated and @iCountFromZero is our new president. We also have a new website! Check it out at buhpc.club! Lots of exciting workshops and hack sessions in store like
1
6
+1. You can get 95% peak HFU or higher from just PTX these days. We’ve (CUTLASS team and the various compiler teams) worked really really hard to ensure democratic access to tensor cores :)
6
164
Replying to @jonmasters
great examples of HW/SW codesign too. if SIMD for cobol is not codesign then I don't know what is.
1
4
Roofline is not helpful in the post Dennard world. Even if you theoretically should be in the compute bound domain, wasting b/w will push you lower below the roofline. DVFS is a B!&#
6
684
I had to make sure I read that right in the morning more than just a couple of times.
5
The motherboard folding over was the best part, it’s so creative! It seems you folks are having fun :) dank Kontent
3
1
6