Vijay (@__tensorcore__) | nitter

Pinned Tweet

Vijay @__tensorcore__

Jan 23

As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚

16

18

370

27,555

Vijay @__tensorcore__

13 May 2025

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/medi…

16

85

422

78,966

Vijay @__tensorcore__

23 Jan 2025

CUDA 12.8 just dropped with Blackwell support. TensorCore 5th Generation Family Instructions: docs.nvidia.com/cuda/paralle…

12

51

307

33,848

Vijay @__tensorcore__

20 Jul 2025

developer.nvidia.com/blog/cu… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!

CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial...

In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models have layers that cannot be expressed as…

developer.nvidia.com

5

51

293

45,297

Vijay @__tensorcore__

11 Jul 2024

FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with @tri_dao, @colfaxintl, @Meta, and my @_prrama. We can get up to 760 TFLOP/s on head dim 128 forward pass!

5

51

285

33,461

Vijay @__tensorcore__

8 Jun 2025

Another 🔥 blog about CUTLASS from @colfaxintl, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them. research.colfax-intl.com/cut…

CUTLASS Tutorial: Sub-byte GEMM on NVIDIA® Blackwell GPUs - Colfax Research

Welcome to part 3 of our series investigating GEMM on the NVIDIA Blackwell architecture. In parts 1 and 2, we looked at the Tensory Memory and 2 SM capabilities of the new Blackwell Tensor Core UMMA...

research.colfax-intl.com

35

154

12,620

Vijay @__tensorcore__

17 Mar 2025

pip install nvidia-cutlass-dsl 👀 CUTLASS 4.0 is on the horizon, and the future is all pythonic! Come talk to us at GTC to learn more and attend our two talks

Vikram @msharmavikram

17 Mar 2025

Cutlass 4.0 Python DSL @__tensorcore__ Fully python! Performance parity! Join these two sessions at @NVIDIAGTC

3

18

148

15,156

Vijay @__tensorcore__

2 Dec 2024

🔥 NVLink Distributed GEMMs natively in CUTLASS for all your tensor parallelism needs 🔥 blog.shi-labs.com/distribute… Work done by @AliHassaniJr over his 2024 summer internship :)

Distributed GEMM

A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems

blog.shi-labs.com

3

17

137

6,666

Vijay @__tensorcore__

25 Jan 2025

🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚

Blackwell is here

ALT Blackwell is here

6

28

121

12,512

Vijay @__tensorcore__

21 Jul 2025

CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs

Vijay @__tensorcore__

13 May 2025

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/medi…

4

11

119

7,657

Vijay @__tensorcore__

23 May 2025

Every GPU kernel writer in shambles

4

8

119

12,268

Vijay @__tensorcore__

20 Mar 2025

This was an extremely rewarding collaboration CUTLASS (4.0) team had over the last few years. Happy to have been involved in a small capacity with the fundamental rearchitecture of CUDA programming model

Mehdi Amini @JokerEph

20 Mar 2025

Almost 2 years at Nvidia, and the Tile IR project has been a very large part of my time here! So happy to see it finally coming to light. The CUDA GPU driver will now include a #MLIR-based JIT compiler! :) More MLIR-based announcement at GTC tomorrow in the Cutlass 4.0 Session!

4

87

6,504

Vijay @__tensorcore__

11 Mar 2025

On March 16th, I will be attending two GPU events in the bay area! First is the SemiAnalysis Blackwell hackathon, and the second is the GPU MODE mixer event. At both, there will be a sneak peek of the future of CUTLASS and tensor core programming 👀 : github.com/NVIDIA/cutlass/di…

GTC 2025: Come see the latest and greatest on CUTLASS!! · NVIDIA cutlass · Discussion #2131

CUTLASS will have three events Programming Blackwell Tensor Cores with CUTLASS [S72720] Friday, Mar 21 10:00 AM - 10:40 AM PDT Enable Tensor Core Programming in Python With CuTe [S74639] Friday, Ma...

SemiAnalysis

@SemiAnalysis_

11 Mar 2025

SemiAnalysis is hosting an Nvidia Blackwell GPU Hackathon on Sunday March 16th. It is the ultimate playground for Blackwell PTX tech enthusiasts, offering hands-on exploration of Blackwell & PTX infrastructure while collaborating on open-source projects.

1

8

69

8,199

Vijay @__tensorcore__

17 Jul 2023

It is so cool to see CUTLASS 3.x and CuTe being used for new breakthroughs in ML compute word! Thanks a lot for taking our stuff for a spin. Can't wait to see more :D

Tri Dao

@tri_dao

17 Jul 2023

Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/

6

55

11,876

Vijay @__tensorcore__

9 Jun 2024

We gave our first in depth publicly available talk on CUTLASS 3.x and it’s up on YouTube now!

Andreas Köpf

@neurosp1ke

8 Jun 2024

The CUTLASS/TensorCores/Hopper lecture covered quite advanced cuda programming. I guess we need further ramp-up lectures to make these topics more accessible. Recoding: piped.video/hQ9GPnV0-50?si=r10V… Slides: drive.google.com/file/d/18st…

6

52

7,344

Vijay @__tensorcore__

21 Mar 2025

Replying to @apaszke @hyhieu226 @nvidia

Sorry to break it to you but PTX is a SASS wrapper. Sorry to break it to you but SASS is a uArch wrapper. Sorry to break it to you but uArch is a transistor wrapper. Sorry to break it to you but transistors are atom wrappers. Sorry to break it to you but atoms are quark wrappers

1

2

45

1,667

Vijay @__tensorcore__

19 May 2025

4

47

5,889

Vijay @__tensorcore__

28 Oct 2019

Microbenchmarking the store and load buffer sizes on my zen2 machine as a follow up to the thread over at nitter.app/trav_downs/status/1187…. Tests written by @trav_downs show that the SB size seems to be 48, confirming what is already on his blog. LB size is 116, ~60% increase over zen1.

Travis Downs @trav_downs

26 Oct 2019

How many load buffers in #AMD Zen2?

5

7

34

Vijay @__tensorcore__

29 Sep 2024

I've been meaning tow rite Manim animations myself for how CUTLASS works, but seems like I don't have to. Thanks @KuterDinel. Btw, if someone wants to collaborate more closely on making documentation animations from CUTLASS, please DM me. I have tons of ideas of what we can do.

Kuter Dinel

@KuterDinel

29 Sep 2024

Considering to make a @3blue1brown style video explaining CuTe/Cutlass layout algebra. Let me know what you think of the small demo clip I made.

2

2

39

5,024

Vijay @__tensorcore__

19 Apr 2024

“NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.” 🥵❤️👀

Andrej Karpathy

@karpathy

19 Apr 2024

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) github.com/karpathy/llm.c/bl… On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0). - It is a direct implementation of the training loop and backpropagation in C/CUDA. - It compiles and runs instantly. No more "hit run then wait for tens of seconds for unknown reasons", for mountains of inscrutable abstractions to build a Universe. - It deletes the need for the Python interpreter and a deep learning library. - It allocates all the memory a single time at the start. - It's pretty cool. How: Getting this to work required us to write a lot of custom CUDA kernels, and doing this manually (instead of using Tensor ops of aten/PyTorch and torch.compile etc.) is a bit like programming in assembly. And you spend quality time looking at more assembly (CUDA PTX/SASS). But this also means we get to hyperoptimize the code and possibly explore optimizations that torch.compile might find difficult to, which is awesome. Examples of optimizations that went in over the last few days: - we're being clever with our memory consumption in the backward pass, only using a few buffers we need to propagate the gradients, saving memory capacity. - one fused classifier kernel does the last layer forward pass, the loss, and kicks off the backward pass. - many improvements to all the kernels involved, including e.g. gains from carefully constraining execution within the autoregressive mask in attention - cuBLAS(Lt) calls for all heavy lifting matmuls, and fused bias accumulation Big credits to two CUDA experts who appeared from somewhere on the internet to help this open source project, ngc92 and ademeure. We're hanging out of Github and Discords of CUDAMODE and my NN Zero to Hero. Next steps: - more optimizing of our (fp32) kernels, and especially switch to flash attention. - mixed precision training (fp16 to start). - multi-gpu training (DDP to start). - data & evals to set up a proper GPT-2 training runs - 🚀 repro GPT-2 (1.6B) training run. - more modern architectures etc. (Llama 3?) - writing, videos, exercises on building all of this from scratch. Figure 1: eye candy: timing profile of the kernels (one layer). NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.

1

5

36

3,584

Vijay @__tensorcore__

19 Apr 2025

Replying to @_xjdr

sure! colfax just dropped a banger today: research.colfax-intl.com/cut… We have a series of really nice incremental tutorials: github.com/NVIDIA/cutlass/tr… The two GTC talks as I mentioned are the richest source out there on Blackwell MMAs. TK blog is nice too.

5

36

3,396

Vijay @__tensorcore__

14 Nov 2023

CUTLASSS is now available as a pip wheel! pypi.org/project/nvidia-cutl…

nvidia-cutlass

CUTLASS

1

3

32

2,675

Vijay @__tensorcore__

19 Mar 2020

Woot! My patches have been merged in for Zen2 PMU counters. A twinkle in the eye of my undergrad self just a couple years ago :3 *insert pun about not holding vendor events at times likes these*

Arnaldo Carvalho de Melo @acmel

19 Mar 2020

Add AMD Zen2 vendor events, see how to use one of them with perf stat, record, report, annotate: git.kernel.org/acme/c/5ed096… #perf_events #linux

4

2

31

Vijay @__tensorcore__

23 Apr 2021

I rarely talk about my employer, or embellish their products. I am going to break that rule for once, this is too big to ignore. Public CUDA compiler with 11.3 release last week has some incredibly huge performance improvements for anyone writing #CUDA kernels.

Big Boi Speedups :D

ALT Big Boi Speedups :D

2

1

31

Vijay @__tensorcore__

29 Mar 2023

CUTLASS 3.0 GEMMs are actually GETTs disguise! github.com/NVIDIA/cutlass/pu… Hopper GEMMs that use TMA are capable of computing any tensor contraction :D Now that is what we mean when we say "next-gen linear algebra library". Read more in the example and let linear algebra go brrrrrrrr

CUTLASS 3.0 Hopper GEMMs are GETTs in disguise by thakkarV · Pull Request #897 · NVIDIA/cutlass

1

6

32

4,850

Vijay @__tensorcore__

13 May 2025

🔋 Batteries included in the form of many new examples across multiple architectures 🔌 Of note are Ampere FlashAattention-2 and Blackwell FlashAttention-3 implementations that are at parity with C++ implementations in terms of performance: github.com/NVIDIA/cutlass/bl…

1

2

31

1,581

Vijay @__tensorcore__

13 May 2025

We believe low level access to hardware is extremely important. High level generators rob away the freedom of programmers to experiment with new ideas and kernel designs while C++ is too slow to compile, learn, and debug. CuTe DSL provides the best of both worlds ⚡

2

2

30

2,534

Vijay @__tensorcore__

23 Jan 2025

Folks like @bfspector complained in the past that Hopper documentation for WGMMA and TMA layouts was inaccurate. Well, we have completely overhauled how we write those docs, and CUTLASS team has worked closely with PTX ISA folks to use CuTe to generate swizzle layout diagrams

2

31

1,138

Vijay @__tensorcore__

13 May 2025

This initial release ships with CuTe DSL, a programming language that is fully consistent with CuTe C++ in its programming model, APIs, abstraction level, and performance. Kernels in CuTe DSL look and feel like CuTe C++ but compile 100x faster without compromise in performance.

1

1

30

2,755

Vijay @__tensorcore__

30 Jan 2025

How to separate the wheat from the chaff 101 🤦🏽‍♂️

This tweet is unavailable

4

26

3,459

Vijay @__tensorcore__

23 Jan 2025

Replying to @__tensorcore__ @bfspector

Even the text descriptions are using CuTe now to describe these layouts :D

1

27

1,130

Vijay @__tensorcore__

31 Jan 2022

Today is my (second) first day at @Nvidia as an intern. For the next six months, I will be on the programming systems and applications research group to work on GPU compute acceleration and linear algebra library design. Pretty pog tbh. Was always my dream to work at a ...

3

24

Vijay @__tensorcore__

23 Jan 2025

Been cooking this PTX design for nearly 2 years, hope y'all cook something good with this like @tri_dao always does. Fun is only beginning ...

2

11

24

2,923

Vijay @__tensorcore__

13 May 2025

We even have a series of Jupyter notebooks to get you started. My favorite is the one that teaches how to print types and values at both compile time and runtime. One of the best and most frequently used ways to debug kernels 😊 github.com/NVIDIA/cutlass/bl…

2

1

24

1,281

Vijay @__tensorcore__

23 Jan 2025

Replying to @__tensorcore__ @tri_dao

"The 5th generation TensorCore has dedicated on-chip memory that is specialized for use by TensorCore operations. This Tensor Memory is organized as a two-dimensional matrix where the horizontal rows are called lanes and the vertical columns are called columns."

1

1

21

3,563

Vijay @__tensorcore__

13 May 2025

Lastly, I want to say a massive thank you to the work (and sacrifices) of everyone who has worked on project. This is the first time CUTLASS has done a compiler, and it required a lot of collaboration across our CUDA and driver, compiler, Python, frameworks, DevTech

2

22

919

Vijay @__tensorcore__

11 Jul 2024

When FA2 was released, I wrote this tweet: nitter.app/DROP_ALL_TABLES/status… Today, I feel compelled to say the same thing again. It is so so cool to see DL community embrace and use CUTLASS and CuTe to develop novel algorithms that make Tensor Cores sing <3

Tri Dao

@tri_dao

11 Jul 2024

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

2

4

19

3,089

Vijay @__tensorcore__

19 Jan 2025

“This is our last Hopper focused release. More exciting releases will come in 2025.” 👀👀👀👀👀👀

Haicheng Wu @asdf1234_0

18 Jan 2025

CUTLASS 3.7 is tagged. The highlights are from our long term contributors and friends: @BigManniM9 's block scaling FP8 GEMM and @AliHassaniJr 's distributed GEMM. This is our last Hopper focused release. More exciting releases will come in 2025. github.com/NVIDIA/cutlass/di…

1

18

1,452

Vijay @__tensorcore__

13 May 2025

We have also heard the your feedback asking for better documentation. CuTe DSL documentation and all existing CUTLASS C++ documentation is now homed at docs.nvidia.com/cutlass with a fresh coat of paint!

1

20

1,152

Vijay @__tensorcore__

17 Dec 2019

Winter break just started and ALL the HotChips31 sessions are up on YouTube!!! Time to get some popcorn...

2

2

16

Vijay @__tensorcore__

21 Jul 2025

Part 2: developer.nvidia.com/blog/cu… Covers the design of CUTLASS 3.x itself and how it builds a 2 layer GPU microkernel abstraction using CuTe as the foundation.

CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design | NVIDIA...

GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions, and warp-specialization schemes.

developer.nvidia.com

4

19

1,292

Vijay @__tensorcore__

23 Apr 2024

On average, the CUTLASS GitHub repo gets downloaded once per second.

François Chollet

@fchollet

23 Apr 2024

On average, the Keras pip package gets downloaded about 7 times per second.

3

17

1,728

Vijay @__tensorcore__

25 Jan 2023

CUTLASS 3.0 started almost exactly 2 years ago to the day. CuTe is truly groundbreaking, and I am lucky to have been a part of 3.0 since its inception. Tons to talk about our research innovations for a unified micro-kernel abstraction for GPUs at conferences throughout 2023! ^_^

Andrew Kerr @arkerr

24 Jan 2023

CUTLASS 3.0 has just been released, offering optimal performance on NVIDIA’s H100 and a new approach to template metaprogramming in CUDA C++. github.com/NVIDIA/cutlass

2

15

4,260

Vijay @__tensorcore__

28 Apr 2019

AgileX FPGA, I Love It

4

14

Vijay @__tensorcore__

22 Dec 2019

I need help! I just got done implementing PMC counter support in perf profiler for zen2. There are a lot of changes, and I wanted help from people who own a Zen1 or Zen2 based system to test if it actually works before I upstream it. Please clone github.com/thakkarv/linux

GitHub - thakkarV/linux: Linux kernel source tree

Linux kernel source tree. Contribute to thakkarV/linux development by creating an account on GitHub.

2

7

16

Vijay @__tensorcore__

13 May 2025

If you missed it, you can also watch our 40 min GTC talk that dives deep into CuTe DSL nvidia.com/en-us/on-demand/s…

1

1

16

879

Vijay @__tensorcore__

13 May 2025

CuTe DSL will be in beta for the next few months. We would love to hear your feedback and suggestions especially during this period. Check us out on GitHub and file issues or contribute examples: github.com/NVIDIA/cutlass

GitHub - NVIDIA/cutlass: CUDA Templates and Python DSLs for High-Performance Linear Algebra

CUDA Templates and Python DSLs for High-Performance Linear Algebra - NVIDIA/cutlass

1

18

1,052

Vijay @__tensorcore__

17 Nov 2020

I really should write a blog about this when I get the time but maybe a thread about the CUDA kernels I implemented is in order. Many thanks to @hpcgarage and collaborators @OLCFGOV for inviting me to work on this project, I am very grateful :) This is a wild to a newbie like me

GaTech CSE @GTCSE

17 Nov 2020

2020 Gordon Bell finalists, @DROP_ALL_TABLES & @richitobonito describe their work w/ @ORNL that uses #HPC to connect the dots across academic bodies of work. 📝:cc.gatech.edu/news/641385/go… The winning paper will be announced at @Supercomputing Nov. 19 @ 2 PM EST. Good luck, team!

1

15

Vijay @__tensorcore__

9 Sep 2024

New username, who dis?

2

15

1,732

Vijay @__tensorcore__

5 Sep 2019

HotChips 2019 Videos have started trickling in 🙃

1

15

Vijay @__tensorcore__

1 Feb 2025

💚

Hieu Pham

@hyhieu226

1 Feb 2025

Among many great things about the Blackwell chip, having CUTLASS as its central programming model is my favorite. That framework is so good it makes programming low-level kernels enjoyable.

14

1,050

Vijay @__tensorcore__

20 Mar 2024

Built on CUTLASS! Congrats @creed_humphrey and thanks for using our stuff :)

Humphrey Shi

@humphrey_shi

9 Mar 2024

Check out Faster Neighborhood Attention: great effort led by @AliHassaniJr to optimize multi-dimensional local, causal, sparse global attention at Threadblock level, paving the way for building next-generation efficient multimodal AI systems. code/paper: github.com/SHI-Labs/NATTEN

1

1

14

2,857

Vijay @__tensorcore__

23 Apr 2021

It was unheard of to approach cuBLAS level perf before this release, but as the cutlass benchmarks show in the chart, cuBLAS is only a stone's throw away. This is a huge win for open source and anyone who obsesses over Speed of Light kernels

1

13

Vijay @__tensorcore__

13 Nov 2020

OMG OMG OMG “Instead of pragma, C++ attribute syntax can be used to specify OpenMP directives, which simplifies their integration with templates. Full support for C11, C18, C++11, C++14, C++17 and C++20”

ALT Jonah Hill Yay GIF

HPC Guru @HPC_Guru

13 Nov 2020

.@OpenMP_ARB Releases #OpenMP 5.1 With Vital Usability Enhancements Bronis de Supinski: OpenMP 5.1 represents the culmination of the past two years of work within the OpenMP Language Committee openmp.org/press-release/ope… #HPC #AI #SC20

1

13

Vijay @__tensorcore__

30 Apr 2020

Plan for next week is to retire because I think my life has peaked 🤣

The HPC Garage @hpcgarage

30 Apr 2020

Not a bad week for our @DROP_ALL_TABLES (@GTCSE): he's a @gtcomputing Donald V. Jackson Fellowship winner *and* wrote code that ran on 24,000 or so GPUs (with @piyusch et al. @ORNL). Rich's logical follow-up question: what's the plan for *next* week? 🤔 cc.gatech.edu/content/2020-g…

3

12

Vijay @__tensorcore__

13 May 2025

Replying to @ebetica

If you want low level control, the freedom to design your own kernel algorithms, triton isn’t meeting your perf needs, or just love low level hacking the yes :)

13

483

Vijay @__tensorcore__

6 Nov 2020

I’ve been trying to compete at @Supercomputing’s @SCCompSC for 3 years now ever since I started in sophomore year of my undergrad. This is the first time I get to participate as a mentor for the Georgia Tech team. Follow our journey as we release daily coverage of the team! #SC20

GaTech CSE @GTCSE

6 Nov 2020

Ever heard of a nonstop 72-hour #HPC focused hackathon that requires 6+ months of training? Beginning tomorrow @ 4:30 PM: @gtcomputing's Team Phoenix will discuss how they prepped & more in their daily @Supercomputing video check-ins from Nov 7 – Nov 11 sites.gatech.edu/gtsc20/team…

ALT Team Phoenix Logo

1

12

Vijay @__tensorcore__

19 Apr 2025

Replying to @_xjdr

the accurate stuff is just regurgitation of the couple of blogs that are out there. There is nothing of actual substance in here that someone could not find with a single google search. The real alpha is in watching the GTC talks that is not captured here. :yawn:

1

12

992

Vijay @__tensorcore__

23 Jan 2025

Replying to @PytorchToAtoms

No, Blackwell does not support WGMMA. It has its own shiny new tensor core (TCGen05 MMA)

2

1

11

1,971

Vijay @__tensorcore__

25 Jul 2025

Replying to @marksaroufim

I’m only a few hours away in Boston! Can I come hang out and possible talk about CuTe DSL stuff?

2

12

513

Vijay @__tensorcore__

17 Jul 2025

This is what the internet was made for 🥹

typedfemale

@typedfemale

17 Jul 2025

presenting: big jeff's trainium hell

11

1,134

Vijay @__tensorcore__

22 Aug 2024

Replying to @typedfemale

Skill issue

1

11

267

Vijay @__tensorcore__

1 Mar 2021

Hear me talk about speed of light CUDA SemiRing kernels, GraphBLAS, shortest path applications and future HPC hardware today at 3:35 Central time. A spicy serving of dank OC memes included because I just can’t help myself. #SIAM #CSE21

The HPC Garage @hpcgarage

1 Mar 2021

If you're at @TheSIAMNews CSE'21, check out the fun GraphBLAS minisymposia happening today, which includes @GTCSE's @DROP_ALL_TABLES on his new CUDA dense semiring-GEMM library, cuASR ("quasar", cuasr.io) — MS11/MS41 meetings.siam.org/sess/dsp_p… #siamCSE21

1

10

Vijay @__tensorcore__

19 May 2020

Check out some of our recent work 👀

Oak Ridge Lab @ORNL

18 May 2020

ORNL and @GeorgiaTech scientists are tapping the immense power of the Summit #supercomputer to comb through millions of journal articles to find potential vaccines, drugs and effective measures that could stop the spread of #COVID19. ▶️ bit.ly/2WHKbCB #NatLabsInTheFight

10

Vijay @__tensorcore__

19 Apr 2025

Replying to @_xjdr

> In CUTLASS, you might call something like cute::copy_tma(output_desc, tmem_ptr_output, gmem_ptr_output, barrier) to lolwat

1

9

993

Vijay @__tensorcore__

15 Jul 2024

Replying to @apaszke

By that metric, latest FA3 forward kernel is only 143 lines too! github.com/Dao-AILab/flash-a…

1

1

10

1,365

Vijay @__tensorcore__

8 Oct 2020

Replying to @dylan522p @david_schor

At much higher target frequencies. IPC doesn’t live in a vacuum

1

10

Vijay @__tensorcore__

28 Sep 2019

Replying to @peresuslog @Cat_Merc

Oh no what will @dcominottim do without fixed boosts?

1

8

Vijay @__tensorcore__

17 Sep 2024

Replying to @__tensorcore__ @typedfemale

My definition of celebrity status is being subtweeted by typedfemale

9

152

Vijay @__tensorcore__

15 Jul 2024

Replying to @apaszke

I was mildly trolling but my point is shall we include the lowering and compiler lines of code for mosaic as well? CUTLASS (and C++ metaporg) is a kernel programming language and a compiler written in-band. Lines of code is not apples to apples

1

8

451

Vijay @__tensorcore__

11 Jul 2024

The ArXiv paper talks about the totally novel hardware aware algorithms that were required to achieve this level of optimization: tridao.me/publications/flash…

1

1

9

882

Vijay @__tensorcore__

19 Apr 2025

Replying to @_xjdr

💚 but we gotta solve this cockiness in LLMs for us to really get anywhere and use them as true copilots. Being confidently wrong is orders of magnitude worse than not knowing something (for existing knowledge, not for frontiers)

1

8

729

Vijay @__tensorcore__

22 Mar 2022

Hopper uArch whitepaper is out: nvidia.com/en-us/technologie… To my HPC nerds friends, I would read more about - Threadblock Clusters - Tensor memory accelerator - Distributed shared memory - Transaction barriers - DPX TMA and clusters are particularly cool

1

2

8

Vijay @__tensorcore__

23 Apr 2021

I don't think I have ever seen such an incredible boost to performance from a totally user invisible compiler change before. Let that sink in, no change to the user code is needed! That is insane! From the PoV of Proebsting's law, this is years worth of improvement in one go.

1

1

8

Vijay @__tensorcore__

19 Apr 2020

Address sanitizer is a wonderful, wonderful tool. That is all.

1

6

Vijay @__tensorcore__

17 Sep 2024

Replying to @typedfemale

My name is mentioned in a tweet without me being tagged in it 😳 Have I finally made it 🥹

1

8

286

Vijay @__tensorcore__

21 Aug 2019

Replying to @dcominottim @ryanshrout @Jauffins @aeassa1 @john_bonini @Po_st

it is not just SMT. "We then disabled four cores on the AMD Ryzen 9 3900X to bring the core count down from 12 to 8" -- if productivity tests are being run, why disable cores?

4

7

Vijay @__tensorcore__

25 Jan 2025

Replying to @bfspector @aaryan04 @HazyResearch @togethercompute @nvidia

You don't have to wait long.

Vijay @__tensorcore__

25 Jan 2025

🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚

Blackwell is here

ALT Blackwell is here

1

7

355

Vijay @__tensorcore__

24 Aug 2022

When people ask me what I’m going to do after my PhD open.spotify.com/track/27Nov…

INDUSTRY BABY (feat. Jack Harlow)

Lil Nas X, Jack Harlow · INDUSTRY BABY (feat. Jack Harlow) · Song · 2021

open.spotify.com

1

7

Vijay @__tensorcore__

13 Mar 2025

Replying to @dylan522p @ID_AA_Carmack

Not quite. The hosts are still different processes. It’s sort of like an MPI program. @AliHassaniJr paging in

1

7

443

Vijay @__tensorcore__

20 Nov 2020

3rd PLACE AS A FIRST TIME TEAM LETS GOOOOOOOOOO!! Hey @danolds it might be time to revise that victory odds calculator ;)

The HPC Garage @hpcgarage

20 Nov 2020

❤️❤️❤️ these 💣🍑 humans, mostly sophomores, who went from "is that like the opposite of 'low-performance computing?'" to a top-3 VSCC crew by overall score in less than a year. — ❤️🤗bbs #sc20 #morethanhpc +1 to @PenguinHPC for tees and @greenhpc @nvidia for encouragement

8

Vijay @__tensorcore__

23 Apr 2021

btw, if you have not heard about cutlass yet, you should totally check it out! github.com/NVIDIA/cutlass/di…

CUDA 11.3 significantly improved the performance of CUTLASS · NVIDIA cutlass · Discussion #241

CUDA 11.3 significantly improves the performance of Ampere/Turing/Volta Tensor Core kernels. 298TFLOPS was recorded when benchmarking CUTLASS FP16 GEMM on A100. This is 14% higher than CUDA 11.2. F...

7

Vijay @__tensorcore__

28 Oct 2019

Replying to @__tensorcore__ @trav_downs @WikiChip

The plot thickens. Here are the results for mov reg, $0 and mov reg, $1. Moving constant 0 into a register has the same graph as zeroing and LB size test. This is different from moving 1 into a reg. Loop latency jumps for const 1 move at around 130-138 instruction depth.

2

2

8

Vijay @__tensorcore__

29 Sep 2024

Replying to @dylan522p @KuterDinel

Skill issue

5

153

Vijay @__tensorcore__

25 Jan 2025

Replying to @__tensorcore__ @bfspector @aaryan04 @HazyResearch @togethercompute @nvidia

PS, may wanna benchmark against example 77 now ;) I hear there might be a new fastest public FA in town

1

6

197

Vijay @__tensorcore__

21 Jul 2025

Replying to @cHHillee

Fair point, but the draft for this started >2 years ago 🥲

7

545

Vijay @__tensorcore__

21 Jun 2023

Replying to @cHHillee @ID_AA_Carmack

there is a sizable delta between 450/500W SXM A100 and PCIe A100 too

6

1,219

Vijay @__tensorcore__

11 Jul 2024

PyTorch blog post summarizes the paper well and provides a lot more data on the performance of the forward pass on fp8 and fp16 data types: nitter.app/PyTorch/status/1811445… On FP8, we can hit 1.2 PFLOP/s

PyTorch

@PyTorch

11 Jul 2024

Introducing FlashAttention-3 🚀 Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to @colfaxintl, @AIatMeta, @NVIDIAAI and @togethercompute for the collaboration here 🙌 Read more in our blog: hubs.la/Q02Gf4hf0

1

1

6

1,274

Vijay @__tensorcore__

28 Oct 2019

Replying to @__tensorcore__ @trav_downs @WikiChip

VPRF size is measured similarly. Tests seem to indicate OoO window usable VPR count to be 144. Published number is 160, so this fits perfectly with the 16 that would be needed to maintain the YMM register state.

1

6

Vijay @__tensorcore__

8 Nov 2020

Here we go! @SCCompSC kicks off today, and here is a little about how the Georgia Tech Team Phoenix was put together, mostly through reddit :P

GaTech CSE @GTCSE

8 Nov 2020

Team Phoenix discusses supercomputers, 6 months of prep & discovering the VIP class that started it all via @reddit. It's been an interesting journey to get to this point in time & now we're ready to start the @Supercomputing Student Cluster Competition! piped.video/iHjCJPsr8hI

1

6

Vijay @__tensorcore__

4 May 2019

Replying to @IanCutress @Jauffins @intel @IntelGraphics @david_schor

Nooo! That would mean no more @WikiChip :(

6

Vijay @__tensorcore__

6 Aug 2020

Replying to @dylan522p @dcominottim @SWenerski @Cat_Merc @Nekrosmas @hms1193

Please don’t bring business decisions into technical ones. The fact that Nv prints money off of GEMM FF is a orthogonal to the discussion

1

6

Vijay @__tensorcore__

17 Aug 2020

Replying to @dcominottim

well it is in somewhat more of generic way, in that texture pipelines are shared with BVH traversal. paging @GPUsAreMagic . That said I have no clear idea about how NV engine is built with respect to other FF bits.

1

5

Vijay @__tensorcore__

25 Jan 2025

Replying to @MalekiSaeed

It should be very hackable. Code quality is really nice :) but I’ll let you be the judge. Really pushed to get this out day zero so folks like you can build on top of it rather than start from scratch. Contributions to add any and all new features are welcome

6

191

Vijay @__tensorcore__

6 Aug 2020

Replying to @dcominottim @SWenerski @Cat_Merc @Nekrosmas @hms1193

Any benchmark that’s SIMT heavy and not GEMM go brrr. MLPerf numbers become domain specialized to the point where it’s hard to extrapolate to other applications. Would you conclude that TPU is a solid graphics chip based on Google’s MLPerf results?

2

6

Vijay @__tensorcore__

13 Sep 2019

Could not be more proud of where @buhpc is and where it is going! It has been a great time building up from scratch with @iCountFromZero and @irlNnContainer.

13 Sep 2019

New Semester means new eboard! @DROP_ALL_TABLES and @irlNnContainer have graduated and @iCountFromZero is our new president. We also have a new website! Check it out at buhpc.club! Lots of exciting workshops and hack sessions in store like

1

6

Vijay @__tensorcore__

5 May 2024

Replying to @cHHillee @johannes_hage

+1. You can get 95% peak HFU or higher from just PTX these days. We’ve (CUTLASS team and the various compiler teams) worked really really hard to ensure democratic access to tensor cores :)

6

164

Vijay @__tensorcore__

17 Aug 2020

Replying to @jonmasters

great examples of HW/SW codesign too. if SIMD for cobol is not codesign then I don't know what is.

1

4

Vijay @__tensorcore__

1 Aug 2024

Roofline is not helpful in the post Dennard world. Even if you theoretically should be in the compute bound domain, wasting b/w will push you lower below the roofline. DVFS is a B!&#

6

684

Vijay @__tensorcore__

19 Aug 2019

Replying to @Cat_Merc @david_schor

I had to make sure I read that right in the morning more than just a couple of times.

5

Vijay @__tensorcore__

10 Mar 2020

Replying to @lavados @fork_while_1 @jovanbulck @mlqxyz @misc0110 @IEEESSP

The motherboard folding over was the best part, it’s so creative! It seems you folks are having fun :) dank Kontent

3

1

6