(5/7) In terms of library improvements, new TK has:
- No more explicit shared layouts (selected internally), just specify type and size.
- Global layouts, to automatically manage TMA descriptors (H100) and strides (otherwise)
- More & complex-valued type support (we swear we’ll get around to FP8 one of these days, it’s not even hard)
Tens of thousands of robust unit tests for just about everything.
- Templates to manage pipelining and latency-hiding. Get FA-3-approaching performance in just 42 lines of device code!
- Lots and lots of performance optimizations: reducing bank conflicts, managing address spaces, issuing better instructions, removing synchronizations, and more.