Scott Gray · Nov 20, 2023 · 10:16 AM UTC

Scott Gray

Scott Gray @scottgray76

20 Nov 2023

OpenAI is nothing without its people

717

55,096

Scott Gray · Nov 19, 2023 · 5:09 AM UTC

Scott Gray @scottgray76

19 Nov 2023

❤️

Sam Altman

@sama

19 Nov 2023

i love the openai team so much

548

138,785

Scott Gray · Mar 7, 2021 · 2:45 AM UTC

Scott Gray @scottgray76

7 Mar 2021

In case you missed it, here's @model_mechanic @jcjohnss and @karpathy getting into the gritty details of the DALL-E paper: piped.video/PtdpWC7Sr98 . I hope this kind of interaction becomes the more of the norm in the future.

Deep Learning Deep Dive Episode #3: DALL-E in depth

The actual paper for DALL-E was released only a few days after we p...

youtube.com

105

Scott Gray · Dec 14, 2017 · 8:03 PM UTC

Scott Gray @scottgray76

14 Dec 2017

You can find the slides to my NIPS talk on Small World Network Architectures here: supercomputersfordl2017.gith… There were lots of other great talks in that workshop. Thanks to Google (mainly @erich_elsen) for organizing.

Scott Gray · Apr 19, 2024 · 9:10 PM UTC

Scott Gray @scottgray76

19 Apr 2024

Replying to @karpathy

Couple tips on layernorm_fwd: use var(x) == mean(x**2) - mean(x)**2 use vector loads and/or loop rolling and/or more threads to hold input in registers for single pass over global mem.

16,119

Scott Gray · May 18, 2023 · 6:53 PM UTC

Scott Gray @scottgray76

18 May 2023

Replying to @soumithchintala @NumFOCUS

I'll preemptively match Soumith's match. I've gotten way more value than this out of those tools over the years.

6,438

Scott Gray · Jan 31, 2016 · 7:04 AM UTC

Scott Gray @scottgray76

31 Jan 2016

My F(4x4,3x3) kernel just peaked 18 eTflops on a TitanX. More work to do but the wait will soon be over. Again, many thanks to @ajlavin.

Scott Gray · Apr 23, 2019 · 4:57 PM UTC

Scott Gray @scottgray76

23 Apr 2019

Lots more work on sparsity in various forms forthcoming...

rewon @rewonfc

23 Apr 2019

Releasing some work today with @scottgray76 @AlecRad and @ilyasut. Contains some simple adaptations for Transformers that extend them to long sequences.

Scott Gray · Dec 18, 2022 · 7:57 PM UTC

Scott Gray @scottgray76

18 Dec 2022

I have no intention of changing my profile link. help.twitter.com/en/rules-an…

27,874

Scott Gray · Apr 20, 2024 · 12:20 AM UTC

Scott Gray @scottgray76

20 Apr 2024

Replying to @karpathy

Don't forget registers.. you have 64k*4*108 = 28M which is more than shared. And it's the fastest local state to leverage (followed by shared then L2)

3,326

Scott Gray · Nov 11, 2015 · 1:26 AM UTC

Scott Gray @scottgray76

11 Nov 2015

Winograd is coming: arxiv.org/abs/1509.09308 I'll try to have a blog post later this week going over this and upcoming work.

Fast Algorithms for Convolutional Neural Networks

Deep convolutional neural networks take GPU days of compute time to train on large data sets. Pedestrian detection for self driving cars requires very low latency. Image recognition for mobile...

arxiv.org

Scott Gray · Apr 20, 2024 · 12:01 AM UTC

Scott Gray @scottgray76

20 Apr 2024

Replying to @karpathy

Speedup will depend on dims and if the original multiple passes were able to be served out of L2. For things like ln, ln_grad, softmax, softmax_grad where you have multiple passes over inputs with reductions in between always do the math to see what you can fit in local state.

3,957

Scott Gray · Dec 18, 2016 · 5:14 PM UTC

Scott Gray @scottgray76

18 Dec 2016

Replying to @soumithchintala

I built these to serve as foundation for other work. @erich_elsen is doing c-api and tensorflow integration.

Scott Gray · May 26, 2019 · 12:48 AM UTC

Scott Gray @scottgray76

26 May 2019

Replying to @ID_AA_Carmack

This isn't as well commented as it could be, but you might glean a bit more from this asm source. It's since been incorporated into cublas for maxwell/pascal: github.com/openai/openai-gem…

Scott Gray · Apr 14, 2016 · 12:45 AM UTC

Scott Gray @scottgray76

14 Apr 2016

My GTC talk is now online: on-demand.gputechconf.com/gt… Probably could have used more diagrams and practice but that takes time away from coding

Scott Gray · Feb 2, 2022 · 2:26 AM UTC

Scott Gray @scottgray76

2 Feb 2022

Replying to @ID_AA_Carmack

Just got my first batch today too. I exercised no such restraint and have more on the way :)

Scott Gray · Aug 22, 2017 · 6:44 PM UTC

Scott Gray @scottgray76

22 Aug 2017

Replying to @tqchenml @guestrin

Working on some kernels now that are 3x faster than the TVM results. I also have some small block grouped conv kernels that are full util.

Scott Gray · Sep 8, 2019 · 12:38 AM UTC

Scott Gray @scottgray76

8 Sep 2019

Replying to @nikiparmar09

I have this implemented already (internally) as a mode of blocksparse transformer, but will likely soon work on an even more efficient version based on this separable conv code: github.com/openai/blockspars…

Scott Gray · Apr 25, 2019 · 5:27 PM UTC

Scott Gray @scottgray76

25 Apr 2019

This is a really cool project I'm happy to have had the chance to help out with.

OpenAI

@OpenAI

25 Apr 2019

Introducing MuseNet, a neural network which discovered how to generate music using many different instruments and styles. Listen & interact: openai.com/blog/musenet/ MuseNet will play an experimental concert today from 12–3pmPT on livestream: twitch.tv/openai

Scott Gray · Apr 20, 2024 · 9:08 PM UTC

Scott Gray @scottgray76

20 Apr 2024

Replying to @cHHillee @karpathy

This is what I was getting at. Perhaps I wasn’t explicit enough. But yah, with the input packed in fp16x2 registers you can go as high as 32k on the channel dim. Though the backward pass loads two tenors and can only fit 16k.

1,442

Scott Gray · Feb 25, 2021 · 4:21 PM UTC

Scott Gray @scottgray76

25 Feb 2021

Replying to @jekbradbury

I think knowing that gradient compression works well at scale is useful information for designing interconnects in the future. Also learning how to train within limited dynamic range can enable lower precision training (we already do a lot with fp8 internally).

Scott Gray · Apr 17, 2022 · 5:15 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

#dalle variation (same energy)

Scott Gray · Oct 9, 2017 · 8:01 PM UTC

Scott Gray @scottgray76

9 Oct 2017

Replying to @kyrpov @Smerity @PyTorch @RichardSocher @jekbradbury @StrongDuality @CaimingXiong

A simple tausworthe generator internally seeded with blockIds, gridIds and clock works just as well for dropout as other implementations.

Scott Gray · May 3, 2019 · 7:35 PM UTC

Scott Gray @scottgray76

3 May 2019

Replying to @soumithchintala @deliprao @Thom_Wolf

Happy to assist if you have any questions. I'm currently adding in support for relative attention. Also thinking about more directly accelerating banded diagonal / convolutional patterns.

Scott Gray · Apr 17, 2022 · 5:14 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

#dalle variation (same energy)

Scott Gray · Mar 9, 2016 · 7:58 AM UTC

Scott Gray @scottgray76

9 Mar 2016

Phenomenal achievement #AlphaGo team. Congratulations!

Scott Gray · Mar 20, 2020 · 7:43 PM UTC

Scott Gray @scottgray76

20 Mar 2020

I like using 1-6-9 and 0-6-10 with 2^-60 - 2^3 exp ranges for running means and variances for Adam. And we're using floating/learned biases to train networks in fp8 to great effect.

Scott Gray · Sep 27, 2020 · 5:35 PM UTC

Scott Gray @scottgray76

27 Sep 2020

Replying to @ID_AA_Carmack

As I understand it the dominate change in memory with aging comes from dentate gyrus volume loss. A reduced capacity to store episodic context likely doesn't help the cortex learn new and better generalizations. nature.com/articles/s41598-0…

Scott Gray · Apr 21, 2024 · 5:01 PM UTC

Scott Gray @scottgray76

21 Apr 2024

Replying to @scottgray76 @cHHillee @karpathy

You'd also want a custom cg::reduce that can operate on float2. Anyway this math has been used to successfully train and inference models at scale. Though now-a-days I think rmsnorm is preferred and all this is a moot point.

1,295

Scott Gray · Mar 4, 2016 · 3:58 AM UTC

Scott Gray @scottgray76

4 Mar 2016

Replying to @soumithchintala

@amiconfusediam It's nice to be able to contribute back to the community from which we derive so much benefit.

Scott Gray · Dec 19, 2016 · 8:00 PM UTC

Scott Gray @scottgray76

19 Dec 2016

.@dominikgrewe @soumithchintala I added P100 benchmarks. Efficient small tile implementations are even more key to perf on 56 SM P100.

Scott Gray · May 25, 2023 · 10:13 PM UTC

Scott Gray @scottgray76

25 May 2023

Replying to @Tim_Dettmers

That's pretty expensive.. keep in mind SFU ops are much lower throughput (maybe 4x?). I've been pondering an in register lookup table using prmt and/or lop3. Ideally you generate 2 fp16 outputs in pairs. An int8 mapping might be useful as well.

453

Scott Gray · Jun 24, 2020 · 3:29 AM UTC

Scott Gray @scottgray76

24 Jun 2020

Replying to @vinodg

Don't think your paper talks about this, but I've seen significant gains over cuBlas with JIT compiled and autotuned gemm kernels (independent of fusion opportunities). Key tuning params are tile size, splitK dims, and tile scheduling strategies. Constant folding helps too.

Scott Gray · May 25, 2023 · 8:57 PM UTC

Scott Gray @scottgray76

25 May 2023

Replying to @Tim_Dettmers

How many instructions per element do you think you can get it down to? That is can you keep it fast for large batches, and not just small batch / bandwidth bound?

441

Scott Gray · Jan 16, 2023 · 6:01 PM UTC

Scott Gray @scottgray76

16 Jan 2023

Replying to @Tim_Dettmers

Mostly yah. though attention op cannot be statically transposed. QK is fine I think.. WV less so. Though I guess you can transpose the V output from the previous projection. Of course by now we'd rather have some fp4 support. Inline conversion for that will likely be possible

513

Scott Gray · Aug 18, 2020 · 4:34 PM UTC

Scott Gray @scottgray76

18 Aug 2020

Replying to @ID_AA_Carmack

memset() on the gpu generally runs in under 1us, independent of the buffer size. Not sure why pytorch doesn't leverage this and instead calls a custom kernel to fill tensors.

Scott Gray · Feb 25, 2021 · 9:48 PM UTC

Scott Gray @scottgray76

25 Feb 2021

Replying to @jekbradbury

I'm a little wary of hardware with 2D tori topology strongly baked in as a prior. It might overly constrain our thinking on the kinds of networks we build. This is probably especially true as we progress towards models with more modalities and sub-modalities (like in the brain)

Scott Gray · Apr 3, 2016 · 4:40 AM UTC

Scott Gray @scottgray76

3 Apr 2016

I'll swap these in as soon as I get back from GTC. nitter.app/ajlavin/status/7164790…

Scott Gray · Apr 21, 2024 · 4:56 PM UTC

Scott Gray @scottgray76

21 Apr 2024

Replying to @cHHillee @karpathy

I was suggesting doing both the numerical shortcut and loading acts into registers for reuse. The shortcut is not unstable given you can do accumulations in close to log(n) serial steps, cancelation is not an issue and you really only need ~3 bits of accurate mantissa at output.

1,225

Scott Gray · Apr 20, 2024 · 12:06 AM UTC

Scott Gray @scottgray76

20 Apr 2024

Replying to @rzidane360 @karpathy

In practice this is never observed with training distributions.. and if you're paranoid you can convert to double precision and do subtraction there (this has zero overhead in this bandwidth bound op)

624

Scott Gray · Jun 9, 2020 · 2:05 AM UTC

Scott Gray @scottgray76

9 Jun 2020

Replying to @proteneer

When I'm coding cuda-c I almost exclusively refer to the ptx ISA documentation. When compiling I always disassemble and make sure the sass looks like I think it should. Then I might use inline ptx to patch things up here and there.

Scott Gray · May 7, 2016 · 11:48 PM UTC

Scott Gray @scottgray76

7 May 2016

Replying to @sedielem

@sedielem @karpathy I'll have filter dilation and reflection supported in neon kernels in the next few days.

Scott Gray · May 25, 2023 · 10:48 PM UTC

Scott Gray @scottgray76

25 May 2023

Replying to @Tim_Dettmers

Great, I'll let you work on it a bit :) Without the normal remapping 4b=>fp16 is pretty trivial. Just mask and shift the bits to the fp16 denorm position and apply scale/bias (works for sym and asym scemes). 1.5 instructions per element. and.b32 a0, b4x8, 0xf000f000; and.b32 a1, b4x8, 0x0f000f00; and.b32 a3, b4x8, 0x000f000f; and.b32 a3, fp4x8, 0x000f000f; shr.b32 a0, a0, 6; shr.b32 a1, a1, 2; shl.b32 a2, a2, 2; shl.b32 a3, a3, 6; fma.rn.f16x2 a0, a0, scale, bias; fma.rn.f16x2 a1, a1, scale, bias; fma.rn.f16x2 a2, a2, scale, bias; fma.rn.f16x2 a3, a3, scale, bias;

491

Scott Gray · Apr 17, 2022 · 7:09 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

Scott Gray · Oct 6, 2015 · 5:04 AM UTC

Scott Gray @scottgray76

6 Oct 2015

Replying to @soumithchintala

@amiconfusediam Working with Andrew on a GPU implementation right now.

Scott Gray · Mar 19, 2022 · 4:54 PM UTC

Scott Gray @scottgray76

19 Mar 2022

Replying to @karpathy

A look at the state of classic NES Tetris today: piped.video/29rj4UKVqxk

Best Roll vs. Best Tap | Cheez vs. Alex T | Finals | Classic Tetris...

Featuring the innovator of rolling Cheez against the fastest tapper...

youtube.com

Scott Gray · Oct 6, 2015 · 10:39 PM UTC

Scott Gray @scottgray76

6 Oct 2015

Replying to @karpathy

@karpathy That's just for 4x4 blocking. 6x6 blocking the speedup can be as much as 4X. Im working on this now and expect full utilization

Scott Gray · Jan 16, 2023 · 5:50 PM UTC

Scott Gray @scottgray76

16 Jan 2023

Replying to @Tim_Dettmers

From the ptx docs: "The transpose operation is only supported for the wgmma.mma_async variants with .f16/ .bf16 types on matrices accessed from shared memory using matrix descriptors." So getting fp8 transposed is likely going to be tricky and inefficient.

261

Scott Gray · May 24, 2015 · 1:15 AM UTC

Scott Gray @scottgray76

24 May 2015

Replying to @petewarden

@petewarden @karpathy And there is also this one: arxiv.org/abs/1502.02551 We've confirmed that stochastic rounding very helpful w/ low prec.

Scott Gray · Oct 9, 2017 · 9:10 PM UTC

Scott Gray @scottgray76

9 Oct 2017

Replying to @Smerity @kyrpov @PyTorch @RichardSocher @jekbradbury @StrongDuality @CaimingXiong

You could do that, I was just doing this: (float)(lfsr0 ^ lfsr1 ^ lfsr2) * 2.0f**-32 > keep_prob ? 0.0f : 1.0f

Scott Gray · Apr 20, 2015 · 10:16 PM UTC

Scott Gray @scottgray76

20 Apr 2015

Replying to @sedielem

@sedielem @coffeephoenix @petewarden @karpathy My fprop does P*Q small MMs of dim KxNxCRS index by blockIdx. Gets full utilization.

Scott Gray · Jun 9, 2020 · 3:42 AM UTC

Scott Gray @scottgray76

9 Jun 2020

Replying to @scottgray76 @proteneer @ajlavin

In my opinion you should be able to mostly ignore your lawyers and rely on your engineering to evolve the design faster that the copy cats can catch up. There's a huge amount of value in allowing the wider community to leverage your hardware to the maximal extent.

Scott Gray · May 7, 2016 · 8:24 AM UTC

Scott Gray @scottgray76

7 May 2016

@sedielem @amiconfusediam I think Nvidia is going to find difficult to segment the graphics and d-learning markets joelonsoftware.com/articles/…

Scott Gray · Jul 28, 2016 · 7:00 PM UTC

Scott Gray @scottgray76

28 Jul 2016

I'll carry on as before but with more focus on supporting cutting edge research at OpenIA. This should generally benefit all.

Scott Gray · Jun 20, 2020 · 12:21 AM UTC

Scott Gray @scottgray76

20 Jun 2020

Replying to @rianflo

Nowadays, with tensorcore code, I find I'm frequently trying to max out register usage (at least in kernels designed to be compute bound). Also, there is 2.67x more SRAM in registers than there is in shared on V100. Don't be afraid to use it if it avoids extra trips to DRAM.

Scott Gray · Apr 17, 2022 · 5:39 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

Scott Gray · Oct 13, 2023 · 11:37 PM UTC

Scott Gray @scottgray76

13 Oct 2023

Replying to @unixpickle

For TitanX (or pre-tensorcore gpus) you can check out my old blog about the assembler and matmul I wrote. Some of it is no longer relevant but there's still a fair amount that still is. github.com/NervanaSystems/ma…

SGEMM

Assembler for NVIDIA Maxwell architecture. Contribute to NervanaSystems/maxas development by creating an account on GitHub.

github.com

643

Scott Gray · Mar 23, 2020 · 12:59 AM UTC

Scott Gray @scottgray76

23 Mar 2020

Replying to @dpkingma @UCSF @annieluet

I've been browsing youtube for old epidemiology talks, particularly from Dr. Mike Osterholm. I think this one from 2 years ago is pretty spot on to what we're going through now. piped.video/C6DNndjBG-c

1918 Pandemic: Expert Panel Discussion

2018 marks the 100th anniversary of the deadly 1918 influenza pande...

youtube.com

Scott Gray · May 8, 2016 · 2:34 AM UTC

Scott Gray @scottgray76

8 May 2016

Replying to @soumithchintala

@amiconfusediam For my kernels that are compute bound I expect them to stay that way. They currently get very high L2 utilization.

Scott Gray · Dec 18, 2016 · 5:17 PM UTC

Scott Gray @scottgray76

18 Dec 2016

I haven't worked on fp16x2 optimization yet, but these kernels run fine on P100.

Scott Gray · Apr 17, 2022 · 6:50 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

And even less specific: "A view of God being created, digital art."

Scott Gray · May 7, 2016 · 7:59 AM UTC

Scott Gray @scottgray76

7 May 2016

Replying to @sedielem

@sedielem @amiconfusediam 8GB is indeed good enough. Just finished some work to allow weights in fp32 and all compute in fp16 (+winograd).

Scott Gray · Jun 16, 2020 · 3:10 PM UTC

Scott Gray @scottgray76

16 Jun 2020

It would be nice to see how this performs in the under-fitting regime. Over-fitting models have lots of spare capacity that is easily compressible.

Scott Gray · Feb 24, 2021 · 8:09 PM UTC

Scott Gray @scottgray76

24 Feb 2021

Replying to @stonehenge2500 @AthenaAkrami @maosbot

I forgot to mention hydroxyzine (10mg) is extremely effective in giving immediate relief from histamine induced brain fog. It's great way to see if that's the cause and not something else (like POTS). Something to have on hand while waiting for MC stabilizer to kick in.

Scott Gray · Feb 6, 2016 · 12:20 AM UTC

Scott Gray @scottgray76

6 Feb 2016

Replying to @soumithchintala

@amiconfusediam @coffeephoenix That 30s time is actually pretty CPU bound on EW op generation currently. I'll be addressing that soon.

Scott Gray · Feb 24, 2021 · 8:19 PM UTC

Scott Gray @scottgray76

24 Feb 2021

Replying to @stonehenge2500 @AthenaAkrami @maosbot

And more on luteolin, MCAS and post-covid: mastcellmaster.com/covid.php

Scott Gray · Mar 4, 2016 · 7:06 PM UTC

Scott Gray @scottgray76

4 Mar 2016

Replying to @dfarmer

@dfarmer @amiconfusediam It is a 4x algorithmic speedup which means it does indeed reduce the number of flops required (still using FFMA)

Scott Gray · Feb 27, 2021 · 4:35 PM UTC

Scott Gray @scottgray76

27 Feb 2021

Replying to @scottgray76 @kennethpayne01 @stonehenge2500 @AthenaAkrami @maosbot

Then there is also this hope as well:

Mara Gay @MaraGay

24 Feb 2021

This is both anecdotal and early, but many long covid survivors are feeling significantly better after receiving their first vaccine dose. Including me. Fascinating.

Scott Gray · Jun 9, 2020 · 3:39 AM UTC

Scott Gray @scottgray76

9 Jun 2020

Replying to @proteneer @ajlavin

A lot of times this requires a deep understanding of gpu micro-architecture. Sadly, Nvidia seems unwilling to provide this due to concerns about giving competitors a recipe for a hardware spec. They're not alone in this as I'm still in the dark on low level TPU specs.

Scott Gray · Feb 23, 2021 · 12:34 AM UTC

Scott Gray @scottgray76

23 Feb 2021

Replying to @heyCrac @AthenaAkrami @maosbot

I take luteolin (PureLut from Algonot) for my histamine intolerance induced brain fog. Though it takes a couple months for the effect to kick in. This (or NeuroProtek) is also what a lot of the MCAS community uses for mast cell stabilization. Perhaps it could help LongCovid..

Scott Gray · Jul 21, 2018 · 4:46 PM UTC

Scott Gray @scottgray76

21 Jul 2018

Replying to @jeremyphoward @varun19299

I've been sitting on a bunch of grouped/seperable conv code for a while now but haven't quite made the time to complete it. It's derived from direct convolution techniques that @ajlavin here developed and first implemented. I'll see if I can make some time to finally finish it.

Scott Gray · May 5, 2022 · 1:26 AM UTC

Scott Gray @scottgray76

5 May 2022

Replying to @ercarp_ @bakztfuture

Definitely getting close it seems..

Scott Gray · May 28, 2024 · 8:19 PM UTC

Scott Gray @scottgray76

28 May 2024

Replying to @cis_female

Why not just use last block to zero the scratchpad: if (res+1 == gridDim.x)? But if you do need to poll you'll need something like ld.volatile or ld.relaxed (not just ldg.cg).

1,181

Scott Gray · Feb 10, 2016 · 5:24 PM UTC

Scott Gray @scottgray76

10 Feb 2016

Replying to @alexjc

@alexjc @sedielem On nvidia hardware popc is 1/4 throughput so only 8x faster than fp32, not 32x. But still worth implementing.

Scott Gray · Jul 1, 2016 · 5:11 PM UTC

Scott Gray @scottgray76

1 Jul 2016

Latest neon release fully supports Pascal. All Maxwell kernels work there now. New gemm kernels are also planned.

Scott Gray · Feb 27, 2021 · 4:32 PM UTC

Scott Gray @scottgray76

27 Feb 2021

Replying to @kennethpayne01 @stonehenge2500 @AthenaAkrami @maosbot

I just have run of the mill HI. Though I live with someone with MCAS and POTS and these symptoms are all very familiar to me: drtinapeers.com/longcovid I would be trying several H1/H2 blockers along with the luteolin: hydroxyzine, rupatadine, famotidine, loratadine, cetirizine

Long Covid — Dr. Tina Peers

Potential relief for those suffering from symptoms post COVID-19.

drtinapeers.com

Scott Gray · Jun 24, 2020 · 3:30 AM UTC

Scott Gray @scottgray76

24 Jun 2020

Replying to @scottgray76 @vinodg

Fast code generation is key to all this. Happy that you guys are moving in this direction.

Scott Gray · Oct 29, 2015 · 7:27 AM UTC

Scott Gray @scottgray76

29 Oct 2015

Replying to @reworkpip

@reworkpip @teamrework I suppose I could trek up there for this. It's good to get away from the computer from time to time and mingle..

Scott Gray · Nov 4, 2015 · 5:13 PM UTC

Scott Gray @scottgray76

4 Nov 2015

Replying to @soumithchintala

@amiconfusediam @AlecRad New versions of direct and winograd conv will have this built in.

Scott Gray · Mar 11, 2021 · 3:30 PM UTC

Scott Gray @scottgray76

11 Mar 2021

Replying to @ID_AA_Carmack

I'd assumed you're either compute bound and using an ASIC or memory bound and on a GPU which is a regime that's pretty easy to optimize. Though perhaps it's a bit more in between in which case a few % could be squeezed out? I haven't looked.

Scott Gray · May 23, 2015 · 11:01 PM UTC

Scott Gray @scottgray76

23 May 2015

Replying to @petewarden

@petewarden @karpathy Binary is something Matthieu Courbariaux is exploring right now, also check out his paper on limited precision.

Scott Gray · May 19, 2016 · 12:24 AM UTC

Scott Gray @scottgray76

19 May 2016

Replying to @soumithchintala @amiconfusediam

This article claims it's just an inference chip and perhaps just 8 bit: eetimes.com/document.asp?doc…

Scott Gray · Nov 11, 2015 · 9:38 PM UTC

Scott Gray @scottgray76

11 Nov 2015

Replying to @petewarden

@petewarden Most of the credit goes to @ajlavin for the amazing bit of research to figure this all out.

Scott Gray · Feb 27, 2021 · 4:51 PM UTC

Scott Gray @scottgray76

27 Feb 2021

Replying to @scottgray76 @kennethpayne01 @stonehenge2500 @AthenaAkrami @maosbot

Oh and I should point out I just take the luteolin now (200mg PureLut with each meal) and don't need anything else. I'm no longer reactive to any foods. I did stop the luteolin once and payed that price by having to wait two months for it to kick back in again.

Scott Gray · May 23, 2015 · 11:03 PM UTC

Scott Gray @scottgray76

23 May 2015

Replying to @petewarden

@petewarden @karpathy Also, I'm finishing up some tools now that should allow full exploration of this low precision space in large networks

Scott Gray · Aug 18, 2020 · 10:02 PM UTC

Scott Gray @scottgray76

18 Aug 2020

Replying to @ID_AA_Carmack

I haven't benchmarked this explicitly, it's just something I've noticed in nvprof profile timelines. With a sync included I can see the gpu forcing you to wait till DRAM is updated. Or this could be a feature of Volta HBM. I'll investigate later today.

Scott Gray · Feb 25, 2021 · 4:26 PM UTC

Scott Gray @scottgray76

25 Feb 2021

Replying to @scottgray76 @jekbradbury

Also our tools for scaling models have progressed quite a bit since this was implemented. I think this was the last big model we trained in Tensorflow/GCE.

Scott Gray · Jun 9, 2020 · 2:25 AM UTC

Scott Gray @scottgray76

9 Jun 2020

Replying to @vinodg @proteneer

I'm really annoyed that the shfl instruction now requires warp sync. I like the old way of not having to worry about inactive threads not participating in the shuffle.

Scott Gray · Jun 9, 2020 · 2:50 AM UTC

Scott Gray @scottgray76

9 Jun 2020

Replying to @scottgray76 @vinodg @proteneer

It should not be the case that the canonical way to do a cta reduction requires ugly warp sync branching.

Scott Gray · Aug 19, 2020 · 5:18 AM UTC

Scott Gray @scottgray76

19 Aug 2020

Replying to @scottgray76 @ID_AA_Carmack

I usually only memset(0) small allocations for use in critical sections and otherwise write code that avoids accumulations into tensors I know are uninitialized.

Scott Gray · Apr 17, 2022 · 8:11 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

Scott Gray · Jun 9, 2020 · 2:31 AM UTC

Scott Gray @scottgray76

9 Jun 2020

Replying to @vinodg @proteneer

Sometimes you just want to tell the hardware to do something you know it's capable of. Leave "safe" mode to the the compiled path.

Scott Gray · May 8, 2016 · 2:02 AM UTC

Scott Gray @scottgray76

8 May 2016

Replying to @soumithchintala

@amiconfusediam With GDDR5X you get about 15% more bandwidth per core. But the cores are running 42% faster..

Scott Gray · May 12, 2015 · 1:02 AM UTC

Scott Gray @scottgray76

12 May 2015

Replying to @petewarden

@petewarden As soon as I get a bit of spare time I'll write them a custom cgemm kernel. Working together we can probably make this viable.

Scott Gray · Apr 17, 2022 · 6:36 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

Sorry.. n = 1 prompt with 10 generations, cherry picked results (though its very hard to pick the best). Here are some more abstract ones from "A view of God being created from the machine, digital art."

Scott Gray · Aug 19, 2020 · 5:12 AM UTC

Scott Gray @scottgray76

19 Aug 2020

Replying to @ID_AA_Carmack

OK, it looks like the profiler is just lying about the duration of the operation. If you timeline this you'll see: gist.github.com/scott-gray/5…

memset_bench.py

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

Scott Gray · Apr 17, 2022 · 6:01 PM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

Adding ", by Asher Brown Durand" at the end of this prompt.

Scott Gray · Mar 26, 2016 · 5:40 PM UTC

Scott Gray @scottgray76

26 Mar 2016

Replying to @soumithchintala

@amiconfusediam @zygmuntzajac How is the performance at small minibatches? Or are you guys only concerned with batched inference?

Scott Gray · Apr 17, 2022 · 8:50 AM UTC

Scott Gray @scottgray76

17 Apr 2022

Replying to @nin_artificial

Scott Gray · Mar 20, 2020 · 7:46 PM UTC

Scott Gray @scottgray76

20 Mar 2020

Ideally network normalization should be tuned to be able to run with a fixed limited dynamic range (like the brain does). This probably requires leveraging stronger non-linearities to induce more activation/gradient sparsity.