GPU Geek at @OpenAI. I have a long standing interest in neuroscience and its application to machine learning. He/Him.

San Francisco, CA
OpenAI is nothing without its people
19
36
717
55,096
❤️
i love the openai team so much
9
15
548
138,785
In case you missed it, here's @model_mechanic @jcjohnss and @karpathy getting into the gritty details of the DALL-E paper: piped.video/PtdpWC7Sr98 . I hope this kind of interaction becomes the more of the norm in the future.
6
105
You can find the slides to my NIPS talk on Small World Network Architectures here: supercomputersfordl2017.gith… There were lots of other great talks in that workshop. Thanks to Google (mainly @erich_elsen) for organizing.
19
51
Replying to @karpathy
Couple tips on layernorm_fwd: use var(x) == mean(x**2) - mean(x)**2 use vector loads and/or loop rolling and/or more threads to hold input in registers for single pass over global mem.
3
51
16,119
I'll preemptively match Soumith's match. I've gotten way more value than this out of those tools over the years.
4
3
47
6,438
My F(4x4,3x3) kernel just peaked 18 eTflops on a TitanX. More work to do but the wait will soon be over. Again, many thanks to @ajlavin.
1
1
33
Lots more work on sparsity in various forms forthcoming...
Releasing some work today with @scottgray76 @AlecRad and @ilyasut. Contains some simple adaptations for Transformers that extend them to long sequences.
1
2
33
I have no intention of changing my profile link. help.twitter.com/en/rules-an…
1
31
27,874
Replying to @karpathy
Don't forget registers.. you have 64k*4*108 = 28M which is more than shared. And it's the fastest local state to leverage (followed by shared then L2)
1
1
28
3,326
Replying to @karpathy
Speedup will depend on dims and if the original multiple passes were able to be served out of L2. For things like ln, ln_grad, softmax, softmax_grad where you have multiple passes over inputs with reductions in between always do the math to see what you can fit in local state.
1
1
20
3,957
Replying to @soumithchintala
I built these to serve as foundation for other work. @erich_elsen is doing c-api and tensorflow integration.
2
20
Replying to @ID_AA_Carmack
This isn't as well commented as it could be, but you might glean a bit more from this asm source. It's since been incorporated into cublas for maxwell/pascal: github.com/openai/openai-gem…
2
1
18
My GTC talk is now online: on-demand.gputechconf.com/gt… Probably could have used more diagrams and practice but that takes time away from coding
4
13
Replying to @ID_AA_Carmack
Just got my first batch today too. I exercised no such restraint and have more on the way :)
1
12
Replying to @tqchenml @guestrin
Working on some kernels now that are 3x faster than the TVM results. I also have some small block grouped conv kernels that are full util.
1
12
Replying to @nikiparmar09
I have this implemented already (internally) as a mode of blocksparse transformer, but will likely soon work on an even more efficient version based on this separable conv code: github.com/openai/blockspars…
11
This is a really cool project I'm happy to have had the chance to help out with.
Introducing MuseNet, a neural network which discovered how to generate music using many different instruments and styles. Listen & interact: openai.com/blog/musenet/ MuseNet will play an experimental concert today from 12–3pmPT on livestream: twitch.tv/openai
1
13
Replying to @cHHillee @karpathy
This is what I was getting at. Perhaps I wasn’t explicit enough. But yah, with the input packed in fp16x2 registers you can go as high as 32k on the channel dim. Though the backward pass loads two tenors and can only fit 16k.
1
10
1,442
Replying to @jekbradbury
I think knowing that gradient compression works well at scale is useful information for designing interconnects in the future. Also learning how to train within limited dynamic range can enable lower precision training (we already do a lot with fp8 internally).
2
9
Replying to @nin_artificial
#dalle variation (same energy)
2
8
A simple tausworthe generator internally seeded with blockIds, gridIds and clock works just as well for dropout as other implementations.
1
9
Happy to assist if you have any questions. I'm currently adding in support for relative attention. Also thinking about more directly accelerating banded diagonal / convolutional patterns.
1
8
Replying to @nin_artificial
#dalle variation (same energy)
1
1
6
Phenomenal achievement #AlphaGo team. Congratulations!
7
I like using 1-6-9 and 0-6-10 with 2^-60 - 2^3 exp ranges for running means and variances for Adam. And we're using floating/learned biases to train networks in fp8 to great effect.
1
7
Replying to @ID_AA_Carmack
As I understand it the dominate change in memory with aging comes from dentate gyrus volume loss. A reduced capacity to store episodic context likely doesn't help the cortex learn new and better generalizations. nature.com/articles/s41598-0…
1
7
You'd also want a custom cg::reduce that can operate on float2. Anyway this math has been used to successfully train and inference models at scale. Though now-a-days I think rmsnorm is preferred and all this is a moot point.
7
1,295
Replying to @soumithchintala
@amiconfusediam It's nice to be able to contribute back to the community from which we derive so much benefit.
1
5
.@dominikgrewe @soumithchintala I added P100 benchmarks. Efficient small tile implementations are even more key to perf on 56 SM P100.
6
Replying to @Tim_Dettmers
That's pretty expensive.. keep in mind SFU ops are much lower throughput (maybe 4x?). I've been pondering an in register lookup table using prmt and/or lop3. Ideally you generate 2 fp16 outputs in pairs. An int8 mapping might be useful as well.
1
6
453
Replying to @vinodg
Don't think your paper talks about this, but I've seen significant gains over cuBlas with JIT compiled and autotuned gemm kernels (independent of fusion opportunities). Key tuning params are tile size, splitK dims, and tile scheduling strategies. Constant folding helps too.
1
4
Replying to @Tim_Dettmers
How many instructions per element do you think you can get it down to? That is can you keep it fast for large batches, and not just small batch / bandwidth bound?
1
5
441
Replying to @Tim_Dettmers
Mostly yah. though attention op cannot be statically transposed. QK is fine I think.. WV less so. Though I guess you can transpose the V output from the previous projection. Of course by now we'd rather have some fp4 support. Inline conversion for that will likely be possible
1
5
513
Replying to @ID_AA_Carmack
memset() on the gpu generally runs in under 1us, independent of the buffer size. Not sure why pytorch doesn't leverage this and instead calls a custom kernel to fill tensors.
1
5
Replying to @jekbradbury
I'm a little wary of hardware with 2D tori topology strongly baked in as a prior. It might overly constrain our thinking on the kinds of networks we build. This is probably especially true as we progress towards models with more modalities and sub-modalities (like in the brain)
5
I'll swap these in as soon as I get back from GTC. nitter.app/ajlavin/status/7164790…
4
Replying to @cHHillee @karpathy
I was suggesting doing both the numerical shortcut and loading acts into registers for reuse. The shortcut is not unstable given you can do accumulations in close to log(n) serial steps, cancelation is not an issue and you really only need ~3 bits of accurate mantissa at output.
2
6
1,225
In practice this is never observed with training distributions.. and if you're paranoid you can convert to double precision and do subtraction there (this has zero overhead in this bandwidth bound op)
1
3
624
Replying to @proteneer
When I'm coding cuda-c I almost exclusively refer to the ptx ISA documentation. When compiling I always disassemble and make sure the sass looks like I think it should. Then I might use inline ptx to patch things up here and there.
2
2
Replying to @sedielem
@sedielem @karpathy I'll have filter dilation and reflection supported in neon kernels in the next few days.
1
4
Replying to @Tim_Dettmers
Great, I'll let you work on it a bit :) Without the normal remapping 4b=>fp16 is pretty trivial. Just mask and shift the bits to the fp16 denorm position and apply scale/bias (works for sym and asym scemes). 1.5 instructions per element. and.b32 a0, b4x8, 0xf000f000; and.b32 a1, b4x8, 0x0f000f00; and.b32 a3, b4x8, 0x000f000f; and.b32 a3, fp4x8, 0x000f000f; shr.b32 a0, a0, 6; shr.b32 a1, a1, 2; shl.b32 a2, a2, 2; shl.b32 a3, a3, 6; fma.rn.f16x2 a0, a0, scale, bias; fma.rn.f16x2 a1, a1, scale, bias; fma.rn.f16x2 a2, a2, scale, bias; fma.rn.f16x2 a3, a3, scale, bias;
4
491
Replying to @soumithchintala
@amiconfusediam Working with Andrew on a GPU implementation right now.
2
Replying to @karpathy
@karpathy That's just for 4x4 blocking. 6x6 blocking the speedup can be as much as 4X. Im working on this now and expect full utilization
4
Replying to @Tim_Dettmers
From the ptx docs: "The transpose operation is only supported for the wgmma.mma_async variants with .f16/ .bf16 types on matrices accessed from shared memory using matrix descriptors." So getting fp8 transposed is likely going to be tricky and inefficient.
1
3
261
Replying to @petewarden
@petewarden @karpathy And there is also this one: arxiv.org/abs/1502.02551 We've confirmed that stochastic rounding very helpful w/ low prec.
2
5
You could do that, I was just doing this: (float)(lfsr0 ^ lfsr1 ^ lfsr2) * 2.0f**-32 > keep_prob ? 0.0f : 1.0f
4
Replying to @sedielem
@sedielem @coffeephoenix @petewarden @karpathy My fprop does P*Q small MMs of dim KxNxCRS index by blockIdx. Gets full utilization.
2
4
In my opinion you should be able to mostly ignore your lawyers and rely on your engineering to evolve the design faster that the copy cats can catch up. There's a huge amount of value in allowing the wider community to leverage your hardware to the maximal extent.
2
@sedielem @amiconfusediam I think Nvidia is going to find difficult to segment the graphics and d-learning markets joelonsoftware.com/articles/…
1
3
I'll carry on as before but with more focus on supporting cutting edge research at OpenIA. This should generally benefit all.
2
Replying to @rianflo
Nowadays, with tensorcore code, I find I'm frequently trying to max out register usage (at least in kernels designed to be compute bound). Also, there is 2.67x more SRAM in registers than there is in shared on V100. Don't be afraid to use it if it avoids extra trips to DRAM.
1
1
3
Replying to @unixpickle
For TitanX (or pre-tensorcore gpus) you can check out my old blog about the assembler and matmul I wrote. Some of it is no longer relevant but there's still a fair amount that still is. github.com/NervanaSystems/ma…
1
4
643
I've been browsing youtube for old epidemiology talks, particularly from Dr. Mike Osterholm. I think this one from 2 years ago is pretty spot on to what we're going through now. piped.video/C6DNndjBG-c
3
Replying to @soumithchintala
@amiconfusediam For my kernels that are compute bound I expect them to stay that way. They currently get very high L2 utilization.
3
I haven't worked on fp16x2 optimization yet, but these kernels run fine on P100.
1
3
Replying to @nin_artificial
And even less specific: "A view of God being created, digital art."
1
Replying to @sedielem
@sedielem @amiconfusediam 8GB is indeed good enough. Just finished some work to allow weights in fp32 and all compute in fp16 (+winograd).
1
2
It would be nice to see how this performs in the under-fitting regime. Over-fitting models have lots of spare capacity that is easily compressible.
2
I forgot to mention hydroxyzine (10mg) is extremely effective in giving immediate relief from histamine induced brain fog. It's great way to see if that's the cause and not something else (like POTS). Something to have on hand while waiting for MC stabilizer to kick in.
3
2
Replying to @soumithchintala
@amiconfusediam @coffeephoenix That 30s time is actually pretty CPU bound on EW op generation currently. I'll be addressing that soon.
1
2
Replying to @dfarmer
@dfarmer @amiconfusediam It is a 4x algorithmic speedup which means it does indeed reduce the number of flops required (still using FFMA)
1
2
Then there is also this hope as well:
This is both anecdotal and early, but many long covid survivors are feeling significantly better after receiving their first vaccine dose. Including me. Fascinating.
1
2
Replying to @proteneer @ajlavin
A lot of times this requires a deep understanding of gpu micro-architecture. Sadly, Nvidia seems unwilling to provide this due to concerns about giving competitors a recipe for a hardware spec. They're not alone in this as I'm still in the dark on low level TPU specs.
1
1
I take luteolin (PureLut from Algonot) for my histamine intolerance induced brain fog. Though it takes a couple months for the effect to kick in. This (or NeuroProtek) is also what a lot of the MCAS community uses for mast cell stabilization. Perhaps it could help LongCovid..
1
2
I've been sitting on a bunch of grouped/seperable conv code for a while now but haven't quite made the time to complete it. It's derived from direct convolution techniques that @ajlavin here developed and first implemented. I'll see if I can make some time to finally finish it.
1
2
Definitely getting close it seems..
2
2
Replying to @cis_female
Why not just use last block to zero the scratchpad: if (res+1 == gridDim.x)? But if you do need to poll you'll need something like ld.volatile or ld.relaxed (not just ldg.cg).
1
5
1,181
Replying to @alexjc
@alexjc @sedielem On nvidia hardware popc is 1/4 throughput so only 8x faster than fp32, not 32x. But still worth implementing.
1
Latest neon release fully supports Pascal. All Maxwell kernels work there now. New gemm kernels are also planned.
I just have run of the mill HI. Though I live with someone with MCAS and POTS and these symptoms are all very familiar to me: drtinapeers.com/longcovid I would be trying several H1/H2 blockers along with the luteolin: hydroxyzine, rupatadine, famotidine, loratadine, cetirizine
1
1
Replying to @scottgray76 @vinodg
Fast code generation is key to all this. Happy that you guys are moving in this direction.
1
Replying to @reworkpip
@reworkpip @teamrework I suppose I could trek up there for this. It's good to get away from the computer from time to time and mingle..
1
1
Replying to @soumithchintala
@amiconfusediam @AlecRad New versions of direct and winograd conv will have this built in.
1
Replying to @ID_AA_Carmack
I'd assumed you're either compute bound and using an ASIC or memory bound and on a GPU which is a regime that's pretty easy to optimize. Though perhaps it's a bit more in between in which case a few % could be squeezed out? I haven't looked.
2
Replying to @petewarden
@petewarden @karpathy Binary is something Matthieu Courbariaux is exploring right now, also check out his paper on limited precision.
1
1
This article claims it's just an inference chip and perhaps just 8 bit: eetimes.com/document.asp?doc…
2
1
Replying to @petewarden
@petewarden Most of the credit goes to @ajlavin for the amazing bit of research to figure this all out.
1
1
Oh and I should point out I just take the luteolin now (200mg PureLut with each meal) and don't need anything else. I'm no longer reactive to any foods. I did stop the luteolin once and payed that price by having to wait two months for it to kick back in again.
2
2
Replying to @petewarden
@petewarden @karpathy Also, I'm finishing up some tools now that should allow full exploration of this low precision space in large networks
1
1
Replying to @ID_AA_Carmack
I haven't benchmarked this explicitly, it's just something I've noticed in nvprof profile timelines. With a sync included I can see the gpu forcing you to wait till DRAM is updated. Or this could be a feature of Volta HBM. I'll investigate later today.
1
Also our tools for scaling models have progressed quite a bit since this was implemented. I think this was the last big model we trained in Tensorflow/GCE.
1
Replying to @vinodg @proteneer
I'm really annoyed that the shfl instruction now requires warp sync. I like the old way of not having to worry about inactive threads not participating in the shuffle.
2
It should not be the case that the canonical way to do a cta reduction requires ugly warp sync branching.
I usually only memset(0) small allocations for use in critical sections and otherwise write code that avoids accumulations into tensors I know are uninitialized.
1
1
Replying to @vinodg @proteneer
Sometimes you just want to tell the hardware to do something you know it's capable of. Leave "safe" mode to the the compiled path.
1
Replying to @soumithchintala
@amiconfusediam With GDDR5X you get about 15% more bandwidth per core. But the cores are running 42% faster..
1
1
Replying to @petewarden
@petewarden As soon as I get a bit of spare time I'll write them a custom cgemm kernel. Working together we can probably make this viable.
1
Replying to @nin_artificial
Sorry.. n = 1 prompt with 10 generations, cherry picked results (though its very hard to pick the best). Here are some more abstract ones from "A view of God being created from the machine, digital art."
1
Replying to @ID_AA_Carmack
OK, it looks like the profiler is just lying about the duration of the operation. If you timeline this you'll see: gist.github.com/scott-gray/5…
1
1
Replying to @nin_artificial
Adding ", by Asher Brown Durand" at the end of this prompt.
Replying to @soumithchintala
@amiconfusediam @zygmuntzajac How is the performance at small minibatches? Or are you guys only concerned with batched inference?
1
1
Ideally network normalization should be tuned to be able to run with a fixed limited dynamic range (like the brain does). This probably requires leveraging stronger non-linearities to induce more activation/gradient sparsity.