engineer. destroy the whole world

Chicago, IL
Pinned Tweet
wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!
107
316
4,807
970,365
Replying to @nearcyan
behold, new food
5
2
689
15,502
Replying to @atomicthumbs
1
9
544
Replying to @ChazakielDoremi
i was an 8 year old without specific expectations so i loved it
502
20,299
arxiv mods rejected this paper. they won’t say why. I don’t really care at this point, took weeks to get approval to submit. I think twitter boys will like it, that’s what matters.
13
207
42,340
the core trick I use comes from `Gradients without Backpropagation`. using the JVP, you can find the alignment of random vectors to the gradient, and reconstruct it. only a forward pass!
5
6
186
30,699
doesn’t this violate information theory? no, it’s probably that the domain in which models are compressible is the correlation of their loss gradient to noise. or something.
7
5
182
25,208
Replying to @chasebratton
the true mark of wealth
2
2
164
what about bitnet? bitnet does inference in 1.58b, but training uses precision weights. basically they clamp weights to ternary {-1,0,1} in forward pass, and pretend they didn’t in backward pass.
1
3
174
34,938
it’s often said academia is unwell. I was unimpressed with how these people operate. Real Science needs a massive cultural change. More openness, less hostility, less structure. I’m not a real researcher; freely discard my comments.
4
2
166
23,067
actually, one acknowledgement: call me schizo but months ago I was discussing the algorithm loudly at dinner and I swear @DarioAmodei was watching me, grinning just like this. just eating with his wife. perhaps some things are fated.
3
2
162
39,269
the other thing is distributed training. the steps are tiny, the optimizer is stateless. imagine: a distributed training cluster across the internet with such low traffic it’s undetectable.
2
2
146
22,116
womp womp: this doesn’t work with ternary weights! if you make the random vectors sparse, it does work! another exciting part is that your ‘alignments’ can just be {-1, 0, +1}, it still works!
2
124
27,250
model exfiltration becomes easier too: a disgruntled worker might gmail themselves gpt4o. woohoo! proliferation! <3
3
1
114
22,743
this means gradient steps are now tiny! an entire step in a few bytes. is there a tradeoff? it doesn’t seem like you need to take more steps to converge.
3
112
26,167
instead of storing model weights, you could instead store training steps, with massive size reduction. download a sota model in a second?
2
1
113
24,184
second, we need better tools to write compute kernels! I tried for a long, long time to get a functional and high performance kernel written for noise_step training. I am still without. kernel language is one of the projects I might work on next.
5
111
20,938
this format has some other cool properties. you can recover the complete history of weights! full-rank finetunes become tiny. you might even be able to flip or mask out past training steps, idk.
2
1
102
22,250
Replying to @jesse_squires
definitely an issue
1
93
now, instead of acknowledgements, I have grievances!
3
2
95
21,309
another neat thing is the JVP can run alongside normal inference for low cost. this enables more practical continual learning.
2
1
87
21,403
Replying to @shaggysurvives
use it like a latex keyboard
59
4,378
Replying to @rebelcrayon
these are very good points. thank you
1
52
4,677
Replying to @atomicthumbs @imgur
dw i will be able to buy imgur for $120 in six months and reverse the policy
1
1
39
6,375
Replying to @PandaAshwinee
I have good news: the algorithms appear not to be the same. there is conceptual similarity in the approach (gradient is being projected onto noise), but the way this is accomplished is different, and the scaling properties should not be assumed to be the same.
1
1
36
12,140
Replying to @Andercot
Recall that blackhole interior has reduced spatial dimension, it is (2, 1), the radial coordinate is timelike. I bet our universe is the (3,1) interior of a black hole in (4,1) spacetime, or even timeless space: (4,0). Odd though that our timelike coordinate seems reversed! We have a singularity of infinite density inevitably in our past, not in our future!
1
1
34
3,338
some sort of chicken game
3
28
3,204
Replying to @floates0x
afaik indian team did not follow procedure properly and is attempting a second synthesis
1
26
9,124
Replying to @hoffridder
These conversations kill me. It's not that it doesn't come up. it is literally constantly *always* coming up, like, you are ignoring or not aware of it! Interviews are structured poorly, sure, but jfc
21
Replying to @VictorTaelin
the lowest precision activations you can get away with to date are 4 bit (uses tricks), but yes i think it’s all integer arithmetic. bitnet uses a precision matrix scalar, but that can be factored out of the dense matmuls. some modalities might require a float encoder. i think there’s been work on binary token embeddings, but again 4 bit activations. i would love to see an all-tern/bit paradigm. idk if that’s possible.
1
24
4,833
Replying to @giffmana @davidad
i’d like to train other models. note that adam over ternary mnist also reaches ~90%. we will see how step size versus weight size scales. my thinking is the steps format is most effective at large scale as the benefit to additional samples diminishes
2
21
3,742
Replying to @fermatslibrary
The relativistic kinetic energy is 48.02 Joules! That's the same energy as dropping your iPhone X on your face, but from 92 feet 4 inches in the air. In a single proton!
2
1
19
They have this extremely useful feature where you can check via search if you can use a song and in what countries. They're removing it for no reason sometime soon. The workaround is post it as private, check if you get flagged, if so, re-edit the video and try again. Vomit.
1
1
19
Replying to @PandaAshwinee
I have good news: the algorithms appear not to be the same. there is conceptual similarity in the approach (gradient is being projected onto noise), but the way this is accomplished is different, and the scaling properties should not be assumed to be the same.
20
9,688
btw, thank you for reading! with this (oversized) model, the correct weight size is 53_191 bytes. with the config used, with convergence at 1000 steps, the steps cost 25_280 bytes.
1
20
10,699
Replying to @swagitda_
Pretty odd that in 2021 a pdf can turn your computer into your secret enemy and the solution the entire world agrees on is just "guess which PDFs are hexed and don't open em"
1
16
Replying to @andrewmccalip
Some dude sells it on Etsy, for an "element collection". etsy.com/listing/1480854294/…
2
14
4,916
Replying to @gfodor
Replying to @_brickner
I woke up late, here is a cpu implementation colab.research.google.com/dr…
18
4,111
i know a secret. the compute, bandwidth, memory, and energy requirements will all melt away.
OpenAI's O3 model really makes the Doomer hard takeoff or "FOOM" theory look like a bunch of BS. As we start to enter the age of AGI, the massive amounts of required compute, interconnect bandwidth, and energy are real physical constraints that govern scale over the time dimension. FOOM can't happen when you need to build massive solar farms and nuclear power plants.
2
17
1,540
Replying to @samsoniuk
1
17
1,232
for all the computational ability of current models i have never seen them have a genuinely good idea. nothing novel of value really emerges. who has studied this?
2
15
2,233
Replying to @sdamico
in fact there are claims that the visitors gifted equivalent artifacts to both the americans and soviets. whatever you believe, the capabilities are now known to be real, and the propulsion is reactionless. a secret physics must exist.
2
12
4,383
Replying to @mountain_ghosts
god saves the silliest battles for the funniest clowns
12
People have been brainwashed into thinking its normal and that any other way would make us unsafe lmao
1
11
Replying to @servomechanica
MIT admission is not about being cool and smart, and this is actually a good thing-
2
12
1,248
Replying to @leaacta
it could be an illusion, and maybe the catgirls latent in all computer communities are just less readily visible or something ..or maybe the big ears allow them to hear the subtle screams of cpus accessing memory unsafely, could be both
1
12
Replying to @mattparlmer
the only real engineers are those who work on steam engine locomotives
1
1
12
2,014
If you've never used @fusetools you've never lived. I never realized how awful the DOM was until now. I started like 3 days ago and it's blowing my mind #FuseTools
1
1
12
Replying to @iamdevloper
As an old man I sit in my armchair and fondly remember my life, filled with mystifying segfaults, intriguing runtime type errors, thrilling RCE CVEs, suspenseful execution times. Rust never did corrupted me, I wrote code the hard way, the honest way. I am happy.
3
10
they may have anticipated this timing attack info leak & set a constant higher duration per token
1
10
2,654
ive seen arrows you people couldnt imagine
2
1
10
1,499
can someone 'endorse' me for the ML arxiv? I cant submit my paper at all lol. i am just a guy. its a cool paper
1
1
11
1,040
Replying to @FeiKhal
if I were a great lakes captain I'd simply make my boat flexible and not have that happen to me
10
was thinking about the long-term positive externality of rustlang, like from an economic perspective. I think the lifetime value of rustlang to humanity is something on the order of +$1T, essentially generated for free by volunteers.
9
the future is gonna be so cool
Everything you love about generative models — now powered by real physics! Announcing the Genesis project — after a 24-month large-scale research collaboration involving over 20 research labs — a generative physics engine able to generate 4D dynamical worlds powered by a physics simulation platform designed for general-purpose robotics and physical AI applications. Genesis's physics engine is developed in pure Python, while being 10-80x faster than existing GPU-accelerated stacks like Isaac Gym and MJX. It delivers a simulation speed ~430,000 faster than in real-time, and takes only 26 seconds to train a robotic locomotion policy transferrable to the real world on a single RTX4090 (see tutorial: genesis-world.readthedocs.io…). The Genesis physics engine and simulation platform is fully open source at github.com/Genesis-Embodied-…. We'll gradually roll out access to our generative framework in the near future. Genesis implements a unified simulation framework all from scratch, integrating a wide spectrum of state-of-the-art physics solvers, allowing simulation of the whole physical world in a virtual realm with the highest realism. We aim to build a universal data engine that leverages an upper-level generative framework to autonomously create physical worlds, together with various modes of data, including environments, camera motions, robotic task proposals, reward functions, robot policies, character motions, fully interactive 3D scenes, open-world articulated assets, and more, aiming towards fully automated data generation for robotics, physical AI and other applications. Open Source Code: github.com/Genesis-Embodied-… Project webpage: genesis-embodied-ai.github.i… Documentation: genesis-world.readthedocs.io… 1/n
10
1,404
Replying to @Kaju_Nut
8
391
Replying to @gbrl_dick
this is the funniest one i’ve seen yet. walking it through your own math problem without mentioning death camp and then claiming gpt plans death camp
10
872
Replying to @leaacta
genuinely wonder what causes the enrichment of catgirls in the rust community / userbase relative to the general computer-person population bc it is definitely a real phenomenon
2
10
place ur bets boys!
Created a manifold market to decide if this paper is real or not, link below
11
3,943
Replying to @everestpipkin
transistors > cells means chip > brain
2
9
btw, thank you for reading! with this (oversized) model, the correct weight size is 53_191 bytes. with the config used, with convergence at 1000 steps, the steps cost 25_280 bytes.
1
10
11,792
Replying to @sharifshameem
I wonder how it scales with large complexity and defining really complex things. At what point is it more cumbersome to describe the behavior in English than to write the code yourself unambiguously?
10
Replying to @teortaxesTex
hell portal link confirmed..?
Replying to @_brickner
I woke up late, here is a cpu implementation colab.research.google.com/dr…
1
10
585
Replying to @acidshill
i think the input from workers was about the thickness of metal they can weld with high quality bonds and no porosity / breakthrough of the material
8
208
problem: schwarzchild radius of our universe in (4,1) does not match! the 5D formula differs. Maybe it’s in (4,0), but literally has no causality! GR is not suitable. Another option is that total mass is over many non-interacting (3,1) shells, spaced along the timelike dimension.
Replying to @Andercot
Recall that blackhole interior has reduced spatial dimension, it is (2, 1), the radial coordinate is timelike. I bet our universe is the (3,1) interior of a black hole in (4,1) spacetime, or even timeless space: (4,0). Odd though that our timelike coordinate seems reversed! We have a singularity of infinite density inevitably in our past, not in our future!
1
9
1,126
Replying to @zswitten
> if only you could evaluate these models objectively by having a general intelligence review a large set of outputs & evaluate them in a highly nuanced wholistic way :)
1
9
627
Replying to @benedictevans
When does Twitter get it's Dancing Hotdog moment
6
what is the source originally? what ties this to the hearing? can’t find any back links
1
8
10,485
Replying to @tonofcrates
none (!), it should be an iterator -> iterator method
2
7
1,635
Replying to @torchcompiled
why is this more impressive to me than HD images and audio and video haha
1
8
406
Replying to @Titan1Beast
Replying to @_brickner
I woke up late, here is a cpu implementation colab.research.google.com/dr…
8
2,178
reminder that the average age of an Apollo mission engineer was 28, and the Manhattan project was 25. put young smart nicotine addict males in control, reap the rewards.
Can’t believe we’re running the Iraq playbook on our own government
1
8
512
Replying to @jpohhhh @ArmandDoma
why would they do that? did it work or something
6
575
Replying to @RBehiel @Andercot
It’s funny you mention a deep connection to the Higgs field. Any scalar field like T requires a spin-0 boson. We have already seen one: the Higgs boson! I’m a layman, is it possible that T literally is the Higgs field? This may offer a different view of its nonzero VEV, as mentioned in the original thread. I would appreciate your thoughts
6
1,575
Replying to @KeziyahL
I thought this guy was doing satire, was I wrong?
2
7
merry christmas the world will never be the same
7
1,080
let my people hoe
6
idk jack but would i be wrong to think that it implies a 20T token corpus
1
6
425
so if i do 1 coin flip and see heads, P = 2/3? very upsetting result
4
6
1,648
also I think it would imply our universe is curved and closed, only locally flat. I think it would be spherical. this could be true, if beyond the cosmic horizon it’s very very large.
2
6
268
o3 deserves an even higher score on ARC lmao. also amazing it can infer the transform so reliably from the text format they use
You've seen some of the puzzles o3 failed, but have you seen the attempts? Yesterday, @OpenAI's o3 dramatically beat the SOTA at @arcprize. But there were 34 tasks that even it couldn't solve with 16 hours of thinking. I've compiled and analyzed all of o3's mistakes below 🧵
7
2,894
Replying to @torchcompiled
there will prove to be problems, working to address them for larger tests. a few responses: the demo is a toy, a good kernel is a single forward pass with no perturbation memory, and because of sparsity the cost can be reduced a lot. too small or too large samples leads to poor convergence as you see. it’s not a random search, and the mnist example is 270k, more than a few hundred dimensions. despite these things your sentiment may prove correct, we will see :)
1
7
1,002
Replying to @soundsonacid
stronger const generics and if let chains next 🤞🏻
1
7
633
o3-mini is very reddit. inflexible knowledge from authority, often misses the point. knowledgeable though.
1
5
270
2) Compare it to something normal: > I hate when I drop my phone on my head laying down ∆E_f = ∆U_i 48.02 = mgh iPhone X weighs 174g h = 48.02/(0.174*9.806) h = 28.14m = 92 ft 3.96 in Done!
5
funny how much time went into linear attention. you should expect attention to be very similar to sorting. algorithms without degradation are all gonna be O(n log n)
1
5
269
adorable and very impressive thought trace thank you
5
86
Replying to @YosarianTwo
what is the probability of this digest occurring randomly? not a follower of this but does seem like the type of thing that’s gonna turn out to be 10^-200
3
5
1,417
humanity requires universal and perpetual nicotine administration to continue making progress. we are too weak without it. specter of 1971.
1
5
319
probably a gyroscopic effect would cause torque to the body
1
6
489
kat just dropped anxiety fish v0.1.0
3
Replying to @bascule @rustlang
it just gets better and better and better. almost unreasonably clear and easy
4
Replying to @gbrl_dick
there is a further implicit leap that dangerous things should be kept from humanity, which is wrong in a more important way
1
5
874