founding @AntimLabs; prev: physics/math, superconducting qubits

San Francisco, CA
Senior Year Undergrad Research on Phase Qubits FINALLY DONE. > derived hamiltonian for a Josephson junction-based phase qubit > mapped it to a spin-1/2 system hamiltonian under magnetic fields > studied the quantum dynamics and evolution of the hamiltonian > derived spin-flip probabilities > explored qubit control via phase shift and applied mag fields for high fidelity
16
25
612
199,550
unfollowed me. removed me as a follower. stole my cover pic which is a picture of notes that me and my friends wrote from my mechanics class. interesting. i’m not mad at all. it’s just very interesting to me how people work.
264
305
15,837
1,095,394
uncontacted tribe of math researchers discovers Obsidian
I discovered this very cool application called Obsidian, it's very good for writing math, especially if you install some plugins that can auto-replace text
36
487
11,141
352,821
i saw this guy’s video on simulating gravity in C++ and that pushed me into learning OpenGL (failure lol). now he’s back with another one simulating black holes. he incorporated ray tracing, Einstein’s field equation, Schwarzschild solution to the field equation, space-time, everything. what an insanely cracked man. @KavanSG
64
561
8,576
279,884
sf is insane. took an uber. a blue tesla model Y. indian driver, late 40s. 25 years of PM experience, ex-apple, verizon, and a cto of an it company. came to USA in 2007 on an h1b and became a citizen after 15 years. got laid off by cognizant as a PM, and now drives an uber.
284
195
6,891
1,038,707
the CTO of Palantir is one of the smartest guys i’ve ever heard talk. what he talks, and the way he talks. insane charisma.
43
117
4,488
384,996
One of the best things I’ve heard @naval say and you’ve probably never heard this so I’ll just drop it here. Ties to the theory of analogy and natural philosophy.
39
660
3,924
126,837
this is how villains are born btw
43
50
3,821
209,431
if you are somewhat like me: a really smart kid back in the day and then after or during college started to feel lost, you gotta watch this video. this really flipped the trajectory of my life and i adopted a new mental model. it's important.
34
204
3,424
198,315
~8 minutes in and i can already sense the difference in the abstraction layers at which Dwarkesh and Dr. Sutton operate
71
87
3,492
486,789
Replying to @offsitedark
i’m indian too 😭
33
10
3,142
64,640
the hell happened to indian youtubers?
We came second in the GPT-5 Hackathon. We are building ML for the electric grid. The grid will define the next 100 years. We want to redefine the grid. We are looking for a founding MLE and a founding engineer. Talk soon :)
42
59
2,952
131,198
i was literally talking to this "LLM researcher" about setting temperature in LLMs and i asked you know why lowering or raising the temperature results in more deterministic or random outputs, right? and he said yeah it changes the way tokens are represented. boy wtf, people IN the fucking field have no idea about botzmann stats or even softmax. i'm gonna cry.
82
65
3,028
392,644
the amount of time it takes you from waking up in the morning to getting straight to work will define your future. trust me on this.
38
127
2,293
60,638
fun story behind how Geoff Hinton got the idea for dropout regularization: > he used to go to his bank > observed that the tellers kept changing > one teller explained they got rotated a lot > Hinton realized this made it harder for employees to collude and defraud the bank > took this idea to neural networks: introduce randomness in training. > randomly drop (zero) a subset of neurons for each example > prevents the model from memorizing insignificant patterns ("conspiracies," as he called them). boom, dropout was born
40
129
2,298
141,263
i remember having a 1 hour discussion post class time with my stats professor about why we divide by n-1 instead of n for the sample variance, after i first learnt about this in my stat analysis class. the reason is shockingly elegant but still a little abstract unless you've had enough statistical analysis practice on field. the degrees of freedom is a very abstract idea when you first come to know about it. if you don't know the reason, please read about it and try to understand the intuition. it's beautiful.
35
144
1,823
157,972
i just realized PCA is Fourier analysis without the sines and cosines, but with your data’s own modes.
51
78
1,661
162,899
if i can: - come from a small city in india, - go to a small school in the states, - graduate with a degree in physics and math, - be reached out by some of the greatest people in tech, - start to self-learn ml/dl in december 2024 - get several offers from awesome startups in the silicone valley, just know that you can do anything.
58
44
1,619
117,923
mathematicians literally do not give a single fuck about computer scientists holy shit
48
49
1,357
105,156
why is indian tpot so obsessed with DSA? it literally took me like a week tops to master it when i did it four years back in cpp. i don’t understand what’s so special about it?
43
29
1,278
130,829
living on a farm with a best-friend who’s a farmer going for a chemistry PhD, while you’re working remotely for a startup has to be one of the best times of my life so grateful for math, physics, and computation
24
41
1,226
41,788
for the past few months, i have been deeply interested in data-driven engineering and, consequently, physics-informed machine learning. in the past week, i have also been studying about protein-structure prediction models like AlphaFold, ESMFold, etc., and this has just deepened my interest in applying novel techniques from deep learning to science. i was thinking if there are other people here who might be around my age (~21 or so) or even a bit older/younger, with an experience in physics/engineering and also in machine learning. i just want to create a group chat or something of that sort (of around 5-6 really focused people) to dive deeper into this intersection and possibly look at building things around this or develop new models/research in the same game. most discussions on X revolve only around NLP/image/video generation, RL for LLMs, etc. i don't see a lot of discussions on AI applied to science (especially physics/engineering), and thus i don't know a lot of people doing this. if you think you find this interesting or have any experience in ml/dl and physics, please DM me or drop a comment below! let's make the best use of twitter!
93
62
1,103
68,513
Replying to @amritwt
it's okay man. all good. we all have takes. we don't unfollow people for their takes, especially if they are not totally un-informed. secondly, i am not woke. rejecting a hypothesis doesn't make someone "woke". thirdly, i never said that i'm mad, i'm not asking for any apology. it was just funny. fourthly, we should judge people on their work, not on their opinions. i do very technically rigorous work. and by that standpoint, you should block most of the tech twitter (actually cracked people) in the industry and academia because most of academia is "woke" for you. thanks.
14
3
1,049
60,698
done. how does it look?
i need my monitor asap
80
20
1,003
67,762
you should do everything in your power to get into a top tier school, get a big company's badge in your resume, win hackathons etc. even if you think you can do stuff without them. you can. but having all this will reduce the friction in you getting big opportunities by a lot. people at those places are not any more smart than you are, they just did more boring work than you. first hand experience because i didn't go to a big school, don't have any big company name in my badge, haven't been to any hackathons etc. and i have witnessed that it's def better to have all those in your arsenal.
16
44
963
42,946
Many people I meet have watched Carl Sagan’s shows, Neil deGrasse Tyson, Brian Cox, and movies like Interstellar or Tenet. I wish I had taken that path too. I haven’t, yet. For me, it began through PDF textbooks I found online or on Telegram during the lockdown. Before that, it was just looking up at the sky and reading the little that school books offered. I also got my hands on Astrophysics for People in a Hurry. When I read it in 2021, I didn’t understand much and stopped after about 30 pages. The author’s humor didn’t help, I’ve never been great with humor. I think starting off with astronomy through textbooks thas worked best for me. They’re available at all levels and set the stage well. Maybe I’m being biased. It's my version of saying, I wish I watched more TV growing up.😄
25
43
930
52,404
ai researchers, a gift for you.
19
60
881
33,940
i'm almost sure that most ML engineers don’t know that diffusion models come from thermodynamics. the forward process is inspired by non-equilibrium thermodynamics: data is gradually perturbed by gaussian noise, forming a markov chain, governed by a discretized fokker-planck equation (eq. that governs the time evolution of probability densities under diffusion processes) the reverse process approximates the time reversal of this entropy-increasing system, learning to reconstruct structure from noise. in thecontinuous limit, this means solving for a reverse-time stochastic diff. eq.: physics leads, and revolutions follow
20
91
893
61,042
Replying to @bihanmahadewa
he added me on linkedin bro
11
2
882
80,603
bet $100 this guy can’t normalise a wavefunction or knows about the boltzmann distribution, and talks about studying thermo and QM. probably doesn’t even know what a wavefunction is, intuitively. physics larping sloppy heads. i’m tired of this farming.
The brain doesn’t need gym, it needs this everyday: - Linear Algebra - Calculus - Statistics - Probability - Differential Equations - Mechanics - Thermodynamics - Electromagnetism - Quantum Mechanics
31
21
795
60,087
simulating a physically intractable process (reverse diffusion) by numerically inverting a stochastic process that increases entropy, to create text/images/videos out of it would never not be fascinating to me. physically, impossible. computationally, close to magic.
19
67
791
33,889
writing a blog on Physics-Informed Neural Networks (PINNs). meanwhile, here’s a simple example of a normal NN vs a PINN fitting to a synthetic dataset (with noise) for a projectile motion. > simple NN overfits heavily to outliers > PINN averts this by the help of a modified loss function > the modified loss includes two more terms: the ODE and initial condition > the PINN fits to the data with the help of an “informed” decision given by the new loss. this animation shows the two networks fitting (red is the PINN). my blog will contain solving a 1D heat equation problem btw.
18
55
770
47,270
good old days. i wanna go back to when i used to read hours and hours after school without a hint of worry in my head.
29
23
742
22,064
coming from physics, this is something i do not particularly like about ML research, especially LLMs: most research ideas/approaches are just trial and error or a permutation and combination of architecures/training objectives etc. researchers don't know why something works or how, nobody does. it's not their fault. but it's just something i find irritating. like a new permutation of architectural designs can completely overthrow a previous SOTA model without any massive changes. there are too many degrees of freedom and the design space is HUGE. adapting to a research field that operates on empiricism, while coming from one that operates on explanatory frameworks gives me massive unrest and leaves me in confusion. theories have longevity in physics/math. while in ML, paradigms change weekly.
78
63
748
56,458
it’s always funny (and weird) to me how most computer/software engineers think that they’ve studied comp org and architecture and now they totally understand how computers work. it took me a semester of digital electronics and quantum mechanics to get a somewhat good understanding of how computers work but i’m still not fully there yet. very often when i’m on my laptop i’m just so fascinated that i’m living in an era where we have these machines and we’ve understood how they work and make them work as per our needs. just something like “oh this chip has billions of transistors on it” has become a mundane thing to say but if you think about it from a physical and engineering POV it just doesn’t make sense is almost magical. here’s to computers! 🫡
28
56
699
26,646
any physicist can be a computer scientist, but any computer scientist cannot be a physicist. thus, cs majors on X, hold your horses. be humble. you’re stupider than you think you are.
52
34
653
28,228
one of the coldest things ever is how Newton - invented calculus - formulated the laws of motion - explained gravity - cracked the nature of optics - laid the foundation for classical physics all during a pandemic, and THEN, he turned 26. blows my fucking mind off.
32
38
603
18,383
I have an intuition that noise cancellation fucks my ears up. I don’t know the science behind this yet but i just feel it. I don’t use it anymore.
51
11
618
43,696
andrej karpathy is to ai hype what dropout is to neural nets.
23
21
623
37,116
such a criminally underrated guy (@prafdhar). some of his papers: - GPT-3 - PPO - glow (flow-based generative models) - Improved Denoising Diffusion Probabilistic Models (IDDMs) - diffusion models beat GANs - GLIDE (text-to-image diffusion) - CLIP Latents (hierarchical text-to-image generation)
11
21
628
49,442
osho, dostoevsky, david deutsch, and douglas hofstadter are all you need. everything else is indirectly achievable through them.
24
54
627
18,692
wtf you talking about bro?
5
531
43,761
implemented an LSTM RNN with a Mixture Density Network head on top (MDN-RNN) for future states prediction as a probability for next state (an image) given past hidden states, past latent vector, and an action vector. tomorrow we implement the training script (half done), and then train both the VAE (vision module) and the MDN-RNN (memory module) of the world model. the control module would be very easy.
13
33
532
42,489
i built Stable Diffusion-v1.5 from scratch in pure PyTorch. > VAE for mapping data to a latent space and building it back up, post diffusion > U-Net as the main diffusion module, with a scheduler, to predict noise, with Self/Cross-Attention modules > a DDPM sampler to add/remove noise from the input image > supports text-to-image, image-to-image, classifier-free-guidance > the CLIP module for text embeddings to steer an image towards the prompt/negative-prompt, with > the pipeline to connect all the pieces together. i have one problem though. i have loaded the weights but there is a mismatch in the tensor shape of one of my groupnorm module, and the pre-trained weights from HuggingFace. i will be working on that in a few days, plus my CPU+GPU power is reckless. next up: i'm thinking of either making a diffusion model for a specific task (related to engineering) that i have in mind, or getting into inference optimization on diffusion models from huggingface's pipeline. idk where to go, but we'll see.
19
28
536
30,802
i know at least 4 different meanings of “kernel”: corn kernels, CUDA kernel, CNN kernel, and the kernel of a group homomorphism. one feeds you, one feeds GPUs, one feeds CNNs, and one feeds my soul.
47
21
525
22,419
btw, how many neuroscientists are employed at the top AI labs? it's kinda weird that labs and companies trying to build "intelligence" have little-to-no researchers who've actually been studying intelligence and consciousness all their lives (like neuroscientists, psychologists, etc.) lol.
42
18
495
23,482
so here's a question: i was thinking about the structure of english and how it might affect the learning of positional embeddings. i went back to @karpathy's GPT2 video and he plotted the wpe matrix for GPT2 where the plot is basically the values of 3 specific channels (out of the 768 dimensions) as a function of the position (1024, context size). he said that the learned pos_embeddings have a structure in them. i got curious and plotted the same for 2 more open-source models: EleutherAI/gpt-neo-125M, and facebook/opt-125m, and i got the same result (i guess?). in the original transformers paper, the authors used a fixed sinusoidal function for positional embeddings. why is it the case that the models learn sinusoidal structure in natural language? is it because english has a sinusoidal structure? subjects usually precede verbs, clauses have temporal or causal order, etc.?
24
21
489
94,685
career update: i joined @AntimLabs as a founding research engineer to work on scaling RL, transfer learning, and advancing reasoning agents! moving full-time to sf next month!
76
5
487
43,734
men’s therapy
17
9
450
9,364
AI models ranked by how cool I think they or their math was: EBMs = Diffusion Models = World Models > VAEs > Normalizing Flows > Transformers > Neural ODEs/PINNs > GNNs > MoE > Diffusion LLMs
32
25
471
152,112
i think maharshi’s work on performance, parallelisation, and inference is genuinely brilliant. he’s a rare kind of cracked. i resonate with a lot of his output. but when it comes to his more philosophical takes, i sometimes find myself at odds. not because they’re wrong, but because they’re deeply entangled in a specific metaphysical substrate, one that is shaped by hindu philosophy. and while there’s beauty in that, it often blends symbolic intuition with what i’d rather treat as computational, biological, or epistemological structure. when we invoke the term “mind”, to me, it’s less a matter of feeling and more a substrate of patterns, like self-referential, capable of modeling itself. introducing words like “heart” or “soul” shifts the axis. it becomes harder to engage critically, because we’ve left the realm where explanations can be improved. we’ve stepped into an aesthetic frame rather than an explanatory one. and the problem with aesthetic frames is that they can be beautiful and still be dead ends in the search for universal constructors of truth.
when someone says “mind” do you consider it as brain, heart, or your soul?
11
9
444
34,453
btw, @isaacbautistas has an awesome video on the math behind diffusion. he also explains the stat mech component and inspiration and an implementation of ddpm etc. it was the first few resources i used to really understand diffusion intuitively. do check out if you're interested! he's a super awesome teacher and explainer!
4
36
426
25,035
i don’t see enough people in AI talk about consciousness. weird.
143
21
412
17,883
super racist bro. not good for health. study physics, won’t be racist anymore. you’ll see the bigger picture. far bigger than what your brain comprehends at this moment.
8
5
407
15,602
Life update: moving to SF full-time tomorrow! 5 months back, I had just graduated college and had no idea what I was gonna do with my life. Incredibly grateful to everyone and everything that worked in my favour.
42
4
424
22,561
most people in my dms are like “dude you’re a genius you’re so smart”. no. i’m not. i’m largely not as smart as most of you guys think. what i am is not ignorant and i pay attention to details. that’s literally all you need to have a good outlook on life. pay attention to details.
17
10
409
18,765
ai research actually requires very little mental capital in contrast to physics/math research. and you can get so many novel research ideas just by reading some research papers or SOTA architecture designs, literally. and with the money that's being moved in the market rn, all you gotta do is push in a little effort and you'll bag the hardware resources to test out that new idea.
11
10
405
18,882
16
6
389
105,208
Replying to @VaguelyVolatile
i never said there’s anything wrong.
1
353
12,702
interesting watch. i had this saved for over a year. idk what to think of this though. anyone who’s seen this wanna comment?
5
22
367
12,303
Replying to @amritwt
it's okay. no further discussions needed. thanks
11
356
16,630
trained a cartpole RL agent with PPO: > wrote the PPO trainer from scratch in pure PyTorch + NumPy > actor-critic network obv > rollouts -> mini-batches -> multiple epochs > clipped surrogate loss > policy + value + entropy combined (as per the paper)
16
21
354
16,052
gödel wrecked math, heisenberg wrecked physics, and turing wrecked computation maybe the real red pill is realizing that uncertainty is fundamental, not a bug
28
27
344
15,613
if you wanna learn stats, i can’t recommend this series enough!!!
New Video Series: Statistics & Data Analysis! piped.video/watch?v=QIXUTsdj… 35 videos, 10 hours: Random sampling, Central limit theorem, Distribution estimation, Method of moments, Maximum likelihood estimation, Hypothesis testing, Monte Carlo sampling, Bayesian statistics, and more!
6
21
335
12,181
first of all, losing* second of all, please stfu for god’s sake
Can’t believe this ⬇️ is loosing to this⬇️
13
1
344
19,963
leonardo da vinci’s to-do list, circa 1490: > measure milan + its suburbs > square a triangle > ask prof. fazio about proportion > dissect pig lungs > figure out why the sky is blue > study how birds fly > learn what makes a face attractive meanwhile people are deploying ai agents without knowing what a dot product is
20
21
332
11,575
wrote a matrix multiplication library from scratch in C with benchmarking 3 multiplication procedures: > it’s shown tested on a 1500x1500 matrix here > normal: 4.293s > parallel: 1.215s > blocked: 1.306s expected result. parallel > blocked as parallel version has some overhead from thread management as the matrix size is relatively small. good C practice. repo link in comments.
14
12
321
18,940
i have a question for those into RL in LLMs/agents/reasoning models: how much of the RL theory do you guys use? like for eg., do you remember how to derive the Bellman equation for state value function? what do you guys do on a daily basis at work? i’m genuinely curious about this.
21
12
314
39,005
Replying to @vxdenton @grok
tf you mean is this true? he himself said it lmao.
3
295
44,589
<5M params and you can teach the model to learn in its own imagination. done 80% with it, hopefully will completely be done in the next couple days.
9
11
297
20,824
if i tell all my physics major friends from college how easy ai research is for them, and they also tell their physics friends, real ai researchers will be in big trouble.
15
12
285
10,728
i wasn’t kidding
i got another dm and you will definitely most likely not believe who dm'd me. my jaw in on the floor.
29
288
51,337
~8 minutes in and i can already sense the difference in the abstraction layers at which Dwarkesh and Dr. Sutton operate
8
2
285
28,706
bf16 doesn’t work when you do sft on qwen-3-1.7b on a multi-turn wordle env. reward stays at 0.2 whereas with fp32 it goes to ~0.65.
19
21
289
19,248
bottomline is, if you can’t model real-world systems with differential equations and solve them to predict behavior, you shouldn’t be writing “engineer” in your bio. and this is one of the most fundamental things engineers do.
12
12
269
10,325
Replying to @amritwt
ran deepseek r1: 14b on this baby at 1.5 tokens/second lol
11
2
267
12,188
Replying to @verrsane
people who’ve run deepseek-r1-14b on a raspberry pi 5 fear nothing . ~ 1.5 tokens/second of pure bliss. feels amazing.
4
3
268
10,616
need to make prof. @eigensteve more famous. what a spectacular teacher.
7
15
265
5,576
til there's an iit in patna and a guy from there has a paper on mech interp of diffusion models
6
13
287
22,385
can i get a phd now?
Replying to @eigenron
i know how to do this because i did a phd in online rl theory not that long ago but it does not come up much in my day-to-day work re: LLMs besides at a conceptual level
12
6
259
26,795
Replying to @edvardh1
you’re already using it lmao
1
247
29,929
i know gpu parallelization is in the hype rn but i just parallelized my rollout collection for a world model implementation across 10 CPU cores on my new MacBook Air M4. holy shit the collection for 10,000 rollouts for the car-racing-v3 env went from ~5 mins to <1.25 mins. this is called embarrassingly parallel processing. zero gpu, full throttle cpu baby.
7
7
257
16,744
people working in pure math don't care about that afait. they just wanna do it on their own. they don't do math with the primary goal to solve the mysteries of the universe, they do it for the same reason why a painter paints or a director directs.
If you are working in pure math or theoretical computer science: keep in mind that there is a $500B multi-million GPU supercomputer pointed at automating your research
13
21
246
11,156
1D heat equation solver using PINN is finally working (i used cursor a lot for the plotting) the plot on the monitor is the noisy data + exact solution + NN prediction + PINN prediction together (for a small t_max of 1) the plot on my laptop is just the exact solution (for a larger t_max = 100 and thus the clearer sine profile decay)
6
21
237
11,090
i miss my parents and my home, i wanna go back home soon :(
5
3
232
12,547
i am done implementing the World Models paper (Ha & Schmidhuber) from scratch. i trained the VAE fully, but haven't trained the MDN-RNN or Controller. wrote the training scripts for all and tested them. everything runs. i just don't have the time to work on this anymore and am moving to a different thing. if anyone wants to take this forward, please let me know. i can send you the code + the rollouts data + encoded rollouts from the trained VAE + trained VAE checkpoints. thanks!
7
8
240
19,655
unfollowed a prime 40k followers indian tpot account due to heavy slop posting
27
233
47,119
pushed the transformer based n-gram repo on github > trained on Mahabharata (vol. 1, books 1-3) > trained on a single A100 > saved the trained model weights in model_weights.pth > run the infer.py for inference github.com/ronaldnetawat/mah…
2
12
229
10,968
most of the indian ‘ai influencers’ on twitter are really not worth following. they’re sloppy and lack real technical rigour. it’s really easy to fake rigour online. it’s also easy to see through the bullshit if you just try to.
23
9
220
11,845
this blog was originally a paper i wrote for my AI class. i wanted to do something with logic, formal systems, and this idea from The Emperor’s New Mind by Sir Penrose that had been in my mind a lot. it's on artificial consciousness, 3000 words. deals with roger penrose's argument against achieving artificial consciousness using gödel's incompleteness theorem. for people into AI, logic, formal systems, physics, artificial consciousness: eigenron.bearblog.dev/the-go…
12
17
228
11,688
life is too short to study everything and it sucks. i feel like going to grad school for like 10 different things.
14
10
200
6,610
> alright students choose a project idea for CS 3X0: AI > me: "prof., i wanna make a chess engine" > prof: "haha, no. too complex." > me: "prof., i can do it" > prof: "alright, suit yourself" 2 months later: > "prof, here it is, you can play PvP or PvAI" > prof: "holy shit. here, take an A" TLDR: you can just do things
6
4
201
7,795
things i’ve been interested in and learning/building lately: - diffusion models for protein structure prediction - protein language models - probability theory - variational inference - mechanistic interpretability - physics-informed ML (SINDy and autoencoders) so much to learn and so much to explore. i love it.
13
6
196
9,000
even training on just 500 frames across 10 rollouts for just 20 epochs, the VAE reconstruction looks pretty good to me. honestly kinda impressed how stable vae training is.
7
9
204
15,685
i was reviewing my diffusion notes and i wrote that the reverse of a gaussian markovian chain is also gaussian as one of the reasons to choose a gaussian as an approximate posterior for the reverse process. i just realised i didn’t quite understand why that’s the case? can anyone explain easily?
11
7
198
13,230
Replying to @jdluk87
is this supposed to be sarcasm? because that’s the only scenario in which it makes sense. and i’m an apple user.
1
196
3,550
never realised i hadn’t shared my undergrad senior research project on the dynamics of a phase qubit here. here’s the link if you’re interested: drive.google.com/file/d/1z4w… i never really touched on actual and practical quantum computation. this was merely based on the dynamics and quantum evolution of a 2-state system (isomorphic to a phase qubit).
11
15
190
11,275
people don’t really realise how close to magic general relativity feels. i wished there was a way to convey to each human the fact how big of a breakthrough the 1915 paper was. it’s literally the closest thing to reading the mind of god. what einstein did made him unable to ever be deemed as overrated by any scale.
10
7
193
7,905
i was bored on the drive to Atlanta so i trained a small transformer based 256-gram level model on the Mahabharata (~3.7 MB) 1xA100 from Lambda labs at $1.29/hr, took 15 mins to train ($0.75 spent because i forgot to terminate the instance lol) 5000 iters with a 6 attention head and 6 attention blocks. it is not that bad i guess. i am just happy i implemented it 95% off the top of my head in ~1.5 hours and only rarely used gemini for some bugs in evals :p
5
5
193
8,696
Replying to @atulit_gaur
ml interviews ask easy questions like these?
1
185
61,491