member of technical staff & co-founder of @coreautoai - and continuing to aspire to understand deep learning.

Pinned Tweet
It turns out multi step backpropaganda is better. paper has a beautiful way of improving backpropagation. One iteration cleanly gets us backprop, multiple iterations get us a preconditioned update.
4
12
213
115,649
A little bit of update from me: I will join the awesome team at @AnthropicAI in two weeks.
104
14
1,304
147,288
10
129
1,211
92,871
Joining the Llama team @AIatMeta today! Time to train models, finally gpu rich :)
22
11
1,049
90,592
Near the office. SF has stepped up its dosa game.
69
9
864
82,018
This paper looks like a big step forward for the Transformer architecture! A foundational improvements, not as shiny as other things, but really big step forward nonetheless
10
91
801
Reading this, its clear that Meta is advancing / recommender systems tech faster than other places including G.
We’re excited to share details on Meta’s Generative Ads Recommendation Model (GEM), a new foundational model built with LLM-scale techniques that’s already helping create more value for businesses, like +5% increase in ad conversions on Instagram. Dive deep into the technology behind GEM and see how it delivers increased ad performance and advertiser ROI by enhancing our ads recommendation models’ ability to serve more relevant and personalized ads: engineering.fb.com/2025/11/1…
18
17
608
173,544
Man, claude solved this verbally by looking at the inputs visually.
Replying to @fchollet
It will also be extremely important to analyze the strengths and limitations of the new system. Here are some examples of tasks that o3 couldn't solve on high-compute settings (even as it was generating millions of CoT search tokens and consuming thousands of dollars of compute in the process). Interestingly this first task was the one we had in our university tour presentation to illustrate "it's easy for humans, hard for AI"
20
32
567
95,546
A bitter sweet moment for me, Gemini is doing really well, and teams are doing great. I had a great close to 12 years at G that one could call me OG. For example, for every search query, I noticed things I was able to contribute to is deeply integrated from the retriever to the final rankings, allowing me to meaningfully impact the world and help Google’s mission. I decided it was time to do something next year that I naturally gravitate towards and give my life’s work meaning. And since today is my last day with corp access - I sent this note this week 🚀 — Hi all, I decided to try to find a new environment for a change next year. It was a hard decision to get out of the comfort of Google safety nets. I am quite nervous and excited too. Especially since I have only explicitly changed teams twice in close to 12 years .. even though .. The number of codebases to train a transformer has changed more times than I can count (half joking!). Google as an employer and as a utility has been life changing on multiple fronts. Firstly, the sign-on bonus allowed me to pay off my student loans that were accruing rapidly, work visas got me to stay in the US and led me to go on this ride to what looks like AGI :) I have a lot of fond memories working alongside you and learning from you all has been a unique privilege! I remember permuting the words, "learn. observe. contribute" every now and then on the go/who page as a way to keep myself focussed on what's important. Getting to work alongside brilliant scientists and engineers in Brain with an open door policy on collaboration allowed me to thrive and was a life changing and a singular experience. And very recently making Gemini work with its own constraints was very challenging and meaningful to me, and I feel like it is paying off with resulting good models such as Flash and Nano and them taking center stage this year. I am extremely grateful to my mentors and colleagues who helped me get to where I am today.  The collaborative spirit and dedication to innovation at Google have been truly inspiring. Finally, I am feeling good thinking about my life's work here across the stack (quite literally) over the last close to 12 years has delivered value to the world in a positive and meaningful way! You can find me at rohan.anil@gmail.com -- I am always open to grabbing coffee or lunch or thinking about the next breakthrough research idea or pair programming on something fun. I am in the east bay and going to take a small break to recharge. PS: This is the longest email I have typed and edited, and I promise I didn't use a LLM to write it. With regards, Rohan
114
9
558
82,900
Meta researchers just dropped PyTorch distributed shampoo🧴few days ago: arxiv.org/pdf/2309.06497.pdf 💥 Train neural networks with a second order method for better performance. This underlying work which it is based on has been a passion project for last 5 years while swimming upstream with @GuptaVineetG - with no love from any conferences chairs. Distributed Shampoo in Pytorch with solid results means as a co-author of the method trust the implementation! Lastly given the effort they have put it in, my guess is it is already in production (:
8
70
544
113,455
That’s insane to convince a cofounder of thinky to bail this fast.
Saturday scoop: Thinking Machines Lab co-founder Andrew Tulloch has joined Meta, the startup confirmed. W/ @keachhagey
2
531
86,946
I got to coauthor papers with two Nobel prize winners, one in Physics and one in Chemistry 😁
4
10
507
26,689
It’s been a privilege to work alongside with our gemini leads and team (across Google DeepMind, Research and Alphabet) in one of the most interesting and challenging projects of my career. We have three versions of Gemini: (a) Ultra (b) Pro and (c) Nano We make significant progress on scaling and efficiency frontiers. We are shipping and continue to ship! But I will now catch up to some sleep and three is a sense of excitement!
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU across 57 subjects with a score above 90%. It also achieves a new state-of-the-art score of 62.4% on the new MMMU multimodal reasoning benchmark, outperforming the previous best model by more than 5 percentage points. Gemini was built by an awesome team of people from @GoogleDeepMind, @GoogleResearch, and elsewhere at @Google, and is one of the largest science and engineering efforts we’ve ever undertaken. As one of the two overall technical leads of the Gemini effort, along with my colleague @OriolVinyalsML, I am incredibly proud of the whole team, and we’re so excited to be sharing our work with you today! There’s quite a lot of different material about Gemini available, starting with: Main blog post: blog.google/technology/ai/go… 60-page technical report authored by th Gemini Team: deepmind.google/gemini/gemin… In this thread, I’ll walk you through some of the highlights.
20
23
471
156,703
A new image generation model just dropped. parti.research.google/ Great work by the team! + Auto-regressive, encoder->decoder Transformer + Classifier-free sampling. + ViT-VQGAN Really amazing results: Image from the website.
14
99
458
I'm back at implementing preconditioners for fun. It's wild how much untapped potential there still is in optimizing neural nets better. Thinking of writing up a tutorial that builds from the basics all the way to SOTA stuff on standard networks.
27
15
472
28,193
Async RL framework for scaling RL
3
57
473
37,415
Shampoo is out of the bottle! Preprint: "Second order optimization made practical" arxiv.org/pdf/2002.09018.pdf We train certain neural nets faster than before. How fast ? It has shown upto ~40% reduction in training time for a Transformer. (@tomerikoriko)
7
104
439
Replying to @iScienceLuvr
One wedding cost is like 10 o3 models.
11
7
415
18,398
❤️
Today, we announced that we plan to expand our use of Google TPUs, securing approximately one million TPUs and more than a gigawatt of capacity in 2026.
6
3
396
92,208
cloud.google.com/blog/produc… PaLM-2 is Generally available for developers! “With this update, developers can access our text model powered by PaLM 2, Embeddings API for text, and other foundation models in Model Garden”
5
81
371
76,916
Today, we present our paper on Google Search Ads CTR model at ORSUM @ACMRecSys, Seattle. arxiv.org/abs/2209.05310 We highlight ML techniques suited to *online learning* that go well beyond traditional accuracy improvements. orsum.inesctec.pt/orsum2022/… A short thread: 1/n
6
81
367
Two good papers landed 🛬 today On test time compute arxiv.org/abs/2501.19393 On long context arxiv.org/abs/2501.19399
2
40
352
35,955
I am relieved I never learned to be a power user of excel.
5
8
343
25,418
Prompt: "A koala bear and grizzly bear playing chess. They are sitting at a table on the beach. You can see the waves crashing into the shores. Bears are very stressed. DSLR camera photo." #imagen #googleai #brain 🐻🐨♟️🏖️
11
38
304
Insane timeline.
"Reflection API" is a sonnet 3.5 wrapper with prompt. And they are currently disguising it by filtering out the string 'claude'. teddit.net/r/LocalLLaMA/comm…
10
4
314
48,155
Transformer: Attention is all you need paper from 2017 used roughly 1e19 flops. Today most labs have a tonne more, like more than a million times more compute. You would expect with right focus and people we should at-least had similar level breakthroughs in architectures.
26
8
297
71,974
arxiv.org/abs/2208.01134 Batch Entropy Regularizer that makes untrainable networks train. Remove skip connection, normalization layers. Published at TMLR, Works on PaLM like transformers -- thanks to Lucid for the pointer!
3
51
292
Distributed Shampoo has dethroned Nesterov Adam marking a new era for deep learning optimization. 👑 🤘 Non-diagonal preconditioning is here! This is the AlexNet moment for optimization for deep learning. I am extremely happy. An email from 2021.
@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI
25
31
293
72,592
Last day today @AIatMeta, reflecting on last several months, and wanted to highlight few things I enjoyed working with: Building new algorithms for on policy distillation with @DatHuynh13 Science of end to end thinking models @agarwl_ and many others Working prototype of something beyond transformers with @Happylemon56775 Newer higher order optimization and infra with @vinaysrao and many others Helping @afrozenator with codistillation for pretraining and helping move to next arch Pushing on the scaling RL plans with many particularly debugging entropy collapses type of bugs. Most fun / memories are on technical work!
9
7
291
36,742
Palm2 is online: ai.google/discover/palm2 🌴🌴 Paper: ai.google/static/documents/p… I learned to code with instructions in Malayalam, so this capability shown by PalM-2 instruction tuned models to explain the code make me quite happy! Possibilities are endless here!
🌴🌴 Very proud of this work; specifically not compromising on model quality, while being extremely fast for inference, so that we can serve the whole wide world i.e bringing technology to everyone!
15
65
282
94,401
Replying to @giffmana
The team is working hard to bring audio inputs to the AI Studio interface for Gemini 1.5 Pro. We have an internal version that handles audio and video and can sample the video less frequently to increase the length of content that can be handled. @karpathy, thanks for the awesome tutorial. 🙏 Here's the prompts we used and the output for something in the direction you suggested. @TeplyashinDenis @joe_stant @SavinovNikolay @machelreid @ankurbpn
7
33
272
61,403
Everyone who is seeing this tweet! It’s time to build cool things with caching feature. 1. You can now cache prompts of 1M tokens, this may mean a reasonably number of pdfs on arxiv, so a paper and it’s related work 2. Then you make a good UX to analyze, summarize and ask questions on this data. Who wants to prototype this demo first?
Replying to @OfficialLoganK
Gemini 1.5 Flash continues to be the best value proposition for anyone building with LLMs. - $0.0875 / 1 million tokens (cache prompts < 128K) - $0.175 / 1 million tokens (cache prompts > 128K) - $1.00 / 1 million tokens per hour (cache storage) Big 🚢 by @shresbm and team!!
14
16
271
135,236
R1 shows you need a good base model, a large math and code prompt reward set, prompting+cleaning then any RL technique+long decode+gpu go brr
2
11
259
56,200
L👈: "A Koala bear in a suit standing at a podium to teach. Variational bayesian methods is written on the chalkboard. There are lot of confused cats in the crowd" R 👉:"Variational bayesian methods is all you need is written on the chalkboard." 🐨🙀 #imagen #googleai #brain
8
38
250
Going home to India after 5 years - one pandemic, 2 kids, 4 LLMs revisions and 4 NeurIPS later Acquired: - Airplane toys, and ipad loaded with movies - enough baby formula to last 28 hrs. - updated personal laptop - installed torch and grabbed notable articles iclr blog track from 2024 - ollama llama70b
18
2
259
33,067
You are telling me that o3 is causal attention with a decoder model!?
8
4
243
36,675
I forgot about this tweet but read this top tier paper and get ultra agi pilled.
Few questions those who are following AlphaEvolve and FunSearch * is anyone reproducing it? * very relevant to diverse data generation in verifiable domains? * one step away from a new paradigm beyond current thinking: “solve this problem under x constraint”? 1. Makes use of human-written prompt templates (for both mutations and crossovers) — instantiate it every generation with (current best code, targets on perf and budgets) as input 2. Evolutionary Search for code. Classic explore vs exploit, and using approps prompts. Use a fast model for inner-loop - aka Flash, Pro for merging or refining. 3. Automated verification with “does it compile” “does it pass unit test” “does it run fast” checks
5
18
249
51,046
Rishabh is an amazing researcher. His algorithms underpin post training at Gemini. I got to work together at meta for a short while and was truly impressed. Whichever group got Rishabh is so lucky to have him!
This is my last week at @AIatMeta. It was a tough decision not to continue with the new Superintelligence TBD lab, especially given the talent and compute density. But after 7.5 years across Google Brain, DeepMind, and Meta, I felt the pull to take on a different kind of risk. The pitch from Mark and @alexandr_wang to build in the Superintelligence team was incredibly compelling. But I ultimately choose to follow Mark's own advice: “In a world that’s changing so fast, the biggest risk you can take is not taking any risk”. In my short time at Meta, we did push the frontier on post-training for "thinking" models. Specifically: - Pushing an 8B dense model to near Deepseek-R1 performance with RL scaling. - Using synthetic data mid-training to warm-start RL. - Developing better on-policy distillation methods. Really enjoyed working with @_arohan_, @brandfonbrener, Leo Li, @ErykHelenowski, @DatHuynh13, Xiaocheng, Jia, Boduo, and Yanjun.
2
4
244
52,254
Prompt: "A train ride in the monsoon rain in Kerala. With a Koala bear wearing a hat looking out of the window. There is a lot of coconut trees out of the window" #imagen #googleai #brain (I will host the imagen team at my home in Kerala if they choose to visit 🚀)
14
13
232
Number of VCs liking this from SF makes me think … there is space to raise money for a startup to make frontier roast dosas.
Near the office. SF has stepped up its dosa game.
19
234
21,043
Just saying Muon was originally done on the cifar speedrun with few gpus on nanogpt. While gpus per researcher is helpful metric is not primary indicator of success. Distillation was on mnist, so was origins of Shampoo, K-FAC and even transformers
7
13
242
57,408
GPT-4 can do well on MIT test Community: oh the methodology is all wrong 🌶️ Introducing new optimizer that is 2x faster than AdamW Community: Impressive! Impressive methodology! Said methodology: use half the steps for new method and change learning rate schedule to advantage the method, dont tune the baseline.
10
21
231
111,835
20 days later. I was curious to check.
I was curious to check! There is a huge increase in engagement via image editing largely outside US. Not math and not homeworks.
16
8
222
50,483
Code for Distributed Shampoo: a scalable second order optimization method bit.ly/3uXXtKy 💥 Joint work w @GuptaVineetG State of the art on MLPerf ResNet-50 training to reach 75.9% accuracy at 32,768 batch size Trains in 1729 steps (not a typo), 284 secs on TPUs.

ALT Shampoo optimizer spirit animal.

30
222
I recently completed 11 years at Google , time flies, it’s been an incredible journey with lots of fun. I have been lucky to be around excellent role models, teachers and absolutely amazing infrastructure. One highlight has been able to contribute to really large scope projects without really having formal hierarchies and purely from passion “to just do it”. One fond memory is the time I was interested in low level performance optimization and making code run faster on cpus for server side services, which eventually lead to hacking on protocol buffer for C++ and co-developing arena based memory allocator with Chris Fallin over in Summer of 2014. I was not in the protobuf team or anywhere close to that organization. It’s was so much fun that I had picked it up, and found ways to make it work and get good performance which led to Chris joining and were hacking on it day and night that Summer until it was released to all of Google3. This was was massive undertaking required fixing every tests ever written or to be written which calls protobuf and designing the API to retrofit existing code while matching memory, execution time, compilation time when arenas were disabled. But we didn’t shy away or anyone stop us because we were starting levels for SWEs. Retrospectively the part that’s quite unique at Google was that interested parties can find each other and make virtual teams. I have fond memories of Geoff Pike helping out going into details of how functions were generated by protobuf compiler. The leadership around Chris and I supported these virtual teams too! Interms of impact it was massive as we were seeing upto 40-50% reduction in cpu usage across core services and huge latency wins too: Such projects with virtual teams still exist and are supported, and it’s mainly I think it’s been a function of motivation and actually willingness to do it rather than role levels. I have been meaning to document all the low level optimizations that went into arena allocators one day including ones from Jeff, Sanjay and Chris, it’s almost a decade and may do so in the summer.
6
6
222
117,344
Following everyone else!
7
4
217
10,818
I am taking the bait here. The OG distillation paper was rejected by the process referred here. Distillation probably is one of the most impactful technique in deep learning practice.
To qualify as Science a piece of research must be correct and reproducible. To be correct and reproducible, it must be described in sufficient details in a publication. To be 'published' (to receive a seal of approval) the publication must be checked for correctness by reviewers. To be reproduced, the publication must be widely available to the community and sufficiently interesting. If you do research and don't publish, it's not Science. Without peer review and reproducibility, chances are your methodology was flawed and you fooled yourself into thinking you did something great. No one will ever hear about your work. No one will pick it up and build on top of it. No one will build new technology and products with it. Your work will have been in vain. You'll die bitter and forgotten. If you never published your research but somehow developed it into a product, you might die rich. But you'll still be a bit bitter and largely forgotten.
9
8
197
250,800
Code for SM3 which is a memory efficient adaptive first-order optimizer is now open-sourced under @GoogleAI research repository. It's useful for training very large language models, for eg: BERT-Large, GPT2 etc. cutt.ly/qwl2hNM
2
47
203
Ilya asks, scaling what? Common answer is; inference time compute. I think answers can include: 1. Architecture that are more expressive and thus more expensive at inference time 2. Objective and optimization methods that reduces negative transfer across tasks and modality but require more expensive steps. Just thinking about 1. and 2. leads to better methods.
12
5
197
32,332
First day at Ant went well!
2
1
201
13,262
I was curious to check! There is a huge increase in engagement via image editing largely outside US. Not math and not homeworks.
Wow, $GOOGL Gemini has overtaken ChatGPT in top downloads in the US on iOS. Something to keep an eye on as ChatGPT has dominated the standings for months now. $GOOGL execution and product shipment showing results…
11
6
196
71,678
Dropping a bit of Lore on this halloween that I got reminded of. Before the first TPU was taped out there was mostly async training of neural nets at Google production. The team was genuinely worried that sync training would be bad and there was a team considering figuring out how to add corruption/ asyncness so that models would converge (theory was that noise helps convergence) and was then disproven by data.
10
1
200
25,298
I completely missed the Parallel Layers used in PaLM. Its makes training 15% faster at larger scale. Mainly run MLP and Attention together! Thanks @achowdhery for pointing this out to me! The savings in compute are quite substantial.
6
13
193
“For example, if the traditional algorithm taught in school multiplies a 4x5 by 5x5 matrix using 100 multiplications, and this number was reduced to 80 with human ingenuity, AlphaTensor has found algorithms that do the same operation using just 76 multiplications.”
Today in @Nature: #AlphaTensor, an AI system for discovering novel, efficient, and exact algorithms for matrix multiplication - a building block of modern computations. AlphaTensor finds faster algorithms for many matrix sizes: dpmd.ai/dm-alpha-tensor & dpmd.ai/nature-alpha-tensor 1/
3
12
183
Gemini ♊️ is most used LLM today on openrouter.ai/rankings
5
20
186
22,313
Some excellent work by @jeankaddour and colleagues arxiv.org/abs/2307.06440 “We find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate” ☠️
Replying to @_arohan_
Our arxiv preprint might be of interest to you: arxiv.org/abs/2307.06440
5
31
176
123,902
This seems like a massive improvement! openreview.net/forum?id=r8J3… One idea is that the encode recency bias in a data dependent way. Multiply your attention prob with (1-p(k)) for all k between the query and key tokens. So if there are dominant attention prob between the tokens it will weigh down the attention prob. Another way to think is that it breaks ties by preferring recent tokens. There are likely improvements to be built on top like making implementation more efficient. Stick-breaking attention, thanks to @Grad62304977 for bringing it to our attention. Length generalization 🪄
8
22
174
17,622
Prediction: People say pretraining will end, and I think everyone will be surprised how many multipliers we can squeeze from existing data through all kinds of algorithms.
7
9
174
31,686
It’s the best time to be a researcher. So many open problems, lot of compute opening up and impact of your work direct impacts the pace of your research work!
5
7
182
22,511
Fascinating 1. Use QK cosine sim to predict similarity at byte level 2. Use similarity to chunk bytes based on threshold. 3. Use an encoder 4. Scatter back + linear attention to form decoder layers (in reverse order) 5. Decoder layers for reconstruction loss
Replying to @sukjun_hwang
H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/
2
9
172
11,591
My well calibrated take on deepseek: Most closed labs don’t mention compute used or model sizes they pre-train firstly, so it’s hard to compare and have to use closed api prices for comparison. DeepSeek trained a good model that’s open weights and largely open science (building on top of others ) while innovating on MoEs and their take on MLA. And this isn’t the final frontier of efficiency ;)
3
6
162
14,775
Noam is back!
5
4
168
42,209
A Saturday reminder to all new followers that Shampoo stands for a preconditioner. It’s called Shampoo because thats what comes pre/before using a conditioner.
6
2
165
21,375
Transformer paper should get half a decade test of time award for completely transforming the industries and what people work on.
4
7
161
One thing that stood out to me was “Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set” Nothing wrong with this in terms of results achieved but isn’t the timeline now comparing apples and oranges if other llms were never tuned against this set? Does tuning result in losing other abilities?
8
4
160
44,461
I have been at Ant the same number of days that I was at Meta which is less than 1/36th of my time at Google. Completely different experience!
2
1
161
24,439
Today is 10th anniversary of Adam paper on arxiv! Even though Shampoo is far better than Adam, it’s undeniable how good Adam is with respect to simplicity.
2
2
160
10,483
Tinker with this visualization here for training neural networks with noise added in the dataset. Made with tensorflow.js and inspired by neural network playground. 👇 google.github.io/bi-tempered…
1
36
147
Arrived to these shores @ 2010 Greencard @ 2023 ✅
24
2
155
30,612
Years of optimization and training instability research finally paying off it seems!
Pre-training is not dead
4
7
152
16,642
10 years ago I left working on iOS communicator at MSFT to work on machine learning at Google, without much connections or a doctoral degree for that matter. Crazy how time flies! And due to a bunch of lucky breaks, very thankful to be doing ML things at Google 🧠
8
4
150
49,436
I was reflecting over the last several months, and I think sharing scientific knowledge and advancing open models aligns more with my values on shared progress and principles of collaboration.
4
5
150
11,559
Since folks are discussing Infra, it is not about models per say, its about agency: Two incidents that I fondly remember: covid happened and meets was slow, a senior engineer and a friend decided to take it own their hands profiling things and making it better. They did a quick set of improvements that dramatically improved performance. Second was XLA taking a lot of time to compile. I was having skill issue and complaining in a chat room because I was a bit tired. The same engineer ended up diving into the code and making everything faster by 25x. This motivated xla team to continue doing this - all though the actual changes were pretty straightforward (a) profile and (b) optimize. Good infrastructure is likely carried about high agency individuals who just do things. I don’t think it’s not about coding models. High agency probably comes from being at the top of Maslow’s pyramid of needs.
7
8
150
19,806
This was one of my last projects at GDM, training one of the most inference efficient model that scale to billions with an amazing crew. I just aged a lot while keeping the model alive.
Intelligence too cheap to meter
5
3
151
19,371
Previous employer sending patents to sign is a leak that stuff worked really well.
4
4
150
25,782
Maybe a good time to post my limited edition TPU star jacket for bringing up the first version of TPU used for training for a Google prod model training and inference.
7
2
144
12,046
Learning to learn - architecture (already done) - optimizer (already done) - sampling from an auto regressive model? *inspired by entropix but why not just automate all the heuristics.
11
4
142
12,775
MADGRAD: 76.22% Shampoo: 77.8% github.com/google-research/g…
We're introducing an optimizer for deep learning, MADGRAD. This method matches or exceeds the performance of the Adam optimizer across a varied set of realistic large-scale deep learning training problems. github.com/facebookresearch/…
3
26
139
Replying to @jeremyphoward
we are on your side and and still cooking 🍳 Scout fits a single h100 80G and packs so much and do understand need for models on 4090 We will continue to squeeze intelligence to even more tiny factor for the community. Deal?
10
145
13,980
Really enjoyed reading paper that trains a tiny model that achieves high arc-“agi” scores Loop(x , y-embed, z-embed); x inputs. Where y is prediction and z is memory initialized to zero. Gradient from last step of the loop (bit nuanced as there are inner loops) Weights shared across loops Use a prediction head to see if prediction is correct, and stop the loop, otherwise repeat the loops (i.e outer loop for 16 times; detach gradients so, y, z updated for final step only)
New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/2… Code: github.com/SamsungSAILMontre… Paper: arxiv.org/abs/2510.04871
3
6
141
24,064
Most of my contribution to deep learning: starting from codistillation, sm3, distributed shampoo, locoprop, ngrammers were when I was on h1b.
1
4
143
14,166
Someone passed this wisdom to me today. Deep learning techniques working vs not working is two devils - your prior about the technique - your attention to details about implementation of the technique Need both to make it work.
Deep learning is ~10% idea and ~90% implementation.
1
7
142
15,119
One not so nice behavior due the gold rush and closed nature of labs is everyone claiming on socials or their website to have lead X and Y on closed models. It feels a bit unsettling/cringe.
12
143
23,955
Lol are we now tweeting about gradient spikes in training runs or some such. I think people may pay to watch training loss curves go down at this rate.
Nothing confirmed, but rumors tonight that Grok 3 suffered some sort of catastrophic incident yesterday during its training. Hopefully not true. There were similar rumors about the Opus 3.5 delay, nothing was ever confirmed there either
9
5
134
19,531
Thinky machines post on non determinism also made me think of non reproducibility of neural network training that was very well studied at google for ads models.
8
3
139
14,831
It’s pretty cool to hear about someone actually reading a paper of someone else and then talking to them and giving them a job offer on the spot. Really legitimate way to hire.
4
1
133
11,286
Gemini Nano improve on the efficiency frontiers. They are multimodal as well, see results in the paper. Nano series: At 1.8B and 3.25B parameters packs so much to provide high utility on device First foundation model on the device! android-developers.googleblo…
Replying to @sundarpichai
Gemini Nano is super efficient for tasks that are on-device. Android developers can sign up for an early access program for Gemini Nano via Android AICore and Pixel 8 Pro users can already see it rolling out in features like Summarize in Recorder and Smart Reply in Gboard + much more to come! blog.google/products/pixel/p…
4
12
130
57,059
Next big jump with Neural Network performance is going to happen when community embraces non-uniformity Eg, stacking of identical layers has become ingrained within our tools and mindsets.
13
11
131
68,951
2025 off to a good start. Per step time needs a bit more work. Getting used to new setup otherwise I think we have improved on Shampoo.
5
3
133
17,057
Bit disappointed that my timeline has very few arxiv paper link / explanation threads. It used to be lot more in the past.
9
8
129
31,170
Gen AI on-device? A foundation model on the phone? Imagine an entire operating system level unlock of capabilities: Well Pixel 8 Pro will have it. Rick announced it here: piped.video/watch?v=pxlaUC… The model was trained with several algorithmic breakthroughs by our team to make it possible!
9
14
132
51,000
My last project at GDM was getting Nano-3 spec’ed with this gang. I am very happy that matformers, per layer embeddings and various innovations are openly published here!
Pocket powerhouse admist I/O awesomeness! Gemma 3n E4B & E2B are insane models, optimized for on-device while rivaling frontier models. It's a 🪆Matryoshka Transformer (MatFormer)🪆: Natively elastic b/w 4B & 2B pareto-optimally! ⭐️: free models with ZERO training cost! 🧵👇
6
3
129
9,472
There is pre-training and post training - everybody skips training I suppose.
11
4
123
8,774
Will you all get mad if I say Muon is Shampoo?
6
1
132
21,633
I have two kids. The 2nd one is slightly older than 6 months old. I have been finding he is ramping up on his capabilities faster than my first one. Then it struck me, he is distilling better from his sister whenever they are together since they are likely implementing similar circuits. He is more observant to her than to us likely for this reason.
11
1
126
18,692
I want to say that models are now better than me in technical subjects that I had spent learning for years. As long as I can ask the right questions, I am able to automate the inner-loop of reasoning to LLMs.
Yann LeCun is not so optimistic about LLMs. 😟 "We are not going to get to human level AI by just scaling up MLMs. This is just not going to happen. There's no way. Okay, absolutely no way. And whatever you can hear from some of my uh more adventurous colleagues, it's not going to happen within the next two years. There's absolutely no way in hell. You know the idea that we're we're going to have you know a country of genius in a data center that's complete BS. There's absolutely no way what we're going to have maybe is systems that are trained on sufficiently large amounts of data that any question that any reasonable person may ask will will find an answer through those systems and it would feel like you have a PhD sitting next to you but it's not a PhD you have next to you. It's system with a gigantic memory and retrieval ability not not a system that can invent solutions to to new problems, which is really what a PhD is. " -------- From "Alex Kantrowitz" YT Channel (full video link in comment)
8
7
127
12,512