rohan anil · Apr 19, 2026 · 7:24 PM UTC

rohan anil

Pinned Tweet

rohan anil

@_arohan_

Apr 19

It turns out multi step backpropaganda is better. paper has a beautiful way of improving backpropagation. One iteration cleanly gets us backprop, multiple iterations get us a preconditioned update.

rohan anil

@_arohan_

Apr 19

Replying to @LinYorker @ryu0000000001 @weijie444

arxiv.org/abs/2106.06199 Same update here

213

115,649

rohan anil · Jun 5, 2025 · 4:29 AM UTC

rohan anil

@_arohan_

5 Jun 2025

A little bit of update from me: I will join the awesome team at @AnthropicAI in two weeks.

104

1,304

147,288

rohan anil · Jun 5, 2025 · 4:28 AM UTC

rohan anil

@_arohan_

5 Jun 2025

129

1,211

92,871

rohan anil · Jan 13, 2025 · 9:22 PM UTC

rohan anil

@_arohan_

13 Jan 2025

Joining the Llama team @AIatMeta today! Time to train models, finally gpu rich :)

1,049

90,592

rohan anil · Nov 6, 2025 · 3:21 AM UTC

rohan anil

@_arohan_

6 Nov 2025

Near the office. SF has stepped up its dosa game.

864

82,018

rohan anil · Dec 7, 2022 · 6:33 AM UTC

rohan anil

@_arohan_

7 Dec 2022

This paper looks like a big step forward for the Transformer architecture! A foundational improvements, not as shiny as other things, but really big step forward nonetheless

801

rohan anil · Nov 10, 2025 · 8:38 PM UTC

rohan anil

@_arohan_

10 Nov 2025

Reading this, its clear that Meta is advancing / recommender systems tech faster than other places including G.

Engineering at Meta

@Meta_Engineers

10 Nov 2025

We’re excited to share details on Meta’s Generative Ads Recommendation Model (GEM), a new foundational model built with LLM-scale techniques that’s already helping create more value for businesses, like +5% increase in ad conversions on Instagram. Dive deep into the technology behind GEM and see how it delivers increased ad performance and advertiser ROI by enhancing our ads recommendation models’ ability to serve more relevant and personalized ads: engineering.fb.com/2025/11/1…

608

173,544

rohan anil · Dec 21, 2024 · 12:19 PM UTC

rohan anil

@_arohan_

21 Dec 2024

Man, claude solved this verbally by looking at the inputs visually.

François Chollet

@fchollet

20 Dec 2024

Replying to @fchollet

It will also be extremely important to analyze the strengths and limitations of the new system. Here are some examples of tasks that o3 couldn't solve on high-compute settings (even as it was generating millions of CoT search tokens and consuming thousands of dollars of compute in the process). Interestingly this first task was the one we had in our university tour presentation to illustrate "it's easy for humans, hard for AI"

567

95,546

rohan anil · Dec 6, 2024 · 5:41 PM UTC

rohan anil

@_arohan_

6 Dec 2024

A bitter sweet moment for me, Gemini is doing really well, and teams are doing great. I had a great close to 12 years at G that one could call me OG. For example, for every search query, I noticed things I was able to contribute to is deeply integrated from the retriever to the final rankings, allowing me to meaningfully impact the world and help Google’s mission. I decided it was time to do something next year that I naturally gravitate towards and give my life’s work meaning. And since today is my last day with corp access - I sent this note this week 🚀 — Hi all, I decided to try to find a new environment for a change next year. It was a hard decision to get out of the comfort of Google safety nets. I am quite nervous and excited too. Especially since I have only explicitly changed teams twice in close to 12 years .. even though .. The number of codebases to train a transformer has changed more times than I can count (half joking!). Google as an employer and as a utility has been life changing on multiple fronts. Firstly, the sign-on bonus allowed me to pay off my student loans that were accruing rapidly, work visas got me to stay in the US and led me to go on this ride to what looks like AGI :) I have a lot of fond memories working alongside you and learning from you all has been a unique privilege! I remember permuting the words, "learn. observe. contribute" every now and then on the go/who page as a way to keep myself focussed on what's important. Getting to work alongside brilliant scientists and engineers in Brain with an open door policy on collaboration allowed me to thrive and was a life changing and a singular experience. And very recently making Gemini work with its own constraints was very challenging and meaningful to me, and I feel like it is paying off with resulting good models such as Flash and Nano and them taking center stage this year. I am extremely grateful to my mentors and colleagues who helped me get to where I am today. The collaborative spirit and dedication to innovation at Google have been truly inspiring. Finally, I am feeling good thinking about my life's work here across the stack (quite literally) over the last close to 12 years has delivered value to the world in a positive and meaningful way! You can find me at rohan.anil@gmail.com -- I am always open to grabbing coffee or lunch or thinking about the next breakthrough research idea or pair programming on something fun. I am in the east bay and going to take a small break to recharge. PS: This is the longest email I have typed and edited, and I promise I didn't use a LLM to write it. With regards, Rohan

114

558

82,900

rohan anil · Sep 16, 2023 · 8:33 PM UTC

rohan anil

@_arohan_

16 Sep 2023

Meta researchers just dropped PyTorch distributed shampoo🧴few days ago: arxiv.org/pdf/2309.06497.pdf 💥 Train neural networks with a second order method for better performance. This underlying work which it is based on has been a passion project for last 5 years while swimming upstream with @GuptaVineetG - with no love from any conferences chairs. Distributed Shampoo in Pytorch with solid results means as a co-author of the method trust the implementation! Lastly given the effort they have put it in, my guess is it is already in production (:

544

113,455

rohan anil · Oct 11, 2025 · 7:06 PM UTC

rohan anil

@_arohan_

11 Oct 2025

That’s insane to convince a cofounder of thinky to bail this fast.

Meghan Bobrowsky

@MeghanBobrowsky

11 Oct 2025

Saturday scoop: Thinking Machines Lab co-founder Andrew Tulloch has joined Meta, the startup confirmed. W/ @keachhagey

531

86,946

rohan anil · Oct 9, 2024 · 4:27 PM UTC

rohan anil

@_arohan_

9 Oct 2024

I got to coauthor papers with two Nobel prize winners, one in Physics and one in Chemistry 😁

507

26,689

rohan anil · Dec 6, 2023 · 4:28 PM UTC

rohan anil

@_arohan_

6 Dec 2023

It’s been a privilege to work alongside with our gemini leads and team (across Google DeepMind, Research and Alphabet) in one of the most interesting and challenging projects of my career. We have three versions of Gemini: (a) Ultra (b) Pro and (c) Nano We make significant progress on scaling and efficiency frontiers. We are shipping and continue to ship! But I will now catch up to some sleep and three is a sense of excitement!

Jeff Dean

@JeffDean

6 Dec 2023

I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU across 57 subjects with a score above 90%. It also achieves a new state-of-the-art score of 62.4% on the new MMMU multimodal reasoning benchmark, outperforming the previous best model by more than 5 percentage points. Gemini was built by an awesome team of people from @GoogleDeepMind, @GoogleResearch, and elsewhere at @Google, and is one of the largest science and engineering efforts we’ve ever undertaken. As one of the two overall technical leads of the Gemini effort, along with my colleague @OriolVinyalsML, I am incredibly proud of the whole team, and we’re so excited to be sharing our work with you today! There’s quite a lot of different material about Gemini available, starting with: Main blog post: blog.google/technology/ai/go… 60-page technical report authored by th Gemini Team: deepmind.google/gemini/gemin… In this thread, I’ll walk you through some of the highlights.

471

156,703

rohan anil · Jun 22, 2022 · 5:26 PM UTC

rohan anil

@_arohan_

22 Jun 2022

A new image generation model just dropped. parti.research.google/ Great work by the team! + Auto-regressive, encoder->decoder Transformer + Classifier-free sampling. + ViT-VQGAN Really amazing results: Image from the website.

458

rohan anil · Dec 29, 2024 · 8:18 AM UTC

rohan anil

@_arohan_

29 Dec 2024

I'm back at implementing preconditioners for fun. It's wild how much untapped potential there still is in optimizing neural nets better. Thinking of writing up a tutorial that builds from the basics all the way to SOTA stuff on standard networks.

472

28,193

rohan anil · Jun 14, 2025 · 3:57 AM UTC

rohan anil

@_arohan_

14 Jun 2025

Async RL framework for scaling RL

473

37,415

rohan anil · Feb 24, 2020 · 2:13 AM UTC

rohan anil

@_arohan_

24 Feb 2020

Shampoo is out of the bottle! Preprint: "Second order optimization made practical" arxiv.org/pdf/2002.09018.pdf We train certain neural nets faster than before. How fast ? It has shown upto ~40% reduction in training time for a Transformer. (@tomerikoriko)

104

439

rohan anil · Jan 6, 2025 · 10:07 AM UTC

rohan anil

@_arohan_

6 Jan 2025

Replying to @iScienceLuvr

One wedding cost is like 10 o3 models.

415

18,398

rohan anil · Oct 23, 2025 · 10:49 PM UTC

rohan anil

@_arohan_

23 Oct 2025

❤️

Anthropic

@AnthropicAI

23 Oct 2025

Today, we announced that we plan to expand our use of Google TPUs, securing approximately one million TPUs and more than a gigawatt of capacity in 2026.

396

92,208

rohan anil · Jun 9, 2023 · 1:20 AM UTC

rohan anil

@_arohan_

9 Jun 2023

cloud.google.com/blog/produc… PaLM-2 is Generally available for developers! “With this update, developers can access our text model powered by PaLM 2, Embeddings API for text, and other foundation models in Model Garden”

Generative AI support on Vertex AI generally available | Google Cloud Blog

Google Cloud announces Generative AI support on Vertex AI generally available.

cloud.google.com

371

76,916

rohan anil · Sep 23, 2022 · 5:30 PM UTC

rohan anil

@_arohan_

23 Sep 2022

Today, we present our paper on Google Search Ads CTR model at ORSUM @ACMRecSys, Seattle. arxiv.org/abs/2209.05310 We highlight ML techniques suited to *online learning* that go well beyond traditional accuracy improvements. orsum.inesctec.pt/orsum2022/… A short thread: 1/n

367

rohan anil · Feb 3, 2025 · 6:02 AM UTC

rohan anil

@_arohan_

3 Feb 2025

Two good papers landed 🛬 today On test time compute arxiv.org/abs/2501.19393 On long context arxiv.org/abs/2501.19399

352

35,955

rohan anil · Oct 27, 2025 · 7:32 PM UTC

rohan anil

@_arohan_

27 Oct 2025

I am relieved I never learned to be a power user of excel.

Dan Shipper 📧

@danshipper

27 Oct 2025

omg

343

25,418

rohan anil · Dec 20, 2023 · 7:31 AM UTC

rohan anil

@_arohan_

20 Dec 2023

Career milestone. Coauthored paper with Jeff D, Oriol V, Koray K, Demis H this year at the same time with rest of the Gemini team. 🤯 arxiv.org/abs/2312.11805

Gemini: A Family of Highly Capable Multimodal Models

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro,...

arxiv.org

329

133,975

rohan anil · May 27, 2022 · 2:29 PM UTC

rohan anil

@_arohan_

27 May 2022

Prompt: "A koala bear and grizzly bear playing chess. They are sitting at a table on the beach. You can see the waves crashing into the shores. Bears are very stressed. DSLR camera photo." #imagen #googleai #brain 🐻🐨♟️🏖️

304

rohan anil · Sep 9, 2024 · 1:15 AM UTC

rohan anil

@_arohan_

9 Sep 2024

Insane timeline.

Joseph @RealJosephus

8 Sep 2024

"Reflection API" is a sonnet 3.5 wrapper with prompt. And they are currently disguising it by filtering out the string 'claude'. teddit.net/r/LocalLLaMA/comm…

314

48,155

rohan anil · Jan 3, 2025 · 4:48 AM UTC

rohan anil

@_arohan_

3 Jan 2025

Transformer: Attention is all you need paper from 2017 used roughly 1e19 flops. Today most labs have a tonne more, like more than a million times more compute. You would expect with right focus and people we should at-least had similar level breakthroughs in architectures.

297

71,974

rohan anil · Aug 17, 2022 · 1:12 AM UTC

rohan anil

@_arohan_

17 Aug 2022

arxiv.org/abs/2208.01134 Batch Entropy Regularizer that makes untrainable networks train. Remove skip connection, normalization layers. Published at TMLR, Works on PaLM like transformers -- thanks to Lucid for the pointer!

292

rohan anil · Aug 1, 2024 · 8:06 PM UTC

rohan anil

@_arohan_

1 Aug 2024

Distributed Shampoo has dethroned Nesterov Adam marking a new era for deep learning optimization. 👑 🤘 Non-diagonal preconditioning is here! This is the AlexNet moment for optimization for deep learning. I am extremely happy. An email from 2021.

MLCommons @MLCommons

1 Aug 2024

@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI

293

72,592

rohan anil · Jun 13, 2025 · 11:31 PM UTC

rohan anil

@_arohan_

13 Jun 2025

Last day today @AIatMeta, reflecting on last several months, and wanted to highlight few things I enjoyed working with: Building new algorithms for on policy distillation with @DatHuynh13 Science of end to end thinking models @agarwl_ and many others Working prototype of something beyond transformers with @Happylemon56775 Newer higher order optimization and infra with @vinaysrao and many others Helping @afrozenator with codistillation for pretraining and helping move to next arch Pushing on the scaling RL plans with many particularly debugging entropy collapses type of bugs. Most fun / memories are on technical work!

291

36,742

rohan anil · May 10, 2023 · 5:45 PM UTC

rohan anil

@_arohan_

10 May 2023

Palm2 is online: ai.google/discover/palm2 🌴🌴 Paper: ai.google/static/documents/p… I learned to code with instructions in Malayalam, so this capability shown by PalM-2 instruction tuned models to explain the code make me quite happy! Possibilities are endless here!

rohan anil

@_arohan_

10 May 2023

🌴🌴 Very proud of this work; specifically not compromising on model quality, while being extremely fast for inference, so that we can serve the whole wide world i.e bringing technology to everyone!

282

94,401

rohan anil · Feb 23, 2024 · 7:22 PM UTC

rohan anil

@_arohan_

23 Feb 2024

Replying to @giffmana

The team is working hard to bring audio inputs to the AI Studio interface for Gemini 1.5 Pro. We have an internal version that handles audio and video and can sample the video less frequently to increase the length of content that can be handled. @karpathy, thanks for the awesome tutorial. 🙏 Here's the prompts we used and the output for something in the direction you suggested. @TeplyashinDenis @joe_stant @SavinovNikolay @machelreid @ankurbpn

272

61,403

rohan anil · Jun 18, 2024 · 6:03 PM UTC

rohan anil

@_arohan_

18 Jun 2024

Everyone who is seeing this tweet! It’s time to build cool things with caching feature. 1. You can now cache prompts of 1M tokens, this may mean a reasonably number of pdfs on arxiv, so a paper and it’s related work 2. Then you make a good UX to analyze, summarize and ask questions on this data. Who wants to prototype this demo first?

Logan Kilpatrick

@OfficialLoganK

18 Jun 2024

Replying to @OfficialLoganK

Gemini 1.5 Flash continues to be the best value proposition for anyone building with LLMs. - $0.0875 / 1 million tokens (cache prompts < 128K) - $0.175 / 1 million tokens (cache prompts > 128K) - $1.00 / 1 million tokens per hour (cache storage) Big 🚢 by @shresbm and team!!

271

135,236

rohan anil · Jan 20, 2025 · 4:21 PM UTC

rohan anil

@_arohan_

20 Jan 2025

R1 shows you need a good base model, a large math and code prompt reward set, prompting+cleaning then any RL technique+long decode+gpu go brr

259

56,200

rohan anil · May 27, 2022 · 2:26 PM UTC

rohan anil

@_arohan_

27 May 2022

L👈: "A Koala bear in a suit standing at a podium to teach. Variational bayesian methods is written on the chalkboard. There are lot of confused cats in the crowd" R 👉:"Variational bayesian methods is all you need is written on the chalkboard." 🐨🙀 #imagen #googleai #brain

250

rohan anil · Dec 15, 2024 · 2:00 PM UTC

rohan anil

@_arohan_

15 Dec 2024

Going home to India after 5 years - one pandemic, 2 kids, 4 LLMs revisions and 4 NeurIPS later Acquired: - Airplane toys, and ipad loaded with movies - enough baby formula to last 28 hrs. - updated personal laptop - installed torch and grabbed notable articles iclr blog track from 2024 - ollama llama70b

259

33,067

rohan anil · Dec 26, 2024 · 4:27 PM UTC

rohan anil

@_arohan_

26 Dec 2024

arxiv.org/pdf/2404.19737

251

72,344

rohan anil · Dec 21, 2024 · 2:13 AM UTC

rohan anil

@_arohan_

21 Dec 2024

You are telling me that o3 is causal attention with a decoder model!?

243

36,675

rohan anil · Oct 10, 2025 · 3:42 AM UTC

rohan anil

@_arohan_

10 Oct 2025

I forgot about this tweet but read this top tier paper and get ultra agi pilled.

rohan anil

@_arohan_

1 Jun 2025

Few questions those who are following AlphaEvolve and FunSearch * is anyone reproducing it? * very relevant to diverse data generation in verifiable domains? * one step away from a new paradigm beyond current thinking: “solve this problem under x constraint”? 1. Makes use of human-written prompt templates (for both mutations and crossovers) — instantiate it every generation with (current best code, targets on perf and budgets) as input 2. Evolutionary Search for code. Classic explore vs exploit, and using approps prompts. Use a fast model for inner-loop - aka Flash, Pro for merging or refining. 3. Automated verification with “does it compile” “does it pass unit test” “does it run fast” checks

249

51,046

rohan anil · Aug 25, 2025 · 6:10 PM UTC

rohan anil

@_arohan_

25 Aug 2025

Rishabh is an amazing researcher. His algorithms underpin post training at Gemini. I got to work together at meta for a short while and was truly impressed. Whichever group got Rishabh is so lucky to have him!

Rishabh Agarwal

@agarwl_

25 Aug 2025

This is my last week at @AIatMeta. It was a tough decision not to continue with the new Superintelligence TBD lab, especially given the talent and compute density. But after 7.5 years across Google Brain, DeepMind, and Meta, I felt the pull to take on a different kind of risk. The pitch from Mark and @alexandr_wang to build in the Superintelligence team was incredibly compelling. But I ultimately choose to follow Mark's own advice: “In a world that’s changing so fast, the biggest risk you can take is not taking any risk”. In my short time at Meta, we did push the frontier on post-training for "thinking" models. Specifically: - Pushing an 8B dense model to near Deepseek-R1 performance with RL scaling. - Using synthetic data mid-training to warm-start RL. - Developing better on-policy distillation methods. Really enjoyed working with @_arohan_, @brandfonbrener, Leo Li, @ErykHelenowski, @DatHuynh13, Xiaocheng, Jia, Boduo, and Yanjun.

244

52,254

rohan anil · May 26, 2022 · 9:53 PM UTC

rohan anil

@_arohan_

26 May 2022

Prompt: "A train ride in the monsoon rain in Kerala. With a Koala bear wearing a hat looking out of the window. There is a lot of coconut trees out of the window" #imagen #googleai #brain (I will host the imagen team at my home in Kerala if they choose to visit 🚀)

232

rohan anil · Nov 7, 2025 · 4:44 AM UTC

rohan anil

@_arohan_

7 Nov 2025

Number of VCs liking this from SF makes me think … there is space to raise money for a startup to make frontier roast dosas.

rohan anil

@_arohan_

6 Nov 2025

Near the office. SF has stepped up its dosa game.

234

21,043

rohan anil · Jul 15, 2025 · 6:01 AM UTC

rohan anil

@_arohan_

15 Jul 2025

Just saying Muon was originally done on the cifar speedrun with few gpus on nanogpt. While gpus per researcher is helpful metric is not primary indicator of success. Distillation was on mnist, so was origins of Shampoo, K-FAC and even transformers

242

57,408

rohan anil · Jun 25, 2023 · 6:02 PM UTC

rohan anil

@_arohan_

25 Jun 2023

GPT-4 can do well on MIT test Community: oh the methodology is all wrong 🌶️ Introducing new optimizer that is 2x faster than AdamW Community: Impressive! Impressive methodology! Said methodology: use half the steps for new method and change learning rate schedule to advantage the method, dont tune the baseline.

231

111,835

rohan anil · Oct 2, 2025 · 3:56 PM UTC

rohan anil

@_arohan_

2 Oct 2025

20 days later. I was curious to check.

rohan anil

@_arohan_

13 Sep 2025

I was curious to check! There is a huge increase in engagement via image editing largely outside US. Not math and not homeworks.

222

50,483

rohan anil · Mar 2, 2021 · 8:27 PM UTC

rohan anil

@_arohan_

2 Mar 2021

Code for Distributed Shampoo: a scalable second order optimization method bit.ly/3uXXtKy 💥 Joint work w @GuptaVineetG State of the art on MLPerf ResNet-50 training to reach 75.9% accuracy at 32,768 batch size Trains in 1729 steps (not a typo), 284 secs on TPUs.

ALT Shampoo optimizer spirit animal.

222

rohan anil · May 17, 2024 · 5:35 PM UTC

rohan anil

@_arohan_

17 May 2024

I recently completed 11 years at Google , time flies, it’s been an incredible journey with lots of fun. I have been lucky to be around excellent role models, teachers and absolutely amazing infrastructure. One highlight has been able to contribute to really large scope projects without really having formal hierarchies and purely from passion “to just do it”. One fond memory is the time I was interested in low level performance optimization and making code run faster on cpus for server side services, which eventually lead to hacking on protocol buffer for C++ and co-developing arena based memory allocator with Chris Fallin over in Summer of 2014. I was not in the protobuf team or anywhere close to that organization. It’s was so much fun that I had picked it up, and found ways to make it work and get good performance which led to Chris joining and were hacking on it day and night that Summer until it was released to all of Google3. This was was massive undertaking required fixing every tests ever written or to be written which calls protobuf and designing the API to retrofit existing code while matching memory, execution time, compilation time when arenas were disabled. But we didn’t shy away or anyone stop us because we were starting levels for SWEs. Retrospectively the part that’s quite unique at Google was that interested parties can find each other and make virtual teams. I have fond memories of Geoff Pike helping out going into details of how functions were generated by protobuf compiler. The leadership around Chris and I supported these virtual teams too! Interms of impact it was massive as we were seeing upto 40-50% reduction in cpu usage across core services and huge latency wins too: Such projects with virtual teams still exist and are supported, and it’s mainly I think it’s been a function of motivation and actually willingness to do it rather than role levels. I have been meaning to document all the low level optimizations that went into arena allocators one day including ones from Jeff, Sanjay and Chris, it’s almost a decade and may do so in the summer.

222

117,344

rohan anil · May 18, 2023 · 5:35 AM UTC

rohan anil

@_arohan_

18 May 2023

arxiv.org/abs/2305.10403

217

767,259

rohan anil · Oct 4, 2025 · 12:54 AM UTC

rohan anil

@_arohan_

4 Oct 2025

Following everyone else!

217

10,818

rohan anil · May 29, 2024 · 6:40 AM UTC

rohan anil

@_arohan_

29 May 2024

I am taking the bait here. The OG distillation paper was rejected by the process referred here. Distillation probably is one of the most impactful technique in deep learning practice.

Yann LeCun

@ylecun

28 May 2024

Replying to @elonmusk @jeremymstamper @bneiluj @how_many_roads_

To qualify as Science a piece of research must be correct and reproducible. To be correct and reproducible, it must be described in sufficient details in a publication. To be 'published' (to receive a seal of approval) the publication must be checked for correctness by reviewers. To be reproduced, the publication must be widely available to the community and sufficiently interesting. If you do research and don't publish, it's not Science. Without peer review and reproducibility, chances are your methodology was flawed and you fooled yourself into thinking you did something great. No one will ever hear about your work. No one will pick it up and build on top of it. No one will build new technology and products with it. Your work will have been in vain. You'll die bitter and forgotten. If you never published your research but somehow developed it into a product, you might die rich. But you'll still be a bit bitter and largely forgotten.

197

250,800

rohan anil · Aug 30, 2019 · 1:40 AM UTC

rohan anil

@_arohan_

30 Aug 2019

Code for SM3 which is a memory efficient adaptive first-order optimizer is now open-sourced under @GoogleAI research repository. It's useful for training very large language models, for eg: BERT-Large, GPT2 etc. cutt.ly/qwl2hNM

203

rohan anil · Jan 2, 2025 · 6:57 AM UTC

rohan anil

@_arohan_

2 Jan 2025

Ilya asks, scaling what? Common answer is; inference time compute. I think answers can include: 1. Architecture that are more expressive and thus more expensive at inference time 2. Objective and optimization methods that reduces negative transfer across tasks and modality but require more expensive steps. Just thinking about 1. and 2. leads to better methods.

197

32,332

rohan anil · Jun 18, 2025 · 2:43 AM UTC

rohan anil

@_arohan_

18 Jun 2025

First day at Ant went well!

201

13,262

rohan anil · Sep 13, 2025 · 6:18 PM UTC

rohan anil

@_arohan_

13 Sep 2025

I was curious to check! There is a huge increase in engagement via image editing largely outside US. Not math and not homeworks.

Rihard Jarc

@RihardJarc

13 Sep 2025

Wow, $GOOGL Gemini has overtaken ChatGPT in top downloads in the US on iOS. Something to keep an eye on as ChatGPT has dominated the standings for months now. $GOOGL execution and product shipment showing results…

196

71,678

rohan anil · Nov 1, 2025 · 2:09 AM UTC

rohan anil

@_arohan_

1 Nov 2025

Dropping a bit of Lore on this halloween that I got reminded of. Before the first TPU was taped out there was mostly async training of neural nets at Google production. The team was genuinely worried that sync training would be bad and there was a team considering figuring out how to add corruption/ asyncness so that models would converge (theory was that noise helps convergence) and was then disproven by data.

200

25,298

rohan anil · Sep 16, 2022 · 4:27 PM UTC

rohan anil

@_arohan_

16 Sep 2022

I completely missed the Parallel Layers used in PaLM. Its makes training 15% faster at larger scale. Mainly run MLP and Attention together! Thanks @achowdhery for pointing this out to me! The savings in compute are quite substantial.

193

rohan anil · Oct 5, 2022 · 3:47 PM UTC

rohan anil

@_arohan_

5 Oct 2022

“For example, if the traditional algorithm taught in school multiplies a 4x5 by 5x5 matrix using 100 multiplications, and this number was reduced to 80 with human ingenuity, AlphaTensor has found algorithms that do the same operation using just 76 multiplications.”

Google DeepMind

@GoogleDeepMind

5 Oct 2022

Today in @Nature: #AlphaTensor, an AI system for discovering novel, efficient, and exact algorithms for matrix multiplication - a building block of modern computations. AlphaTensor finds faster algorithms for many matrix sizes: dpmd.ai/dm-alpha-tensor & dpmd.ai/nature-alpha-tensor 1/

183

rohan anil · Sep 6, 2024 · 5:11 AM UTC

rohan anil

@_arohan_

6 Sep 2024

Gemini ♊️ is most used LLM today on openrouter.ai/rankings

186

22,313

rohan anil · Jul 22, 2023 · 3:46 AM UTC

rohan anil

@_arohan_

22 Jul 2023

Some excellent work by @jeankaddour and colleagues arxiv.org/abs/2307.06440 “We find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate” ☠️

Jean Kaddour

@jeankaddour

17 Jul 2023

Replying to @_arohan_

Our arxiv preprint might be of interest to you: arxiv.org/abs/2307.06440

176

123,902

rohan anil · Jan 10, 2025 · 7:13 PM UTC

rohan anil

@_arohan_

10 Jan 2025

This seems like a massive improvement! openreview.net/forum?id=r8J3… One idea is that the encode recency bias in a data dependent way. Multiply your attention prob with (1-p(k)) for all k between the query and key tokens. So if there are dominant attention prob between the tokens it will weigh down the attention prob. Another way to think is that it breaks ties by preferring recent tokens. There are likely improvements to be built on top like making implementation more efficient. Stick-breaking attention, thanks to @Grad62304977 for bringing it to our attention. Length generalization 🪄

174

17,622

rohan anil · Dec 13, 2024 · 10:29 PM UTC

rohan anil

@_arohan_

13 Dec 2024

Prediction: People say pretraining will end, and I think everyone will be surprised how many multipliers we can squeeze from existing data through all kinds of algorithms.

174

31,686

rohan anil · Aug 23, 2025 · 9:26 PM UTC

rohan anil

@_arohan_

23 Aug 2025

It’s the best time to be a researcher. So many open problems, lot of compute opening up and impact of your work direct impacts the pace of your research work!

182

22,511

rohan anil · Jul 13, 2025 · 4:17 PM UTC

rohan anil

@_arohan_

13 Jul 2025

Fascinating 1. Use QK cosine sim to predict similarity at byte level 2. Use similarity to chunk bytes based on threshold. 3. Use an encoder 4. Scatter back + linear attention to form decoder layers (in reverse order) 5. Decoder layers for reconstruction loss

Sukjun (June) Hwang

@sukjun_hwang

11 Jul 2025

Replying to @sukjun_hwang

H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/

172

11,591

rohan anil · Jan 25, 2025 · 5:16 PM UTC

rohan anil

@_arohan_

25 Jan 2025

My well calibrated take on deepseek: Most closed labs don’t mention compute used or model sizes they pre-train firstly, so it’s hard to compare and have to use closed api prices for comparison. DeepSeek trained a good model that’s open weights and largely open science (building on top of others ) while innovating on MoEs and their take on MLA. And this isn’t the final frontier of efficiency ;)

162

14,775

rohan anil · Aug 2, 2024 · 6:40 PM UTC

rohan anil

@_arohan_

2 Aug 2024

Noam is back!

168

42,209

rohan anil · Jul 26, 2025 · 6:22 PM UTC

rohan anil

@_arohan_

26 Jul 2025

A Saturday reminder to all new followers that Shampoo stands for a preconditioner. It’s called Shampoo because thats what comes pre/before using a conditioner.

165

21,375

rohan anil · Sep 7, 2022 · 4:39 PM UTC

rohan anil

@_arohan_

7 Sep 2022

Transformer paper should get half a decade test of time award for completely transforming the industries and what people work on.

161

rohan anil · Dec 21, 2024 · 7:59 AM UTC

rohan anil

@_arohan_

21 Dec 2024

One thing that stood out to me was “Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set” Nothing wrong with this in terms of results achieved but isn’t the timeline now comparing apples and oranges if other llms were never tuned against this set? Does tuning result in losing other abilities?

François Chollet

@fchollet

20 Dec 2024

Replying to @fchollet

My full statement here: arcprize.org/blog/oai-o3-pub…

160

44,461

rohan anil · Oct 8, 2025 · 4:11 AM UTC

rohan anil

@_arohan_

8 Oct 2025

I have been at Ant the same number of days that I was at Meta which is less than 1/36th of my time at Google. Completely different experience!

161

24,439

rohan anil · Dec 22, 2024 · 2:32 PM UTC

rohan anil

@_arohan_

22 Dec 2024

Today is 10th anniversary of Adam paper on arxiv! Even though Shampoo is far better than Adam, it’s undeniable how good Adam is with respect to simplicity.

160

10,483

rohan anil · Oct 17, 2020 · 2:47 AM UTC

rohan anil

@_arohan_

17 Oct 2020

Tinker with this visualization here for training neural networks with noise added in the dataset. Made with tensorflow.js and inspired by neural network playground. 👇 google.github.io/bi-tempered…

147

rohan anil · Jul 14, 2023 · 4:14 PM UTC

rohan anil

@_arohan_

14 Jul 2023

Arrived to these shores @ 2010 Greencard @ 2023 ✅

155

30,612

rohan anil · Jun 17, 2025 · 6:51 PM UTC

rohan anil

@_arohan_

17 Jun 2025

Years of optimization and training instability research finally paying off it seems!

elie

@eliebakouch

17 Jun 2025

Pre-training is not dead

152

16,642

rohan anil · Apr 15, 2023 · 6:01 PM UTC

rohan anil

@_arohan_

15 Apr 2023

10 years ago I left working on iOS communicator at MSFT to work on machine learning at Google, without much connections or a doctoral degree for that matter. Crazy how time flies! And due to a bunch of lucky breaks, very thankful to be doing ML things at Google 🧠

150

49,436

rohan anil · Dec 12, 2024 · 4:46 PM UTC

rohan anil

@_arohan_

12 Dec 2024

I was reflecting over the last several months, and I think sharing scientific knowledge and advancing open models aligns more with my values on shared progress and principles of collaboration.

150

11,559

rohan anil · Oct 8, 2025 · 5:22 PM UTC

rohan anil

@_arohan_

8 Oct 2025

Since folks are discussing Infra, it is not about models per say, its about agency: Two incidents that I fondly remember: covid happened and meets was slow, a senior engineer and a friend decided to take it own their hands profiling things and making it better. They did a quick set of improvements that dramatically improved performance. Second was XLA taking a lot of time to compile. I was having skill issue and complaining in a chat room because I was a bit tired. The same engineer ended up diving into the code and making everything faster by 25x. This motivated xla team to continue doing this - all though the actual changes were pretty straightforward (a) profile and (b) optimize. Good infrastructure is likely carried about high agency individuals who just do things. I don’t think it’s not about coding models. High agency probably comes from being at the top of Maslow’s pyramid of needs.

150

19,806

rohan anil · Mar 5, 2025 · 9:51 PM UTC

rohan anil

@_arohan_

5 Mar 2025

This was one of my last projects at GDM, training one of the most inference efficient model that scale to billions with an amazing crew. I just aged a lot while keeping the model alive.

Dan Zhang @ ICLR @DZhang50

5 Mar 2025

Intelligence too cheap to meter

151

19,371

rohan anil · Sep 13, 2025 · 12:13 AM UTC

rohan anil

@_arohan_

13 Sep 2025

Previous employer sending patents to sign is a leak that stuff worked really well.

150

25,782

rohan anil · Oct 24, 2025 · 3:22 AM UTC

rohan anil

@_arohan_

24 Oct 2025

Maybe a good time to post my limited edition TPU star jacket for bringing up the first version of TPU used for training for a Google prod model training and inference.

144

12,046

rohan anil · Nov 29, 2024 · 5:10 PM UTC

rohan anil

@_arohan_

29 Nov 2024

Learning to learn - architecture (already done) - optimizer (already done) - sampling from an auto regressive model? *inspired by entropix but why not just automate all the heuristics.

142

12,775

rohan anil · Mar 31, 2021 · 12:19 AM UTC

rohan anil

@_arohan_

31 Mar 2021

MADGRAD: 76.22% Shampoo: 77.8% github.com/google-research/g…

AI at Meta

@AIatMeta

30 Mar 2021

We're introducing an optimizer for deep learning, MADGRAD. This method matches or exceeds the performance of the Adam optimizer across a varied set of realistic large-scale deep learning training problems. github.com/facebookresearch/…

139

rohan anil · Apr 6, 2025 · 12:06 AM UTC

rohan anil

@_arohan_

6 Apr 2025

Replying to @jeremyphoward

we are on your side and and still cooking 🍳 Scout fits a single h100 80G and packs so much and do understand need for models on 4090 We will continue to squeeze intelligence to even more tiny factor for the community. Deal?

145

13,980

rohan anil · Oct 7, 2025 · 4:57 PM UTC

rohan anil

@_arohan_

7 Oct 2025

Really enjoyed reading paper that trains a tiny model that achieves high arc-“agi” scores Loop(x , y-embed, z-embed); x inputs. Where y is prediction and z is memory initialized to zero. Gradient from last step of the loop (bit nuanced as there are inner loops) Weights shared across loops Use a prediction head to see if prediction is correct, and stop the loop, otherwise repeat the loops (i.e outer loop for 16 times; detach gradients so, y, z updated for final step only)

Alexia Jolicoeur-Martineau @jm_alexia

7 Oct 2025

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/2… Code: github.com/SamsungSAILMontre… Paper: arxiv.org/abs/2510.04871

141

24,064

rohan anil · Sep 20, 2025 · 9:15 PM UTC

rohan anil

@_arohan_

20 Sep 2025

Most of my contribution to deep learning: starting from codistillation, sm3, distributed shampoo, locoprop, ngrammers were when I was on h1b.

143

14,166

rohan anil · May 15, 2025 · 6:24 AM UTC

rohan anil

@_arohan_

15 May 2025

Someone passed this wisdom to me today. Deep learning techniques working vs not working is two devils - your prior about the technique - your attention to details about implementation of the technique Need both to make it work.

Hieu Pham

@hyhieu226

14 May 2025

Deep learning is ~10% idea and ~90% implementation.

142

15,119

rohan anil · Jun 26, 2025 · 5:11 PM UTC

rohan anil

@_arohan_

26 Jun 2025

One not so nice behavior due the gold rush and closed nature of labs is everyone claiming on socials or their website to have lead X and Y on closed models. It feels a bit unsettling/cringe.

143

23,955

rohan anil · Feb 9, 2023 · 4:30 AM UTC

rohan anil

@_arohan_

9 Feb 2023

Adam is the past? arxiv.org/abs/2302.03764

Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions

Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find...

arxiv.org

135

79,253

rohan anil · Nov 17, 2024 · 5:38 AM UTC

rohan anil

@_arohan_

17 Nov 2024

Lol are we now tweeting about gradient spikes in training runs or some such. I think people may pay to watch training loss curves go down at this rate.

Andrew Curran

@AndrewCurran_

17 Nov 2024

Nothing confirmed, but rumors tonight that Grok 3 suffered some sort of catastrophic incident yesterday during its training. Hopefully not true. There were similar rumors about the Opus 3.5 delay, nothing was ever confirmed there either

134

19,531

rohan anil · Sep 11, 2025 · 6:46 PM UTC

rohan anil

@_arohan_

11 Sep 2025

Thinky machines post on non determinism also made me think of non reproducibility of neural network training that was very well studied at google for ads models.

139

14,831

rohan anil · Aug 26, 2025 · 6:49 PM UTC

rohan anil

@_arohan_

26 Aug 2025

It’s pretty cool to hear about someone actually reading a paper of someone else and then talking to them and giving them a job offer on the spot. Really legitimate way to hire.

133

11,286

rohan anil · Dec 6, 2023 · 5:51 PM UTC

rohan anil

@_arohan_

6 Dec 2023

Gemini Nano improve on the efficiency frontiers. They are multimodal as well, see results in the paper. Nano series: At 1.8B and 3.25B parameters packs so much to provide high utility on device First foundation model on the device! android-developers.googleblo…

Sundar Pichai

@sundarpichai

6 Dec 2023

Replying to @sundarpichai

Gemini Nano is super efficient for tasks that are on-device. Android developers can sign up for an early access program for Gemini Nano via Android AICore and Pixel 8 Pro users can already see it rolling out in features like Summarize in Recorder and Smart Reply in Gboard + much more to come! blog.google/products/pixel/p…

130

57,059

rohan anil · Dec 17, 2022 · 7:48 PM UTC

rohan anil

@_arohan_

17 Dec 2022

Next big jump with Neural Network performance is going to happen when community embraces non-uniformity Eg, stacking of identical layers has become ingrained within our tools and mindsets.

131

68,951

rohan anil · Jan 2, 2025 · 5:46 AM UTC

rohan anil

@_arohan_

2 Jan 2025

2025 off to a good start. Per step time needs a bit more work. Getting used to new setup otherwise I think we have improved on Shampoo.

133

17,057

rohan anil · Jun 2, 2024 · 11:23 PM UTC

rohan anil

@_arohan_

2 Jun 2024

Bit disappointed that my timeline has very few arxiv paper link / explanation threads. It used to be lot more in the past.

129

31,170

rohan anil · Oct 7, 2023 · 2:30 AM UTC

rohan anil

@_arohan_

7 Oct 2023

Gen AI on-device? A foundation model on the phone? Imagine an entire operating system level unlock of capabilities: Well Pixel 8 Pro will have it. Rick announced it here: piped.video/watch?v=pxlaUC… The model was trained with several algorithmic breakthroughs by our team to make it possible!

#MadeByGoogle ‘23: Keynote

Watch the #MadeByGoogle '23 event and get to know the new #Pixel8, ...

youtube.com

132

51,000

rohan anil · May 21, 2025 · 12:04 AM UTC

rohan anil

@_arohan_

21 May 2025

My last project at GDM was getting Nano-3 spec’ed with this gang. I am very happy that matformers, per layer embeddings and various innovations are openly published here!

Aditya Kusupati @adityakusupati

20 May 2025

Pocket powerhouse admist I/O awesomeness! Gemma 3n E4B & E2B are insane models, optimized for on-device while rivaling frontier models. It's a 🪆Matryoshka Transformer (MatFormer)🪆: Natively elastic b/w 4B & 2B pareto-optimally! ⭐️: free models with ZERO training cost! 🧵👇

129

9,472

rohan anil · Nov 27, 2024 · 5:32 AM UTC

rohan anil

@_arohan_

27 Nov 2024

There is pre-training and post training - everybody skips training I suppose.

123

8,774

rohan anil · Jul 13, 2025 · 8:21 PM UTC

rohan anil

@_arohan_

13 Jul 2025

Will you all get mad if I say Muon is Shampoo?

132

21,633

rohan anil · Oct 29, 2024 · 8:34 PM UTC

rohan anil

@_arohan_

29 Oct 2024

I have two kids. The 2nd one is slightly older than 6 months old. I have been finding he is ramping up on his capabilities faster than my first one. Then it struck me, he is distilling better from his sister whenever they are together since they are likely implementing similar circuits. He is more observant to her than to us likely for this reason.

126

18,692

rohan anil · Jun 3, 2025 · 9:01 PM UTC

rohan anil

@_arohan_

3 Jun 2025

I want to say that models are now better than me in technical subjects that I had spent learning for years. As long as I can ask the right questions, I am able to automate the inner-loop of reasoning to LLMs.

Rohan Paul

@rohanpaul_ai

3 Jun 2025

Yann LeCun is not so optimistic about LLMs. 😟 "We are not going to get to human level AI by just scaling up MLMs. This is just not going to happen. There's no way. Okay, absolutely no way. And whatever you can hear from some of my uh more adventurous colleagues, it's not going to happen within the next two years. There's absolutely no way in hell. You know the idea that we're we're going to have you know a country of genius in a data center that's complete BS. There's absolutely no way what we're going to have maybe is systems that are trained on sufficiently large amounts of data that any question that any reasonable person may ask will will find an answer through those systems and it would feel like you have a PhD sitting next to you but it's not a PhD you have next to you. It's system with a gigantic memory and retrieval ability not not a system that can invent solutions to to new problems, which is really what a PhD is. " -------- From "Alex Kantrowitz" YT Channel (full video link in comment)

127

12,512