Associate Professor CS/stats UC Berkeley. Former Research Scientist at Google DeepMind. ML/AI Researcher working on LLMs and deep learning. PhD at Stanford.

Berkeley, CA
Gd dominates ridge regression but not sgd! Surprising that such simple things aren't known for linear regression
6
9
140
132,396
Why... Who even uses perplexity
If Apple buys Perplexity, that would be its biggest ever acquisition
682
67
5,419
985,126
Answer: model is complete junk, it's a hallucination machine. Overfit to reasoning benchmarks and has absolutely zero recall ability
Is the gpt-OSS stronger than qwen or kimi or Chinese open models?
82
147
3,734
641,268
Wtf does thinking machines lab even do??
BREAKING: Mira Murati's Thinking Machines Lab raised $2 billion at $10 billion valuation we are so back.
365
77
2,419
853,141
O1-pro is pretty useless for research work. It runs for near 10 min per prompt and either 1) freezes, 2) didn't follow the instructions and returned some bs, or 3) just made some simple error in the middle that's hard to find. @OpenAI @sama @markchen90 refund me my $200
115
87
1,836
417,956
Such bs. The majority of math majors or even math phds and faculty could not win a gold in imo.
Replying to @GaryMarcus
What most people don't realize is that IMO (and IOI, though to a different extent) aren't particularly hard. They're aimed at high schoolers, so anyone with decent uni education should be able to solve most of them.
28
26
1,168
139,367
Canceled my chatgpt pro subscription
90
13
697
80,230
Will be at ICML starting Wed. I am open to any offers of 100m+.
17
8
714
56,123
Returning to the bay area after 9 years! Moving to UC Berkeley in statistics and eecs! It's been a great 6 years at Princeton, but I am so happy to be back! Finally there will be edible sushi!
71
7
708
39,177
Proud to have not contributed to Gpt-oss.
26
11
680
94,429
Finally got a career! On the nth try, for very large n.
53
4
602
Please pay me 100m to convert papers like openreview.net/pdf?id=3zKtaq… to blogposts! @agarwl_
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
19
16
576
121,091
All icml rejected!
22
7
516
I know all 3. Will work for 99m
See below on what Zuckerberg is looking for in star recruits worth $100m pay packages for Meta’s plans in Artificial Intelligence. But weren’t some people saying calculus is no longer useful in the AI age? 🤔
19
9
512
51,723
Predicting What You Already Know Helps: Provable Self-Supervised Learning We analyze how predicting parts of the input from other parts (missing patch, missing word, etc.) helps to learn a representation that linearly separates the downstream task. arxiv.org/abs/2008.01064 1/2
3
100
499
In the new gpt 5.1, the chat interface defaults to the router. If I start a thread with 5 pro, on the second interaction I have to remember to select 5 pro every single time. This is super annoying and really ruins the workflow. If I'm using 5 pro, I probably will use it for entire conversation. @OpenAI not that I expect a response, for 2400 a year we get almost no support.
47
10
445
99,208
Why was this surprising?! Last year Alphaproof was 1 point from gold which is Def just noise. I could have rerun last years model on this years problems and with decent probability win gold.
I think it's safe to say this @OpenAI IMO gold result came as a bit of a surprise to folks
23
11
432
65,983
Great videos! I learned so much. The assignments are too hard for me.
Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:
7
18
412
61,193
Meta come poach me!
Scoop: Meta has poached three OpenAI researchers: Lucas Beyer, Alexander Kolesnikov and Xiaohua Zhai, according to people familiar with the matter. An OpenAI spox confirmed the three have left the company.
17
3
390
62,991
Just 1 gigawatt? Others doing tens of gigawatts
Today, we announced that we plan to expand our use of Google TPUs, securing approximately one million TPUs and more than a gigawatt of capacity in 2026.
31
2
355
93,630
Can someone explain the billion update rules here? What are the desiderata and what are the tradeoffs?
Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Ki… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡
18
20
336
56,358
I always get frustrated when asked what is ML theory good for and people ask for specific examples. I find this question unfair, I think its really just having a theory/mathematical perspective is sometimes super helpful. E.g. Diffusion models and its relatives, I don't see how you can come up with it without at least some theory training. Does it count as ml theory? Maybe not quite, the original papers didn't show any results like it samples in poly time/samples or whatever. But it is still an example of why we should learn some math/theory.
Replying to @QuanquanGu
No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.
12
12
330
133,493
This is well known technique right? What's new here??
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
27
10
325
96,879
At least the GDM imo proofs are readable!
5
8
310
26,935
I didn't get this talk at all. Why does good compression, eg near kolmogorov complexity imply that it's a good learner??
There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: - Kolmogorov compressor is the theoretical shortest-length program that produces a dataset. SGD is a practical approximation of the Kolmogorov search that finds an implicit program embedded in the weights of a soft computer, i.e. big Transformers. - Unsupervised learning is about computing the conditional Kolmogorov complexity of a target dataset given an unlabelled corpus, i.e. K(Y|X) - Theory tells us that optimizing for K(X, Y), the joint complexity, is as good as K(Y|X). So simply throw all data into the mix, and "just compress everything". - Joint compression is maximum likelihood over the giant concatenated dataset. - Ilya cites iGPT, Chen et al. 2020, to illustrate the ideas. iGPT is an image compressor that learns to predict the next pixel using a 1D sequence model. This is a phenomenal lecture, very accessible, and sometimes quite entertaining. YouTube: piped.video/watch?v=AKMuA_TV… Lecture page: simons.berkeley.edu/talks/il…
38
17
261
251,930
LLMs can learn long reasoning and composition tasks. The key is data mixture. Must have a mixture of short reasoning chains and long reasoning chains! One of the rare theory papers that directly sheds light on how data should be designed.
LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training? In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient! arxiv.org/abs/2505.23683 🧵 below (1/10)
7
36
257
28,242
Optimization is real math too.
This is an unwise statement that can only make people confused about what LLMs can or cannot do. Let me tell you something: Math is NOT about solving this kind of ad hoc optimization problems. Yeah, by scraping available data and then clustering it, LLMs can sometimes solve some very minor math problems. It's an achievement, and I applaud you for that. But let's be honest: this is NOT the REAL Math. Not by 10,000 miles. REAL Math is about concepts and ideas - things like "schemes" introduced by the great Alexander Grothendieck, who revolutionized algebraic geometry; the Atiyah-Singer Index Theorem; or the Langlands Program, tying together Number Theory, Analysis, Geometry, and Quantum Physics. That's the REAL Math. Can LLMs do that? Of course not. So, please, STOP confusing people - especially, given the atrocious state of our math education. LLMs give us great tools, which I appreciate very much. Useful stuff! Go ahead and use them AS TOOLS (just as we use calculators to crunch numbers or cameras to render portraits and landscapes), an enhancement of human abilities, and STOP pretending that LLMs are somehow capable of replicating everything that human beings can do. In this one area, mathematics, LLMs are no match to human mathematicians. Period. Not to mention many other areas. Calling on my friend @ericweinstein and @GaryMarcus, who has been one of the few sane expert voices on these matters lately. 🙏 h/t @hellheff
12
7
256
31,524
There will soon also be a PSA of why muon is not a second order optimizer and how all the fancy manifold math is irrelevant. I am sick of people using math intimidation to make their methods sound fancy to bs VCs.
Proof by picture of why lr convergence is not useful unless it is fast relative to loss/predictions. Credit to nikhil Ghosh, Denny Wu, and Alberto for studying this and critical of the muP series of conclusions and overclaims.
10
10
245
31,231
Extremely happy with this result! Mechanistic Understanding of how Transformers Learn Causal Structure!
Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with @alex_damian_ @jasondeanlee! arxiv.org/abs/2402.14735 (1/10)
1
20
236
29,738
I managed to prompt gpt-5-thinking into proving the tight 1.75/L matching v2 of the arxiv paper. From the arxiv paper, it was clear that this problem is perfect for PEP framework. I told gpt to do a search in the coefficients for combining cocoercivity at different pairs of points. Going up to 9 coefficients , it found the solution using sympy (I told it to use symbolic solver). We can also be pretty certain the authors used PEP to find the magical combination of coefficients also.
Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.
11
24
240
52,884
But dumber. So much math slop tonight.
Gpt 5 pro super fast tonight
16
5
213
66,432
I could decide in 6 seconds
> be zucc > makes an offer you can’t refuse > $100 million signing bonus > $100 million base salary > with bonus up to $300 million a year “offer expires in 6 hours”
9
2
206
28,547
A Chinese-American hero that inspired my parents' generation to study and stay in America.
Prof. Chen Ning Yang, a world-renowned physicist, Nobel Laureate in Physics, Academician of the Chinese Academy of Sciences, Professor at Tsinghua University, and Honorary Director of the Institute for Advanced Study at Tsinghua University, passed away in Beijing due to illness at the age of 103. His life stands as a timeless chapter in human history—one that shines not only for China but for the global community of thinkers and innovators. His legacy will live on forever.
2
10
206
24,572
I strongly dislike the term inductive bias. It sounds like jargon to me, and whenever we use it like "the inductive bias allows it to learn well on domain X". Translation: we don't understand why it learns well on domain X, but it beats the competitors! Must be inductive bias!
Replying to @PreetumNakkiran
“What is the inductive bias of XX” is a fancy way of asking “on which distributions/tasks does XX work well?”
13
19
197
I want to remind everyone that disabilities may also be invisible. Your colleagues, group members, students, postdocs, may be going through this. I am not an eloquent person, so WE NEED TO PAY MORE ATTENTION TO THE DISABLED AND THEIR ACCOMMODATION
Being a #disabled junior researcher in #AI comes at a massive price; when your disabilities flare, you are on your own: there is neither medical insurance nor salary for you during this difficult time This is a very important aspect that needs our attention #Academia #Insecurity
1
26
187
62,706
Any suggestions for lecture notes /videos/short books on stochastic calculus or sdes? Looking for something operational, not rigorous
21
19
189
155,186
How do physicists learn stochastic calculus? There is no way they spend a month defining brownian motion /into integral.
Any suggestions for lecture notes /videos/short books on stochastic calculus or sdes? Looking for something operational, not rigorous
29
10
168
86,950
Our new work on scaling laws that includes compute, model size, and number of samples. The analysis involves an extremely fine-grained analysis of online sgd built up over the last 8 years of understanding sgd on simple toy models (tensors, single index models, multi index model)
Excited to announce a new paper with Yunwei Ren, Denny Wu, @jasondeanlee! We prove a neural scaling law in the SGD learning of extensive width two-layer neural networks. arxiv.org/abs/2504.19983 🧵below (1/10)
10
16
170
97,794
Replying to @techikansh
Not my math problems.
3
157
24,614
How do you evaluate the correctness of the mathematical reasoning chain? Take a look!
What is "good" reasoning and how to evalute it? 🚀We explore a new pipeline to model step-level reasoning, a “Goldilocks principle” that balances free-form CoT and LEAN! Led by my student @yuanhezhang6, in colloboration with Ilja from DeepMind, @jasondeanlee, @CL_Theory
2
15
152
25,249
I won!!! Much of the proposal is based off work with @AlexDamian and @tengyuma
📣Announcing… this year’s ONR Young Investigator Program (YIP) recipients! #ONRYIP onr.navy.mil/Media-Center/Pr…
20
147
Replying to @doe1478725
No 200/month
1
1
145
5,931
Proof by picture of why lr convergence is not useful unless it is fast relative to loss/predictions. Credit to nikhil Ghosh, Denny Wu, and Alberto for studying this and critical of the muP series of conclusions and overclaims.
🎯 Just released a new preprint that proves LR transfer under μP. -> The Problem: When training large neural networks, one of the trickiest questions is: what learning rate should I use? [1/n]🧵 Link: arxiv.org/abs/2511.01734
7
10
145
58,682
Gpt searched for existing solutions in the literature. It did not solve them itself.
Update: Mehtaab and I pushed further on this. Using thousands of GPT5 queries, we found solutions to 10 Erdős problems that were listed as open: 223, 339, 494, 515, 621, 822, 883 (part 2/2), 903, 1043, 1079. Additionally for 11 other problems, GPT5 found significant partial progress that we added to the official website: 32, 167, 188, 750, 788, 811, 827, 829, 1017, 1011, 1041. For 827, Erdős's original paper actually contained an error, and the work of Martínez and Roldán-Pensado explains this and fixes the argument. The future of scientific research is going to be fun.
Community note
GPT-5 did not solve those Erdos problems. It only "found" solutions in the sense of finding existing published literature that solved the problems. Here is an explanation from the maintainer of erdosproblems.com: nitter.app/thomasfbloom/s
2
4
139
18,477
Why are official announcements posted to medium? I click this link and can't read the article. Instead I get something about paying to subscribe to medium to read...
The #NeurIPS2021 paper submission deadline has been extended by 48 hours. The new deadline is Friday, May 28 at 1pm PT (abstracts due Friday, May 21). Read the official announcement to learn more. link.medium.com/mJTaFjBeggb
8
2
130
Absolutely insane
Several of my team members + myself are impacted by this layoff today. Welcome to connect :)
3
1
130
39,277
He's terrible. Screwed my buddy sgd.
Anyone knows adam?
5
5
135
11,476
I implemented Adam and ran it on a new dataset/arch. Blogpost coming soon! My implementation will also be available in the fiddler api.
3
2
127
12,686
I think I just submitted over 100 ref letters...
2
2
117
This is such cope. True that they gave almost no information in the release but it's still a super hard competition
1
113
8,535
Gpt 5 pro working much better for me this morning. Last night was a disaster
8
2
115
9,185
Exactly. Many of the matrix preconditioning methods, people call 2nd order, but really are much closer to first order. Eg muon, shampoo, etc
Second-order methods and preconditioner-based methods are **NOT** the same. Please stop using them interchangeably!
9
3
110
18,490
Haozhe's paper is worth a read, really nice use of fixed point theorems. The new one about 1 to 1 seems almost immediate though Just from reading the thread only I would guess the proof is as follows: Say your input space is discrete in j \in [n] and represented by x_j. The embedding is E: [n] \to R^d . "For almost all E", E x_j are distinct. Then a transformer f is composed of building blocks that are analytic, composition of analytic is analytic and also preserved under many algebraic operations (o-minimal stuff) . Analytic functions are either the zero function, or crosses zero on a measure zero set (they can't have positive measure set f(set) =0) . Thus so long as f \neq 0 uniformly, this should be injective (not bijective necessarily its not onto).
(1/7) Glad to see that people are following up on our work studying topological properties of modern neural network architectures. It was cool to see that widely used neural architectures can almost always generate any output given appropriate inputs, a.k.a. are surjective.
8
14
113
18,524
arxiv.org/abs/2312.00752 @tdietterich How come this paper can be uploaded without tex source? Was it written in word? Asking because I always download source and change the font size to make it readable for my eyes.
9
9
107
46,434
Seriously @OpenAI if your main business is being a consumer-facing company then fix your UX. Equations haven't rendered properly since the original 2023 release. The default to router is annoying.
Replying to @jasondeanlee
Could be. I started making sure changing it back to Pro every time I continue conversations. A bit annoying!
5
3
108
16,037
New work arxiv.org/abs/2506.05500 on learning multi-index models with @alex_damian_ and Joan Bruna. Multi-index are of the form y= g(Ux), where U=r by d maps from d dimension to r dimension and d>>r. g is an arbitrary function. Examples of multi-index models are any neural net with first hidden layer that is width r. Our new paper proposes a new spectral estimator that attains optimal dimension dependency for recovering span(U). We define the generative leap exponent that governs the difficulty of learning and show both upper and lower bounds of d^{k/2} , where k is the generative leap exponent. This gives optimal results for learning several families: 1. Deep ReLU networks with bias (generalizing the result of Chen, Meka, Klivans from bias-less ReLU networks) 2. low-rank polynomials where g is a polynomial (Chen and Meka). 3. Almost all deep neural networks with first hidden layer of width r. 4. Sparse Gaussian parity
2
19
112
9,250
At the @SimonsInstitute working on AGI (Artificial Gaussian Intelligence)
1
1
102
11,523
Most surprising part is that xai stock is worth 7m.
🚨 xAI is suing former engineer Xuechen Li for allegedly stealing trade secrets about its Grok chatbot before joining OpenAI. The company claims Li admitted to taking files, sold $7M in stock, and is now seeking damages and a restraining order to block him from joining OpenAI.
5
2
99
18,242
How do I short oai before gpt5 release?
15
95
10,066
TLDR: Heuristics such as clipping cause weird biases. Let's move away from heuristics to principled methods so at least we know what they are optimizing
Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-consi…
2
4
97
11,846
Oh god
NeurIPS 2024 will have a track for papers from high schoolers.
3
5
83
15,527
I strongly believe distribution shift is one of the major challenges in deploying ml systems. We take a step towards addressing subpopulation shift via a label propagation framework.
Subpopulation shift is a ubiquitous component of natural distribution shift. We propose a general theoretical framework of learning under subpopulation shift based on label propagation. And our insights can help to improve domain adaptation algorithms. arxiv.org/abs/2102.11203
2
10
88
arxiv.org/abs/2106.06530 arxiv.org/abs/2209.15594 We identified the third order effect in two algorithms, sgd and gd.
I think the reason why second-order methods keep underperforming relative to first-order methods in deep learning is that first-order methods are more powerful than the theory gives them credit for. First-order methods + large step sizes can implicitly access specific **third
2
5
89
27,460
Replying to @karololszacki
How to grt it for free?
17
81
37,039
Picard's statement is a non-apology. NeurIPS statement (not linked here) is better.
Replying to @NeurIPSConf
In addition, Dr. Picard has also released an apology to the NeurIPS community. It can be read at neurips.cc/Conferences/2024/…
1
85
13,185
Very impressive! Shows that existing models are not far from gold, and with some minor self verification +prompting work already. GDM and oai results maybe only require some light rl tuning on top of the existing model (eg against the self verifier)
🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025
3
1
81
8,186
Where does the Markov chain come from? It depends on all previous not just the immediate
4
72
9,686
I have never seen a monograph (or book) with such an incomplete list of citations.
I'm delighted to share publicly "The Principles of Deep Learning Theory," co-written with @ShoYaida, and based on work also with @BorisHanin. It will appear on the @arxiv on Sunday and will be published by @CambridgeUP early next year: deeplearningtheory.com/ 1/
5
5
76
OK so they lasted one day, considerably less than 6 months.
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
1
3
79
10,499
OK forget the 100m. I'll just be the doorman at Nvidia.
How can the median employee be worth 25m. No way this is right...
2
1
82
9,246
Optimal isn't even defined for feature learning so how could it be optimal. Mup, completep and variants satisfy some desiderata. The choice of desiderata lead to different init schemes, and it's not clear what the right set of desiderata are.
Replying to @attentionmech
Just for any doubts μP or CompleteP are techniques for optimal feature learning. (Not a optimixer) ref:arxiv.org/abs/2505.01618
7
2
80
19,812
I'm in good company!
Verdict from @icmlconf: 3 out of 3 ..... rejected. If I go by tweet statistics, ICML has rejeted every single paper this year 🤣
3
74
Replying to @DimitrisPapail
Right. The surprising part is it's just a llm, no special tool or solver like alpha geometry +lean
6
1
74
5,150
Replying to @george__wing
Wasn't sakana the one doing fraudulent demos
1
70
20,279
I remember coming to these when I was in middle school and high school (all the way from Cupertino!). The first one I attended was by Prof Stankova on circle inversions and it was like magic.
Unbelievable: the famed Berkeley Math Circle is being forced to shut down due to a bureaucratic requirement where a guest lecturer giving an hour long lesson needs to be officially fingerprinted. How is fingerprinting even still a thing in the 21st century? Chancellor Lyons @richlyons: can you see the absurdity of the situation and figure out a solution? dailycal.org/news/campus/gen…
1
4
71
20,747
Bye east coast!
3
73
7,893
Adam has more citations than Robbins-Monro! Wtf
Oldies but goldies: H Robbins, S Monro, A Stochastic Approximation Method, 1951. Early appearance of the stochastic gradient method, which is the workhorse of many large-scale ML methods. en.wikipedia.org/wiki/Stocha… en.wikipedia.org/wiki/Stocha…
2
2
70
15,954
All models are equally good.
10
2
64
21,345
Agree flash thinking is at least fast!
2
63
18,592
Are Kolmogorov-Arnold networks (KAN) just standard MLP activation corresponding to the B-spline and some parallel net arch to handle the gridpoints? Feels like this should be true...
3
1
64
20,219
Last week was 100m, and this week it's 200m. finance.yahoo.com/news/meta-… I'm waiting 8 more weeks for the 1B offer.
4
2
63
16,373
Our new colt paper solving a colt open problem!
How to learn the best shared model across multiple data distributions — a unified paradigm with applications in robustness, fairness, and calibration? Our COLT 2024 paper shows how to do it optimally using Hedge! arxiv.org/abs/2312.05134. Also resolved 3 COLT 2023 open problems: arxiv.org/abs/2307.12135
5
62
9,395
We (@alex_damian_ @EshaanNichani) need help from Markov chain experts! mathoverflow.net/questions/4…
2
6
58
12,002
Replying to @roydanroy @ilyasut
More like if you completely close your eyes...
2
3
60
We have openings including in AI, please apply!
UC Berkeley Department of Statistics is hiring! We’re seeking applicants for up to three approved tenure-track positions at the Assistant Professor level in Statistics, Probability and AI. Details & apply: aprecruit.berkeley.edu/JPF05… #AI #Statistics #Probability #UCBerkeley
2
2
60
17,111
New work on training deep transformers for multi-step reasoning!
LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training? In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient! arxiv.org/abs/2505.23683 🧵 below (1/10)
1
2
59
6,976
Though all are considerably worse than my students @alex_damian_ @EshaanNichani
I can confirm opus 4 is equal in ability to o3 and gemini 2.5 pro at computing Gaussian and Hermite identities. That's my main use case.
57
4,999
Quite happy with A*-PO. 1. Simple (no heuristics such as clipping/normalization 2. One rollout per iteration , improving efficiency.
Current RLVR methods like GRPO and PPO require explicit critics or multiple generations per prompt, resulting in high computational and memory costs. We introduce ⭐A*-PO, a policy optimization algorithm that uses only a single sample per prompt during online RL without critic.
2
1
57
7,315
To be clear, this would probably take 1 hour for someone with experience with pep or mathematica. Not the 5 hours of prompting +spending all my gpt-5-pro credits
2
2
52
4,047
TLDR: By moving further from initialization, you can provably learn a broad class of functions (low-rank polynomials) with less samples than any kernel method. Low-rank polynomials include networks with polynomial activation of bounded degree and analytic activation (approx).
🚨 New blog post on Deep Learning Theory Beyond NTKs: Salesforce research blog: blog.einstein.ai/beyond-ntk/ offconvex: offconvex.org/2021/03/25/bey… An exposition of "escaping the NTK ball with stronger learning guarantees". Joint w/ @jasondeanlee @MinshuoC
6
52
Downloaded the source of this one and tried to compile in Larger font to make it readable. The latex is completely obfuscated to make it hard to edit. What's the point of this? Make it inaccessible to low vision readers? Makes it near impossible to reformat to be read
Our first paper in a series studying the inner mechanisms of transformers. TL;DR: we show *how* GPTs learn complex CFG trees via learning to do dynamic programming. Huge thanks to @MetaAI for making this research journey possible. arxiv.org/abs/2305.13673 FYI to @OpenAI @mbzuai
4
3
50
40,136
Like wtf.
Please fix the defaulting to chatgpt. So annoying, manipulative and dishonest @nickaturley Also fix your mathjax /latex/markdown. No one wants to be reading raw latex.
4
1
50
20,721
Now I know who took my slots!
Wow, all of our 6 submissions to ICML and COLT got accepted this year! Congrats to all my collaborators.
48
I love the east bay! Grew up in south bay and never thought much of the east bay. Turns out way better 1) better weather 2) food is cheaper, less crowded
3
51
8,708
What is the analog of ERM for offline RL? We propose primal dual regularized offline rl (PRO-RL), which has many of the properties that makes ERM so successful. arxiv.org/abs/2202.04634
3
2
50