Jason Lee · Sep 26, 2025 · 4:00 PM UTC

Jason Lee

Pinned Tweet

Jason Lee @jasondeanlee

26 Sep 2025

Gd dominates ridge regression but not sgd! Surprising that such simple things aren't known for linear regression

This Post is from an account that no longer exists.

140

132,396

Jason Lee · Jun 26, 2025 · 4:14 PM UTC

Jason Lee @jasondeanlee

26 Jun 2025

Why... Who even uses perplexity

Rohan Paul

@rohanpaul_ai

25 Jun 2025

If Apple buys Perplexity, that would be its biggest ever acquisition

682

5,419

985,126

Jason Lee · Aug 6, 2025 · 9:54 AM UTC

Jason Lee @jasondeanlee

6 Aug 2025

Answer: model is complete junk, it's a hallucination machine. Overfit to reasoning benchmarks and has absolutely zero recall ability

Jason Lee @jasondeanlee

6 Aug 2025

Is the gpt-OSS stronger than qwen or kimi or Chinese open models?

147

3,734

641,268

Jason Lee · Jun 20, 2025 · 8:47 PM UTC

Jason Lee @jasondeanlee

20 Jun 2025

Wtf does thinking machines lab even do??

NIK

@ns123abc

20 Jun 2025

BREAKING: Mira Murati's Thinking Machines Lab raised $2 billion at $10 billion valuation we are so back.

365

2,419

853,141

Jason Lee · Dec 22, 2024 · 5:25 PM UTC

Jason Lee @jasondeanlee

22 Dec 2024

O1-pro is pretty useless for research work. It runs for near 10 min per prompt and either 1) freezes, 2) didn't follow the instructions and returned some bs, or 3) just made some simple error in the middle that's hard to find. @OpenAI @sama @markchen90 refund me my $200

115

1,836

417,956

Jason Lee · Jul 20, 2025 · 2:50 AM UTC

Jason Lee @jasondeanlee

20 Jul 2025

Such bs. The majority of math majors or even math phds and faculty could not win a gold in imo.

лисиця @hehehe52318711

19 Jul 2025

Replying to @GaryMarcus

What most people don't realize is that IMO (and IOI, though to a different extent) aren't particularly hard. They're aimed at high schoolers, so anyone with decent uni education should be able to solve most of them.

1,168

139,367

Jason Lee · Aug 11, 2025 · 9:18 AM UTC

Jason Lee @jasondeanlee

11 Aug 2025

Canceled my chatgpt pro subscription

697

80,230

Jason Lee · Jul 15, 2025 · 12:01 AM UTC

Jason Lee @jasondeanlee

15 Jul 2025

Will be at ICML starting Wed. I am open to any offers of 100m+.

714

56,123

Jason Lee · Jun 12, 2025 · 9:00 PM UTC

Jason Lee @jasondeanlee

12 Jun 2025

Returning to the bay area after 9 years! Moving to UC Berkeley in statistics and eecs! It's been a great 6 years at Princeton, but I am so happy to be back! Finally there will be edible sushi!

708

39,177

Jason Lee · Aug 6, 2025 · 2:04 PM UTC

Jason Lee @jasondeanlee

6 Aug 2025

Proud to have not contributed to Gpt-oss.

680

94,429

Jason Lee · May 20, 2022 · 4:40 PM UTC

Jason Lee @jasondeanlee

20 May 2022

Finally got a career! On the nth try, for very large n.

602

Jason Lee · Oct 28, 2025 · 2:47 AM UTC

Jason Lee @jasondeanlee

28 Oct 2025

Please pay me 100m to convert papers like openreview.net/pdf?id=3zKtaq… to blogposts! @agarwl_

Thinking Machines

@thinkymachines

27 Oct 2025

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…

576

121,091

Jason Lee · May 15, 2022 · 5:21 PM UTC

Jason Lee @jasondeanlee

15 May 2022

All icml rejected!

516

Jason Lee · Jun 28, 2025 · 7:47 PM UTC

Jason Lee @jasondeanlee

28 Jun 2025

I know all 3. Will work for 99m

Jelani Nelson

@minilek

28 Jun 2025

See below on what Zuckerberg is looking for in star recruits worth $100m pay packages for Meta’s plans in Artificial Intelligence. But weren’t some people saying calculus is no longer useful in the AI age? 🤔

512

51,723

Jason Lee · Aug 19, 2020 · 11:37 PM UTC

Jason Lee @jasondeanlee

19 Aug 2020

Predicting What You Already Know Helps: Provable Self-Supervised Learning We analyze how predicting parts of the input from other parts (missing patch, missing word, etc.) helps to learn a representation that linearly separates the downstream task. arxiv.org/abs/2008.01064 1/2

100

499

Jason Lee · Nov 7, 2025 · 10:53 PM UTC

Jason Lee @jasondeanlee

7 Nov 2025

In the new gpt 5.1, the chat interface defaults to the router. If I start a thread with 5 pro, on the second interaction I have to remember to select 5 pro every single time. This is super annoying and really ruins the workflow. If I'm using 5 pro, I probably will use it for entire conversation. @OpenAI not that I expect a response, for 2400 a year we get almost no support.

445

99,208

Jason Lee · Jul 19, 2025 · 7:22 PM UTC

Jason Lee @jasondeanlee

19 Jul 2025

Why was this surprising?! Last year Alphaproof was 1 point from gold which is Def just noise. I could have rerun last years model on this years problems and with decent probability win gold.

Noam Brown

@polynoamial

19 Jul 2025

I think it's safe to say this @OpenAI IMO gold result came as a bit of a surprise to folks

432

65,983

Jason Lee · Jun 23, 2025 · 11:31 PM UTC

Jason Lee @jasondeanlee

23 Jun 2025

Great videos! I learned so much. The assignments are too hard for me.

Percy Liang

@percyliang

18 Jun 2025

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

412

61,193

Jason Lee · Jun 26, 2025 · 1:12 PM UTC

Jason Lee @jasondeanlee

26 Jun 2025

Meta come poach me!

Meghan Bobrowsky

@MeghanBobrowsky

26 Jun 2025

Scoop: Meta has poached three OpenAI researchers: Lucas Beyer, Alexander Kolesnikov and Xiaohua Zhai, according to people familiar with the matter. An OpenAI spox confirmed the three have left the company.

390

62,991

Jason Lee · Oct 24, 2025 · 12:39 AM UTC

Jason Lee @jasondeanlee

24 Oct 2025

Just 1 gigawatt? Others doing tens of gigawatts

Anthropic

@AnthropicAI

23 Oct 2025

Today, we announced that we plan to expand our use of Google TPUs, securing approximately one million TPUs and more than a gigawatt of capacity in 2026.

355

93,630

Jason Lee · Oct 31, 2025 · 1:57 PM UTC

Jason Lee @jasondeanlee

31 Oct 2025

Can someone explain the billion update rules here? What are the desiderata and what are the tradeoffs?

Kimi.ai

@Kimi_Moonshot

30 Oct 2025

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Ki… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡

336

56,358

Jason Lee · Oct 30, 2025 · 1:25 PM UTC

Jason Lee @jasondeanlee

30 Oct 2025

I always get frustrated when asked what is ML theory good for and people ask for specific examples. I find this question unfair, I think its really just having a theory/mathematical perspective is sometimes super helpful. E.g. Diffusion models and its relatives, I don't see how you can come up with it without at least some theory training. Does it count as ml theory? Maybe not quite, the original papers didn't show any results like it samples in poly time/samples or whatever. But it is still an example of why we should learn some math/theory.

Quanquan Gu

@QuanquanGu

29 Oct 2025

Replying to @QuanquanGu

No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.

330

133,493

Jason Lee · Oct 28, 2025 · 2:39 AM UTC

Jason Lee @jasondeanlee

28 Oct 2025

This is well known technique right? What's new here??

Thinking Machines

@thinkymachines

27 Oct 2025

325

96,879

Jason Lee · Jul 21, 2025 · 6:36 PM UTC

Jason Lee @jasondeanlee

21 Jul 2025

At least the GDM imo proofs are readable!

310

26,935

Jason Lee · Aug 16, 2023 · 7:32 PM UTC

Jason Lee @jasondeanlee

16 Aug 2023

I didn't get this talk at all. Why does good compression, eg near kolmogorov complexity imply that it's a good learner??

Jim Fan

@DrJimFan

16 Aug 2023

There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: - Kolmogorov compressor is the theoretical shortest-length program that produces a dataset. SGD is a practical approximation of the Kolmogorov search that finds an implicit program embedded in the weights of a soft computer, i.e. big Transformers. - Unsupervised learning is about computing the conditional Kolmogorov complexity of a target dataset given an unlabelled corpus, i.e. K(Y|X) - Theory tells us that optimizing for K(X, Y), the joint complexity, is as good as K(Y|X). So simply throw all data into the mix, and "just compress everything". - Joint compression is maximum likelihood over the giant concatenated dataset. - Ilya cites iGPT, Chen et al. 2020, to illustrate the ideas. iGPT is an image compressor that learns to predict the next pixel using a 1D sequence model. This is a phenomenal lecture, very accessible, and sometimes quite entertaining. YouTube: piped.video/watch?v=AKMuA_TV… Lecture page: simons.berkeley.edu/talks/il…

261

251,930

Jason Lee · Oct 4, 2025 · 1:46 PM UTC

Jason Lee @jasondeanlee

4 Oct 2025

LLMs can learn long reasoning and composition tasks. The key is data mixture. Must have a mixture of short reasoning chains and long reasoning chains! One of the rare theory papers that directly sheds light on how data should be designed.

Zixuan Wang

@zzZixuanWang

30 May 2025

LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training? In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient! arxiv.org/abs/2505.23683 🧵 below (1/10)

257

28,242

Jason Lee · Aug 22, 2025 · 7:23 PM UTC

Jason Lee @jasondeanlee

22 Aug 2025

Optimization is real math too.

Edward Frenkel

@edfrenkel

22 Aug 2025

This is an unwise statement that can only make people confused about what LLMs can or cannot do. Let me tell you something: Math is NOT about solving this kind of ad hoc optimization problems. Yeah, by scraping available data and then clustering it, LLMs can sometimes solve some very minor math problems. It's an achievement, and I applaud you for that. But let's be honest: this is NOT the REAL Math. Not by 10,000 miles. REAL Math is about concepts and ideas - things like "schemes" introduced by the great Alexander Grothendieck, who revolutionized algebraic geometry; the Atiyah-Singer Index Theorem; or the Langlands Program, tying together Number Theory, Analysis, Geometry, and Quantum Physics. That's the REAL Math. Can LLMs do that? Of course not. So, please, STOP confusing people - especially, given the atrocious state of our math education. LLMs give us great tools, which I appreciate very much. Useful stuff! Go ahead and use them AS TOOLS (just as we use calculators to crunch numbers or cameras to render portraits and landscapes), an enhancement of human abilities, and STOP pretending that LLMs are somehow capable of replicating everything that human beings can do. In this one area, mathematics, LLMs are no match to human mathematicians. Period. Not to mention many other areas. Calling on my friend @ericweinstein and @GaryMarcus, who has been one of the few sane expert voices on these matters lately. 🙏 h/t @hellheff

256

31,524

Jason Lee · Feb 16, 2022 · 8:29 PM UTC

Jason Lee @jasondeanlee

16 Feb 2022

Announcing our new result that 3-layer nets are conscious and 2-layer nets are provably not conscious. arxiv.org/abs/2112.02393

Optimization-Based Separations for Neural Networks

Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior...

arxiv.org

235

Jason Lee · Nov 5, 2025 · 4:03 PM UTC

Jason Lee @jasondeanlee

5 Nov 2025

There will soon also be a PSA of why muon is not a second order optimizer and how all the fancy manifold math is irrelevant. I am sick of people using math intimidation to make their methods sound fancy to bs VCs.

Jason Lee @jasondeanlee

5 Nov 2025

Proof by picture of why lr convergence is not useful unless it is fast relative to loss/predictions. Credit to nikhil Ghosh, Denny Wu, and Alberto for studying this and critical of the muP series of conclusions and overclaims.

245

31,231

Jason Lee · Feb 26, 2024 · 4:00 PM UTC

Jason Lee @jasondeanlee

26 Feb 2024

Extremely happy with this result! Mechanistic Understanding of how Transformers Learn Causal Structure!

Eshaan Nichani @EshaanNichani

26 Feb 2024

Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with @alex_damian_ @jasondeanlee! arxiv.org/abs/2402.14735 (1/10)

236

29,738

Jason Lee · Aug 22, 2025 · 2:31 PM UTC

Jason Lee @jasondeanlee

22 Aug 2025

I managed to prompt gpt-5-thinking into proving the tight 1.75/L matching v2 of the arxiv paper. From the arxiv paper, it was clear that this problem is perfect for PEP framework. I told gpt to do a search in the coefficients for combining cocoercivity at different pairs of points. Going up to 9 coefficients , it found the solution using sympy (I told it to use symbolic solver). We can also be pretty certain the authors used PEP to find the magical combination of coefficients also.

Sebastien Bubeck

@SebastienBubeck

20 Aug 2025

Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.

240

52,884

Jason Lee · Nov 8, 2025 · 5:23 AM UTC

Jason Lee @jasondeanlee

8 Nov 2025

But dumber. So much math slop tonight.

Jason Lee @jasondeanlee

8 Nov 2025

Gpt 5 pro super fast tonight

213

66,432

Jason Lee · Jul 4, 2025 · 10:02 PM UTC

Jason Lee @jasondeanlee

4 Jul 2025

I could decide in 6 seconds

NIK

@ns123abc

4 Jul 2025

> be zucc > makes an offer you can’t refuse > $100 million signing bonus > $100 million base salary > with bonus up to $300 million a year “offer expires in 6 hours”

206

28,547

Jason Lee · Oct 18, 2025 · 4:35 PM UTC

Jason Lee @jasondeanlee

18 Oct 2025

A Chinese-American hero that inspired my parents' generation to study and stay in America.

Tsinghua University

@Tsinghua_Uni

18 Oct 2025

Prof. Chen Ning Yang, a world-renowned physicist, Nobel Laureate in Physics, Academician of the Chinese Academy of Sciences, Professor at Tsinghua University, and Honorary Director of the Institute for Advanced Study at Tsinghua University, passed away in Beijing due to illness at the age of 103. His life stands as a timeless chapter in human history—one that shines not only for China but for the global community of thinkers and innovators. His legacy will live on forever.

206

24,572

Jason Lee · Mar 13, 2022 · 1:14 PM UTC

Jason Lee @jasondeanlee

13 Mar 2022

I strongly dislike the term inductive bias. It sounds like jargon to me, and whenever we use it like "the inductive bias allows it to learn well on domain X". Translation: we don't understand why it learns well on domain X, but it beats the competitors! Must be inductive bias!

Preetum Nakkiran @PreetumNakkiran

13 Mar 2022

Replying to @PreetumNakkiran

“What is the inductive bias of XX” is a fancy way of asking “on which distributions/tasks does XX work well?”

197

Jason Lee · May 5, 2023 · 12:26 PM UTC

Jason Lee @jasondeanlee

5 May 2023

I want to remind everyone that disabilities may also be invisible. Your colleagues, group members, students, postdocs, may be going through this. I am not an eloquent person, so WE NEED TO PAY MORE ATTENTION TO THE DISABLED AND THEIR ACCOMMODATION

Maria Skoularidou (she/her)@skoularidou

22 Apr 2022

Being a #disabled junior researcher in #AI comes at a massive price; when your disabilities flare, you are on your own: there is neither medical insurance nor salary for you during this difficult time This is a very important aspect that needs our attention #Academia #Insecurity

187

62,706

Jason Lee · Dec 19, 2022 · 1:40 PM UTC

Jason Lee @jasondeanlee

19 Dec 2022

Any suggestions for lecture notes /videos/short books on stochastic calculus or sdes? Looking for something operational, not rigorous

189

155,186

Jason Lee · Nov 27, 2024 · 1:03 PM UTC

Jason Lee @jasondeanlee

27 Nov 2024

arxiv.org/abs/2411.17668 Our postdoc zihan slays another COLT open problem! proceedings.mlr.press/v247/k…

Anytime Acceleration of Gradient Descent

This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that...

arxiv.org

185

40,113

Jason Lee · Dec 19, 2022 · 3:55 PM UTC

Jason Lee @jasondeanlee

19 Dec 2022

How do physicists learn stochastic calculus? There is no way they spend a month defining brownian motion /into integral.

Jason Lee @jasondeanlee

19 Dec 2022

Any suggestions for lecture notes /videos/short books on stochastic calculus or sdes? Looking for something operational, not rigorous

168

86,950

Jason Lee · May 5, 2025 · 4:19 PM UTC

Jason Lee @jasondeanlee

5 May 2025

Our new work on scaling laws that includes compute, model size, and number of samples. The analysis involves an extremely fine-grained analysis of online sgd built up over the last 8 years of understanding sgd on simple toy models (tensors, single index models, multi index model)

Eshaan Nichani @EshaanNichani

5 May 2025

Excited to announce a new paper with Yunwei Ren, Denny Wu, @jasondeanlee! We prove a neural scaling law in the SGD learning of extensive width two-layer neural networks. arxiv.org/abs/2504.19983 🧵below (1/10)

170

97,794

Jason Lee · Aug 6, 2025 · 2:50 PM UTC

Jason Lee @jasondeanlee

6 Aug 2025

Replying to @techikansh

Not my math problems.

157

24,614

Jason Lee · Oct 27, 2025 · 8:21 PM UTC

Jason Lee @jasondeanlee

27 Oct 2025

How do you evaluate the correctness of the mathematical reasoning chain? Take a look!

Fanghui Liu @Fanghui_SgrA

27 Oct 2025

What is "good" reasoning and how to evalute it? 🚀We explore a new pipeline to model step-level reasoning, a “Goldilocks principle” that balances free-form CoT and LEAN! Led by my student @yuanhezhang6, in colloboration with Ilja from DeepMind, @jasondeanlee, @CL_Theory

152

25,249

Jason Lee · May 7, 2021 · 3:03 AM UTC

Jason Lee @jasondeanlee

7 May 2021

I won!!! Much of the proposal is based off work with @AlexDamian and @tengyuma

Office of Naval Research (ONR)

@USNavyResearch

6 May 2021

📣Announcing… this year’s ONR Young Investigator Program (YIP) recipients! #ONRYIP onr.navy.mil/Media-Center/Pr…

147

Jason Lee · Aug 11, 2025 · 10:12 AM UTC

Jason Lee @jasondeanlee

11 Aug 2025

Replying to @doe1478725

No 200/month

145

5,931

Jason Lee · Nov 5, 2025 · 3:58 PM UTC

Jason Lee @jasondeanlee

5 Nov 2025

Soufiane Hayou

@hayou_soufiane

4 Nov 2025

🎯 Just released a new preprint that proves LR transfer under μP. -> The Problem: When training large neural networks, one of the trickiest questions is: what learning rate should I use? [1/n]🧵 Link: arxiv.org/abs/2511.01734

145

58,682

Jason Lee · Oct 17, 2025 · 8:25 PM UTC

Jason Lee @jasondeanlee

17 Oct 2025

Gpt searched for existing solutions in the literature. It did not solve them itself.

Mark Sellke @MarkSellke

17 Oct 2025

Update: Mehtaab and I pushed further on this. Using thousands of GPT5 queries, we found solutions to 10 Erdős problems that were listed as open: 223, 339, 494, 515, 621, 822, 883 (part 2/2), 903, 1043, 1079. Additionally for 11 other problems, GPT5 found significant partial progress that we added to the official website: 32, 167, 188, 750, 788, 811, 827, 829, 1017, 1011, 1041. For 827, Erdős's original paper actually contained an error, and the work of Martínez and Roldán-Pensado explains this and fixes the argument. The future of scientific research is going to be fun.

Community note

GPT-5 did not solve those Erdos problems. It only "found" solutions in the sense of finding existing published literature that solved the problems. Here is an explanation from the maintainer of erdosproblems.com: nitter.app/thomasfbloom/s

139

18,477

Jason Lee · May 15, 2021 · 2:26 AM UTC

Jason Lee @jasondeanlee

15 May 2021

Why are official announcements posted to medium? I click this link and can't read the article. Instead I get something about paying to subscribe to medium to read...

NeurIPS Conference

@NeurIPSConf

14 May 2021

The #NeurIPS2021 paper submission deadline has been extended by 48 hours. The new deadline is Friday, May 28 at 1pm PT (abstracts due Friday, May 21). Read the official announcement to learn more. link.medium.com/mJTaFjBeggb

130

Jason Lee · Oct 23, 2025 · 3:24 AM UTC

Jason Lee @jasondeanlee

23 Oct 2025

Absolutely insane

Yuandong Tian

@tydsh

23 Oct 2025

Several of my team members + myself are impacted by this layoff today. Welcome to connect :)

130

39,277

Jason Lee · Jul 25, 2025 · 4:29 PM UTC

Jason Lee @jasondeanlee

25 Jul 2025

He's terrible. Screwed my buddy sgd.

Yiping Lu

@2prime_PKU

25 Jul 2025

Anyone knows adam?

135

11,476

Jason Lee · Oct 28, 2025 · 1:34 PM UTC

Jason Lee @jasondeanlee

28 Oct 2025

I implemented Adam and ran it on a new dataset/arch. Blogpost coming soon! My implementation will also be available in the fiddler api.

127

12,686

Jason Lee · Dec 3, 2021 · 1:48 AM UTC

Jason Lee @jasondeanlee

3 Dec 2021

I think I just submitted over 100 ref letters...

117

Jason Lee · Jul 20, 2025 · 2:59 AM UTC

Jason Lee @jasondeanlee

20 Jul 2025

This is such cope. True that they gave almost no information in the release but it's still a super hard competition

113

8,535

Jason Lee · Nov 8, 2025 · 11:36 PM UTC

Jason Lee @jasondeanlee

8 Nov 2025

Gpt 5 pro working much better for me this morning. Last night was a disaster

115

9,185

Jason Lee · Nov 6, 2025 · 5:45 PM UTC

Jason Lee @jasondeanlee

6 Nov 2025

Exactly. Many of the matrix preconditioning methods, people call 2nd order, but really are much closer to first order. Eg muon, shampoo, etc

Aryan Mokhtari

@AryanMokhtari

6 Nov 2025

Second-order methods and preconditioner-based methods are **NOT** the same. Please stop using them interchangeably!

110

18,490

Jason Lee · Oct 30, 2025 · 2:24 PM UTC

Jason Lee @jasondeanlee

30 Oct 2025

Haozhe's paper is worth a read, really nice use of fixed point theorems. The new one about 1 to 1 seems almost immediate though Just from reading the thread only I would guess the proof is as follows: Say your input space is discrete in j \in [n] and represented by x_j. The embedding is E: [n] \to R^d . "For almost all E", E x_j are distinct. Then a transformer f is composed of building blocks that are analytic, composition of analytic is analytic and also preserved under many algebraic operations (o-minimal stuff) . Analytic functions are either the zero function, or crosses zero on a measure zero set (they can't have positive measure set f(set) =0) . Thus so long as f \neq 0 uniformly, this should be injective (not bijective necessarily its not onto).

Haozhe Jiang @erichzjiang

28 Oct 2025

(1/7) Glad to see that people are following up on our work studying topological properties of modern neural network architectures. It was cool to see that widely used neural architectures can almost always generate any output given appropriate inputs, a.k.a. are surjective.

113

18,524

Jason Lee · Dec 10, 2023 · 11:56 AM UTC

Jason Lee @jasondeanlee

10 Dec 2023

arxiv.org/abs/2312.00752 @tdietterich How come this paper can be uploaded without tex source? Was it written in word? Asking because I always download source and change the font size to make it readable for my eyes.

107

46,434

Jason Lee · Nov 8, 2025 · 12:19 AM UTC

Jason Lee @jasondeanlee

8 Nov 2025

Seriously @OpenAI if your main business is being a consumer-facing company then fix your UX. Equations haven't rendered properly since the original 2023 release. The default to router is annoying.

Kangwook Lee

@Kangwook_Lee

7 Nov 2025

Replying to @jasondeanlee

Could be. I started making sure changing it back to Pro every time I continue conversations. A bit annoying!

108

16,037

Jason Lee · Jun 10, 2025 · 3:21 PM UTC

Jason Lee @jasondeanlee

10 Jun 2025

New work arxiv.org/abs/2506.05500 on learning multi-index models with @alex_damian_ and Joan Bruna. Multi-index are of the form y= g(Ux), where U=r by d maps from d dimension to r dimension and d>>r. g is an arbitrary function. Examples of multi-index models are any neural net with first hidden layer that is width r. Our new paper proposes a new spectral estimator that attains optimal dimension dependency for recovering span(U). We define the generative leap exponent that governs the difficulty of learning and show both upper and lower bounds of d^{k/2} , where k is the generative leap exponent. This gives optimal results for learning several families: 1. Deep ReLU networks with bias (generalizing the result of Chen, Meka, Klivans from bias-less ReLU networks) 2. low-rank polynomials where g is a polynomial (Chen and Meka). 3. Almost all deep neural networks with first hidden layer of width r. 4. Sparse Gaussian parity

The Generative Leap: Sharp Sample Complexity for Efficiently...

In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$...

arxiv.org

112

9,250

Jason Lee · Nov 15, 2024 · 4:12 PM UTC

Jason Lee @jasondeanlee

15 Nov 2024

At the @SimonsInstitute working on AGI (Artificial Gaussian Intelligence)

102

11,523

Jason Lee · Aug 29, 2025 · 9:39 PM UTC

Jason Lee @jasondeanlee

29 Aug 2025

Most surprising part is that xai stock is worth 7m.

AshutoshShrivastava

@ai_for_success

29 Aug 2025

🚨 xAI is suing former engineer Xuechen Li for allegedly stealing trade secrets about its Grok chatbot before joining OpenAI. The company claims Li admitted to taking files, sold $7M in stock, and is now seeking damages and a restraining order to block him from joining OpenAI.

18,242

Jason Lee · Aug 7, 2025 · 2:00 PM UTC

Jason Lee @jasondeanlee

7 Aug 2025

How do I short oai before gpt5 release?

10,066

Jason Lee · Jul 16, 2025 · 2:38 PM UTC

Jason Lee @jasondeanlee

16 Jul 2025

TLDR: Heuristics such as clipping cause weird biases. Let's move away from heuristics to principled methods so at least we know what they are optimizing

Gokul Swamy @g_k_swamy

15 Jul 2025

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-consi…

11,846

Jason Lee · Apr 12, 2024 · 2:30 AM UTC

Jason Lee @jasondeanlee

12 Apr 2024

Oh god

Gautam Kamath @thegautamkamath

12 Apr 2024

NeurIPS 2024 will have a track for papers from high schoolers.

15,527

Jason Lee · Feb 23, 2021 · 3:56 AM UTC

Jason Lee @jasondeanlee

23 Feb 2021

I strongly believe distribution shift is one of the major challenges in deploying ml systems. We take a step towards addressing subpopulation shift via a label propagation framework.

Tianle Cai

@tianle_cai

23 Feb 2021

Subpopulation shift is a ubiquitous component of natural distribution shift. We propose a general theoretical framework of learning under subpopulation shift based on label propagation. And our insights can help to improve domain adaptation algorithms. arxiv.org/abs/2102.11203

Jason Lee · Oct 2, 2023 · 2:21 AM UTC

Jason Lee @jasondeanlee

2 Oct 2023

arxiv.org/abs/2106.06530 arxiv.org/abs/2209.15594 We identified the third order effect in two algorithms, sgd and gd.

Label Noise SGD Provably Prefers Flat Global Minimizers

In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by...

arxiv.org

Jeremy Cohen @deepcohen

1 Oct 2023

I think the reason why second-order methods keep underperforming relative to first-order methods in deep learning is that first-order methods are more powerful than the theory gives them credit for. First-order methods + large step sizes can implicitly access specific **third

27,460

Jason Lee · Jun 26, 2025 · 10:34 PM UTC

Jason Lee @jasondeanlee

26 Jun 2025

Replying to @karololszacki

How to grt it for free?

37,039

Jason Lee · Dec 16, 2024 · 2:34 PM UTC

Jason Lee @jasondeanlee

16 Dec 2024

Picard's statement is a non-apology. NeurIPS statement (not linked here) is better.

NeurIPS Conference

@NeurIPSConf

15 Dec 2024

Replying to @NeurIPSConf

In addition, Dr. Picard has also released an apology to the NeurIPS community. It can be read at neurips.cc/Conferences/2024/…

13,185

Jason Lee · Jul 22, 2025 · 3:26 AM UTC

Jason Lee @jasondeanlee

22 Jul 2025

Very impressive! Shows that existing models are not far from gold, and with some minor self verification +prompting work already. GDM and oai results maybe only require some light rl tuning on top of the existing model (eg against the self verifier)

Lin Yang

@lyang36

22 Jul 2025

🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025

8,186

Jason Lee · Mar 26, 2024 · 4:35 PM UTC

Jason Lee @jasondeanlee

26 Mar 2024

Replying to @chrmanning @strwbilly

Where does the Markov chain come from? It depends on all previous not just the immediate

9,686

Jason Lee · Jun 18, 2021 · 5:53 PM UTC

Jason Lee @jasondeanlee

18 Jun 2021

I have never seen a monograph (or book) with such an incomplete list of citations.

Dan Roberts

@danintheory

18 Jun 2021

I'm delighted to share publicly "The Principles of Deep Learning Theory," co-written with @ShoYaida, and based on work also with @BorisHanin. It will appear on the @arxiv on Sunday and will be published by @CambridgeUP early next year: deeplearningtheory.com/ 1/

Jason Lee · May 29, 2025 · 5:54 PM UTC

Jason Lee @jasondeanlee

29 May 2025

OK so they lasted one day, considerably less than 6 months.

Shashwat Goel

@ShashwatGoel7

29 May 2025

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

10,499

Jason Lee · Aug 4, 2025 · 8:33 PM UTC

Jason Lee @jasondeanlee

4 Aug 2025

OK forget the 100m. I'll just be the doorman at Nvidia.

Jason Lee @jasondeanlee

4 Aug 2025

How can the median employee be worth 25m. No way this is right...

9,246

Jason Lee · May 24, 2025 · 8:09 PM UTC

Jason Lee @jasondeanlee

24 May 2025

Optimal isn't even defined for feature learning so how could it be optimal. Mup, completep and variants satisfy some desiderata. The choice of desiderata lead to different init schemes, and it's not clear what the right set of desiderata are.

Shaurya Sharthak @ShauryaSharthak

24 May 2025

Replying to @attentionmech

Just for any doubts μP or CompleteP are techniques for optimal feature learning. (Not a optimixer) ref:arxiv.org/abs/2505.01618

19,812

Jason Lee · May 16, 2022 · 3:35 AM UTC

Jason Lee @jasondeanlee

16 May 2022

I'm in good company!

Yann LeCun

@ylecun

16 May 2022

Verdict from @icmlconf: 3 out of 3 ..... rejected. If I go by tweet statistics, ICML has rejeted every single paper this year 🤣

Jason Lee · Jul 19, 2025 · 7:24 PM UTC

Jason Lee @jasondeanlee

19 Jul 2025

Replying to @DimitrisPapail

Right. The surprising part is it's just a llm, no special tool or solver like alpha geometry +lean

5,150

Jason Lee · Jun 26, 2025 · 4:33 PM UTC

Jason Lee @jasondeanlee

26 Jun 2025

Replying to @george__wing

Wasn't sakana the one doing fraudulent demos

20,279

Jason Lee · Nov 3, 2025 · 2:11 AM UTC

Jason Lee @jasondeanlee

3 Nov 2025

I remember coming to these when I was in middle school and high school (all the way from Cupertino!). The first one I attended was by Prof Stankova on circle inversions and it was like magic.

Doris Tsao

@doristsao

2 Nov 2025

Unbelievable: the famed Berkeley Math Circle is being forced to shut down due to a bureaucratic requirement where a guest lecturer giving an hour long lesson needs to be officially fingerprinted. How is fingerprinting even still a thing in the 21st century? Chancellor Lyons @richlyons: can you see the absurdity of the situation and figure out a solution? dailycal.org/news/campus/gen…

20,747

Jason Lee · Jun 14, 2025 · 5:51 PM UTC

Jason Lee @jasondeanlee

14 Jun 2025

Bye east coast!

7,893

Jason Lee · May 12, 2024 · 10:57 AM UTC

Jason Lee @jasondeanlee

12 May 2024

Adam has more citations than Robbins-Monro! Wtf

Gabriel Peyré

@gabrielpeyre

12 May 2024

Oldies but goldies: H Robbins, S Monro, A Stochastic Approximation Method, 1951. Early appearance of the stochastic gradient method, which is the workhorse of many large-scale ML methods. en.wikipedia.org/wiki/Stocha… en.wikipedia.org/wiki/Stocha…

15,954

Jason Lee · May 22, 2025 · 8:00 PM UTC

Jason Lee @jasondeanlee

22 May 2025

All models are equally good.

21,345

Jason Lee · Dec 22, 2024 · 5:50 PM UTC

Jason Lee @jasondeanlee

22 Dec 2024

Replying to @farooque_99 @OpenAI @sama @markchen90

Agree flash thinking is at least fast!

18,592

Jason Lee · May 12, 2024 · 12:05 PM UTC

Jason Lee @jasondeanlee

12 May 2024

Are Kolmogorov-Arnold networks (KAN) just standard MLP activation corresponding to the B-spline and some parallel net arch to handle the gridpoints? Feels like this should be true...

20,219

Jason Lee · Jul 10, 2025 · 1:00 AM UTC

Jason Lee @jasondeanlee

10 Jul 2025

Last week was 100m, and this week it's 200m. finance.yahoo.com/news/meta-… I'm waiting 8 more weeks for the 1B offer.

16,373

Jason Lee · May 9, 2024 · 6:14 PM UTC

Jason Lee @jasondeanlee

9 May 2024

Our new colt paper solving a colt open problem!

Simon Shaolei Du

@SimonShaoleiDu

9 May 2024

How to learn the best shared model across multiple data distributions — a unified paradigm with applications in robustness, fairness, and calibration? Our COLT 2024 paper shows how to do it optimally using Hedge! arxiv.org/abs/2312.05134. Also resolved 3 COLT 2023 open problems: arxiv.org/abs/2307.12135

9,395

Jason Lee · Jan 14, 2024 · 10:17 PM UTC

Jason Lee @jasondeanlee

14 Jan 2024

We (@alex_damian_ @EshaanNichani) need help from Markov chain experts! mathoverflow.net/questions/4…

12,002

Jason Lee · Oct 18, 2020 · 8:21 PM UTC

Jason Lee @jasondeanlee

18 Oct 2020

Replying to @roydanroy @ilyasut

More like if you completely close your eyes...

Jason Lee · Oct 23, 2025 · 6:20 PM UTC

Jason Lee @jasondeanlee

23 Oct 2025

We have openings including in AI, please apply!

Berkeley Statistics @UCBStatistics

23 Oct 2025

UC Berkeley Department of Statistics is hiring! We’re seeking applicants for up to three approved tenure-track positions at the Assistant Professor level in Statistics, Probability and AI. Details & apply: aprecruit.berkeley.edu/JPF05… #AI #Statistics #Probability #UCBerkeley

17,111

Jason Lee · May 30, 2025 · 3:07 PM UTC

Jason Lee @jasondeanlee

30 May 2025

New work on training deep transformers for multi-step reasoning!

Zixuan Wang

@zzZixuanWang

30 May 2025

6,976

Jason Lee · May 22, 2025 · 9:10 PM UTC

Jason Lee @jasondeanlee

22 May 2025

Though all are considerably worse than my students @alex_damian_ @EshaanNichani

Jason Lee @jasondeanlee

22 May 2025

I can confirm opus 4 is equal in ability to o3 and gemini 2.5 pro at computing Gaussian and Hermite identities. That's my main use case.

4,999

Jason Lee · Jun 12, 2025 · 12:49 PM UTC

Jason Lee @jasondeanlee

12 Jun 2025

Quite happy with A*-PO. 1. Simple (no heuristics such as clipping/normalization 2. One rollout per iteration , improving efficiency.

Zhaolin Gao

@GaoZhaolin

11 Jun 2025

Current RLVR methods like GRPO and PPO require explicit critics or multiple generations per prompt, resulting in high computational and memory costs. We introduce ⭐A*-PO, a policy optimization algorithm that uses only a single sample per prompt during online RL without critic.

7,315

Jason Lee · Dec 11, 2022 · 8:44 PM UTC

Jason Lee @jasondeanlee

11 Dec 2022

Replying to @thegautamkamath

arxiv.org/abs/2212.03714

Jason Lee · Aug 22, 2025 · 3:13 PM UTC

Jason Lee @jasondeanlee

22 Aug 2025

To be clear, this would probably take 1 hour for someone with experience with pep or mathematica. Not the 5 hours of prompting +spending all my gpt-5-pro credits

4,047

Jason Lee · Mar 31, 2021 · 9:44 PM UTC

Jason Lee @jasondeanlee

31 Mar 2021

TLDR: By moving further from initialization, you can provably learn a broad class of functions (low-rank polynomials) with less samples than any kernel method. Low-rank polynomials include networks with polynomial activation of bounded degree and analytic activation (approx).

Yu Bai

@yubai01

31 Mar 2021

🚨 New blog post on Deep Learning Theory Beyond NTKs: Salesforce research blog: blog.einstein.ai/beyond-ntk/ offconvex: offconvex.org/2021/03/25/bey… An exposition of "escaping the NTK ball with stronger learning guarantees". Joint w/ @jasondeanlee @MinshuoC

Jason Lee · May 24, 2023 · 2:07 PM UTC

Jason Lee @jasondeanlee

24 May 2023

Downloaded the source of this one and tried to compile in Larger font to make it readable. The latex is completely obfuscated to make it hard to edit. What's the point of this? Make it inaccessible to low vision readers? Makes it near impossible to reformat to be read

Zeyuan Allen-Zhu, Sc.D.

@ZeyuanAllenZhu

24 May 2023

Our first paper in a series studying the inner mechanisms of transformers. TL;DR: we show *how* GPTs learn complex CFG trees via learning to do dynamic programming. Huge thanks to @MetaAI for making this research journey possible. arxiv.org/abs/2305.13673 FYI to @OpenAI @mbzuai

40,136

Jason Lee · Nov 9, 2025 · 4:57 PM UTC

Jason Lee @jasondeanlee

9 Nov 2025

Like wtf.

Jason Lee @jasondeanlee

9 Nov 2025

Please fix the defaulting to chatgpt. So annoying, manipulative and dishonest @nickaturley Also fix your mathjax /latex/markdown. No one wants to be reading raw latex.

20,721

Jason Lee · Dec 22, 2024 · 5:29 PM UTC

Jason Lee @jasondeanlee

22 Dec 2024

Replying to @Kangwook_Lee @OpenAI @sama @markchen90

Good idea!

25,767

Jason Lee · May 17, 2022 · 2:52 PM UTC

Jason Lee @jasondeanlee

17 May 2022

Now I know who took my slots!

Chi Jin @chijinML

17 May 2022

Wow, all of our 6 submissions to ICML and COLT got accepted this year! Congrats to all my collaborators.

Jason Lee · Jun 26, 2025 · 2:20 PM UTC

Jason Lee @jasondeanlee

26 Jun 2025

I love the east bay! Grew up in south bay and never thought much of the east bay. Turns out way better 1) better weather 2) food is cheaper, less crowded

8,708

Jason Lee · Nov 10, 2025 · 5:31 PM UTC

Jason Lee @jasondeanlee

10 Nov 2025

UT gets more compute!news.utexas.edu/2025/11/10/u…

UT Doubles Size of One of World’s Most Powerful AI Computing Hubs

AUSTIN, Texas — The Center for Generative AI at The University of Texas at Austin, already among the most powerful artificial intelligence hubs in the

news.utexas.edu

33,458

Jason Lee · Mar 9, 2022 · 3:04 AM UTC

Jason Lee @jasondeanlee

9 Mar 2022

What is the analog of ERM for offline RL? We propose primal dual regularized offline rl (PRO-RL), which has many of the properties that makes ERM so successful. arxiv.org/abs/2202.04634