Training Accelerations @OpenAI. Previously @SFResearch, PhD @Stanford.

San Francisco, CA
Pinned Tweet
🥔GPT-5.5 is here in Codex and ChatGPT. 🚀 Don’t want to keep saying “step change” with each release, but this time we feel it’s a pretty big one. It may be an inflection point for a lot of things down the road. Please try using this model in Codex for your coding and other professional use cases – Start with the same tasks as before, expect it to do more with less human in the loop, it will make a big difference over 5.4.
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.
3
2
53
11,929
We adjusted GPT5 based on all of your valuable feedback -- separated Auto vs. Fast, made 4o available, increased rate limits and made Mini explicit (Mini is an insane model too!)
Updates to ChatGPT: You can now choose between “Auto”, “Fast”, and “Thinking” for GPT-5. Most users will want Auto, but the additional control will be useful for some people. Rate limits are now 3,000 messages/week with GPT-5 Thinking, and then extra capacity on GPT-5 Thinking mini after that limit. Context limit for GPT-5 Thinking is 196k tokens. We may have to update rate limits over time depending on usage. 4o is back in the model picker for all paid users by default. If we ever do deprecate it, we will give plenty of notice. Paid users also now have a “Show additional models” toggle in ChatGPT web settings which will add models like o3, 4.1, and GPT-5 Thinking mini. 4.5 is only available to Pro users—it costs a lot of GPUs. We are working on an update to GPT-5’s personality which should feel warmer than the current personality but not as annoying (to most users) as GPT-4o. However, one learning for us from the past few days is we really just need to get to a world with more per-user customization of model personality.
33
17
377
57,572
📢 Life update: In the vibes of the announcement today I am thrilled to share that I have joined @OpenAI as a researcher! It's just my week 2 here but already amazed by so many things the team has achieved. Looking forward to learning more and contributing!
GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot 🙂. Here’s how it’s been doing.
51
10
346
79,304
We used RL to train a much stronger reasoning model. Excited to have been part of this journey, and way to go!!!
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math. openai.com/index/introducing…
9
12
284
38,229
Thrilled to share our new work "Transformers as Statisticians👩‍🎓👨‍🎓" Unveiling a new mechanism "In-Context Algorithm Selection" for In-Context Learning (ICL) in LLMs/transformers. ++ A comprehensive theory for transformers to do ICL. arxiv.org/abs/2306.04637 Thread⬇️
3
47
219
82,606
🚨 New blog post on Deep Learning Theory Beyond NTKs: Salesforce research blog: blog.einstein.ai/beyond-ntk/ offconvex: offconvex.org/2021/03/25/bey… An exposition of "escaping the NTK ball with stronger learning guarantees". Joint w/ @jasondeanlee @MinshuoC
3
38
176
We released our first open-source language model since GPT-2! It was amazing how the entire team has came together in every stage of this work -- squeezing the absolute best performance, stress-testing and mitigating safety risks to a new standard, and overcoming many unforeseen challenges -- the model is finally here and we can't wait to see the amazing research you will all do on top of it!
Our open models are here. Both of them. openai.com/open-models
4
5
153
26,224
Excited to share that "Transformers as Statisticians" will appear at #NeurIPS2023 as an oral! We have two other posters on learning with attention models and RL theory (thread may follow): arxiv.org/abs/2307.11353 arxiv.org/abs/2306.01243
Thrilled to share our new work "Transformers as Statisticians👩‍🎓👨‍🎓" Unveiling a new mechanism "In-Context Algorithm Selection" for In-Context Learning (ICL) in LLMs/transformers. ++ A comprehensive theory for transformers to do ICL. arxiv.org/abs/2306.04637 Thread⬇️
2
13
140
21,417
📜🆕New extended blog post on recent progresses in multi-agent RL theory (joint w/ Chi Jin): yubai.org/blog/marl_theory.h… We talk about "How does RL theory become different when it's multi-agent", and present the various recent developments and opportunities therein.
2
15
112
Flying to #NeurIPS2023 now. Looking forward to meeting old and new friends, and talking about everything LLM / RL! "Transformers as Statisticians" oral would be in the Wed afternoon session.
1
6
84
10,536
Surreal & beyond excited to have joined forces with @Song__Mei again -- more to come soon!
I’m excited to start at OpenAI this May and help ship the oss model. More to come soon!
4
2
84
20,487
Today is the day -- we are excited to bring gpt5 to you. Fortunate to have led several workstreams in GPT5 Thinking and Mini model training. Among many other improvements, with @Song__Mei @minyoung_huh @SebastienBubeck and co, we applied some unexpected but cool techniques to make the model smart, chatty, and a good model all-around. Also honored to have worked together with the crew @yanndubs @ElaineYaLe6 @christinahkim @ericmitchellai @michpokrass @max_a_schwarzer and everyone else, it was fun coming together and doing things! Let us know how you like or dislike it -- this will not be the last model we're gonna train.
GPT-5 is here. Rolling out to everyone starting today. openai.com/gpt-5/
19
7
82
16,485
And as my journey at Salesforce Research @SFResearch comes to an end after 4.5 years, I can't help but feel so fortunate to have been a part of this amazing AI research team. Thanks @huan__wang @CaimingXiong @silviocinguetta @RichardSocher ++everyone for all the support ♥️
3
2
78
11,915
Brightness = how many cells were reprogrammed by the protein. Left is unaltered; middle is existing protein; right is the new model-designed protein.
At @OpenAI, we believe that AI can accelerate science and drug discovery. An exciting example is our work with @RetroBiosciences, where a custom model designed improved variants of the Nobel-prize winning Yamanaka proteins. Today we published a closer look at the breakthrough. ⬇️
4
5
71
7,758
How do deep networks perform hierarchical learning? We theoretically show that networks with wide intermediate representations can express functions hierarchically, and be more sample efficient than "shallow learners" such as the NTK. arxiv.org/abs/2006.13436
2
13
68
#ICLR2022 We present CP-Gen, a modular approach for improving the efficiency (e.g. length, volume) with conformal prediction, by tuning prediction sets with more than one parameters. Paper: openreview.net/forum?id=Ht85… Poster (Monday 6:30pm PT): iclr.cc/virtual/2022/poster/…
2
9
54
We also used ReLU attention to study the expressive power of transformers. It matches softmax in our small (gpt2 scale) experiment in the first paper below. arxiv.org/abs/2306.04637 arxiv.org/abs/2307.11353 Nice work @hoonkp @skornblith and co to get ReLU transformers in action!
Replacing softmax with ReLU in Vision Transformers ReLU-attention has better compute-performance scaling than softmax-attention on Vision Transformers arxiv.org/abs/2309.08586
1
7
50
10,259
📜 Our paper on efficient learning in Extensive-Form Games will appear at #NeurIPS2022 as an Oral Presentation! 🔗 Paper: arxiv.org/abs/2205.15294 📢 Poster: neurips.cc/virtual/2022/post… Joint work with @chijinML @WispyMay Ziang Song @tianchengyu14 🧵1/
1
7
48
CoT is a nice name! And excited that "Transformers as Statisticians" oral will be in this "CoT/Reasoning" session (Wed)--Being a statistician probably does mean that you're good at reasoning 😃
In NeurIPS 2023, there is a section “CoT/Reasoning”. When preparing our CoT paper, I kicked off a discussion on the title. Different names were proposed, like stream of thought (Jason), train of thought (Dale), chain of thought (Dale). Finally I decided to choose “chain of thought”. Happy to see the name is liked by the community and popularized. :)
5
43
6,852
New preprint on offline RL: arxiv.org/pdf/2102.01748.pdf * A variance reduction algorithm for offline RL * Optimal horizon dependence: O(H^2/d_m) sample complexity on time-homogeneous MDPs Joint w/ Ming Yin (@MingYin_0312) and Yu-Xiang Wang
2
6
38
Check out our #ICML2021 paper---We theoretically analyze calibration, and show that over-confident prediction happens for well-specified logistic regression too, not just on large NNs! Paper: arxiv.org/abs/2102.07856 Poster: icml.cc/virtual/2021/poster/…, Wed 9am PT 1/4
2
4
37
Congrats on the ICML best paper @GoogleDeepMind @misovalko @Tdash_Koz & team! Wraps a "trilogy" on learning Extensive-Form Games: * Our #ICML2022 paper, which first got O(X) (tight): arxiv.org/abs/2202.01752 * Their #NeurIPS2021 paper which got O(X^2): arxiv.org/abs/2106.06279
Congrats to @GoogleDeepMind’s Remi Munos, @misovalko, & team on the Outstanding Paper Award at @ICMLConf! “Adapting to game trees in zero-sum imperfect information games” helps answer: how do you make the best move in a game w/ only partial info? Paper: openreview.net/pdf?id=O1j4uF…
1
4
34
9,519
I'll be presenting "Transformers as Statisticians" at the ES-FOMO workshop at #ICML2023, at 1:00pm HT. See you there! Workshop website: icml.cc/virtual/2023/worksho…
Thrilled to share our new work "Transformers as Statisticians👩‍🎓👨‍🎓" Unveiling a new mechanism "In-Context Algorithm Selection" for In-Context Learning (ICL) in LLMs/transformers. ++ A comprehensive theory for transformers to do ICL. arxiv.org/abs/2306.04637 Thread⬇️
2
2
35
7,261
Our paper on low switching cost RL (arxiv.org/abs/1905.12849) has been accepted at @NeurIPSConf 2019. We showed that efficient PAC exploration can be achieved by switching the policy only logarithmically many times. Congrats Tengyang, @nanjiang_cs, and Yu-Xiang!
1
2
31
TLDR: Attention sink/massive tokens emerge in LLMs, simply because most heads need to be * Active for some input sequences; * "Dormant" for others. Started as a fun collab during my time @SFResearch, huge shoutout to @TianyuGuo0505 @druv_pai @Song__Mei +co for the amazing work!
Many LLMs, e.g., GPT2 and Llama, exhibit a fascinating attention sink phenomenon: attention weights often concentrate on the first token. We studied the training dynamics of toy models to demystify the sink formation mechanisms in LLMs. With fantastic @TianyuGuo0505 , @druv_pai , @yubai01 , @JiantaoJ , and Mike Jordan! ArXiv link: arxiv.org/abs/2410.13835 In detail: Practitioners have consistently found three extreme-token phenomena in LLMs: attention sinks, value-state drains, and residual-state peaks. They often cause trouble in LLM inference and quantization. To understand them, we developed the Bigram-Backcopy task and analyzed a single-layer transformer, revealing two key mechanisms: • Active-dormant mechanism: The attention sink represents the dormant phase of an attention head. • Mutual reinforcement mechanism: Attention sinks and value-state drains mutually reinforce during training. All results can transfer to LLMs! • Llama 2 has a “coding head” that is dormant given Wikipedia texts. • OMLo’s training dynamics closely match the theory and the toy model. We also found that replacing SoftMax attention with ReLU attention can mitigate the extreme-token phenomenon.
30
2,970
Attending #NeurIPS2022 from Mon Evening -> Sat, and presenting 4 papers (1 oral + 3 posters) on multi-agent RL, games, and deep learning theory. I will also be at Salesforce's booth Tuesday afternoon, starting 2:45pm. Let me know if you want to chat!
1
31
The AI Economist: Using multi-agent RL to simulate complex economic systems, guide policy designs, and improve social equality. Impressive work by colleagues @StephanZheng @alexrtrott and all at @SFResearch!
Excited to introduce the AI Economist: Extends ideas from Reinforcement Learning for tackling inequality through learned tax policy design. The framework optimizes productivity and equality. Blog: blog.einstein.ai/the-ai-econ… Paper: arxiv.org/abs/2004.13332 Q&A: salesforce.com/company/news-…
7
29
Can wide neural nets be systematically analyzed beyond the kernel / linearized regime? Our recent work shows that wide NNs can couple with higher-order (e.g. quadratic) submodels and genearlize better than the linearized ones! arxiv.org/abs/1910.01619 (joint w/ @jasondeanlee)
2
27
Besides ~saturating AIME, o3-mini is also the first to consistently solve some of the hard math questions in my own "test set" -- have to update that as well 🤣 Congrats @ren_hongyu @shengjia_zhao @_kevinlu + co!
1
26
2,223
En route to #ICML2023 ✈️🌴. Let's chat about LLMs / in-context learning, (multi-agent) RL, and their theories. You can also find me at our posters and workshop papers:
25
4,170
Congrats @LesterMackey!! still remember all the fun stuff at Stanford Stats 300 class and the stats ML reading group -- influenced me and so many of the next generation of statisticians
🙌🎉Our 2025 recipient of the COPSS Presidents' Award, is Lester Mackey! This award is given annually to a young member of the statistical community in recognition of outstanding contributions to the profession of statistics.
1
1
21
4,439
GPT-4o mini is out!
Introducing GPT-4o mini! It’s our most intelligent and affordable small model, available today in the API. GPT-4o mini is significantly smarter and cheaper than GPT-3.5 Turbo. openai.com/index/gpt-4o-mini…
23
4,287
We have a better reasoning model again! I continue to get amazed by how much more the model gains by unlocking more tool use and post trained on a better stack.
Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.
2
2
20
2,556
Excited to be attending #NeurIPS2019 at Vancouver next week!
20
Happy to be selected as an expert reviewer for @TmlrOrg ! Time to send in a submission for earning that expert certificate :)
We have just finalized our first selection of TMLR Expert Reviewers. These are reviewers who have done particularly exemplary work in evaluating TMLR submissions. See the following page for details and the list of reviewers: openreview.net/group?id=TMLR…
1
20
3,200
Excited to present our paper "Provable Self-Play Algorithms for Competitive Reinforcement Learning" at #ICML2020! Talk: Wednesday (July 15) 9am PT / 10pm PT Paper: arxiv.org/abs/2002.04017 Poster: icml.cc/virtual/2020/poster/… Joint work with Chi Jin. 1/2
1
1
20
Check out NPO, a simple objective for LLM unlearning.
LLM unlearning was mostly based on variants of gradient ascent (GA), susceptible to catastrophic forgetting. We propose Negative Preference Optimization (NPO), demonstrating efficient unlearning on TOFU benchmark. w/ @RuiqiZhang0614 @ Licong Lin, @yubai01. arxiv.org/abs/2404.05868
1
2
18
5,878
Exciting opportunity for working with Song on LLMs!
My group at Berkeley Stats and EECS has a postdoc opening in the theoretical (e.g., scaling laws, watermark) and empirical aspects (e.g., efficiency, safety, alignment) of LLMs or diffusion models. Send me an email with your CV if interested!
18
4,521
Looking forward to this tomorrow! Thanks for organizing @CsabaSzepesvari @neu_rips @CiaraPikeBurke
Thinking of scaling up multiagent RL to a large number of agents? Provably? Choose your equilibrium concept right and you may be rewarded! Yu Bai will tell us tomorrow how! For details see tinyurl.com/375x6j6b
18
Flying to #ICML2023 tomorrow. Ping me if you'd like to chat!
19
2,944
Lol don't mind at all if deep research gets pass me in 2025! @EdwardSun0909 it's on you :)
PhD experts? 🤣🤣 Unless they can perform at @yubai01’s level, they’re irrelevant to the machine learning theory community.
2
16
6,160
Appearing at #NeurIPS2020! Come to our poster session at Tuesday 9-11am PT to have some fun with NTKs, shallow Taylorized models, and better sample complexity than all these via neural hierarchical learning. w/ @MinshuoC @jasondeanlee ++ neurips.cc/virtual/2020/prot…
How do deep networks perform hierarchical learning? We theoretically show that networks with wide intermediate representations can express functions hierarchically, and be more sample efficient than "shallow learners" such as the NTK. arxiv.org/abs/2006.13436
1
3
15
Come chat with us about our Beyond Linearization paper and more! ICLR poster session today 10am - 12pm and 1 - 3pm PDT: iclr.cc/virtual/poster_rkllG… Paper: openreview.net/forum?id=rkll…
Our Beyond Linearization paper is accepted at #ICLR2020 ! openreview.net/forum?id=rkll…
3
17
In new paper led by @EshaanNichani, we utilize the spectral structure + higher-order "QuadNTK" approximation to show benefit of "After NTK" learning.
What happens “after NTK” in wide neural nets, and how does it improve over the NTK? Excited to announce a new paper with @yubai01 and @jasondeanlee! arxiv.org/abs/2206.03688 A thread on the main takeaways below: (1/9)
15
Our Beyond Linearization paper is accepted at #ICLR2020 ! openreview.net/forum?id=rkll…
Can wide neural nets be systematically analyzed beyond the kernel / linearized regime? Our recent work shows that wide NNs can couple with higher-order (e.g. quadratic) submodels and genearlize better than the linearized ones! arxiv.org/abs/1910.01619 (joint w/ @jasondeanlee)
1
15
Check out our new work for efficiently learning "rationalizable equilibria" in multiplayer games---Strategies that are both approximate CE/CCE, and supported on rationalizable actions.
Replying to @chijinML
We are excited to announce our recent work arxiv.org/abs/2210.11402 with @YuanhaoWang3, Dingwen Kong, @yubai01, which presents new algorithms and the first sample-efficient guarantees for learning rationalizable equilibria.
12
Was always hoping we could do this, and we are finally doing it!
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: openai.com/open-model-feedba… we are excited to make this a very, very good model! __ we are planning to release our first open-weigh language model since GPT-2. we’ve been thinking about this for a long time but other priorities took precedence. now it feels important to do. before release, we will evaluate this model according out our preparedness framework, like we would for any other model. and we will do extra work given that we know this model will be modified post-release. we still have some decisions to make, so we are hosting developer events to gather feedback and later play with early prototypes. we’ll start in SF in a couple of weeks followed by sessions in europe and APAC. if you are interested in joining, please sign up at the link above. we’re excited to see what developers build and how large companies and governments use it where they prefer to run a model themselves.
9
2,320
Our annual AI research grant is now open for applications!
Announcing the Third Annual AI Research Grant! For more details and how to apply: Blog: blog.einstein.ai/announcing-… Website: einstein.ai/outreach/grants Good luck to our future applicants!
11
🆕"When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?" arxiv.org/abs/2110.04184 We theoretically study what RL can learn in multi-player general-sum MGs without exp(# players) samples. Joint w/ Ziang Song (Peking U.) & @WispyMay. 🧵
2
11
Will be attending this workshop Thu - Fri. Looking forward to it!
2-day workshop "New Directions in Reinforcement Learning and Control" @the_IAS in Princeton Nov 7-8. Schedule math.ias.edu/ndrlc and livestream here ias.edu/livestream .
10
#NeurIPS2020 What is the optimal algorithm for multi-agent reinforcement learning in zero-sum Markov games? We present "Near-Optimal Reinforcement Learning via Self-Play" Paper: arxiv.org/abs/2006.12007 Poster session: Tuesday 9-11am PT Joint w/ Chi Jin, Tiancheng Yu.
1
10
Congrats @alexwei_ @SherylHsu02 @polynoamial !! This is a crazy result.
Replying to @alexwei_
9/N Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.
12
2,372
I think "halftime" is such a nice framing -- I had a similar feeling when sitting in Ilya's test-of-time talk at NeurIPS 2024. It felt like major pieces of the puzzle have already come together. Evals--often imaginations of what models can do---lead the way of what we can train them to do.
I finally wrote another blogpost: ysymyth.github.io/The-Second… AI just keeps getting better over time, but NOW is a special moment that i call “the halftime”. Before it, training > eval. After it, eval > training. The reason: RL finally works. Lmk ur feedback so I’ll polish it.
11
2,423
🚀
we are going to take a little more time with our open-weights model, i.e. expect it later this summer but not june. our research team did something unexpected and quite amazing and we think it will be very very worth the wait, but needs a bit longer.
2
10
1,831
#ICLR2022 We present provably sample-efficient algorithms for multi-agent RL with large # players, without exp(# players) blowup! Poster session today (Tue 6:30 - 8:30pm PT): iclr.cc/virtual/2022/poster/… Paper👇
🆕"When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?" arxiv.org/abs/2110.04184 We theoretically study what RL can learn in multi-player general-sum MGs without exp(# players) samples. Joint w/ Ziang Song (Peking U.) & @WispyMay. 🧵
8
Welcome to check out our new AI Residency Program!
Our new AI Residency Program aims to foster the next generation of AI researchers. Our program gives candidates real-world experience and makes them more qualified for top PhD programs. Applications close January 3, 2022: bit.ly/AIResJobAppTwitter
10
hm what would that be?
LIVE5TREAM THURSDAY 10AM PT
11
2,145
Great summer research intern opportunity!
We're hiring AI Research Interns for Summer 2025! Spend 3 months with us working on AI Agents, LLMs, Reasoning, Planning & more—with a focus on publishing high-quality academic papers. If you have a strong publication record, apply or DM me! #researchpaper #JobOpening #intern
9
3,538
Check out Song's blog for more nice stuff on statistical physics <-> theoretical ML.
Very nice blog post from Song Mei on the replica method from statistical physics. meisong541.github.io/jekyll/…
1
8
Congrats @yanndubs !! Back to keeping the capability flywheel flying. 🛞
🔥 So excited to share GPT-5! For thinking mode and API models, we’ve improved performance across key: - Axes: factuality, steerability, long-context performance, efficiency - Domains: coding, writing, healthcare But we still have so many ideas for improvement, and as @SebastienBubeck mentioned, we also discovered a new capability flywheel, where GPT-N can help improve GPT-N+1. I can’t wait to see how GPT-6 will pan out! Now, back to crafting! cc @michpokrass @ericmitchellai @max_a_schwarzer @markchen90 @sama @OpenAI
1
9
1,850
Curious exp: A single transformer (TF_alg_select) can simultaneously match Bayes optimal in-context predictions on two tasks (noisy linear models with different noise levels). Those two tasks required different optimal algorithms (ridge regression with different \lambda's)!
1
6
1,235
Welcome to check out our work!
If you are attending ICML @icmlconf this week, you are most welcome to check out some recent work from our team: #ICML2021 @SFResearch
7
Replying to @_aidan_clark_
Bad local minima were studied a lot, e.g. Auer et al. 1995 "Exponentially many bad local minima for single neurons": papers.nips.cc/paper_files/p… tho the bad example there is clearly contrived, and the authors did not explicitly draw an implication like "NNs are bad" based on it
6
337
Replying to @EdwardSun0909
Congrats!! When is Deep Research's thesis defense?
1
6
2,761
Replying to @johnschulman2
It's been an honor to have been colleague with you and wished it could be longer. Thank you and all the best!
6
965
That's the poster style we all need! 🤣
A huge thanks for everyone who came to the poster session. Posting this to whoever missed the jokes, comments & suggestions are more than welcome.
1
6
1,879
Replying to @sughanthans1
It should be much better on coding and a little better across the board!
6
340
Nice 🧵 by @yuxiangw_cs on low-switching cost RL (aka deployment efficiency). It's a practically relevant setting in between offline RL and "truly" online RL, with many exciting progresses and open questions for both deep RL and RL theory!
Online RL guarantees good exploration but has limited applicability (due to safety / legal concerns for online trials-and-errors). Offline RL (aka Batch RL) shows great promise but requires strong assumptions on logged data. Is there anything in between? 1/7
6
Acknowledgement: Thanks @prfsanjeevarora for hosting offconvex and the many helps! Thanks to other co-authors of our paper @tourzhao @huan__wang @CaimingXiong @RichardSocher
1
6
Replying to @jasondeanlee
Congrats! The highest compliment for food -- "edible" by Jason
1
7
1,471
To discover and understand the capabilities of LLMs, I believe this combination will become even more powerful. This is joint work with an amazing team of collaborators: Fan Chen (Peking U), @Song__Mei (Berkeley) @huan__wang @CaimingXiong (Salesforce)
1
4
866
Personally, one thing I really like in this project is that, **both experiments and ML theory** (statistics, linear algebra with transformers) played crucial roles in isolating and rigorizing the phenomenon.
1
4
818
In the first mechanism, “Post-ICL Validation”, the TF executes many ICL algorithms in parallel on a train split, and outputs the one with lowest loss on a validation split. Example: A TF can do ridge regression with \lambda_1 on input 1 and \lambda_2 on input 2.
1
3
806
Looking forward to seeing more exciting works from the team!
6
1,522
We coin this capability as "In-Context Algorithm Selection". This is similar to what a statistician / ML expert can do in real-life: Choose the best algorithm for their data at hand. How can a Transformer (TF) do that? We construct two mechanisms in theory.
1
3
894
The Accuracy-Calibration frontier is much more informative than just calibration errors alone. Nice to see this extensive empirical study.
New paper: Revisiting the Calibration of Modern Neural Networks (arxiv.org/abs/2106.07998). We studied the calibration of MLP-Mixer, Vision Transformers, BiT, and many others. Non-convolutional models are doing surprisingly well! 1/5
4
++ Along the way, we develop a comprehensive & quantitative theory for TFs to do ICL: * Implementing many more ML algs by TF (Lasso, Logistic regression, neural networks...) * New efficient implementation of in-context gradient descent as backbone * Analysis of pretraining * ...
1
3
650
In the second mechanism, "Pre-ICL testing", the TF runs a certain distribution test to deci which ICL algorithm to use. Example: A TF can do linear regression on a regression problem, and logistic regression on a classification problem, using a binary type check.
1
2
736
X = number of information sets for a single player, the main measure of game size for EFGs. Their new ICML paper improves over ours in the H (game horizon) dependency, and importantly does not require the structure of the game tree to be known ahead.
1
3
534
Thanks for flagging! Conditional coverage is definitely an important goal, great to see backproping thru conformal works here. May be interesting to see whether a proper efficiency loss could be designed for conditional coverage (and be optimized) too.
3
Congrats team!
Our NLP team got 16 papers (11 long, 2 short, and 3 finds) at #emnlp2020, which cover dialogue, summarization, question answering, multilingual, few-shot, NLI, semantic parsing, data augmentation, etc. Congrats to team members and coauthors. More info about papers coming soon!
4
Replying to @max_simchowitz
Congrats Max!
3
476
Yeah the meaning of self-play does depend on the context. In many theory works we do have >=2 *different* agents playing against each other, and we called it "self-play" too, to emphasize we don't need guidance from expert opponents / demonstrations (think AlphaZero vs. AlphaGo)
1
3
Thanks Caiming, it was great working with you!
3
444
Replying to @chijinML @Song__Mei
those were really good days!
1
2
371
These mechanisms not only match our findings in experiments. They also allow TFs to achieve strong ICL performance in theory. Example: We construct a TF to do nearly Bayes-optimal ICL in a challenging task---noisy linear models with **mixed** noise levels.
1
1
698
Our results generalize theirs to the case of EFCEs. Besides, we unveil a new connection in their setting as well: Hedge in NFG space = Kernelized MWU (Farina et al.'s efficient impl.) = Standard OMD with dilated entropy regularizer. Once again, OMD <-> NFG 😎 15/
1
2
& stay tuned for more updates later this week.
1
4
1,119
Replying to @nanjiang_cs
Congrats Nan!
3
675
Replying to @WenSun1
Congrats!
1
220
Replying to @EdwardSun0909
Congrats!
1
518
Replying to @LoVVgE
Running into you at both the local street and NOLA! 🤣
1
149
What's even nicer about the OMD connection: We build on this connection to design a modified OMD algorithm, that achieves better and the first near-optimal sample complexity for learning EFCE under bandit feedback. That is our second main result. 12/
1
1
Replying to @lilianweng
We will miss you! Good luck on your new journey 🩵
1
882
Congrats!
2
451