Open model research @ something new. Prev. co-led Olmo at Ai2. Contact via email. Writes @interconnectsai Wrote The RLHF Book, 🏔️🏃‍♂️

Seattle
TMax: An open RL recipe for terminal agents I’m very excited to get to share a new RL paper today that I got to have a small part in – a type of paper I suspect we’ll see much more of in the future. The key is that RL research is very different today, in mid-2026, than what most observers have in their context. The average conception of an RL paper is grounded in the RLVR revolution of early 2025, where many people could use vanilla RLVR libraries to hillclimb on math benchmarks. Crucially, this style of math work could be done on base models or fairly stably on already trained models. With agents, the tasks of focus are very hard, requiring complex tool-use, harnesses where the model automatically manages its history, and much more training to make smaller eval improvements. We’re shifting from a renaissance of RL study to rapidly needing to improve its empirical rigor and common community engagements. TMax is the best open data for hillclimbing on frontier terminal tasks. It’s been validated with rigorous experiments, and if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars. This data work is some of my favorite stuff to be around in my 2.5+ years at Ai2. As a general summary, the recipe is open data and recipe lessons from hillclimbing the Qwen 3.5 smaller, dense models on terminal tasks. These models are super hard to hillclimb in this area, as they’re already trained heavily on the task. The training is very infrastructure-dependent, and most of the RL innovations are more designed to make training stable than to improve the rate of learning. I strongly recommend this paper. I joke around that I was happy to be an author just so I had to read it twice! You can find Hamish’s thread sharing more here or read the paper here. You can click through to find the model weights, the data, and even some fun further artifacts to study like all the RL rollouts from a training run – where the model sometimes became aware that it was being tested. The biggest takeaway I have from following this work, and more of the work in the community, is how important recipe work is. Let me define “recipe work.” It is a style of paper that explains all the steps you need to make crucial model improvements – data, algorithm, codebase, pitfalls, etc. Getting started in meaningful RL experiments today is a substantial expense. There are a ton of companies, an entire industry emerging really, around the idea of taking open-weight language models and finetuning them with RL on your domain-specific tasks. What I see in many projects is that getting an initial baseline is very hard. This phase, which can cost weeks and anywhere from $10K to $1M+, feels like spinning your wheels (A fun fact is that an RL step on a model like Nvidia Nemotron 3 Ultra on Tinker costs $1K and a meaningful RL run would be hundreds of steps – credit Edward Hu). It takes a lot of time to get traction in learning signal on meaningful, hard RL tasks. What we need as a community is a way for people to study small ablations to established RL recipes, as most labs won’t have the resources to do it from scratch in a meaningful way. This is what I hope TMAX can be for terminal agents, or the start of. Yes the training jobs are expensive, as the paper documents a standard training job being 8 nodes of H100s (2 train 6 inference) for 2-3 days, but that is approaching something academics can study. The establishment of this recipe took O(100) of these training jobs to get right. This isn’t my first time trying to establish this direction. When we launched Olmo 3 we had the “RL Zero“ model families, which are clean RL runs from a base model on a certain domain. This type of recipe-dependent work is a clear indicator that meaningful post-training work today looks much more like pretraining work of years past. We need decision-making ladders, clear ways of seeing small improvements in the models, stability, and so on. Part of this is down to academic gatekeepers, who won’t reward a paper doing very clean empirical work to push a recipe 1-2% up. They’ll favor a “new algorithm” that matches results, or something sort of bogus. My hope is that we can have multiple, stable, clear recipes across agent types, so innovations can be tested more clearly in multiple domains. (If you’re working on this, please reach out – I’m happy to support if I can, but I likely can’t reply to every email). As a quick aside, the RL frameworks in vogue today seem to be SLIME and SkyRL. The libraries of choice have shifted throughout these seasons in RL, which further contributes to a form of fragility in the literature. A bit of continuity will go a long way. So, go read this paper. It’s a really great example of how seemingly simple data and infrastructure work can be very hard and impactful. It’s also got me looking for more applications of Divergence Proximal Policy Optimization (DPPO) as another small evolution to the best RL algorithms of the day, by virtue of being a bit more stable by improving token-level clipping.
Trained some terminal agents with friends! Introducing Tmax, open RL terminal agent models. Under default settings and shorter length (65k) token budgets, tmax outperforms prior open work on terminal use. We are releasing all data+weights+rollouts publically!
12
71
641
157,592
Meta is definitely not alone in this. And its normally overblown too.
170
827
9,370
3,176,244
Airbnb CEO Brian Chesky: “We’re relying a lot on Alibaba’s Qwen model. It’s very good. It’s also fast and cheap... We use OpenAI’s latest models, but we typically don’t use them that much in production because there are faster and cheaper models.” The valley is built on Qwen?
123
342
4,403
1,124,869
I'm happy to announce I'm the next CEO of OpenAI and we're going to start doing open source again
76
198
4,079
881,634
The weirdest VC subsidizing of our time, 10% of the Anthropic series F goes to writers
96
151
3,555
290,152
Life update, she said yes. 🤩👩‍❤️‍👨🐕‍🦺
275
15
3,379
178,226
transparency in AI is dying: no evals, no release notes, just vibes and more bad naming
we updated GPT-4o today! improved both intelligence and personality.
54
115
2,686
234,851
vagueposting is the worst part of the ai community here
i just witnessed a new form of human computer interaction completely blown away by what’s coming in the next month
73
44
2,620
131,761
Anthropic sliding into that code tooling company role instead of AGI race role
90
74
2,136
242,801
In my recent trip, the waymo market in SF has converged to ~2-3x the wait time and ~2-3x the cost of uber because that's how much more people are willing to pay for Waymo.
61
39
1,915
159,269
leaders in ai talk like there's some master plan but it's literally just this
32
105
1,868
68,530
Recurring frontier lab gossip: OpenAI has best post-training/rl and has pushed it super hard on weaker pretraining. Gemini has spectacular pretraining. Making a reasoning model was super easy for them & OpenAI folks were surprised Anthropic? Secretive i guess.
47
45
1,784
277,493
A sign Google is waking up is them dropping by far and away the biggest free plans across the industry. This time is for the Claude Code competitor. Super excited, next time can be them getting rid of "Gemini Advanced" or whatever it is. Free Gemini.
47
116
1,684
325,295
Most important plot from IO today -- AI usage is skyrocketing. This is real.
54
124
1,584
148,896
ILYA: "PRETRAINING IS DONE. WE ARE NOW IN THE POST TRAINING ERA."
39
90
1,512
199,422
This is a way bigger deal than the Claude release.
ChatGPT already helps millions of people find what to buy. Now it can help them buy it too. We’re introducing Instant Checkout in ChatGPT with @Etsy and @Shopify, and open-sourcing the Agentic Commerce Protocol that powers it, built with @Stripe, so more merchants and developers can integrate agentic checkout.
95
44
1,479
309,260
Props to Google for including O4-mini in their Flash 2.5 release. A model released *yesterday*, while some companies only compare to their own models. Looking good gemini.
43
65
1,399
73,474
o3 does not use different training nor inference methods than o1 (at least in pro mode). No special "search". OpenAI just found a hill and very quickly started hillclimbing it. Excited to build an open-source one and prove this to you in 2025. interconnects.ai/p/openais-o…
43
100
1,343
127,213
Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes. Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches... Reading list below.
22
168
1,366
123,097
Since everyone wants to learn RL for language models now post DeepSeek, reminder that I've been working on this book quietly in the background for months. Policy gradient chapter is coming together. Plugging away at the book every day now. rlhfbook dot com
22
163
1,353
108,328
The first fantastic paper on scaling RL with LLMs just dropped. I strongly recommend taking a look and will be sharing more thoughts on the blog soon. The Art of Scaling Reinforcement Learning Compute for LLMs Khatri & Madaan et al.
19
187
1,232
91,695
TIN HAT TIME on what OpenAI is cooking with Q* RLHF To start, the hilarious quotes from the @Reuters article: "long-time executive Mira Murati told employees on Wednesday that a letter about the AI breakthrough called Q* (pronounced Q-Star), precipitated the board's actions." + "Given vast computing resources, the new model was able to solve certain mathematical problems... Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success" Now, buckle up: OpenAI's new technology is Q* (Q-star), two prominent things: Q learning (RL algorithm) an A Star (search algorithm) 1. Q learning makes sense, the first notable RL algorithm with many variants used today, the tokens / word are states and some response is actions. 2. A star is a graph search algorithm known for saving results in memory as it goes. The post says "Given vast computing resources, the new model was able to solve certain mathematical problems" --> needs to store A TON of data in the new RLHF training. Search is important for making multi-turn optimization in training work??? Essentially, I guess applying the A* formula over Q-values for multi-turn reasoning. Why may this work really well and be hard? * Optimizing over multiple turns means more model forward passes in memory + more gradients * need this type of thing to solve hard math problems * really, it's prolly closer to RLAIF
45
141
1,154
517,490
First draft online version of The RLHF Book is DONE. Recently I've been creating the advanced discussion chapters on everything from Constitutional AI to evaluation and character training, but I also sneak in consistent improvements to the RL specific chapter. rlhfbook.com/ RLHF has a long future ahead of it and this will do a lot to make it more accessible to the next generation. What's next: Getting a physical copy in your hands (may not be exactly 1to1, we'll see) and minor fixes at a slower cadence (thanks to many github contributors, some of you will get a copy from me). Here are all the chapters. 1.Introduction: Overview of RLHF and what this book provides. 2.Seminal (Recent) Works: Key models and papers in the history of RLHF techniques. 3.Definitions: Mathematical definitions for RL, language modeling, and other ML techniques leveraged in this book. 4.RLHF Training Overview: How the training objective for RLHF is designed and basics of understanding it. 5.What are preferences?: Why human preference data is needed to fuel and understand RLHF. 6.Preference Data: How preference data is collected for RLHF. 7.Reward Modeling: Training reward models from preference data that act as an optimization target for RL training (or for use in data filtering). 8.Regularization: Tools to constrain these optimization tools to effective regions of the parameter space. 9.Instruction Tuning: Adapting language models to the question-answer format. 10.Rejection Sampling: A basic technique for using a reward model with instruction tuning to align models. 11.Policy Gradients: The core RL techniques used to optimize reward models (and other signals) throughout RLHF. 12.Direct Alignment Algorithms: Algorithms that optimize the RLHF objective directly from pairwise preference data rather than learning a reward model first. 13.Constitutional AI and AI Feedback: How AI feedback data and specific models designed to simulate human preference ratings work. 14.Reasoning and Reinforcement Finetuning: The role of new RL training methods for inference-time scaling with respect to post-training and RLHF. 15.Synthetic Data: The shift away from human to synthetic data and how distilling from other models is used. 16.Evaluation: The ever-evolving role of evaluation (and prompting) in language models. 17.Over-optimization: Qualitative observations of why RLHF goes wrong and why over-optimization is inevitable with a soft optimization target in reward models. 18.Style and Information: How RLHF is often underestimated in its role in improving the user experience of models due to the crucial role that style plays in information sharing. 19.Product, UX, Character: How RLHF is shifting in its applicability as major AI laboratories use it to subtly match their models to their products.
24
207
1,201
90,674
It is a major policy failure that the US cannot accommodate top AI conferences due to visa issues.
45
153
1,196
328,255
I'm going to miss o3
62
42
1,150
106,618
It’s very common for leadership at top labs to be in the know of other lab’s release schedules, so the simplest explanation to Meta releasing today is that next week is going to be bonkers, or at least plausibly outshine llama 4 (and they wanted to release last week, but other news was too big).
38
39
1,169
343,273
Major reasoning models so far with technical reports (focused on those w RL): 2025-01-22 — DeepSeek R1 — arxiv.org/abs/2501.12948 2025-01-22 — Kimi 1.5 — arxiv.org/abs/2501.12599 2025-03-31 — Open-Reasoner-Zero — arxiv.org/abs/2503.24290 2025-04-10 — Seed 1.5-Thinking — arxiv.org/abs/2504.13914 2025-04-30 — Phi-4 Reasoning — arxiv.org/abs/2504.21318 2025-05-02 — Llama-Nemotron — arxiv.org/abs/2505.00949 2025-05-14 — Qwen 3 — arxiv.org/abs/2505.09388 2025-05-28 — Skywork Open Reasoner 1 — arxiv.org/abs/2505.22312 2025-06-04 — Xiaomi MiMo — arxiv.org/abs/2505.07608 2025-06-10 — Magistral — mistral.ai/static/research/m… Did I miss any?
16
185
1,153
83,513
For those trying to understand DeepSeeks Group Relative Policy Optimization (GRPO): GRPO is just PPO without a value function using monte carlo estimates of the advantage. So, study why PPO exists (lots of docs / writing on that) and understand that value functions are tricky with LLMs. Left ppo, right grpo
26
118
1,095
77,576
immortalizing this moment forever when RL is so easy that you can just use random rewards and your benchmarks still go up smh
20
56
1,082
95,609
Here's a recent talk I gave recapping the last 6-12 months of AI progress, why getting perfect models is hard, how labs are likely approaching the next phase of training (for agents), and other interesting tidbits across the reasoning landscape. Topics: 00:00 Introduction & the state of reasoning 05:50 Hillclimbing imperfect evals 09:18 Technical bottlenecks 13:02 Sycophancy 18:08 The Goldilocks Zone 19:28 What comes next? (hint, planning) 26:40 Q&A YouTube etc in replies. Thanks @corbtt and @OpenPipeAI for hosting me.
14
140
1,072
89,524
A tier list of China's top 19 open model builders. Who did we miss? At the frontier * DeepSeek * Qwen Close competitors * Moonshot AI (Kimi) * Zhipu / Z AI Noteworthy * StepFun * Tencent (Hunyuan) * RedNote (Xiaohongshu) * MiniMax * OpenGVLab / InternLM * Skywork On the rise * ByteDance Seed * OpenBMB * Xiaomi (MiMo) * Baidu (ERNIE) Honorable Mentions * Multimodal Art Projection * Alibaba International Digital Commerce Group * Beijing Academy of Artificial Intelligence (BAAI) * inclusionAI * Pangu (Huawei) I learned a lot from these. We have so much more we need to do to understand how their AI ecosystem works.
China's Top 19 Open Model Labs We ranked all the organizations in China releasing open models, from the top of DeepSeek to small, newer academic labs making waves with tech reports and niche models. interconnects.ai/p/chinas-to…
46
159
1,027
558,483
DeepSeek app sitting at number 1 overall in the US Iphone App Store is not on my bingo card and is the biggest sign yet that the ChatGPT moat can maybe be cracked.
62
78
936
157,511
I'm flying so I'm working on my mini RLHF book today :) rlhfbook dot com
15
79
936
87,845
A very exciting day for open-source AI! We're releasing our biggest open source model yet -- OLMo 2 32B -- and it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. As usual, all data, weights, code, etc. are available. For a long time, people have asked for an truly open-source version of ChatGPT and we finally have it. This is multiple years coming into efforts following the release of ChatGPT and builds on the efforts of so many at both Ai2 and in the broader open AI ecosystem. With just a bit more progress everyone can pretrain, midtrain, post-train, whatever they need to get a GPT 4 class model in their class. This is a major shift in how open-source AI can grow into real applications. Oh yeah, it's also Apache 2 as always, so happy to make things that are simple to use. I did NOT expect to be undercutting OpenAI's offerings this year but here we are :D
51
146
933
99,666
Since everyone loved my Chinese lab ranking based on open model releases, here’s the equivalent for the Western companies based on released open models aligned in tiers to comparable Chinese companies. We have no one comparable in the top two tiers.
A tier list of China's top 19 open model builders. Who did we miss? At the frontier * DeepSeek * Qwen Close competitors * Moonshot AI (Kimi) * Zhipu / Z AI Noteworthy * StepFun * Tencent (Hunyuan) * RedNote (Xiaohongshu) * MiniMax * OpenGVLab / InternLM * Skywork On the rise * ByteDance Seed * OpenBMB * Xiaomi (MiMo) * Baidu (ERNIE) Honorable Mentions * Multimodal Art Projection * Alibaba International Digital Commerce Group * Beijing Academy of Artificial Intelligence (BAAI) * inclusionAI * Pangu (Huawei) I learned a lot from these. We have so much more we need to do to understand how their AI ecosystem works.
50
70
882
238,363
did OpenAI tell me to downgrade to a free plan today?
43
21
864
284,297
I put together all the interview timelines, reflections, and advice from my job search. I focused on RL jobs, varying from applied ML engineer to research scientist! Lot's of people want to get a PhD and this is what getting a job after looks like! natolambert.com/writing/ai-p…
18
121
884
R1 making me feel very heard. Will read more thoroughly later. Laughs in continued shock that RL working like this.
13
42
895
86,467
bruh what's on monday
It’s very common for leadership at top labs to be in the know of other lab’s release schedules, so the simplest explanation to Meta releasing today is that next week is going to be bonkers, or at least plausibly outshine llama 4 (and they wanted to release last week, but other news was too big).
31
34
880
190,641
New export controls incoming, Bloomberg reporting: "But if an AI company wants to fine-tune a general-purpose open weight model for a specific purpose, and that process uses a significant amount of computing power, they would need to apply for a US government license to do so in a Tier 2 country." Controlling who in the entire world can finetune on what seems like a losing and generally bad proposition.
105
104
819
253,886
New toy!
44
32
802
54,633
Not falling for OpenAI’s hype-vague posting about the new IMO gold model with “general purpose RL” and whatever else “breakthrough.” Google also got IMO gold (harder than mastering AIME), but remember, simple ideas scale best.
26
35
860
117,054
> be me > be zuck > need llama 4 to land > send a model/prompt to LMSYS to get a top1 score, cringe be damned > release a different model as "open source" > think people won't find out even with weights
18
25
802
136,131
oh no
15
14
803
56,893
I'm going to sound like a shill but I describe paying for better AIs right now as a way that you can "pay to win" in your career. Normally dynamics like this are restricted to video games.
I am very impressed by GPT-5 Pro. Had a bug in a script. Claude Code w/ Opus couldn’t find it after repeated attempts. dumped the problem and all relevant code into GPT-5 Pro and it found it first shot. Very impressive.
32
38
812
111,342
I'd put good money on this being an high-impact finetune of one of the large, Chinese MoE models. I'm very excited to see more companies able to train models that suit their needs. Bodes very well for the ecosystem that specific data is stronger than a bigger, general model.
Introducing Cursor 2.0. Our first coding model and the best way to code with agents.
35
28
788
80,592
They're showcasing the RL to prod pipeline as a good form of continual learning. Totally shocking if you had told me this just 12 months ago.
We've trained a new Tab model that is now the default in Cursor. This model makes 21% fewer suggestions than the previous model while having a 28% higher accept rate for the suggestions it makes. Learn more about how we improved Tab with online RL.
8
52
702
97,816
FINALLY!!!
21
38
778
47,022
Grok 4 coming soon after Llama 4 with a completely different trajectory should help people finally take in how important culture is to progress in technology generally and AI specifically. I don't agree with many of xAI's values but give full props to hard work.
24
27
786
56,163
I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor — Tülu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct at 8B and 70B on the tasks we focused on. So many things to share: New SFT data, recipes for scaling preference fine-tuning, a new RL optimization stage, extensive evaluation details, etc.
25
139
789
118,324
Life update: Today's my last day at @huggingface after 1.5 years. It's been an awesome ride, I'm moving within the open RLHF space to have a slightly more research oriented role, continue to figure out what makes RLHF tick, share everything, and make some in-person friends. Some key lessons: * If you don't promote and communicate your work, no one else will. * It's harder to get visibility for your work than to do good work. People don't like this reality. * Open-source moves very fast, so it takes clever leadership and guidance to maximize collective effort. * Open-source ML is at it's very early days. We're figuring out what it means to do open-source ML. OSS will forever be changed. * Open-source succeeds in multiplicity. Just because someone is trying something similar does NOT mean you should stop. * RLHF is very, very under-explored. Leaders in the space have just tried a few more things. Please join us. Thanks to everyone who helped this be an awesome ride. I'm sure I'll still collaborate on many of the same projects.
32
37
774
151,435
Gemini 2.5 Pro long context goes hard. First model to take an entire paper latex (15k tokens), tell it to ignore comments, find all typos. Does it perfectly. Even o1pro didn't even feel coherent on that!
15
41
766
57,885
In some ways the GPT-5 release feels like the Llama 4 release. They just waited too long to get it out. Feels like some weirdness may be happening behind the scenes. Messy release in terms of presentation & technical details. Blip or trend for OpenAI?
70
40
754
176,506
RL research is becoming like pretraining/modeling. This is a huge vibe shift. Most research published on RL isn't using enough compute to make many of these decisions matter as much. This is slowly shifting.
practical, modern GRPO tweaks as described in Meta's Code World Models paper
9
69
733
92,719
GPT 5o has arrived.
We’re making GPT-5 warmer and friendlier based on feedback that it felt too formal before. Changes are subtle, but ChatGPT should feel more approachable now. You'll notice small, genuine touches like “Good question” or “Great start,” not flattery. Internal tests show no rise in sycophancy compared to the previous GPT-5 personality. Changes may take up to a day to roll out, more updates soon.
15
11
739
70,397
I bet pretty soon a Chinese research org drops a LLM scaling laws for RL paper. Closed frontier labs have definitely done this and wont share it, academics havent mastered the data + infra tweaks yet.
13
42
742
67,569
Mastering the GRPO math, its implementation details, and other policy gradient algorithms has made it way easier to understand new research on reasoning algorithms. Read the policy gradients chapter of the rlhf book. Studying pays off, I started writing before R1 was released.
9
42
747
50,364
Claude 4.1 Opus > Claude 4.5 Sonnet
141
12
721
119,528
Almost everyone I know working in AI these days feels one step away from total burnout. I took the time to take you behind the curtain and know what people on the state-of-the-art AI are struggling with: robotic.substack.com/p/behin…
12
132
688
338,055
Qwen's first o1 inspired model. Blog: buff.ly/3ZbWCpz Model weights: buff.ly/4i7AJjA
10
103
663
76,389
NotebookLM and OpenAI Advanced Voice Mode feel like we have entirely new ways we need to learn how to work with AI/LLMs again. Normally, when this happens we unlock a bunch of value.
27
34
668
54,440
Glad to see DeepSeek team members writing more papers than just their fun tech reports :) (maybe I just missed them in the past)
14
82
672
167,649
New Zuck post, what a difference a few years makes: Today: "We'll need to be rigorous about mitigating these risks and careful about what we choose to open source." 2024: "Meta is committed to open source AI... and therefore a platform that will be around for the long term."
55
62
678
188,220
The Q* hypothesis I can stand behind (from literature): 1. Tree of Thoughts reasoning: something to search over 2. Process reward models: rank all the steps of reasoning 3. GPT4 to score all vertices of the tree (RLAIF) 4. Q-learning to optimize 🚀 interconnects.ai/p/q-star
19
122
650
358,592
Excited! I've started as a Research Scientist at @allen_ai (on @ai2_allennlp) working on all things "RLHF research" - it encompasses a lot. Let's show the world that openness can foster broadly beneficial AI. I'm excited to work with everyone who wants to make that happen.
52
26
670
151,912
hahahahah there were actually two technical reports for RL reasoning models today, kimi 1.5 also has good stuff on reward shaping + RL infra
9
76
680
54,998
Models that are actually really really good. Way better than what we were using in 2024: Gemini 2.5 Pro o1 pro
Recent AI model progress feels mostly like bullshit (2025)
31
20
664
68,272
In a world before derivatives, Fermat solved optimization by assuming two converging points are not equal (so they don’t divide by 0 in algebra) but are approximately 0 so can be substituted. 🤯 The steps people took to discover modern math is crazy. Now we have autodiff.
7
93
622
Zuck just buying Scale to cut off the competition from data obviously
29
18
653
55,911
going to go on the record and say this is a bad idea
4o vision fine-tuning enables autonomous driving
30
11
628
83,393
Very happy to show that we can do RL finetuning on 405B models with open-source code, beat Llama 405B instruct with their base model, and beat DeepSeek V3 too. Enjoy building off this teams hard work. Here's Tulu 3 405B. A holiday present from @hamishivi, @vwxyzjn and team.
Here is Tülu 3 405B 🐫 our open-source post-training model that surpasses the performance of DeepSeek-V3! The last member of the Tülu 3 family demonstrates that our recipe, which includes Reinforcement Learning from Verifiable Rewards (RVLR) scales to 405B - with performance on par with GPT-4o, and surpassing prior open-weight post-trained models of the same size including Llama 3.1
26
77
659
74,015
I hear people are pretty into GRPO and RL these days, so I wrote up a pretty comprehensive research survey of recent papers I liked. Kimi 1.5, OpenReasonerZero, DAPO and Dr. GRPO. + discussion on if GRPO is special and further reading. interconnects.ai/p/papers-im…
7
92
665
76,135
Anthropic is the only leading AI lab to not release a reasonable open weights model. Is notable that pretty much everyone has a touchpoint here now.
44
18
622
64,332
Qwen 3 coming imminently! Meta's smart to have locked in LlamaCon, else Llama 4 maybe would've been delayed again 🤭. Really I'm hype for Llama 4, bring it asap.
13
54
638
100,574
DeepSeek makes it quite clear how they trained R1. None of these steps alone are super surprising, but how to sequence and blend them together definitely is.
The DeepSeek R1 recipe, what questions we need to answer to train an o1 replication ourselves at home, and what it means for the near future of AI. interconnects.ai/p/deepseek-…
11
79
615
64,973
Thinking machines proving you can be worth $10B with your one product being great content.
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…
13
25
648
83,621
Stoked to get to talk to @lexfridman + my homie @dylan522p for 5+ hours to try and get to the bottom of what is actually happening in AI right now. DeepSeek R1 & V3, China v US, open vs closed, decreasing hype, datacenters, everything in between... 🚀 what a fun whirlwind week
Here's my 5-hour conversation with @dylan522p and @natolambert on DeepSeek, China, OpenAI, NVIDIA, xAI, Google, Anthropic, Meta, Microsoft, TSMC, Stargate, megacluster buildouts, RL, reasoning, and a lot of other topics at the cutting edge of AI. This is was a mind-blowing, super-technical, and fun conversation. Yes, we discuss r1 and o3-mini, but more importantly we look into the future of technology, geopolitics, and humanity in a world that stands on the precipice of a global AI revolution. The first 4 hours are here on X (4 hours is current limit), and the full 5 hours are up everywhere else. Links in comment. Timestamps: 0:00 - Introduction 3:33 - DeepSeek-R1 and DeepSeek-V3 25:07 - Low cost of training 51:25 - DeepSeek compute cluster 58:57 - Export controls on GPUs to China 1:09:16 - AGI timeline 1:18:41 - China's manufacturing capacity 1:26:36 - Cold war with China 1:31:05 - TSMC and Taiwan 1:54:44 - Best GPUs for AI 2:09:36 - Why DeepSeek is so cheap 2:22:55 - Espionage 2:31:57 - Censorship 2:44:52 - Andrej Karpathy and magic of RL 2:55:23 - OpenAI o3-mini vs DeepSeek r1 3:14:31 - NVIDIA 3:18:58 - GPU smuggling 3:25:36 - DeepSeek training on OpenAI data 3:36:04 - AI megaclusters 4:11:26 - Who wins the race to AGI? 4:21:39 - AI agents 4:30:21 - Programming and AI 4:37:49 - Open source 4:47:01 - Stargate 4:54:30 - Future of AI
63
59
641
89,552
OpenAI skips o2, previews o3 scores, and they're truly crazy. Huge progress on the few benchmarks we think are truly hard today. Including ARC AGI. Rip to people who say any of "progress is done," "scale is done," or "llms cant reason" 2024 was awesome. I love my job.
14
60
623
86,065
I'm convinced to try it asap, we should all try fp16, look at this plot man. FP16 is like perfect in error reduction. "This is precisely why switching to FP16 provides a fundamental solution. With its 10 mantissa bits, FP16 offers 8 times more precision (2^10 values vs. 2^7 values) than BF16. This higher fidelity means that the outputs of the training and inference engines are much more likely to be numerically identical. The increased precision creates a buffer that absorbs the minor implementation differences between the two engines, preventing rounding errors from accumulating and causing a policy divergence. For RL fine-tuning, the dynamic range of the model’s weights and activations has already been established during pre-training. Therefore, the extreme range of BF16 is less critical, while the precision it sacrifices becomes a dominant drawback. By reverting to FP16, we trade the unnecessary range of BF16 for the critical precision, effectively closing the gap between training and inference without any complex algorithmic or engineering workaround."
25
43
646
144,104
Who's using GPT-OSS and for what? Was it cheaper, better, faster than other open models? Or just not from China? Download numbers are actually very strong on HuggingFace for first model releases.
123
37
590
125,360
My latest post: The American DeepSeek Project Build fully open models in the US in the next two years to enable a flourishing, global scientific AI ecosystem to balance China's surge in open-source and an alternative to building products ontop of leading closed models.
50
76
627
145,083
Tbh I’m happily using GPT-4.5. thanks OpenAI for not being too eval obsessed
25
14
619
92,542
Gemini 2.5 is amazing -- a bigger jump than the recent releases of Claude 3.7, Grok 3, and GPT 4.5. Google needs to have the same drive for excellence across the product and cloud orgs They can reclaim the crown in AI if they commit to it
Gemini 2.5 Pro and Google's second chance with AI Plus some coverage for the latest DeepSeek. interconnects.ai/p/gemini-25…
22
28
621
72,463
o3's search abilities are incredible. Can find extremely niche information without a ton of additional context. Just what I would say to a colleague.
28
22
617
144,592
TLDR on o3 and o4-mini: incremental. pace of progress still really high, no dramatic changes in performance or shocking new features. Pressure to ship fast across the industry has never been higher.
16
32
617
46,616
I'm using GPT 5 Pro a lot. Mostly for research in my case, but I am bullish it or Gemini Deep Think are the smartest models available publicly today. You should use one or both of them.
I think congrats again to OpenAI for cooking with GPT-5 Pro. This is the third time I've struggled on something complex/gnarly for an hour on and off with CC, then 5 Pro goes off for 10 minutes and comes back with code that works out of the box. I had CC read the 5 Pro version and it wrote up 2 paragraphs admiring it (very wholesome). If you're not giving it your hardest problems you're probably missing out.
19
51
481
75,804
America needs to take open models more seriously. This summer the early lead in open model adoption of the US via Llama has been overtaken by Chinese models. With The American Truly Open Models (ATOM) Project we're looking to build support and express the urgency of this issue.
32
100
621
132,162
Watch my @stanford CS 25 lecture next week, "aligning open language models," it'll be good, v excited
18
85
598
141,112
"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" This isn't a new intuition, but a nice new set of results. The paper in question Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? has a lot of discussions underway on if Reinforcement Learning from Verifiable Rewards (RLVR) is actually improving the models we’re training. The core figures are attached. These are all using pass@k as the core metric. Pass@k is the metric that checks to see if the right answer exists in k completions. This is not how practical inference works, but is a good test to see if the model is “in distribution.” You should focus on the bottom three rows here which are in-distribution for the RL training data. Second, Qwen is known to be predisposed to learning reasoning so those base models may be stronger. I can’t say a lot more about the base models as that’s an open area of research — what are the right base models for reasoning? Some are surprised that the base model does so well, but really we’ve been saying for a while that RL training is increasing the probability of correct behaviors — elicitation. With this view, the results align totally with what RLVR should be doing. There are also some caveats on the work that make it have the usual academic grains of salt. Mostly, they only train on the MATH and GSM8K training sets. While this is great for controlled ablations, it’s not great for showing the fundamental limits of RL training. OpenAI and others have shown that scaling RL is a crucial aspect of it, and with only these narrow training sets that isn’t really possible. Second, the paper doesn’t have a ton of plots showing the training curves for their models. It’s safe to assume they’re decent because the results look reasonable, but the base model training is much more reliable than the RL training in another paper trying to make a point. The pass@1 results for RL are extremely promising and should reinforce that RLVR is working. That being said, if we had perfect verifiers — an oracle — we’d never need RLVR in the first place (or post-training really), and we could just use that instead of trying to make the model better. My very first post on inference models made the same point that random sampling with pass@k metrics is important as a baseline for inference scaling! This isn’t new. This is a nice reminder that there’s no free lunch. We should keep checking how this changes as we: Scale RL training to many more prompts, and Scale RL to bigger base models. A final caveat, which I think is minor. These results are all RL-Zero style, i.e. just on the base model with no warm start. DeepSeek and others stated that better performance comes from a warmup with on-policy SFT before RL. This’ll make the RL results above even stronger, where the base model results won’t change.
18
84
627
118,668
RL keeps cooking 🤣🤡🫡 “Deep research was trained using end-to-end reinforcement learning on hard browsing and reasoning tasks across a range of domains.”
18
40
582
62,533
I gave a talk today at The Curve on the state of open models. Here are the slides, recording soon. Topics include: Chinese ecosystem, reflections on DeepSeek, the demise of Llama, who will fill the U.S. market, what local models do, ATOM project & ai2, and more topics
15
79
595
91,422
There are like 10-20 Chinese orgs shipping open models that I try and keep a somewhat close eye on and there are like 3-4 in the rest of the world 😳
20
38
581
72,983
If you look at most of the models we've received from OpenAI, Anthropic, and Google in the last 18 months you'll hear a lot of "Most of the improvements were in the post-training phase." Here's a simple analogy for how so many gains can be made on mostly the same base model: The intuition I've been using to understand the potential of post-training is what I call the elicitation interpretation of post-training, where all we are doing is extracting and amplifying valuable behaviors in the base model. Consider F1, most of the teams show up to the beginning of the year with a new chassis and engine. Then, they spend all year on aerodynamics and systems changes (ofc a minor over simplification), and can dramatically improve the performance of the car. The best F1 teams improve way more during a season than chassis to chassis. The same is true for post-training. The best post-training teams extract a ton of performance in a very short time frame. The set of techniques is everything after the end of most of pretraining. It includes "mid-training" like Annealing / high-quality end of pre-training, instruction tuning, RLVR, preference-tuning, etc. Then, when you look like models such as GPT4.5, you can see this as a way more dynamic and exciting base for OpenAI to build onto. We also know that bigger base models can absorb far more diverse changes than their smaller counterparts. This is to say -- scaling also allows post-training to move faster. Of course, to do this, you need the infrastructure to train the models. This is why all the biggest companies are still building gigantic clusters. Still, it is very important to understand how much craft their is to post-training and how these labs are grabbing so much available performance. Improvements have been easier than most people think -- fitting them all together in one model is harder, but we have a lot more to gain still.
17
38
594
95,879
Modern licenses so funny. Amazing looking model. MIT-Modified. Marketing is king. "Our only modification part is that, if the Software (or any derivative worksthereof) is used for any of your commercial products or services that havemore than 100 million monthly active users, or more than 20 million US dollars(or equivalent in other currencies) in monthly revenue, you shall prominentlydisplay "Kimi K2" on the user interface of such product or service."
21
27
596
63,064
Excited to share my analysis of the LLAMA2 model. In short, this model and paper are incredibly well done. Meta has stepped up the level for open-source and signaled another path for the future of LLMs. Influencers are now right with "equals chatgpt". interconnects.ai/p/llama-2-f…
16
130
571
135,671
Getting ready to invest more time into the RLHF book to prepare for print edition. What do you wish was clearer or had more coverage in it?
10
50
585
38,037
A new essay on the crazy, all or nothing approach to work happening in AI today, the looming human costs, and the lack of a finish line. I wouldn't say it's okay, but I'm not sure how to fix it.
16
52
592
284,914
So, Dylan, where does Google land on this chart?
20
18
578
92,216
Got the essentials, not much more is needed.
11
88
255
14,222
If you're a student wanting an exciting life, good comp, and impact on the real world: work on LLMs
If you are a student interested in building the next generation of AI systems, don't work on LLMs
29
17
524
213,535