I will be at NeurIPS this week! If you want to talk about research, RL, life at @thinkymachines or get some Tinker credits reach out!
14
6
189
18,999
We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.
24
224
1,362
181,431
We have a new preprint out - your language model is not a reward, it’s a Q function! 1. The likelihood of the preferred answer must go down - it’s a policy divergence 2. MCTS guided decoding on language is equivalent to likelihood search on DPO 3. DPO learns credit assignment
16
149
938
100,382
Super excited to announce what we have been working on in the last six months - Agent Q is out now! This is a framework for self-supervised agent reasoning and search that can self-correct and autonomously improve by self-play and RL on real tasks on the real internet! 👇
14
103
756
166,500
The most surprising thing working on this was that RL with LoRA completely matches full training and develops the same extended reasoning patterns. I think this is a great sign for custom agent training.
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…
6
41
509
45,475
Excited to announce DPO has gone multi-modal! New paper out on RLHF for text-to-image diffusion models! We obtain large-scale state of the art results with 70% win rates against Stable Diffusion XL on human evals! Deep dive below 🧵
10
81
471
233,679
My Bet: Strawberry is algorithm distillation/procedural cloning. Everyone right now is coming up with ways to distill System 2 into System 1, but that will always be limited. We need to train the model to run the algorithms, not just outputs (and post-train with RL of course).
9
46
459
123,999
I saw this challenge aimoprize.com/ to develop an AI that can win a gold medal at the IMO. I competed at that level a couple of times (only silver medals though) and have been working on RL and LLMs for a bit. Here is my thoughts on what the challenges are: 1/N
17
60
450
161,251
Not to mention that most students don’t even have access to that cluster. I don’t have access to any A100s myself. It is becoming increasingly hard to even do research and that is Stanford, other places have it even worse.
Fei-Fei Li says Stanford's Natural Language computing lab has only 64 GPUs and academia is "falling off a cliff" relative to industry
33
28
399
106,056
Excited to announce our latest work on generative reward models that unify RLHF and RLAIF approaches! We begin with a standard LLM-as-a-judge RLAIF framework and use further RL tuning to align the judge model's evaluations with the preference dataset.
8
48
396
65,625
I actually believe Tinker could be the most advanced ML system in the world. It optimizes everything from the kernel level to a distributed system that can process millions of simultaneous requests with near 100% reliability and insane throughput efficiency.
So excited about this! Tinker provides a simple+powerful interface for postraining/RL research. It also manages all the infrastructure so that users can focus on data and environments. Hidden behind that simple interface is a ton of interesting and complex ML systems challenges! In addition to the work building an efficient RL stack (orchestration, numerics, parallelism, weight transfer, etc.), we also tackled a bunch of new challenges (transparent failure recovery, multi-tenant scheduling, autoscaling, etc.). I had a lot of fun working on early parts of this system and am excited to see what others are able to build with it!
6
19
331
45,806
Fn nailed it - tree search distillation + RL post training!
My Bet: Strawberry is algorithm distillation/procedural cloning. Everyone right now is coming up with ways to distill System 2 into System 1, but that will always be limited. We need to train the model to run the algorithms, not just outputs (and post-train with RL of course).
4
17
308
39,489
Very excited to share what I have been working on with a great team of people at @thinkymachines. Tinker is a whole new way to train and customize models all the way up to frontier scale. Most importantly, it allows everyone to use their own code, data, tools and environments, while it provides a frontier level training stack with a few lines of code.
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker
11
22
301
51,681
This is a really cool project where we trained a multi-agent system of 3 LLMs to do cooperative problem-solving end-to-end with reinforcement learning! MARL holds a lot of promise to teach models to be more cooperative with real collaborators! Check out @sumeetrm's thread bellow!
Introducing MALT: Improving Reasoning with Multi-Agent LLM Training🫡 We present a new multi-agent post-training method that uses credit assigned synthetic data to improve the reasoning capabilities and self-correction rates of a generator, critic, and refinement model working together🧵
5
35
291
56,114
It’s weird how people still blindly copy it. There was a whole paper about this.
Replying to @zjasper
The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:
3
17
281
45,056
After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇
2
47
250
44,647
We are just entering the RoboGPT era. Will have some big news on this soon!
OpenAI + humanoid robots — we’re collaborating with @Figure_robot to expand our multimodal models to robotic perception, reasoning, and interaction. prnewswire.com/news-releases…
3
5
238
73,477
Given the successful recent releases of Zephyr, NeuralChat and Tulu 2, there has been a lot of discussion around DPO (and variants) for RLHF and comparisons to the classical reward modeling + online RL (PPO) pipeline. What I think is missing from the discussion: 1/N
5
30
239
170,859
For 1% of Stanford's 36.5 Billion endowment we could blow DeepSeek out of the water. For 2.5% we could probably compete with OpenAI. Yet for some reason as a Ph.D. student I can use 4GPUs on a good day and pray my one 8B fine-tuning run goes well. Food for thought.
½ wrong, ½ right: The problem is not API 💰💰 but whether students can hack on—“research”—the details of models! .@ericschmidt: “said a [US] failure to invest in open-source [AI] would prevent scientific discovery in western universities, which could not afford closed models.”
9
21
234
36,814
The latest SalesForce approach achieves SOTA 55% on SWE-Bench Lite. The key component is a critic model which selects among a number of proposed solutions. It's the same approach we used in Agent Q for web tasks. I am pretty bullish about the idea of TRAINING generative critics.
3
19
211
21,090
I highly recommend zoning out all AIfluencers over the next few weeks (or indefinitely really).
6
4
159
16,427
These models can’t even learn \n\n from 50 gradient steps, much less complex exploration like this. If it generates code to solve math problems is clear it had a bunch of curated data in pre-training.
We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to PRIME and rStar-MATH that use >50x more data and more complicated components. 🚀 Increased CoT length and self-reflection emerge We share the details and our findings in the blog: hkust-nlp.notion.site/simple… Training code and implementation details here: github.com/hkust-nlp/simpleR…
11
11
152
56,116
I’ll take the opposite view - current methods are saturating and we need at least 1 practical breakthrough and at least two fundamental ones (which will likely take years) just off the top of my head to reach AGI. None of these are oversight or safety related.
Scalable oversight is pretty much the last big research problem left. Once you get an unhackable reward function for anything then you can RL on everything.
11
11
153
30,111
When we first published our work on this 9 months ago it was rejected for being impractical in realistic cases. Six months later it was rejected for lack of novelty. It’s the way academic publishing goes.
Another generative / inference-time scaling reward modeling paper. It's the direction things are going.
4
15
153
14,698
New preprint is out on interplay between DPO and verbosity. Some of the first feedback we got on DPO was that training on LARGE scale the model becomes increasingly verbose until it diverges. Verbosity effects have also been observed in the OS community. Credit to @peterjliu
4
25
138
31,013
DeepSeek R1 with "Cold Start" pretty much works as expected. I still don't buy the R1 Zero result, the base models barely output coherent solutions without finagling. My bet is there is some correction/reflection/backtracking-like data in mid-training.
7
5
132
24,103
Despite all the twitter hype there still hasn't been public proof that the "reasoning" models have any emergence. I.e. is there a class of problems that are solvable with "advanced reasoning" that were not under GPT4o with search under some computational budget?
8
11
127
20,857
From the LLaMa 3 blogpost - they use a combination of rejection sampling, DPO and PPO for post-training. Really interested to know what tasks/parts of the process each algorithms benefits the most.
3
14
117
71,745
Our new paper on RL From Human Feedback is out: arxiv.org/abs/2305.18290. In Direct Preference Optimization (DPO) we reparameterize the reward model in a suitable way without any loss in generality and optimize the EXACT RLHF objective directly with a simple classification loss.
2
29
120
22,823
Replying to @kalomaze
You are wrong on like three different levels, but something that will blow your mind - GRPO was first published in the DPO paper under the name “PPO-ours” which was group size 4 (but our version was mathematically correct unlike the actual “GRPO”).
2
5
122
36,818
"Superintelligence isn't about discovering new things; it's about discovering new ways to discover" -> Meta RL
Superintelligence isn't about discovering new things; it's about discovering new ways to discover I think our latest work formalizes Meta Chain-of-Thought which we believe lies on the path to ASI When we train models on the problem-solving process itself—rather than the final solution—they internalize how to think about reasoning tasks, not just what to think The next wave of AI is a Meta-CoT loop. We can't predict what novel forms of thinking might emerge, but it points to an extraordinary synthetic future I'm so proud of @synth_labs team & our incredible open science collaborators for getting this work out
2
31
115
16,499
DPO was designed as an offline algorithm, but works better online for data constrained reasons. However I believe the most efficient formulation is as s massive distributed async off-policy RL. This allows you to reuse real with AI feedback while massively scaling data.
Replying to @sanmikoyejo
PrefLearn: How Do Advanced Replay Buffers and Online DPO Affect the Performance of RL Tetris with DQNs by Andy Liang, Abhinav Sinha, Jeremy Tian, and Kenny Dao proposes PrefLearn with superior performance and faster convergence tinyurl.com/preflearn 4/n
4
9
114
18,340
New follow-up work on the effects of synthetic data on model pre-training. It’s becoming increasingly clear that the model collapse issues predicted by prior works are not panning out in theory and practice. Industry labs now even have entire synthetic data pre-training teams.
📢New preprint📢 🔄Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄 A deeper dive into the effects of self-generated synthetic data on model-data feedback loops w/ @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo 1/9
1
13
108
17,458
Replying to @jxmnop
Conference reviewing is completely broken these days, I don’t hold this in very high regard.
2
3
104
9,535
Been working on a new off-policy policy-gradient approach that should be more numerically stable (no importance ratios). The final goal is long-context and agentic RL where these can become a big issue, but I wonder how many RL people know what HalfCheetah is these days.
9
6
98
9,287
Missed this paper, but it’s pretty cool - it managed to scale our “Meta-CoT” proposal to 70B models by creating synthetic CoTs from search traces and post-training with RL. Thanks for the shout-out as well!
Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417
1
9
92
10,869
Inference time search is going to be huge for agents and likely very hard to match with standard training. That was my biggest take-away from the Agent Q work as well.
Can open-weight models match frontier LLM performance on SWE-bench? They can if you equip them with search! We've been studying how guided search can improve SWE agents, and built an SWE-agent-based system that scores 40.6% on SWE-Bench Verified using only open-weight models. 🧵
1
6
79
14,282
O1-mini asking for pairwise feedback. Interesting, I didn't expect this.
7
3
75
11,426
Replying to @kalomaze
You know he’s one of the transformer inventors?
2
74
5,397
Excited and grateful for the opportunity to speak at TED AI this year!
📢Excited to welcome @rm_rafailov researcher at @thinkymachines His work in reinforcement & continuous learning is shaping how AI learns to learn. Hear him at #TEDAI, Oct 21–22 in SF. Apply to attend: tedai-sanfrancisco.ted.com/
2
2
71
10,116
We address the model collapse issue in a new preprint! With AI generated data growing exponentially and making it's way into pre-training datasets new concerns are being raised about potential degradation in model performance. In this new work we claim this might not be an issue!
1
10
71
9,318
We are working to make applications like this even faster!
The Tinker API seems quite promising for async RL training, though I haven’t seen much discussion on this aspect. I ran a few experiments to get an initial sense of how async RL performs with Tinker. The results are pretty impressive! Async matches the sync version under the same settings, but finishes in about half the wall-clock time (max steps off-policy = 4). A few quick notes: - Efficiency can likely improve further — I did notice some rate limiting - These runs use GRPO-like algorithm; setups with more components (e.g., actor-critic + reward models) could show even greater async benefits - Compared with the non-API RL infra, the async imple with tinker doesn’t need to manage computing resources, which largely simplies the complexity Looking forward to seeing more exploration from the community on async RL with Tinker! Experiments based on the math training recipe from tinker-cookbook: github.com/thinking-machines…
3
3
67
15,662
From @StabilityAI Stable Diffusion 3 paper - fine-tuning with Diffusion-DPO achieves close to 20% higher win rate over the base model under human evals!
2
8
67
23,884
Sometimes even LLaMa has had enough.
4
4
65
5,139
“We developed a fully asynchronous online RL training framework that enhanced flexibility. …. This innovation resulted in a ~10x improvement in training efficiency over previous generations.” Asynch distributed RL strikes again!
1
3
65
6,136
My guess is that scaling inference time compute wont do much for agents (maybe a little - aka internal world models) but we need to scale inference-time interaction.
Surprising find: OpenAI's O1 - reasoning-high only hit 30% on SWE-Bench Verified - far below their 48.9% claim. Even more interesting: Claude achieves 53% in the same framework. Something's off with O1's "enhanced reasoning"... 🧵1/8
4
2
65
9,958
Replying to @jxmnop
We can replicate it, it just takes insane amounts of data, compute and infrastructure, all of which are in short supply in academia/oss community.
8
59
3,962
A lot of my own thinking on the inference compute problem has been influenced by discussions with Aviral, check out his stance on this at their blog post!
Lots of buzz around scaling test-time compute! But from an ML viewpoint: what does it mean to "use" test-time compute wisely? How to train to do so? How to measure that scaling it is useful? This new blog from students @mldcmu provides a conceptual perspective on these! 🧵⬇️ blog.ml.cmu.edu/2025/01/08/o…
1
7
58
7,101
In summary, we would likely need a new architecture, one that can: 1. See and understand spatial reasoning. 2. Be able to maintain and prior use knowledge 3. Do latent planning with arbitrary depth over concepts 4. Have hierarchical structure to generate the actual output 16/N
2
5
58
4,405
Replying to @polynoamial
Most people pursue Ph.Ds to do novel science and discoveries. If that is no longer on the table, the whole endeavor seems … pointless.
60
1,320
I make the AI, very nice!
congrats @rm_rafailov on your hard-earned acceptance to the USofA as alien of officially extraordinary ability. The alien piece comes as no surprise to your mates of course, but at least the general public now has fair warning and a fighting chance. To celebrate with a fitting observation from a preeminent Kazakh journalist: "My country send me to United States to make AI model. Please, come and see my model. If it not success, I will be execute." 🇺🇸🇺🇸🇺🇸🇺🇸🇧🇬🇧🇬🇧🇬🇧🇬
6
55
5,558
Replying to @teortaxesTex
We need a massive coordinated effort to generate a few billion samples of process supervision data. I think the recipe is actually clear the data and compute requirements are gonna be orders of magnitude larger.
6
1
53
8,966
The main point of why we need "advanced reasoning" is complexity. The model training data contains solutions for hard problems, but NOT the true data generating process for those solutions. The solution itself is the output of some complex Meta-CoT, which is not written down.
2
2
52
9,661
What about a massive community effort to generate reasoning process training data?
Great to see @scale_AI and @cais initiating a massive effort on harder evals! Many popular benchmarks are now saturated by @OpenAI o1, and we expect rapid progress to continue.
6
52
10,425
This is the dataset we curated for our own reasoning experiments. There is a lot of reasoning data coming out now, but we spend extra time on this to make sure all the problems are high-quality and suitable for RL training!
thrilled to see Big-MATH climbing to #3️⃣ on @huggingface—clear signal the community wants more high-quality, verifiable RL datasets. grateful to everyone who’s been liking, downloading, and supporting ❤️
2
9
52
11,143
Replying to @Shalev_lif
I’ve grown to think that the RL algorithm doesn’t matter as much as long as it’s scalable enough and can work slightly off-policy. The model priors matter a lot.
4
52
2,802
Replying to @McaleerStephen
You will have a great answer in three weeks.
2
3
52
3,812
Pretty good analysis, we have similar observations and trying to dig more into the DS V3 model itself. I believe these behaviors are also likely due to deliberate use/curation of synthetic data in pre/mid training (maybe even o1 traces in the case of R1).
🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵
2
3
52
4,914
Contrastive Preference Learning (CPL) extends DPO to arbitrarily MDPs and achieves great results on robot control! It uses a simple 1-line objective and does not require a reward model or any RL! I am so excited about this work, more below 🧵 arXiv: arxiv.org/abs/2310.13639
1
10
50
6,496
Multi-turn interactive RL should be a bigger focus. Current methods are not well-suited for this - i.e. PPO can't train with user in the loop generally and offline Q-learning still does not work at scale. It's interesting to see more work in that direction.
When prompting language models to complete a task, users often leave important things unsaid. Can language models teach themselves to ask clarifying questions? In STaR-GATE, we explore LMs' ability to self-improve by rewarding the model for generating useful questions!
6
50
7,424
The image below is the solution to the Geometry problem from IMO 2023. Even providing the solution plot, the model struggles with basic geometric structures and messes up multiple spatial relations. It does not understand how these things relate to each other. 4/N
2
6
46
11,288
Similar to many of @ylecun's arguments the model needs to be able to do planning over prior results/techniques AND develop them on the fly if they are not available. This is fundamentally different from token space planning, since it needs to be done over concepts. 9/N
1
3
44
5,051
Check out our position paper: arxiv.org/abs/2501.04682 for a lot more discussion, empirical results and technical details. We have nearly two pages of open research problems and we need people to work on them! If these interest you and want to work on open research, get in touch!
2
5
48
3,069
Finally the “gold” level at the IMO requires a certain amount of “spark”. My favorite example is the “windmill” problem from IMO 2011 (I did not solve it). The solution is only a few lines long and does not require any prior knowledge (try t think about it). 12/N
4
3
47
5,204
To be able to solve these problems, the model would need significant specialized training on spatial understanding and geometry, something which does not emerge from generic Internet data, since that is more semantics based (perhaps textbooks could help). 5/N
1
2
41
6,201
Replying to @kalomaze
I don’t understand this? There is no insider knowledge or anything. The thing in the DPO paper called “PPO-ours” was something we implemented that used four rollouts per prompt and centered the rewards. It uses a correct KL estimate (as part of the reward) unlike the original GRPO paper. At the time I thought about this as an implementation trick rather than anything groundbreaking.
2
44
3,419
I agree completely, the question is what changed in the base model? The internet data distribution on “reasoning” is the same.
Simply, no. I've been looking at my old results from doing RL with "verifiable" rewards (math puzzle games, python code to pass unit tests) starting from 2019 with GPT-1/2 to 2024 with Qwen Math Deepseek's success likely lies in the base models improving, the RL is constant
4
43
10,813
As a former math competitor this definitely fit my own thinking process - evaluating potential approaches to a solution, pruning directions that don't make progress, exploring branching claims trying to build a graph towards the final goal (solution/proof) based on intuition-v(S)
1
4
41
5,072
You’ve “derived” nothing, this is a definition.
1
42
736
(Meta) CoTs are search inside world models (the prompt is the goal specification).
Are world models necessary to achieve human-level agents, or is there a model-free short-cut? Our new #ICML2025 paper tackles this question from first principles, and finds a surprising answer, agents _are_ world models… 🧵
3
42
3,518
We’ve thrown all algorithms we have at this problem, including PPO and MCTS, over the last 3 years. All of them saturated. What changed is what goes in the “base” model. Literally thousands of papers on this, idk how its a discussion.
1
1
39
5,723
Out of the gate, the first main challenge is Geometry. Solving these problems requires a significant amount of spatial understanding and reasoning, which a pure LLM likely cannot develop. Even a strong VLM such as GPT4-V struggles a lot with basic spatial understanding. 3/N
4
1
38
7,213
Our robotics foundation model OpenVLA has crossed 10,000 downloads in the last month! If you're using or fine-tuning the model, I'd be really interested to hear about your use cases and experience!
7
38
6,510
Replying to @prathamgrv
How does one “derive” KL divergence in your mind?
3
37
2,180
We are presenting this work at ICML today 11.30-1pm! Stop by to discuss anything related to RLHF and LLM fine-tuning!
(1/N) Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Our new work finds that approaches employing on-policy sampling or negative gradients outperform offline, maximum likelihood objectives.
1
7
38
6,441
Really cool work towards explaining the persistent gap between fully offline and online(ish) RLHF methods.
1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to🤿!
2
3
39
7,042
Another DPO SOTA model. @lmsysorg can we get this one in the arena?
SOLAR: an 11B model that beats every open model, including Mixtral, Yi-34B, Llama 2 70B, and Falcon 180B: huggingface.co/upstage/SOLAR…
5
4
39
40,988
Off-policy RL ftw
Dudes will carefully optimize GRPO clipping epsilon, only to immediately discard their generations after one parameter update.
3
38
6,879
So do we (and advanced reasoning models) just need to do search? No, we need to TEACH the models to do this themselves for two main reasons: 1. Efficiency - training a model to search in-context can teach it to avoid exploring similar branches. 2. Super-Intelligence.
1
1
38
3,400
Our new paper MJ-BENCH evaluating generative reward models for text-to-image generation is now out! We find that Large Vision Language Models can act as zero shot feedback providers for diffusion models! More details below 👇
1
12
36
7,045
Excited for more community integrations!
GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced deepseek-r1 length increasing with LoRA 3. benchmarked PPO, GRPO, REINFORCE and showed their tradeoffs 4. added Terminal, MCP, visual and multi-agent environments … Open the thread for a deep dive!
4
35
8,134
The competition consists of two days with three problems each with 4.5 hours each day. The problems usually cover four main areas - Algebra, Geometry, Number Theory and Combinatorics. These problems are quite challenging, the average high-schooler would likely get 0. 2/N
1
1
36
7,557
So, do advanced reasoning models also carry out in-context search? We believe so! 1. O1 seems to implement a general search with backtracking and branching. 2. DeepSeek R1 uses additional self-criticism or inner-dialogue. 3. Gemini Think follows a revision-based format.
1
3
36
3,433
Meta-RL- learning to think
"Move 37" is the word-of-day - it's when an AI, trained via the trial-and-error process of reinforcement learning, discovers actions that are new, surprising, and secretly brilliant even to expert humans. It is a magical, just slightly unnerving, emergent phenomenon only achievable by large-scale reinforcement learning. You can't get there by expert imitation. It's when AlphaGo played move 37 in Game 2 against Lee Sedol, a weird move that was estimated to only have 1 in 10,000 chance to be played by a human, but one that was creative and brilliant in retrospect, leading to a win in that game. We've seen Move 37 in a closed, game-like environment like Go, but with the latest crop of "thinking" LLM models (e.g. OpenAI-o1, DeepSeek-R1, Gemini 2.0 Flash Thinking), we are seeing the first very early glimmers of things like it in open world domains. The models discover, in the process of trying to solve many diverse math/code/etc. problems, strategies that resemble the internal monologue of humans, which are very hard (/impossible) to directly program into the models. I call these "cognitive strategies" - things like approaching a problem from different angles, trying out different ideas, finding analogies, backtracking, re-examining, etc. Weird as it sounds, it's plausible that LLMs can discover better ways of thinking, of solving problems, of connecting ideas across disciplines, and do so in a way we will find surprising, puzzling, but creative and brilliant in retrospect. It could get plenty weirder too - it's plausible (even likely, if it's done well) that the optimization invents its own language that is inscrutable to us, but that is more efficient or effective at problem solving. The weirdness of reinforcement learning is in principle unbounded. I don't think we've seen equivalents of Move 37 yet. I don't know what it will look like. I think we're still quite early and that there is a lot of work ahead, both engineering and research. But the technology feels on track to find them. piped.video/watch?v=HT-UZkiO…
1
2
33
5,480
To predict the next token in the training data the model needs to internalize the whole meta-reasoning process in it's activations, which have limited capacity. This thread makes the point very clearly:
There is a nuanced but important difference between chain-of-thought before and after o1. Before the o1 paradigm (i.e., chain-of-thought prompting), there was a mismatch between what chain of thought was and what we wanted it to be. We wanted chain of thought to reflect the thinking process of the model, but what the model was really doing was just imitating reasoning paths that it had seen in pretraining, e.g., math homework solutions. The problem with this type of data is that it is a post-hoc solution summarized after the author did all the work somewhere else, and not really a sequence of thoughts. So the solutions often had poor information density, with an egregious example being things like “The answer is 5 because…”, where the token “5” has a huge amount of new information. With the o1 paradigm, you can see that the chain of thought looks very different from a textbook math solution (you can view examples in the blog post). These chains of thought are kinda like “inner monologue” or “stream of consciousness”. You can see the model backtracking; it says things like “alternatively, let’s try” or “wait, but”. And I have not measured directly, but I would wager a bet (my psycholinguistics friends would probably be able to confirm) that the information density is *much* more uniform in the chain of thought than average text on the internet.
1
1
35
6,741
Beyond the issues of understanding graphs and spatial reasoning, the challenges only get harder. A strong IMO competitor is not unlike an athlete. It takes years of problem solving for multiple hours a day to develop the background and skills to compete at this level. 6/N
1
1
35
6,053
However, the biggest unanswered question is about Super-Intelligence - can these models discover novel ALGORITHMS of thinking, which allow them to solve problems that classical search CANNOT solve under ANY compute budget? DID THE COMPUTE-PERFORMANCE CURVE MOVE LEFT OR UP?
1
1
35
3,625
A solution is usually a few hundred tokens, but only represents a combination of less than 5 concepts - we need some sort of latent compositional planning. It is not clear that current LLM systems can do that (perhaps ToT + Q learning)? 10/N
1
1
33
4,782
It’s important to eat healthy!
7
2
34
3,429
As I said earlier, we need to figure out RL at foundation model scale. This work is yet another piece of the missing puzzle. What I still wonder is how dynamic programming RL training affects the knowledge inherent within a pre-trained model? Some thoughts on this soon.
Super simple code change to get value-based deep RL scale *much* better w/ big models across the board on Atari games, robotic manipulation w/ transformers, LLM + text games, & even Chess! Just use classification loss (i.e., cross entropy), not MSE!! arxiv.org/abs/2403.03950🧵⬇️
4
34
6,122
At NeurIPS this whole week. Hit me up if you want to chat about RLHF, generative models, agents, robot learning, world models or anything about foundation models and decision making!
1
4
34
9,462
Evidence is mounting that increased inference time budgets are a capabilities shift. The question is how to really use them. Longer term I believe large scale RL will allow us to discover better optimization strategies.
Salesforce releases DEI, an open AI software engineering agents org with a 55% resolve rate on SWE-Bench Lite Discussion: huggingface.co/papers/2408.0… We propose DEI (Diversity Empowered Intelligence), a framework that leverages SWE agents' unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions.
5
34
4,063
It really surprises me how far we can push a 7B model. It feels like with the right data mix and a 70B range model, we could already be able to match or even out-perform GPT 3.5 with an open-source model!
🔥Open-source, open-science, and data curation for the win! Meet Notus 7B, a new LLM tuned with DPO on a new curated UltraFeedback dataset, surpassing Zephyr and Claude 2 on AlpacaEval. Built on the shoulders of giants: 🙌@huggingface Alignment Handbook argilla.io/blog/notus7b
2
7
34
5,549
Doing efficient RL properly at Foundation Model scale is still an open problem in my opinion. It’s especially prominent in agent and robotics applications and we can get significant benefits from figuring this out. This work is a step in that direction.
How can we train LLM Agents, to learn from their own experience autonomously? Introducing ArCHer, a simple (i.e., small change on top of standard RLHF) and effective way of doing so with multi-turn RL 🧵⬇️ Paper: arxiv.org/abs/2402.19446 Website: yifeizhou02.github.io/archer…
3
33
13,152
We've been working on distributed, highly-scallable online inference, search and RL infrastructure on top of the Neo-X framework, shooting for SOTA, which we aim to be FULLY OPEN. If you're interested in Infra, get in touch! synthlabs.ai/blog/rlhf-and-r…
1
2
33
2,540
Replying to @pashmerepat
This is not at all what an environment is. This is an abstraction that has existed and been build over 30 years.
1
33
2,063
So how does the Meta-CoT look like? It's hard to tell since people don't write down their problem-solving processes. However, we stipulate that in domains with a generator-verifier gap this is fundamentally represented by a SEARCH process.
2
1
33
3,968
This was an awesome project - we teach models to follow constitutional principles with self-supervision (no labels). We also show that a weak model can generate principles for a stronger one, which self-aligns (SUPERALIGNENT!) and can beat the instruction-tuned (RLHF-ed) model!
Constitutional AI showed LMs can learn to follow constitutions by labeling their own outputs. But why can't we just tell a base model the principles of desired behavior and rely on it to act appropriately? Introducing SAMI: Self-Supervised Alignment with Mutual Information!
8
29
11,819