Assistant Professor of CS & ML at @CarnegieMellon. PhD from UC Berkeley.

Pittsburgh, PA
🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️
11
82
700
71,372
Thrilled to share that I will be joining Carnegie Mellon @SCSatCMU as an Assistant Professor of CS and ML @CSDatCMU @mldcmu in Fall 2024. Extremely thankful to my mentors & collaborators, especially @svlevine! Looking forward to working with amazing students & colleagues at CMU!
65
28
651
112,273
🚨 New paper: we trained a SOTA (> GPT4, Gemini) VLM agent, DigiRL, that can do tasks on an Android phone in real time, in the wild, via autonomous offline + online RL Web: digirl-agent.github.io/ Paper: arxiv.org/abs/2406.11896 🧵 ⬇️ / little gif of learning progress👇:
5
103
607
80,240
Lots of buzz around scaling test-time compute! But from an ML viewpoint: what does it mean to "use" test-time compute wisely? How to train to do so? How to measure that scaling it is useful? This new blog from students @mldcmu provides a conceptual perspective on these! 🧵⬇️ blog.ml.cmu.edu/2025/01/08/o…
7
76
407
36,608
🚨 How can we fine-tune LLMs to implement nuanced algorithmic behaviors at test-time? I've been very behind on posting, but in SCoRe, we studied a special instance: training LLMs to self-correct. arxiv.org/abs/2409.12917 (in v2, we've updated the presentation, 🙏 for feedback!)
7
65
357
51,494
🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL arxiv.org/abs/2502.04327 🧵⬇️
5
54
330
37,300
Posting this a bit late, but if you are applying for a PhD in AI and are interested in decision making and reinforcement learning, please consider applying to my upcoming lab at CMU by December 13! Details about my interests and application process can be found on my website.
4
56
293
44,077
🚨 New paper on RL, synthetic data, LLM math reasoning (MATH / GSM 8k) TL, DR: RL on wrong responses (yes, "proper" RL, not filtered SFT or STaR / RFT) scales utility of syn data by **8x**, ❌spurious correlations ✅stitching, credit assignment arxiv.org/abs/2406.14532 🧵⬇️
10
50
282
32,409
Many LLM fine-tuning methods. Unclear what you should use & why? In our new paper, we did an extensive study of on-policy RL, supervised & offline contrastive methods (DPO, IPO) to answer this... 🧵⬇️ On-policy > offline, mode-seeking > mode-covering understanding-rlhf.github.io…
3
64
271
37,669
Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️ tinyurl.com/rlshadis
4
37
273
17,252
🚨🚨New paper: if you want robots to do bimanual long-horizon tasks well, try RaC: a human in-the-loop data collection protocol that naturally amplifies Recovery & Correction behaviors + trains on them. 📈 data efficiency 10x vs prior results (+many nice properties). 🧵⬇️
5
55
269
25,881
Super simple code change to get value-based deep RL scale *much* better w/ big models across the board on Atari games, robotic manipulation w/ transformers, LLM + text games, & even Chess! Just use classification loss (i.e., cross entropy), not MSE!! arxiv.org/abs/2403.03950🧵⬇️
3
40
264
52,308
If you want to try train Q-functions via flow-matching, we just released code and runs: Code: github.com/CMU-AIRe/floq Wandbs: docs.google.com/spreadsheets… Also great to see so many other groups also training value functions via flow-matching!
🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️
3
33
248
21,595
A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean we use it well. My students developed "v0" of a paradigm to do this optimally by running RL with dense rewards = minimizing regret over long CoT episodes. 🧵⬇️ cohenqu.github.io/mrt.github…
3
33
200
16,675
🚨🚨New paper that proves systematically training w/ RL (or any method w/ rewards or verifiers) to scale test-time compute >> doing it via SFT or distillation. If you scale up prompts too, the gap b/w RL & SFT gets larger w/ more test-time compute budget! arxiv.org/abs/2502.12118
🚨 RL or distillation/SFT: what to use to train next reasoning model? Which 📈 perf faster as we scale test compute? We answer these in a principled way so you don't have to burn GPUs🔥. 🎯 Ans: RL w/ rewards or verification >> SFT/distillation 😱 arxiv.org/pdf/2502.12118 🧵⤵️
2
31
199
21,022
How can we train LLM Agents, to learn from their own experience autonomously? Introducing ArCHer, a simple (i.e., small change on top of standard RLHF) and effective way of doing so with multi-turn RL 🧵⬇️ Paper: arxiv.org/abs/2402.19446 Website: yifeizhou02.github.io/archer…
2
40
190
36,208
Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️ Website: matthewyryang.com/e3/ Paper: arxiv.org/abs/2506.09026
2
27
179
12,504
We have been doing work on scaling laws for off-policy RL for some time now and we just put a new paper out: arxiv.org/abs/2508.14881 Here, @preston_fu @_oleh lead a study on how to best allocate compute for training value functions in deep RL: 🧵⬇️
2
24
160
7,497
If you are at #icml25 and are interested in RL algorithms, scaling laws for RL, and test-time scaling (& related stuff), come talk to us at various poster sessions (details ⬇️). We are also presenting some things at workshops later in the week, more on that later.
1
10
151
7,184
Conventional wisdom: the BIG blocker holding offline RL behind imitation / SFT, preventing good scaling, etc is the value function. But can we still do well with current value functions? We find: often *policy* learning bottlenecks offline RL scaling: arxiv.org/abs/2406.09329 🧵
6
28
136
15,194
First tweet: Recent work showing how to train big models via offline RL on diverse, multi-game data. 2 billion sub-opt. data + offline RL => generalist policy better than data & good at fine-tuning. sites.google.com/view/scalin… w/ @svlevine @agarwl_ @younggeng @georgejtucker
2
14
134
Can offline RL methods do well on any problem, as we scale compute and data? In our new paper led by @seohong_park, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this. arxiv.org/abs/2506.04168 🧵⬇️
2
18
128
7,289
Two new papers on self-improvement: paper 1 today ⬇️ In RISE, we build on online imitation to teach LLMs *how* to improve their own responses *sequentially*. w/ Llama2/3/Mistral, this gives solid +10-20% in 5 turns, outperforms parallel sampling! cohenqu.github.io/rise.githu… 🧵⬇️
1
26
125
17,122
We extended our DigiRL approach (digirl-agent.github.io/) to now utilize trained VLM-based Q-functions for building mobile device control agents, with offline RL. Results: 23% to 71% improvement in device control perf. Website: digiq-agent.com/ Paper: digirl-agent.github.io/DigiQ… 🧵⬇️
2
19
125
7,806
🚨 We are organizing an ICLR workshop on self-improving foundation models w/o human supervision at ICLR 2025 in Singapore! Deadline: Feb 7, AoE (submit your ICML papers!) Details: sites.google.com/berkeley.ed… We have an amazing line up of speakers + panelists, more info coming soon.
3
21
125
26,325
🚨On the topic of online RL fine-tuning, we also released another paper that studies unlearning and forgetting, and attempts to fix it! Typically, RL needs offline data during fine-tuning for stability. But this is hard to scale😢 We can avoid this! zhouzypaul.github.io/wsrl/
2
19
123
8,845
Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough? @jackbai_jkb & @JunhongShen1 show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction! test-time-interaction.github…
1
9
120
11,223
My students & collaborators are presenting many things at both the @iclr_conf main conf & workshops on topics including reasoning, test-time compute, RL for digital agents, generalist robot policy finetuning, & core deep RL. Go talk to them! At main conf, we are presenting:
1
10
113
6,801
Does test-time scaling help in open-ended problem domains? And by how much & why? This is quite a bit of a nuanced question to answer fully. To begin studying this, w/ @danielkty96 & @AdtRaghunathan we trained LLMs to scale test-time compute for safety. training-adaptive-reasoners-…
7
17
113
7,340
How can we fine-tune generalist policies autonomously w/ RL (value functions)? @maxsobolmark's new paper on Policy-agnostic RL provides a single way to fine-tune generalist VLAs w/ any backbone, output, size (we fine-tune 7B OpenVLA on real robot) policyagnosticrl.github.io/🧵⬇️
1
24
107
7,775
🚨New paper led by @setlur_amrith on process rewards for reasoning! Our PRMs that model specific notion of "progress" reward (NO human supervision) improve: - compute efficiency of search by 1.5-5x - online RL by 6x - 3-4x vs past PRM results arxiv.org/abs/2410.08146 How? 🧵👇
3
19
103
12,527
A crucial component in modern ML seems to be using the *right*, quality subset of data for learning. What does this mean for offline RL? Given an offline dataset, can we also improve perf. by developing automatic ways to filter data? We answer this in our NeurIPS 2023 paper 🧵
1
13
102
21,197
Exciting to see results of our paper reproduced with Llama models! 🎉 If you are interested in learning more, check out our paper here: arxiv.org/abs/2408.03314 (which also evaluates other strategies for scaling test-time compute) + read other references therein!
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥 How? By combining step-wise reward models with tree search algorithms :) We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think" We're open sourcing the full recipe and sharing a detailed blog post 👇
11
102
11,719
Human video (e.g., Ego 4D) pre-training can improve robot control, including for downstream robotic RL. But can we *also* use RL for actually doing video pre-training? Yes! Value-based offline RL can pre-train on video for your robot! Introducing V-PTR 🧵dibyaghosh.com/vptr/
1
13
95
19,900
Can we use text-to-image diffusion models to steer robots into doing things, zero-shot? Our method, SuSIE, fine-tunes diffusion models trained for image editing to produce future subgoals from a given scene, which then drive a low-level policy. rail-berkeley.github.io/susi… 🧵⬇️
1
19
93
18,865
At #NeurIPS2024 main conf, we will present several works on understanding offline RL methods, RL for LLM reasoning, agents, etc. led by my students and collaborators. Come talk to us to learn more and discuss future directions + what we are excited about! More details in 🧵⬇️
1
16
94
6,654
Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-training with offline RL to enable fast fine-tuning, that's just a 1-line code change on conservative Q-learning (CQL)! arxiv.org/abs/2303.05479 A thread 🧵...
1
17
92
15,393
Check out our new paper on improving exploration in CoT for LLMs by generating abstractions! 👇 Rather than letting the LLM think longer and longer to explore, we can let it first produce concise insights that help guide structured exploration later. This works really well! Led by @QuYuxiao @Anikait_Singh_ @yoonholeee!
🚨 NEW PAPER: "RLAD: Training LLMs to Discover Abstractions for Reasoning"! We introduce reasoning abstractions: concise insights that help LLMs solve hard reasoning problems by guiding structured exploration. 📄 arxiv.org/abs/2510.02263 🌐 cohenqu.github.io/rlad.githu… 🧵[1/N]
10
87
10,841
I am honored to be selected as one of this year's #AI2050 Early Career Fellows, joining an amazing cohort. Thank you, @schmidtsciences, for this incredible opportunity and for supporting our group's research!
We're thrilled to welcome the 2024 cohort of AI2050 Senior and Early Career Fellows –– 25 visionary researchers tackling AI's toughest challenges to ensure it serves humanity for the better. Learn more about this year’s cohort of fellows: schmidtsciences.org/schmidt-…
6
86
6,508
Check out @GraceLiu78 & @QuYuxiao's new paper on training models to know *when* they know enough. This general approach is effective at both addressing overthinking and excessive information-seeking in multi-step agentic problems, resulting in better use of test-time compute. 👇
NEW PAPER: "CaRT: Teaching LLM Agents to Know When They Know Enough"! LLMs often overthink, ask too many questions, or waste compute. We introduce Counterfactuals and Reasoning for Termination (CaRT) - teaching LLMs when to stop gathering info and make decisions. 🧵[1/9]
3
5
67
11,626
Our new paper on understanding why LLMs make up stuff & hallucinate and how RL fine-tuning with an appropriate conservative reward model can mitigate these issues Paper: arxiv.org/abs/2403.05612 A thread below 🧵⬇️ (+ check @katie_kang_ 's thread for many more details)
We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate arxiv.org/abs/2403.05612 🧵
2
10
62
9,548
At #ICLR25 workshops, my students+collabs will give many orals talks on newer stuff (don't miss!): - robot VLA RL fine-tuning @maxsobolmark - optimizing test-time compute @QuYuxiao - why RL is crucial for test-time scaling @setlur_amrith - scaling laws for value-based RL @_oleh - in-context LLM verifiers & search @Anikait_Singh_ 🧵⬇️
1
6
64
5,596
Check out @katie_kang_'s work on understanding memorization vs learning in reasoning! By probing LLMs in training, we identify if an LLM "learns" to answer a question by memorizing or by learning to "draw inferences" ➡️ a metric to predict generalization, insights for data, etc
LLMs excel at fitting finetuning data, but are they learning to reason or just parroting🦜? We found a way to probe a model's learning process to reveal *how* each example is learned. This lets us predict model generalization using only training data, amongst other insights: 🧵
1
7
53
7,621
New paper on using value functions trained via Cal-QL (arxiv.org/abs/2303.05479) for improving "foundation" policies at test-time: improves precision and robot motion on manipulation tasks! Also checkout our work on test-time training with value functions: arxiv.org/abs/2406.09329
Many generalist robot policies have been released, but they're not perfect. How can we make them better? Introducing V-GPS🚀: Value Guided Policy Steering, a simple approach to improve any off-the-shelf generalist policy at deployment time.🧵#CoRL2024 🌐nakamotoo.github.io/V-GPS
2
5
51
8,526
Check out Mian’s new work on training **adversarial** critics that dynamically adapt to the policy, when running RL to train LLMs on hard-to-verify tasks.👇 This alleviates the need to extensively verify all rubrics to obtain reward, making RL more practical and robust.
Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a procedure that also trains the judge/critic dynamically during RL. The critic finds just one most likely mistake in response, the generator fixes it, and now the critic updates itself to find new mistakes... this adversarial training procedure does really well!
1
5
50
13,081
Check out our work on training VLM Q-functions for building device-control agents: digirl-agent.github.io/DigiQ…
We just made Q function work on 7B VLMs with TD learning. If you work on end-to-end RL with Q functions, you know it's extremely hard. tbh most people give it up right after they finish the first wandb run. Let me show how we got through: A thread 🧵 1/n arxiv.org/abs/2502.15760
11
51
4,803
Checkout these awesome new real-robot online RL fine-tuning results that @andy_peng05 and @zhiyuan_zhou_ got with our WSRL method. WSRL appeared at ICLR earlier this year -- check this out for more details: zhouzypaul.github.io/wsrl/ 👇
We tested WSRL (Warm-start RL) on a Franka Robot, and it leads to really efficient online RL fine-tuning in the real world! WSRL learned the peg insertion task perfectly with only 11 minutes of warmup and *7 minutes* of online RL interactions 👇🧵
5
50
4,408
On my way to NOLA for #NeurIPS2023! We will present several works on offline RL, fast online fine-tuning, using pre-trained models for improving low-level robot control, RL pre-training on human videos, and querying VLMs for maximal efficacy in RL. Come talk to us! Details ⬇️
1
1
41
7,243
We show if LLMs can made to effectively use verifiers, search, or look at past attempts at a problem at inference time, this can make a better use of the same amount of FLOPs than using bigger models or more pre-training compute. arxiv.org/abs/2408.03314
3
8
42
4,559
Do make sure to attend our workshop on self improvement at @iclr_conf on Sunday — we have an amazing line up speakers, contributed papers, and then a panel from 5-6pm!
With a stellar lineup of speakers and panelists, including Yoshua Bengio 🙀, the Scaling Self-Improving Foundation Models at @iclr_conf promises to be 🔥 ⏰ Sunday, April 27 📍 Garnet 214-215
4
39
3,510
Check out @max_simchowitz's insightful new paper showing how imitation learning in continuous action spaces is exponentially harder than discrete spaces!
There’s a lot of awesome research about LLM reasoning right now. But how is  learning in the physical world 🤖different than in language 📚? In a new paper, show that imitation learning in continuous spaces can be exponentially harder than for discrete state spaces, even when the underlying dynamics are seemingly benign and insensitive to perturbations. (1/n)🧵
2
40
3,146
@preston_fu @_oleh and I wrote a blog post on scaling laws and value function based RL, summarizing our two papers in this direction and discussing open questions! value-scaling.github.io/ Check it out! Feedback & comments are very welcome!
3
34
1,424
Checkout this spotlight poster today at 3pm! Go to this paper to learn about process rewards, exploration & hard problems (presented by @ianwu97) and then attend @QuYuxiao's oral talk at the FM-Wild workshop to see how these ideas can also help with long CoT / thinking models!
I couldn't be there @iclr_conf but if you are interested in process verifiers that can boost exploration and get LLMs to solve hard problems, check out our spotlight poster on PAVs at 3pm Hall 3+2B #548. Also chat with the amazing @ianwu97 who will be presenting on our behalf!
3
33
2,289
Go check out @GabrielSarch's new work on how to train VLMs to reason in a grounded way via RL. ViGoRL works quite well! I personally like some of the insights about how to induce useful base behaviors in the model that can be amplified via RL for better visual reasoning. visually-grounded-rl.github.…
How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵
1
3
30
2,856
Before the (exciting) workshops on Sun, catch Vincent’s oral talk at the #ICLR2025 main conference on this paper today at 3:30pm, Hall 1 Apex! And don’t forget to talk with the co-leads Vincent and @YiSu37328759 at the poster 10 a.m - 12:30 p.m Hall 3 + Hall 2B #558.
🚨 How can we fine-tune LLMs to implement nuanced algorithmic behaviors at test-time? I've been very behind on posting, but in SCoRe, we studied a special instance: training LLMs to self-correct. arxiv.org/abs/2409.12917 (in v2, we've updated the presentation, 🙏 for feedback!)
4
28
3,020
This was an awesome collaboration led by @seohong_park, w/ @kvfrans and @svlevine. @seohong_park also wrote a terrific blog post (please check it out for more insights and results + short version, if you don't have time): seohong.me/projects/offrl-bo… Paper: arxiv.org/abs/2406.09329
1
5
27
5,570
Check out our work on training large transformer policies on demo and autonomous data (including failures of existing imitation policies) via offline Q-learning. Q-Transformer improves over RT-1 on real robots & provides a recipe for building ever-improving robotic systems! ⬇️
Offline RL strikes back! In our new Q-Transformer paper, we introduce a scalable framework for offline reinforcement learning using Transformers and autoregressive Q-Learning to learn from mixed-quality datasets! Website and paper: q-transformer.github.io 🧵
24
5,438
Unfortunately I am not at ICML this year, but my students & collaborators are there to present some exciting work on RL, RL x LLMs at the conference (3 papers: 2 posters + 1 oral) & workshops (3 talks + 1 poster; some hot off the press work). Please talk to them! Details⬇️
1
1
23
6,580
Main idea: Instead of predicting Q-values in one shot, floq (flow Q-function) treats Q-learning via generative modelling. arxiv.org/abs/2509.06863 TL, DR: A velocity field iteratively transforms noise --> Q-value, while being supervised densely at every step of this integration.
1
3
24
2,406
We have been looking at scaling in RL for a while now. Last year, it seemed clear that the gap was in policy extraction (arxiv.org/abs/2406.09329), but with the use of diffusion/flow policies (see PA-RL: arxiv.org/abs/2412.06685) and other work in the community, it seemed again that the gap is in critic learning. The question then was: what's a reliable and stable way to scale value functions in RL? We tried monolithic critics several times, with normalization, regularizers, etc, but these approaches seemed a bit finicky always. So, we wanted to build a better approach for parameterizing a Q-function to do better.
1
3
24
2,420
This was a fun paper led by @AgrawallaBhavya (his first paper in PhD!), w/ @mic_nau! I learned a lot. We started from a very complex approach & iterated a lot to arrive at the core, simple ideas we describe now. Paper: arxiv.org/abs/2509.06863 Comments & feedback very welcome.
4
23
1,714
At @icmlconf workshops on Fri (Jul 26), we will present 2 papers at the FM-in-the-wild and ARLET workshops: - DigiRL: RL for building a SOTA device-control agent digirl-agent.github.io/ - "Is value learning really the main bottleneck in offline RL?" seohong.me/projects/offrl-bo…
1
5
19
1,276
Overall, we show: - Scaling value learning in RL can work really well. - Value learning in RL can benefit from the same principles that made LLMs & diffusion models scale well. - Opens the door to test-time scaling in value learning. - Should be your goto choice for RL algo.
1
1
19
1,730
1) We need a scalable interaction env, that emulates phone state in real-time for open-ended training. Most Android device control work provided data (e.g., Android in the Wild), but no simulator. + an autonomous evaluator based on Gemini 1.5 Pro github.com/DigiRL-agent/digi…
1
19
1,557
2) Given a total compute + data budget, we use the above Pareto frontiers to determine how to set hyperparameters of this algorithm to attain maximum performance while still being within the overall data+compute budget. We can do this with some neat geometry (see "iso-cost" curves)...
1
16
1,892
Why should this be of interest to you? Improving value functions still remains important (and I personally have been looking at doing so too) but, this paper shows value functions can help if utilized well (DDPG+BC, OPEX, TTT) but we don't study how to use a value function much
Conventional wisdom: the BIG blocker holding offline RL behind imitation / SFT, preventing good scaling, etc is the value function. But can we still do well with current value functions? We find: often *policy* learning bottlenecks offline RL scaling: arxiv.org/abs/2406.09329 🧵
1
17
3,105
That's it! This gives us a SOTA VLM agent, DigiRL, on these tasks. Our method outperforms prompting + off-the-shelf VLM (Gemini 1.5 Pro, GPT 4) as well as SFT / imitation, other VLMs (CogAgent, AutoUI, etc) around 70% improvement over the next best approach
1
1
16
1,243
To address these, a scalable way is to run autonomous online RL (SFT / imitation is not enough), where the agent interacts with the phone & internet in real time and learns from its mistakes. We did exactly this. Many systems & methodological needed to be done to enable this: ⬇️
1
15
1,655
Device control is very challenging for foundation models: - real-world stochasticity + non-stationarity + disctractors (e.g., pop-ups) - websites + internal device state changing - pixel, gesture control ➡️ need to continuously keep agents up-to-date + learn from "own" failures
1
16
2,288
There has been tons of work on new methods for test-time scaling. See this concurrent paper surveying this area, which provides yet another instantiation: arxiv.org/abs/2501.04682 But still an understanding of when, why, & how test-time scaling should work at all is unclear....
1
3
15
1,688
We find that scaling behavior of "SAC-style" value-based RL is predictable, e.g.: - A Pareto frontier between compute and data requirements - Optimal budget allocation to compute and data - Hyperparameter dependencies We show that we can predict these for larger compute, data, or performance, by only running small-scale experiments.
1
4
14
2,004
@nived_rajaraman will give an oral talk at the VerifAI workshop on why RL or verification is needed to effectively scale test-time compute! Lots of interesting insights in this paper! At VerifAI workshop, 3:45pm, April 27 arxiv.org/abs/2502.12118
🚨 RL or distillation/SFT: what to use to train next reasoning model? Which 📈 perf faster as we scale test compute? We answer these in a principled way so you don't have to burn GPUs🔥. 🎯 Ans: RL w/ rewards or verification >> SFT/distillation 😱 arxiv.org/pdf/2502.12118 🧵⤵️
1
2
15
492
Three method ideas from RL: - Advantage-weighted regression for policy learning - Doubly robust advantage estimators using a step-level value function - Automated curriculum to learn on most informative states using instruction-level value function Check the paper for details
1
1
14
1,303
Lots of ablations/analyses in the paper, one I would like to highlight. Even one flow step outperforms standard critics due to representation learning. Real gains come from multiple flow steps, which help fix errors to give a better Q-function estimate for training the policy.
1
1
14
1,269
We tested different ways of scaling critic capacity: ✅ More flow steps (sequential compute) → better performance for floq ❌ Larger ensembles (parallel compute) → worse than floq ❌ Deeper ResNets → worse despite sequential compute This shows floq & its training is crucial!
1
2
14
1,254
Concretely, we do so by answering resource optimization problems by fitting scaling laws:
1
2
11
1,533
Don't forget to check out @QuYuxiao & @matthewyryang's poster on dense rewards for test-time scaling & why it matters Today at 11 am in East Exhibition Hall, poster E-2712.
Heading to @icmlconf #ICML2025 this week! DM me if you’d like to chat ☕️ Come by our poster sessions on: 🧠 Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning (arxiv.org/abs/2503.07572) 🔍 Learning to Discover Abstractions for LLM Reasoning (drive.google.com/file/d/1Sfa…)
2
13
2,568
This means to learn the self-correction algorithm, we need to: (1) train under the model's own distribution (i.e., kind of like on-policy RL), (2) incentivize the indirect behavior (i.e., find a good first response + correct it >> find best first response + minor edits)
1
3
13
3,110
Our scaling laws are **very** different from LLMs -- lots of new aspects to consider in RL! Main results: 1) data / compute requirements to attain a given performance lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio, a hyperparameter that appears centrally in RL
1
11
1,501
Think of how diffusion models denoise step by step, floq “integrates” step by step to refine Q-value estimates. This scales compute for estimating Q-functions by scaling number of flow integration steps (i.e., forward passes) but no extra params. Just like longer CoTs for LLMs.
2
1
13
1,498
One can bake this into an objective that can explicitly train models to use test-time compute or use such an idea to measure if scaling test-time compute is useful. Theoretically this implies that dense process-based rewards, RL, & asymmetry b/w generation & verification might be crucial.
1
11
1,448
Looking back, some of the most effective methods that we've built for training LLM/VLM agents in multi-turn settings also *needed* to utilize such a hierarchical structure, e.g., ArCHer (yifeizhou02.github.io/archer…) by @YifeiZhou02, further showing the promise behind such ideas.
2
1
12
521
Premise: Ultimately, if we want models to succeed on hard queries, we need to teach *how* to discover strategies to find its way to solve hard inputs, using more inference compute. But, we train models to directly produce the output given input, not to discover strategies.
1
1
11
1,134
Robotic Offline RL from Internet Videos via Value-Function Pre-Training paper page: huggingface.co/papers/2309.1… Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods.
1
12
1,378
..and now you can use the intersections to predict the optimal budget allocation in terms of target performance. This means we can tell you what UTD to run with (and accordingly how to set other hyperparameters) for best use of a certain resource budget
2
11
932
The blog posits that: optimizing for test-time compute = train models to be capable of using more tokens for figuring out "how" to discover solutions at test time... ..and not simply output "what" an answer could be. And this generalization problem, is equivalent to solving meta RL (see this: arxiv.org/abs/2107.06277)
1
2
11
1,388
The website has gifs of various methods (I could not upload them here as my X would always crash), but please check the website (the panel looks like this image below): digirl-agent.github.io/
3
12
995
What's the approach? What didn't work? - We spent a *lot* of time trying to SFT our way out. - Generated a lot of data of correction traces by prompting models, paired incorrect and correct responses, and trained on this data This is STaR / RFT, but it didn't work (more⬇️).
1
2
12
1,566
This project was amazing & fun led by @JesseFarebro @agarwl_, with a number of fantastic collaborators @QuanVng Jordi Orbay Adrien Ali Taiga @YevgenChebotar @pcastr @AleksandraFaust @svlevine @xiao_ted @AlexIrpan.
12
943
2) Running RL is not easy as we're fine-tuning VLMs amidst so much stochasticity / non-stationarity. Step 1: fine-tune on existing demos via offline RL (e.g., data collected by rolling out off-the-shelf VLMs in the env), Step 2: autonomous online RL. All done with 1.5B VLM.
1
12
1,464
Our goal was to understand + compare approaches of learning from syn. data. Approach: 1. ask GPT / Gemini to give new problems + solutions (see: arxiv.org/abs/2403.04706): 2. run SFT on it 3. STaR (reject bad data) or run RL (with good+bad data)
2
1
12
1,522
To illustrate advantage vs current PRMs (which model future success probabilities or Q-values as in arxiv.org/abs/2408.03314), note this simple example: Here searching with future success probs. throws useful transitions out of beam search at test-time (similar issues with RL)
1
2
11
1,363
Results wise, floq does really well with same default hparams! On benchmarks, like OGBench (the standard benchmark in offline RL now, though ofc i have my complaints about it), we outperform prior results. Also best method on online RL fine-tuning right now (see Fig 5 in paper)
1
1
11
1,296
OK, but why is solving such a meta RL problem useful (in theory)? Why can it do better than standard RL? Does this violate "no free lunch", i.e., why is more tokens => better generalization if all tokens come from a learned model?
1
1
9
917
Overall, this gave us a way to determine how to scale value-based algos. Predictability is going to be crucial in scaling up, since RL comes with many algorithmic components. This is our first step towards that and I am very excited about it... An aside: a while back I wrote some thoughts on RL research, which has more takes on "predictability": cmu-aire.github.io/pages/blo…
1
11
1,386
Turns out that the core concept that meta RL can exploit to improve generalization is "information" gain towards discovering the correct answer from each subsequent token appearing in the test-time output (see illustration⬇️)
1
1
11
863
We derive updates for training a flow Q-function via TD-learning & identify some important design decisions needed to scale compute this way. E.g., a "healthy" floq model would learn a fairly curved field. We used categorical representations for inputs + embed time differently.
1
1
11
1,416