Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

Aviral Kumar

@aviral_kumar2

9 Sep 2025

🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️

700

71,372

Aviral Kumar · Jul 17, 2023 · 6:26 PM UTC

Aviral Kumar

@aviral_kumar2

17 Jul 2023

Thrilled to share that I will be joining Carnegie Mellon @SCSatCMU as an Assistant Professor of CS and ML @CSDatCMU @mldcmu in Fall 2024. Extremely thankful to my mentors & collaborators, especially @svlevine! Looking forward to working with amazing students & colleagues at CMU!

651

112,273

Aviral Kumar · Jun 21, 2024 · 3:28 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

🚨 New paper: we trained a SOTA (> GPT4, Gemini) VLM agent, DigiRL, that can do tasks on an Android phone in real time, in the wild, via autonomous offline + online RL Web: digirl-agent.github.io/ Paper: arxiv.org/abs/2406.11896 🧵 ⬇️ / little gif of learning progress👇:

103

607

80,240

Aviral Kumar · Jan 10, 2025 · 3:31 AM UTC

Aviral Kumar

@aviral_kumar2

10 Jan 2025

Lots of buzz around scaling test-time compute! But from an ML viewpoint: what does it mean to "use" test-time compute wisely? How to train to do so? How to measure that scaling it is useful? This new blog from students @mldcmu provides a conceptual perspective on these! 🧵⬇️ blog.ml.cmu.edu/2025/01/08/o…

407

36,608

Aviral Kumar · Oct 7, 2024 · 10:14 PM UTC

Aviral Kumar

@aviral_kumar2

7 Oct 2024

🚨 How can we fine-tune LLMs to implement nuanced algorithmic behaviors at test-time? I've been very behind on posting, but in SCoRe, we studied a special instance: training LLMs to self-correct. arxiv.org/abs/2409.12917 (in v2, we've updated the presentation, 🙏 for feedback!)

357

51,494

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL arxiv.org/abs/2502.04327 🧵⬇️

330

37,300

Aviral Kumar · Nov 30, 2023 · 4:17 PM UTC

Aviral Kumar

@aviral_kumar2

30 Nov 2023

Posting this a bit late, but if you are applying for a PhD in AI and are interested in decision making and reinforcement learning, please consider applying to my upcoming lab at CMU by December 13! Details about my interests and application process can be found on my website.

293

44,077

Aviral Kumar · Jun 24, 2024 · 3:13 PM UTC

Aviral Kumar

@aviral_kumar2

24 Jun 2024

🚨 New paper on RL, synthetic data, LLM math reasoning (MATH / GSM 8k) TL, DR: RL on wrong responses (yes, "proper" RL, not filtered SFT or STaR / RFT) scales utility of syn data by **8x**, ❌spurious correlations ✅stitching, credit assignment arxiv.org/abs/2406.14532 🧵⬇️

282

32,409

Aviral Kumar · Apr 22, 2024 · 4:15 PM UTC

Aviral Kumar

@aviral_kumar2

22 Apr 2024

Many LLM fine-tuning methods. Unclear what you should use & why? In our new paper, we did an extensive study of on-policy RL, supervised & offline contrastive methods (DPO, IPO) to answer this... 🧵⬇️ On-policy > offline, mode-seeking > mode-covering understanding-rlhf.github.io…

271

37,669

Aviral Kumar · Jun 24, 2025 · 1:27 PM UTC

Aviral Kumar

@aviral_kumar2

24 Jun 2025

Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️ tinyurl.com/rlshadis

Sharpening or Discovery, RL or Meta RL?: How RL Improves LLM Reasoning | Notion

Amrith Setlur and Aviral Kumar, Carnegie Mellon University

pinnate-flare-8f3.notion.site

273

17,252

Aviral Kumar · Sep 10, 2025 · 12:54 PM UTC

Aviral Kumar

@aviral_kumar2

10 Sep 2025

🚨🚨New paper: if you want robots to do bimanual long-horizon tasks well, try RaC: a human in-the-loop data collection protocol that naturally amplifies Recovery & Correction behaviors + trains on them. 📈 data efficiency 10x vs prior results (+many nice properties). 🧵⬇️

269

25,881

Aviral Kumar · Mar 7, 2024 · 6:50 AM UTC

Aviral Kumar

@aviral_kumar2

7 Mar 2024

Super simple code change to get value-based deep RL scale *much* better w/ big models across the board on Atari games, robotic manipulation w/ transformers, LLM + text games, & even Chess! Just use classification loss (i.e., cross entropy), not MSE!! arxiv.org/abs/2403.03950🧵⬇️

264

52,308

Aviral Kumar · Oct 23, 2025 · 4:58 PM UTC

Aviral Kumar

@aviral_kumar2

23 Oct 2025

If you want to try train Q-functions via flow-matching, we just released code and runs: Code: github.com/CMU-AIRe/floq Wandbs: docs.google.com/spreadsheets… Also great to see so many other groups also training value functions via flow-matching!

GitHub - CMU-AIRe/floq: Code Release for floq: Training Critics via Flow-Matching for Scaling...

Code Release for floq: Training Critics via Flow-Matching for Scaling Compute In Value-Based RL - CMU-AIRe/floq

github.com

Aviral Kumar

@aviral_kumar2

9 Sep 2025

248

21,595

Aviral Kumar · Mar 12, 2025 · 1:59 PM UTC

Aviral Kumar

@aviral_kumar2

12 Mar 2025

A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean we use it well. My students developed "v0" of a paradigm to do this optimally by running RL with dense rewards = minimizing regret over long CoT episodes. 🧵⬇️ cohenqu.github.io/mrt.github…

200

16,675

Aviral Kumar · Feb 19, 2025 · 4:07 AM UTC

Aviral Kumar

@aviral_kumar2

19 Feb 2025

🚨🚨New paper that proves systematically training w/ RL (or any method w/ rewards or verifiers) to scale test-time compute >> doing it via SFT or distillation. If you scale up prompts too, the gap b/w RL & SFT gets larger w/ more test-time compute budget! arxiv.org/abs/2502.12118

Amrith Setlur

@setlur_amrith

19 Feb 2025

🚨 RL or distillation/SFT: what to use to train next reasoning model? Which 📈 perf faster as we scale test compute? We answer these in a principled way so you don't have to burn GPUs🔥. 🎯 Ans: RL w/ rewards or verification >> SFT/distillation 😱 arxiv.org/pdf/2502.12118 🧵⤵️

199

21,022

Aviral Kumar · Mar 1, 2024 · 4:22 PM UTC

Aviral Kumar

@aviral_kumar2

1 Mar 2024

How can we train LLM Agents, to learn from their own experience autonomously? Introducing ArCHer, a simple (i.e., small change on top of standard RLHF) and effective way of doing so with multi-turn RL 🧵⬇️ Paper: arxiv.org/abs/2402.19446 Website: yifeizhou02.github.io/archer…

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather...

arxiv.org

190

36,208

Aviral Kumar · Jun 13, 2025 · 8:35 PM UTC

Aviral Kumar

@aviral_kumar2

13 Jun 2025

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️ Website: matthewyryang.com/e3/ Paper: arxiv.org/abs/2506.09026

179

12,504

Aviral Kumar · Sep 3, 2025 · 2:55 PM UTC

Aviral Kumar

@aviral_kumar2

3 Sep 2025

We have been doing work on scaling laws for off-policy RL for some time now and we just put a new paper out: arxiv.org/abs/2508.14881 Here, @preston_fu @_oleh lead a study on how to best allocate compute for training value functions in deep RL: 🧵⬇️

160

7,497

Aviral Kumar · Jul 15, 2025 · 2:04 AM UTC

Aviral Kumar

@aviral_kumar2

15 Jul 2025

If you are at #icml25 and are interested in RL algorithms, scaling laws for RL, and test-time scaling (& related stuff), come talk to us at various poster sessions (details ⬇️). We are also presenting some things at workshops later in the week, more on that later.

151

7,184

Aviral Kumar · Jun 14, 2024 · 2:36 PM UTC

Aviral Kumar

@aviral_kumar2

14 Jun 2024

Conventional wisdom: the BIG blocker holding offline RL behind imitation / SFT, preventing good scaling, etc is the value function. But can we still do well with current value functions? We find: often *policy* learning bottlenecks offline RL scaling: arxiv.org/abs/2406.09329 🧵

136

15,194

Aviral Kumar · Dec 6, 2022 · 11:08 PM UTC

Aviral Kumar

@aviral_kumar2

6 Dec 2022

First tweet: Recent work showing how to train big models via offline RL on diverse, multi-game data. 2 billion sub-opt. data + offline RL => generalist policy better than data & good at fine-tuning. sites.google.com/view/scalin… w/ @svlevine @agarwl_ @younggeng @georgejtucker

Scaling offline RL

Can we train large models via offline RL on large datasets?

sites.google.com

134

Aviral Kumar · Jun 5, 2025 · 7:00 PM UTC

Aviral Kumar

@aviral_kumar2

5 Jun 2025

Can offline RL methods do well on any problem, as we scale compute and data? In our new paper led by @seohong_park, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this. arxiv.org/abs/2506.04168 🧵⬇️

128

7,289

Aviral Kumar · Aug 8, 2024 · 12:04 AM UTC

Aviral Kumar

@aviral_kumar2

8 Aug 2024

Two new papers on self-improvement: paper 1 today ⬇️ In RISE, we build on online imitation to teach LLMs *how* to improve their own responses *sequentially*. w/ Llama2/3/Mistral, this gives solid +10-20% in 5 turns, outperforms parallel sampling! cohenqu.github.io/rise.githu… 🧵⬇️

125

17,122

Aviral Kumar · Feb 21, 2025 · 9:01 PM UTC

Aviral Kumar

@aviral_kumar2

21 Feb 2025

We extended our DigiRL approach (digirl-agent.github.io/) to now utilize trained VLM-based Q-functions for building mobile device control agents, with offline RL. Results: 23% to 71% improvement in device control perf. Website: digiq-agent.com/ Paper: digirl-agent.github.io/DigiQ… 🧵⬇️

125

7,806

Aviral Kumar · Jan 21, 2025 · 7:21 PM UTC

Aviral Kumar

@aviral_kumar2

21 Jan 2025

🚨 We are organizing an ICLR workshop on self-improving foundation models w/o human supervision at ICLR 2025 in Singapore! Deadline: Feb 7, AoE (submit your ICML papers!) Details: sites.google.com/berkeley.ed… We have an amazing line up of speakers + panelists, more info coming soon.

125

26,325

Aviral Kumar · Dec 18, 2024 · 12:14 AM UTC

Aviral Kumar

@aviral_kumar2

18 Dec 2024

🚨On the topic of online RL fine-tuning, we also released another paper that studies unlearning and forgetting, and attempts to fix it! Typically, RL needs offline data during fine-tuning for stability. But this is hard to scale😢 We can avoid this! zhouzypaul.github.io/wsrl/

123

8,845

Aviral Kumar · Jun 12, 2025 · 5:18 PM UTC

Aviral Kumar

@aviral_kumar2

12 Jun 2025

Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough? @jackbai_jkb & @JunhongShen1 show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction! test-time-interaction.github…

120

11,223

Aviral Kumar · Apr 21, 2025 · 11:35 PM UTC

Aviral Kumar

@aviral_kumar2

21 Apr 2025

My students & collaborators are presenting many things at both the @iclr_conf main conf & workshops on topics including reasoning, test-time compute, RL for digital agents, generalist robot policy finetuning, & core deep RL. Go talk to them! At main conf, we are presenting:

113

6,801

Aviral Kumar · Jul 9, 2025 · 4:15 PM UTC

Aviral Kumar

@aviral_kumar2

9 Jul 2025

Does test-time scaling help in open-ended problem domains? And by how much & why? This is quite a bit of a nuanced question to answer fully. To begin studying this, w/ @danielkty96 & @AdtRaghunathan we trained LLMs to scale test-time compute for safety. training-adaptive-reasoners-…

113

7,340

Aviral Kumar · Dec 16, 2024 · 10:06 PM UTC

Aviral Kumar

@aviral_kumar2

16 Dec 2024

How can we fine-tune generalist policies autonomously w/ RL (value functions)? @maxsobolmark's new paper on Policy-agnostic RL provides a single way to fine-tune generalist VLAs w/ any backbone, output, size (we fine-tune 7B OpenVLA on real robot) policyagnosticrl.github.io/🧵⬇️

107

7,775

Aviral Kumar · Oct 15, 2024 · 6:22 AM UTC

Aviral Kumar

@aviral_kumar2

15 Oct 2024

🚨New paper led by @setlur_amrith on process rewards for reasoning! Our PRMs that model specific notion of "progress" reward (NO human supervision) improve: - compute efficiency of search by 1.5-5x - online RL by 6x - 3-4x vs past PRM results arxiv.org/abs/2410.08146 How? 🧵👇

103

12,527

Aviral Kumar · Oct 16, 2023 · 8:50 PM UTC

Aviral Kumar

@aviral_kumar2

16 Oct 2023

A crucial component in modern ML seems to be using the *right*, quality subset of data for learning. What does this mean for offline RL? Given an offline dataset, can we also improve perf. by developing automatic ways to filter data? We answer this in our NeurIPS 2023 paper 🧵

102

21,197

Aviral Kumar · Dec 16, 2024 · 9:19 PM UTC

Aviral Kumar

@aviral_kumar2

16 Dec 2024

Exciting to see results of our paper reproduced with Llama models! 🎉 If you are interested in learning more, check out our paper here: arxiv.org/abs/2408.03314 (which also evaluates other strategies for scaling test-time compute) + read other references therein!

Scaling LLM Test-Time Compute Optimally can be More Effective than...

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In...

arxiv.org

Lewis Tunstall

@_lewtun

16 Dec 2024

We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥 How? By combining step-wise reward models with tree search algorithms :) We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think" We're open sourcing the full recipe and sharing a detailed blog post 👇

102

11,719

Aviral Kumar · Oct 11, 2023 · 8:01 PM UTC

Aviral Kumar

@aviral_kumar2

11 Oct 2023

Human video (e.g., Ego 4D) pre-training can improve robot control, including for downstream robotic RL. But can we *also* use RL for actually doing video pre-training? Yes! Value-based offline RL can pre-train on video for your robot! Introducing V-PTR 🧵dibyaghosh.com/vptr/

19,900

Aviral Kumar · Oct 17, 2023 · 4:45 PM UTC

Aviral Kumar

@aviral_kumar2

17 Oct 2023

Can we use text-to-image diffusion models to steer robots into doing things, zero-shot? Our method, SuSIE, fine-tunes diffusion models trained for image editing to produce future subgoals from a given scene, which then drive a low-level policy. rail-berkeley.github.io/susi… 🧵⬇️

18,865

Aviral Kumar · Dec 8, 2024 · 4:50 PM UTC

Aviral Kumar

@aviral_kumar2

8 Dec 2024

At #NeurIPS2024 main conf, we will present several works on understanding offline RL methods, RL for LLM reasoning, agents, etc. led by my students and collaborators. Come talk to us to learn more and discuss future directions + what we are excited about! More details in 🧵⬇️

6,654

Aviral Kumar · Mar 10, 2023 · 9:27 PM UTC

Aviral Kumar

@aviral_kumar2

10 Mar 2023

Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-training with offline RL to enable fast fine-tuning, that's just a 1-line code change on conservative Q-learning (CQL)! arxiv.org/abs/2303.05479 A thread 🧵...

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning

A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online fine-tuning with limited interaction. However,...

arxiv.org

15,393

Aviral Kumar · Oct 7, 2025 · 11:39 AM UTC

Aviral Kumar

@aviral_kumar2

7 Oct 2025

Check out our new paper on improving exploration in CoT for LLMs by generating abstractions! 👇 Rather than letting the LLM think longer and longer to explore, we can let it first produce concise insights that help guide structured exploration later. This works really well! Led by @QuYuxiao @Anikait_Singh_ @yoonholeee!

Yuxiao Qu

@QuYuxiao

3 Oct 2025

🚨 NEW PAPER: "RLAD: Training LLMs to Discover Abstractions for Reasoning"! We introduce reasoning abstractions: concise insights that help LLMs solve hard reasoning problems by guiding structured exploration. 📄 arxiv.org/abs/2510.02263 🌐 cohenqu.github.io/rlad.githu… 🧵[1/N]

10,841

Aviral Kumar · Dec 10, 2024 · 8:30 PM UTC

Aviral Kumar

@aviral_kumar2

10 Dec 2024

I am honored to be selected as one of this year's #AI2050 Early Career Fellows, joining an amazing cohort. Thank you, @schmidtsciences, for this incredible opportunity and for supporting our group's research!

Schmidt Sciences @schmidtsciences

10 Dec 2024

We're thrilled to welcome the 2024 cohort of AI2050 Senior and Early Career Fellows –– 25 visionary researchers tackling AI's toughest challenges to ensure it serves humanity for the better. Learn more about this year’s cohort of fellows: schmidtsciences.org/schmidt-…

6,508

Aviral Kumar · Oct 22, 2025 · 12:30 PM UTC

Aviral Kumar

@aviral_kumar2

22 Oct 2025

Check out @GraceLiu78 & @QuYuxiao's new paper on training models to know *when* they know enough. This general approach is effective at both addressing overthinking and excessive information-seeking in multi-step agentic problems, resulting in better use of test-time compute. 👇

Grace Liu @GraceLiu78

21 Oct 2025

NEW PAPER: "CaRT: Teaching LLM Agents to Know When They Know Enough"! LLMs often overthink, ask too many questions, or waste compute. We introduce Counterfactuals and Reasoning for Termination (CaRT) - teaching LLMs when to stop gathering info and make decisions. 🧵[1/9]

11,626

Aviral Kumar · Mar 12, 2024 · 7:47 PM UTC

Aviral Kumar

@aviral_kumar2

12 Mar 2024

Our new paper on understanding why LLMs make up stuff & hallucinate and how RL fine-tuning with an appropriate conservative reward model can mitigate these issues Paper: arxiv.org/abs/2403.05612 A thread below 🧵⬇️ (+ check @katie_kang_ 's thread for many more details)

Katie Kang @katie_kang_

12 Mar 2024

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate arxiv.org/abs/2403.05612 🧵

9,548

Aviral Kumar · Apr 25, 2025 · 11:22 PM UTC

Aviral Kumar

@aviral_kumar2

25 Apr 2025

At #ICLR25 workshops, my students+collabs will give many orals talks on newer stuff (don't miss!): - robot VLA RL fine-tuning @maxsobolmark - optimizing test-time compute @QuYuxiao - why RL is crucial for test-time scaling @setlur_amrith - scaling laws for value-based RL @_oleh - in-context LLM verifiers & search @Anikait_Singh_ 🧵⬇️

5,596

Aviral Kumar · Nov 20, 2024 · 7:56 PM UTC

Aviral Kumar

@aviral_kumar2

20 Nov 2024

Check out @katie_kang_'s work on understanding memorization vs learning in reasoning! By probing LLMs in training, we identify if an LLM "learns" to answer a question by memorizing or by learning to "draw inferences" ➡️ a metric to predict generalization, insights for data, etc

Katie Kang @katie_kang_

19 Nov 2024

LLMs excel at fitting finetuning data, but are they learning to reason or just parroting🦜? We found a way to probe a model's learning process to reveal *how* each example is learned. This lets us predict model generalization using only training data, amongst other insights: 🧵

7,621

Aviral Kumar · Oct 18, 2024 · 4:40 AM UTC

Aviral Kumar

@aviral_kumar2

18 Oct 2024

New paper on using value functions trained via Cal-QL (arxiv.org/abs/2303.05479) for improving "foundation" policies at test-time: improves precision and robot motion on manipulation tasks! Also checkout our work on test-time training with value functions: arxiv.org/abs/2406.09329

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning

A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online fine-tuning with limited interaction. However,...

arxiv.org

Mitsuhiko Nakamoto @mitsuhiko_nm

18 Oct 2024

Many generalist robot policies have been released, but they're not perfect. How can we make them better? Introducing V-GPS🚀: Value Guided Policy Steering, a simple approach to improve any off-the-shelf generalist policy at deployment time.🧵#CoRL2024 🌐nakamotoo.github.io/V-GPS

8,526

Aviral Kumar · Nov 8, 2025 · 3:44 PM UTC

Aviral Kumar

@aviral_kumar2

8 Nov 2025

Check out Mian’s new work on training **adversarial** critics that dynamically adapt to the policy, when running RL to train LLMs on hard-to-verify tasks.👇 This alleviates the need to extensively verify all rubrics to obtain reward, making RL more practical and robust.

Mian Wu

@MerlinNoth79247

7 Nov 2025

Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a procedure that also trains the judge/critic dynamically during RL. The critic finds just one most likely mistake in response, the generator fixes it, and now the critic updates itself to find new mistakes... this adversarial training procedure does really well!

13,081

Aviral Kumar · Mar 7, 2025 · 9:03 PM UTC

Aviral Kumar

@aviral_kumar2

7 Mar 2025

Check out our work on training VLM Q-functions for building device-control agents: digirl-agent.github.io/DigiQ…

Jack Bai

@jackbot_cs

7 Mar 2025

We just made Q function work on 7B VLMs with TD learning. If you work on end-to-end RL with Q functions, you know it's extremely hard. tbh most people give it up right after they finish the first wandb run. Let me show how we got through: A thread 🧵 1/n arxiv.org/abs/2502.15760

4,803

Aviral Kumar · Jul 3, 2025 · 6:01 PM UTC

Aviral Kumar

@aviral_kumar2

3 Jul 2025

Checkout these awesome new real-robot online RL fine-tuning results that @andy_peng05 and @zhiyuan_zhou_ got with our WSRL method. WSRL appeared at ICLR earlier this year -- check this out for more details: zhouzypaul.github.io/wsrl/ 👇

Paul Zhou @zhiyuan_zhou_

3 Jul 2025

We tested WSRL (Warm-start RL) on a Franka Robot, and it leads to really efficient online RL fine-tuning in the real world! WSRL learned the peg insertion task perfectly with only 11 minutes of warmup and *7 minutes* of online RL interactions 👇🧵

4,408

Aviral Kumar · Dec 12, 2023 · 6:15 PM UTC

Aviral Kumar

@aviral_kumar2

12 Dec 2023

On my way to NOLA for #NeurIPS2023! We will present several works on offline RL, fast online fine-tuning, using pre-trained models for improving low-level robot control, RL pre-training on human videos, and querying VLMs for maximal efficacy in RL. Come talk to us! Details ⬇️

7,243

Aviral Kumar · Aug 8, 2024 · 3:18 PM UTC

Aviral Kumar

@aviral_kumar2

8 Aug 2024

We show if LLMs can made to effectively use verifiers, search, or look at past attempts at a problem at inference time, this can make a better use of the same amount of FLOPs than using bigger models or more pre-training compute. arxiv.org/abs/2408.03314

4,559

Aviral Kumar · Apr 26, 2025 · 2:45 PM UTC

Aviral Kumar

@aviral_kumar2

26 Apr 2025

Do make sure to attend our workshop on self improvement at @iclr_conf on Sunday — we have an amazing line up speakers, contributed papers, and then a panel from 5-6pm!

Roberta Raileanu @robertarail

26 Apr 2025

With a stellar lineup of speakers and panelists, including Yoshua Bengio 🙀, the Scaling Self-Improving Foundation Models at @iclr_conf promises to be 🔥 ⏰ Sunday, April 27 📍 Garnet 214-215

3,510

Aviral Kumar · Apr 17, 2025 · 6:18 PM UTC

Aviral Kumar

@aviral_kumar2

17 Apr 2025

Check out @max_simchowitz's insightful new paper showing how imitation learning in continuous action spaces is exponentially harder than discrete spaces!

Max Simchowitz

@max_simchowitz

16 Apr 2025

There’s a lot of awesome research about LLM reasoning right now. But how is learning in the physical world 🤖different than in language 📚? In a new paper, show that imitation learning in continuous spaces can be exponentially harder than for discrete state spaces, even when the underlying dynamics are seemingly benign and insensitive to perturbations. (1/n)🧵

3,146

Aviral Kumar · Sep 8, 2025 · 9:50 PM UTC

Aviral Kumar

@aviral_kumar2

8 Sep 2025

@preston_fu @_oleh and I wrote a blog post on scaling laws and value function based RL, summarizing our two papers in this direction and discussing open questions! value-scaling.github.io/ Check it out! Feedback & comments are very welcome!

1,424

Aviral Kumar · Apr 24, 2025 · 3:39 AM UTC

Aviral Kumar

@aviral_kumar2

24 Apr 2025

Checkout this spotlight poster today at 3pm! Go to this paper to learn about process rewards, exploration & hard problems (presented by @ianwu97) and then attend @QuYuxiao's oral talk at the FM-Wild workshop to see how these ideas can also help with long CoT / thinking models!

Amrith Setlur

@setlur_amrith

24 Apr 2025

I couldn't be there @iclr_conf but if you are interested in process verifiers that can boost exploration and get LLMs to solve hard problems, check out our spotlight poster on PAVs at 3pm Hall 3+2B #548. Also chat with the amazing @ianwu97 who will be presenting on our behalf!

2,289

Aviral Kumar · Jun 4, 2025 · 2:26 AM UTC

Aviral Kumar

@aviral_kumar2

4 Jun 2025

Go check out @GabrielSarch's new work on how to train VLMs to reason in a grounded way via RL. ViGoRL works quite well! I personally like some of the insights about how to induce useful base behaviors in the model that can be amplified via RL for better visual reasoning. visually-grounded-rl.github.…

Gabriel Sarch @GabrielSarch

30 May 2025

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

2,856

Aviral Kumar · Apr 26, 2025 · 1:29 AM UTC

Aviral Kumar

@aviral_kumar2

26 Apr 2025

Before the (exciting) workshops on Sun, catch Vincent’s oral talk at the #ICLR2025 main conference on this paper today at 3:30pm, Hall 1 Apex! And don’t forget to talk with the co-leads Vincent and @YiSu37328759 at the poster 10 a.m - 12:30 p.m Hall 3 + Hall 2B #558.

Aviral Kumar

@aviral_kumar2

7 Oct 2024

3,020

Aviral Kumar · Jun 14, 2024 · 2:36 PM UTC

Aviral Kumar

@aviral_kumar2

14 Jun 2024

This was an awesome collaboration led by @seohong_park, w/ @kvfrans and @svlevine. @seohong_park also wrote a terrific blog post (please check it out for more insights and results + short version, if you don't have time): seohong.me/projects/offrl-bo… Paper: arxiv.org/abs/2406.09329

5,570

Aviral Kumar · Sep 8, 2023 · 3:43 PM UTC

Aviral Kumar

@aviral_kumar2

8 Sep 2023

Check out our work on training large transformer policies on demo and autonomous data (including failures of existing imitation policies) via offline Q-learning. Q-Transformer improves over RT-1 on real robots & provides a recipe for building ever-improving robotic systems! ⬇️

Yevgen Chebotar

@YevgenChebotar

7 Sep 2023

Offline RL strikes back! In our new Q-Transformer paper, we introduce a scalable framework for offline reinforcement learning using Transformers and autoregressive Q-Learning to learn from mixed-quality datasets! Website and paper: q-transformer.github.io 🧵

5,438

Aviral Kumar · Jul 22, 2024 · 1:54 AM UTC

Aviral Kumar

@aviral_kumar2

22 Jul 2024

Unfortunately I am not at ICML this year, but my students & collaborators are there to present some exciting work on RL, RL x LLMs at the conference (3 papers: 2 posters + 1 oral) & workshops (3 talks + 1 poster; some hot off the press work). Please talk to them! Details⬇️

6,580

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

Main idea: Instead of predicting Q-values in one shot, floq (flow Q-function) treats Q-learning via generative modelling. arxiv.org/abs/2509.06863 TL, DR: A velocity field iteratively transforms noise --> Q-value, while being supervised densely at every step of this integration.

2,406

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

We have been looking at scaling in RL for a while now. Last year, it seemed clear that the gap was in policy extraction (arxiv.org/abs/2406.09329), but with the use of diffusion/flow policies (see PA-RL: arxiv.org/abs/2412.06685) and other work in the community, it seemed again that the gap is in critic learning. The question then was: what's a reliable and stable way to scale value functions in RL? We tried monolithic critics several times, with normalization, regularizers, etc, but these approaches seemed a bit finicky always. So, we wanted to build a better approach for parameterizing a Q-function to do better.

Is Value Learning Really the Main Bottleneck in Offline RL?

While imitation learning requires access to high-quality data, offline reinforcement learning (RL) should, in principle, perform similarly or better with substantially lower data quality by using...

arxiv.org

2,420

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

This was a fun paper led by @AgrawallaBhavya (his first paper in PhD!), w/ @mic_nau! I learned a lot. We started from a very complex approach & iterated a lot to arrive at the core, simple ideas we describe now. Paper: arxiv.org/abs/2509.06863 Comments & feedback very welcome.

floq: Training Critics via Flow-Matching for Scaling Compute in...

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token...

arxiv.org

1,714

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

This was a really awesome collab led by @_oleh, with @mic_nau, @preston_fu, @sea_snell, @pabbeel, & @svlevine! @_oleh taught me a lot about scaling laws & how we could extend them to RL in this project! Check out paper (& feedback very welcome): arxiv.org/abs/2502.04327

Value-Based Deep RL Scales Predictably

Scaling data and compute is critical to the success of modern ML. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their...

arxiv.org

1,311

Aviral Kumar · Jun 21, 2024 · 3:47 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

We are super excited to now use this as a foundation to study value-based RL (see our prev work ArCHer: arxiv.org/abs/2402.19446) to real user-scale problems! Awesome collab co-led by @jackbai_jkb @YifeiZhou02 w/ @mertcemri @pan_jiayipan @svlevine @alsuhr

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather...

arxiv.org

1,234

Aviral Kumar · Jul 25, 2024 · 10:44 PM UTC

Aviral Kumar

@aviral_kumar2

25 Jul 2024

At @icmlconf workshops on Fri (Jul 26), we will present 2 papers at the FM-in-the-wild and ARLET workshops: - DigiRL: RL for building a SOTA device-control agent digirl-agent.github.io/ - "Is value learning really the main bottleneck in offline RL?" seohong.me/projects/offrl-bo…

1,276

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

Overall, we show: - Scaling value learning in RL can work really well. - Value learning in RL can benefit from the same principles that made LLMs & diffusion models scale well. - Opens the door to test-time scaling in value learning. - Should be your goto choice for RL algo.

1,730

Aviral Kumar · Jun 21, 2024 · 3:28 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

1) We need a scalable interaction env, that emulates phone state in real-time for open-ended training. Most Android device control work provided data (e.g., Android in the Wild), but no simulator. + an autonomous evaluator based on Gemini 1.5 Pro github.com/DigiRL-agent/digi…

GitHub - DigiRL-agent/digirl: Official repo for paper DigiRL: Training In-The-Wild Device-Control...

Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning. - DigiRL-agent/digirl

github.com

1,557

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

2) Given a total compute + data budget, we use the above Pareto frontiers to determine how to set hyperparameters of this algorithm to attain maximum performance while still being within the overall data+compute budget. We can do this with some neat geometry (see "iso-cost" curves)...

1,892

Aviral Kumar · Jun 14, 2024 · 2:42 PM UTC

Aviral Kumar

@aviral_kumar2

14 Jun 2024

Why should this be of interest to you? Improving value functions still remains important (and I personally have been looking at doing so too) but, this paper shows value functions can help if utilized well (DDPG+BC, OPEX, TTT) but we don't study how to use a value function much

Aviral Kumar

@aviral_kumar2

14 Jun 2024

3,105

Aviral Kumar · Jun 21, 2024 · 3:47 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

That's it! This gives us a SOTA VLM agent, DigiRL, on these tasks. Our method outperforms prompting + off-the-shelf VLM (Gemini 1.5 Pro, GPT 4) as well as SFT / imitation, other VLMs (CogAgent, AutoUI, etc) around 70% improvement over the next best approach

1,243

Aviral Kumar · Jun 21, 2024 · 3:28 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

To address these, a scalable way is to run autonomous online RL (SFT / imitation is not enough), where the agent interacts with the phone & internet in real time and learns from its mistakes. We did exactly this. Many systems & methodological needed to be done to enable this: ⬇️

1,655

Aviral Kumar · Jun 21, 2024 · 3:28 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

Device control is very challenging for foundation models: - real-world stochasticity + non-stationarity + disctractors (e.g., pop-ups) - websites + internal device state changing - pixel, gesture control ➡️ need to continuously keep agents up-to-date + learn from "own" failures

2,288

Aviral Kumar · Jan 10, 2025 · 3:31 AM UTC

Aviral Kumar

@aviral_kumar2

10 Jan 2025

There has been tons of work on new methods for test-time scaling. See this concurrent paper surveying this area, which provides yet another instantiation: arxiv.org/abs/2501.04682 But still an understanding of when, why, & how test-time scaling should work at all is unclear....

Towards System 2 Reasoning in LLMs: Learning How to Think With...

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular...

arxiv.org

1,688

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

We find that scaling behavior of "SAC-style" value-based RL is predictable, e.g.: - A Pareto frontier between compute and data requirements - Optimal budget allocation to compute and data - Hyperparameter dependencies We show that we can predict these for larger compute, data, or performance, by only running small-scale experiments.

2,004

Aviral Kumar · Apr 25, 2025 · 11:22 PM UTC

Aviral Kumar

@aviral_kumar2

25 Apr 2025

@nived_rajaraman will give an oral talk at the VerifAI workshop on why RL or verification is needed to effectively scale test-time compute! Lots of interesting insights in this paper! At VerifAI workshop, 3:45pm, April 27 arxiv.org/abs/2502.12118

Amrith Setlur

@setlur_amrith

19 Feb 2025

492

Aviral Kumar · Jun 21, 2024 · 3:28 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

Three method ideas from RL: - Advantage-weighted regression for policy learning - Doubly robust advantage estimators using a step-level value function - Automated curriculum to learn on most informative states using instruction-level value function Check the paper for details

1,303

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

Lots of ablations/analyses in the paper, one I would like to highlight. Even one flow step outperforms standard critics due to representation learning. Real gains come from multiple flow steps, which help fix errors to give a better Q-function estimate for training the policy.

1,269

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

We tested different ways of scaling critic capacity: ✅ More flow steps (sequential compute) → better performance for floq ❌ Larger ensembles (parallel compute) → worse than floq ❌ Deeper ResNets → worse despite sequential compute This shows floq & its training is crucial!

1,254

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

Concretely, we do so by answering resource optimization problems by fitting scaling laws:

1,533

Aviral Kumar · Jul 17, 2025 · 3:26 PM UTC

Aviral Kumar

@aviral_kumar2

17 Jul 2025

Don't forget to check out @QuYuxiao & @matthewyryang's poster on dense rewards for test-time scaling & why it matters Today at 11 am in East Exhibition Hall, poster E-2712.

Yuxiao Qu

@QuYuxiao

14 Jul 2025

Heading to @icmlconf #ICML2025 this week! DM me if you’d like to chat ☕️ Come by our poster sessions on: 🧠 Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning (arxiv.org/abs/2503.07572) 🔍 Learning to Discover Abstractions for LLM Reasoning (drive.google.com/file/d/1Sfa…)

2,568

Aviral Kumar · Oct 7, 2024 · 10:14 PM UTC

Aviral Kumar

@aviral_kumar2

7 Oct 2024

This means to learn the self-correction algorithm, we need to: (1) train under the model's own distribution (i.e., kind of like on-policy RL), (2) incentivize the indirect behavior (i.e., find a good first response + correct it >> find best first response + minor edits)

3,110

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

Our scaling laws are **very** different from LLMs -- lots of new aspects to consider in RL! Main results: 1) data / compute requirements to attain a given performance lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio, a hyperparameter that appears centrally in RL

1,501

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

Think of how diffusion models denoise step by step, floq “integrates” step by step to refine Q-value estimates. This scales compute for estimating Q-functions by scaling number of flow integration steps (i.e., forward passes) but no extra params. Just like longer CoTs for LLMs.

1,498

Aviral Kumar · Jan 10, 2025 · 3:31 AM UTC

Aviral Kumar

@aviral_kumar2

10 Jan 2025

One can bake this into an objective that can explicitly train models to use test-time compute or use such an idea to measure if scaling test-time compute is useful. Theoretically this implies that dense process-based rewards, RL, & asymmetry b/w generation & verification might be crucial.

1,448

Aviral Kumar · Jun 5, 2025 · 7:00 PM UTC

Aviral Kumar

@aviral_kumar2

5 Jun 2025

Looking back, some of the most effective methods that we've built for training LLM/VLM agents in multi-turn settings also *needed* to utilize such a hierarchical structure, e.g., ArCHer (yifeizhou02.github.io/archer…) by @YifeiZhou02, further showing the promise behind such ideas.

521

Aviral Kumar · Oct 7, 2024 · 10:14 PM UTC

Aviral Kumar

@aviral_kumar2

7 Oct 2024

Premise: Ultimately, if we want models to succeed on hard queries, we need to teach *how* to discover strategies to find its way to solve hard inputs, using more inference compute. But, we train models to directly produce the output given input, not to discover strategies.

1,134

Aviral Kumar · Oct 11, 2023 · 8:01 PM UTC

Aviral Kumar

@aviral_kumar2

11 Oct 2023

Great collab led by @ChetBhateja, Derek & @its_dibya. w/ @Anikait_Singh_, @manan_tomar, @QuanVng, @YevgenChebotar, @svlevine! I was quite(?) late in posting, but check: nitter.app/svlevine/status/170634…, nitter.app/_akhaliq/status/170618… dibyaghosh.com/vptr/ Paper: arxiv.org/abs/2309.13041

@_akhaliq

25 Sep 2023

Robotic Offline RL from Internet Videos via Value-Function Pre-Training paper page: huggingface.co/papers/2309.1… Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods.

1,378

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

..and now you can use the intersections to predict the optimal budget allocation in terms of target performance. This means we can tell you what UTD to run with (and accordingly how to set other hyperparameters) for best use of a certain resource budget

932

Aviral Kumar · Jan 10, 2025 · 3:31 AM UTC

Aviral Kumar

@aviral_kumar2

10 Jan 2025

The blog posits that: optimizing for test-time compute = train models to be capable of using more tokens for figuring out "how" to discover solutions at test time... ..and not simply output "what" an answer could be. And this generalization problem, is equivalent to solving meta RL (see this: arxiv.org/abs/2107.06277)

1,388

Aviral Kumar · Jun 21, 2024 · 3:47 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

The website has gifs of various methods (I could not upload them here as my X would always crash), but please check the website (the panel looks like this image below): digirl-agent.github.io/

995

Aviral Kumar · Oct 7, 2024 · 10:14 PM UTC

Aviral Kumar

@aviral_kumar2

7 Oct 2024

What's the approach? What didn't work? - We spent a *lot* of time trying to SFT our way out. - Generated a lot of data of correction traces by prompting models, paired incorrect and correct responses, and trained on this data This is STaR / RFT, but it didn't work (more⬇️).

1,566

Aviral Kumar · Mar 7, 2024 · 6:50 AM UTC

Aviral Kumar

@aviral_kumar2

7 Mar 2024

This project was amazing & fun led by @JesseFarebro @agarwl_, with a number of fantastic collaborators @QuanVng Jordi Orbay Adrien Ali Taiga @YevgenChebotar @pcastr @AleksandraFaust @svlevine @xiao_ted @AlexIrpan.

943

Aviral Kumar · Jun 21, 2024 · 3:28 AM UTC

Aviral Kumar

@aviral_kumar2

21 Jun 2024

2) Running RL is not easy as we're fine-tuning VLMs amidst so much stochasticity / non-stationarity. Step 1: fine-tune on existing demos via offline RL (e.g., data collected by rolling out off-the-shelf VLMs in the env), Step 2: autonomous online RL. All done with 1.5B VLM.

1,464

Aviral Kumar · Jun 24, 2024 · 3:13 PM UTC

Aviral Kumar

@aviral_kumar2

24 Jun 2024

Our goal was to understand + compare approaches of learning from syn. data. Approach: 1. ask GPT / Gemini to give new problems + solutions (see: arxiv.org/abs/2403.04706): 2. run SFT on it 3. STaR (reject bad data) or run RL (with good+bad data)

1,522

Aviral Kumar · Sep 3, 2025 · 2:55 PM UTC

Aviral Kumar

@aviral_kumar2

3 Sep 2025

There're tons of more results and analysis in the paper, please check them out. arxiv.org/abs/2508.14881 Very fun collab led by @_oleh @preston_fu w/ @zhiyuan_zhou_ @mic_nau @pabbeel @svlevine

Compute-Optimal Scaling for Value-Based Deep RL

As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal...

arxiv.org

658

Aviral Kumar · Oct 15, 2024 · 6:22 AM UTC

Aviral Kumar

@aviral_kumar2

15 Oct 2024

To illustrate advantage vs current PRMs (which model future success probabilities or Q-values as in arxiv.org/abs/2408.03314), note this simple example: Here searching with future success probs. throws useful transitions out of beam search at test-time (similar issues with RL)

1,363

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

Results wise, floq does really well with same default hparams! On benchmarks, like OGBench (the standard benchmark in offline RL now, though ofc i have my complaints about it), we outperform prior results. Also best method on online RL fine-tuning right now (see Fig 5 in paper)

1,296

Aviral Kumar · Jan 10, 2025 · 3:31 AM UTC

Aviral Kumar

@aviral_kumar2

10 Jan 2025

OK, but why is solving such a meta RL problem useful (in theory)? Why can it do better than standard RL? Does this violate "no free lunch", i.e., why is more tokens => better generalization if all tokens come from a learned model?

917

Aviral Kumar · Feb 7, 2025 · 7:26 AM UTC

Aviral Kumar

@aviral_kumar2

7 Feb 2025

Overall, this gave us a way to determine how to scale value-based algos. Predictability is going to be crucial in scaling up, since RL comes with many algorithmic components. This is our first step towards that and I am very excited about it... An aside: a while back I wrote some thoughts on RL research, which has more takes on "predictability": cmu-aire.github.io/pages/blo…

1,386

Aviral Kumar · Jan 10, 2025 · 3:31 AM UTC

Aviral Kumar

@aviral_kumar2

10 Jan 2025

Turns out that the core concept that meta RL can exploit to improve generalization is "information" gain towards discovering the correct answer from each subsequent token appearing in the test-time output (see illustration⬇️)

863

Aviral Kumar · Sep 9, 2025 · 12:46 PM UTC

Aviral Kumar

@aviral_kumar2

9 Sep 2025

We derive updates for training a flow Q-function via TD-learning & identify some important design decisions needed to scale compute this way. E.g., a "healthy" floq model would learn a fairly curved field. We used categorical representations for inputs + embed time differently.

1,416