RL Bigrun @xai @spacex Grok 4.x and Grok Build Coding Model Reinforcement Learner

Finally released. More to come.
An early beta of Grok Build, an agentic CLI for coding, building apps, and automating workflows is now available for SuperGrok Heavy subscribers. Through this early beta, we will improve the model and product based on your feedback. Try it at x.ai/cli
54
48
971
2,831,756
@danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug we detailed, which was particularly problematic on A100s: On A100 GPUs (and L20s), under certain batch/seq lengths, the kernel triggers its split_kv path. This path had a bug that incorrectly transposed the LSE layout. This corrupted tensor caused a "complete precision collapse" in Cascade Attention, which is what we measured as that massive mismatch KL. Good luck with the FP16/BF16 tests, I'd love to know what you find! yingru.notion.site/When-Spee…
Replying to @_arohan_
:) Original plots come from nitter.app/RichardYRLi/status/197… - also their blog is super good! - still unsure if the FP16 vs BF16 debate is due to hardware issues due to FP32 accumulation sizes - planning to run some experiments!
8
22
339
122,824
(1/x) Ever had your #LLM-#RL training mysteriously collapse? 📉 You're not alone. We saw #agentic RL runs fail with exploding #gradients, and found the culprit: a fundamental "training-inference mismatch." Our new #blog post demystifies this vicious cycle. yingru.notion.site/When-Spee…
12
53
332
49,009
Inspired by @thinkymachines 's "#LoRA Without Regret" post, I formalized their insight that policy gradient learns ~1 bit per episode via Bayesian #RL formulation. I prove this is a hard information-theoretic ceiling and extend the analysis to actor-critic methods. Full writeup with theorems: richardli.xyz/post/informati… 🧵
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…
5
41
312
49,260
1/ Great thread by @IdanShenfeld on Policy Mirror Descent (PMD) and its likely use in models like Kimi K2. It's a powerful technique for stabilizing RL. We'd like to highlight our NeurIPS 2019 work which was one of the first to frame policy optimization as a mirror descent problem and demonstrate its effectiveness in deep RL, where our method significantly outperformed the PPO baseline. Our paper: "Divergence-Augmented Policy Optimization" with Professor Tong Zhang. proceedings.neurips.cc/paper…
Everyone’s talking about Kimi K2 Thinking and its impressive performance. No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick that’s quietly becoming standard in frontier labs. Let’s break down what it is:
1
7
53
9,893
Discover theoretical advancements and applications in #GenAI Reasoning & Agents at #INFORMS2024! 🚀 Sessions on: #LLM Agents, RL, Exploration, Alignment, Gene-Editing, Math Reasoning & more! Oct 20 & 21, Summit-342, Seattle Convention Center. @Zanette_ai @RichardYRLi @MingYin_0312 @g_k_swamy @JoeyHejna @KaixuanHuang1 @weixiong_1 @chanwoopark20
1
5
31
7,647
Replying to @agarwl_
Great thread on the "training-inference mismatch," which our post detailed last month. The core issue is that the severe mismatch on low-probability tokens creates an off-policy problem, making sequence-level Importance Sampling the principled fix. Simpler ideas you mentioned, like an fp32 lm_head, unfortunately don't work—we tested it. The good news: based on this, the vLLM/veRL & SGLang/SLIME teams are on it! 🚀 Full deep dive: yingru.notion.site/When-Spee…
2
1
27
2,383
Spot on. The core challenge is still sample-efficient and scalable exploration for sparse rewards. This is why work building on ideas like Bootstrapped DQN, Ensemble Sampling and EpiNet—such as the new HyperAgent—is so important. arxiv.org/abs/2402.10228
We're still (mostly) in the massively low sample regime for LLMs+RL. Recall that Montezuma's Revenge and other sparse reward settings would take O(1-10M) steps/frames before starting to actually learn anything at all.
2
26
3,819
(4/x) We were shocked to find that hardware is a first-order variable. The exact same code can succeed on one GPU but fail spectacularly on another. Our tests showed mismatch levels followed H20 < L20 < A100. Results aren't always portable! #MLOps #GPU
1
2
22
2,166
(2/x) Why does agentic RL crash? The chain reaction often starts with tool use. Tool outputs are Out-of-Distribution (OOD) context for the LLM. This unfamiliar input makes the policy uncertain, causing it to sample more low-probability tokens—the first domino to fall.
2
2
24
5,441
[1/7] Want to save #GPU and #Data budget, addressing #AI #risk? Join us @icmlconf, a group of researchers from @cuhksz, focusing on #optimization, #RL, and #LLMs. #ICML2024
2
3
16
1,613
(3/x) Here's the smoking gun: the training-inference mismatch isn't uniform. It's catastrophically worse for these exact low-probability tokens. The training framework assigns them near-zero probability, while the inference engine saw them as merely unlikely. This massive divergence causes collapse.
1
1
19
2,788
(9/x) This is an active area of research for us. The training-inference mismatch is a deep problem, especially with the rise of #MoE models and more complex #agentic #systems. We will be continually updating our blog with new findings. Follow us and stay tuned for future updates!
17
1,259
🚨 UPDATE to the "1 bit per episode" analysis (inspired by @johnschulman's post at @thinkymachines ): After discussion with @mgostIH, I ned to points out the limit only applies to *scalar advantage*! REINFORCE with per-timestep advantages can learn O(T) bits when rewards are dense and independent.
Inspired by @thinkymachines 's "#LoRA Without Regret" post, I formalized their insight that policy gradient learns ~1 bit per episode via Bayesian #RL formulation. I prove this is a hard information-theoretic ceiling and extend the analysis to actor-critic methods. Full writeup with theorems: richardli.xyz/post/informati… 🧵
1
8
18
2,536
(7/x) Key Takeaways for Practitioners: 1️⃣ Mismatch is a fundamental speed vs. stability trade-off. 2️⃣ Monitor vllm-kl & gradient norms. 3️⃣ Beware of low-prob tokens from OOD inputs (tool use!). 4️⃣ Hardware is a key variable. 5️⃣ Use sequence-level IS as the default fix.
1
21
1,509
(8/x) The training-inference mismatch is a critical challenge for the future of agentic AI. We documented all our findings, effective and ineffective, to help others navigate this issue. Read the full investigation: yingru.notion.site/When-Spee… By Jiacai Liu @RichardYRLi @Chert_Fu @JarvisMSUst @sivil_taram Yu Shen #LLMRL #Research
2
2
18
2,110
(6/x) The principled solution is Sequence-Level Importance Sampling. It provides an unbiased gradient correction by accounting for the entire trajectory's probability, restoring training stability. We also propose Masked Importance Sampling (MIS), which further improves performance.
1
1
16
3,205
Great to see the vLLM team tackling the training-inference mismatch! Their new post achieves bitwise consistency—a deep engineering fix to make training & inference identical (at a 2.4x speed cost). They also mentioned our blog in the post. In our blog, we explored the algorithmic side: accepting the mismatch to preserve speed & correcting the resulting off-policy bias with rollout distribution correction like importance sampling and rejection sampling. We found the mismatch is most dangerous for low-probability tokens in agentic/tool-use RL. Two vital approaches to the same core challenge! Our deep-dive on the problem & our algorithmic fix: yingru.notion.site/When-Spee…
🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq → same output regardless of batching) 2️⃣ Ensure forward passes in training use identical kernels as inference 3️⃣ Add custom backward passes in PyTorch ✅ Verified on Qwen3 1.7B + GSM8K: • batch_inv_ON (bitwise exact) → KL=0.0, faster convergence, higher reward • batch_inv_OFF → reduced reward, instability We audited every op, imported vLLM’s fused kernels (SiLU MLPs, RMSNorm+residual), and wrote matching backward passes. Run is fully on-policy, deterministic, and reproducible. Next: • Unified model code • torch.compile support • Perf tuning (current bitwise RL ≈2.4× slower) • Broader model + op coverage 🔗 blog.vllm.ai/2025/11/10/bitw… #vLLM #TorchTitan #RL #LLM #AIResearch
15
2,004
So, why is sequence-level IS the principled solution for the training-inference mismatch, not the token-level heuristic? The token-level approximation is theoretically unsound due to a double mismatch: 🗺️ State Occupancy Mismatch: It uses the state distribution from the rollout policy, failing to correct for taking a different path. 🎯 Mismatched Reward Signal: It evaluates actions using the rollout policy's advantage function—the wrong measuring stick. Sequence-level IS elegantly solves both with a single trajectory-wide ratio. The derivations below show the math is unambiguous. @zzlccc @willccbb @nanjiang_cs #ReinforcementLearning #LLM #AI #RL
Replying to @RichardYRLi
(6/x) The principled solution is Sequence-Level Importance Sampling. It provides an unbiased gradient correction by accounting for the entire trajectory's probability, restoring training stability. We also propose Masked Importance Sampling (MIS), which further improves performance.
1
2
13
1,483
The SimpleTIR paper is officially out! We go beyond our July blog post to provide a deeper mathematical explanation and rigorous proof for why multi-turn RL agents are so unstable. The root cause? A predictable domino effect: OOD Tool Feedback → Low-Prob Tokens → Exploding Importance Ratios and logits update magnitude → Gradient Explosion. We discovered that "void turns" are the critical signal for when this is about to happen. We stress-tested this theory against the latest trajectory quality control methods. None of them could fix the core instability problem (max score: 26.3 on AIME24). But by simply filtering out these "void turns," SimpleTIR makes training fundamentally stable. This minimalist approach rockets the Qwen2.5-7B-Base AIME24 score from 22.1 to 50.5, completely from base model without any SFT. Simple, effective, and battle-tested. Hope this helps the community build more reliable agents! 📜 Paper: huggingface.co/papers/2509.0… 💻 Code: github.com/ltzheng/SimpleTIR ✍️ Blog: simpletir.notion.site/report
Thanks AK for sharing our work! 🔥 🧵 Back to Jan when we just started this project... we were living a nightmare 😩 Months of watching our multi-turn RL models collapse. Every. Single. Time. 💥 We thought we were doing something wrong... until we discovered other research teams seem to hit the same invisible wall (Devin, verl and other reported issues) 🧱 Multi-turn tool reasoning just... BROKE 💔 It's NOT like the elegant simplicity of R1-Zero’s approach. This was pure chaos. Then came the "aha!" moment of our SimpleTIR💡✨ The secret was hiding in plain sight: “void turns” - those meaningless steps where models generates text that leads... absolutely NOWHERE 🕳️ One simple filter changed everything ✨ Our 7B model jumped from 22% (DAPO) to 50% (Multi-Turn Tool Use) on AIME24 📈 No complex algorithms, no fancy techniques. Just removing the void turn examples that were poisoning the training 🎯 Sometimes, the biggest gains come from understanding what NOT to learn 💡 📄 Paper: huggingface.co/papers/2509.0…
💻 Code: github.com/ltzheng/SimpleTIR ✍️ Blog: simpletir.notion.site/report
2
11
1,518
Replying to @willccbb
what happened?
11
3,948
This is more than just a new recipe for LoRA; it's a masterclass in how first-principles thinking drives frontier research. For years, the community saw a performance gap, but this work shows it was never an architectural flaw—just a misunderstanding of the method. @thinkymachines @johnschulman2
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…
1
11
1,096
(5/x) The fix is Importance Sampling (IS), but the token-level approach is a trap. We show it's theoretically flawed due to biased gradients—it ignores state distribution shifts & uses mismatched rewards. In our experiments, token-level IS still collapsed.
1
10
1,722
I'll be at #ICML2024 from July 21st to 27th! Check out the picture for detailed locations and times of my presentations. Also, don't miss my lightning talk on "Agile #Human-#AI #Collaboration for #RiskOversight" at the #AlignmentWorkshop @farairesearch on 21st and 22nd. #RL #Probability #GPT #Martingale #Statistics #Uncertainty #OnlineLearning #decisionmaking #Moderation #Alignment #Risk #Oversight #Dynamics #ICML #conference #Vienna #FoundationModel #LLMs #exploration #poster #HyperAgent #Theory #Practice #math
9
1,399
Quoted from Ben Van Roy (Stanford Professor): @karpathy piped.video/fERcdhg0Kds?si=lhVj…
Replying to @karpathy
@karpathy I think it would be good to distinguish RL as a problem from the algorithms that people use to address RL problems. This would allow us to discuss if the problem is with the algorithms, or if the problem is with posing a problem as an RL problem. 1/x
1
8
2,173
Efficient Exploration by HyperDQN in Deep Reinforcement Learning. At #ICML2021 RL4RealLife Workshop on Friday 23rd. See you at 23:00 - 1:00 Poster Discussion. #GatherTown: A4 in eventhosts.gather.town/PHQYR… With @ZiniuLi, Hao Liang, and Tong Zhang
1
4
7
Our ICML24 work directly addresses this concern. Effective RL is possible with dramatic simplification: 20x fewer parameters, 7x fewer samples (figure 1), and removal of many conventional tricks (table 3) - while still achieving human-level performance. arxiv.org/abs/2402.10228
Is there a hyperparameter crisis in reinforcement learning? Let's talk about it at NeurIPS. #NeurIPS2024
2
5
1,452
on A100s? How large is your starting mismatch_KL? Have you converted the rollout_logprob to the higher precision when calculating the mismatch_KL
1
6
3,251
FYI
@danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug we detailed, which was particularly problematic on A100s: On A100 GPUs (and L20s), under certain batch/seq lengths, the kernel triggers its split_kv path. This path had a bug that incorrectly transposed the LSE layout. This corrupted tensor caused a "complete precision collapse" in Cascade Attention, which is what we measured as that massive mismatch KL. Good luck with the FP16/BF16 tests, I'd love to know what you find! yingru.notion.site/When-Spee…
6
1,043
📈 Scale real-time online decisions without scaling model size? Yes! Presenting simple, cost-effective, theory-backed RL solutions at #ICML2024: 🕐 Today, 23 Jul, 11:30am-1pm 📍 Hall C 4-9, Poster #1303 🔬 Poster Session 1 Title: "Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent" Code: github.com/szrlee/HyperAgent #Posterior #DeepRL #AI #Uncertainty #Exploration #Regret #Theory #Computaion #Efficiency
I'll be at #ICML2024 from July 21st to 27th! Check out the picture for detailed locations and times of my presentations. Also, don't miss my lightning talk on "Agile #Human-#AI #Collaboration for #RiskOversight" at the #AlignmentWorkshop @farairesearch on 21st and 22nd. #RL #Probability #GPT #Martingale #Statistics #Uncertainty #OnlineLearning #decisionmaking #Moderation #Alignment #Risk #Oversight #Dynamics #ICML #conference #Vienna #FoundationModel #LLMs #exploration #poster #HyperAgent #Theory #Practice #math
5
1
6
598
5/ And of course, all modern work on this topic stands on the shoulders of giants. A huge credit is due to Nemirovsky and Yudin (1983) for their seminal work that introduced the mirror descent method in the first place. A classic from optimization theory powering today's AI. Link: isye.gatech.edu/~nemirovs/Ne…
1
5
349
4/ It's exciting to see these foundational ideas from 2019 becoming central to today's frontier models. As @IdanShenfeld noted, while PMD-like methods were tricky in continuous domains, they are perfectly suited for the discrete action spaces of LLMs. arXiv Link to the NeurIPS2019 paper: arxiv.org/abs/2501.15034
1
1
5
398
I said hardware differences remain but didn’t say FP16 is better than BF16. I have not done the experiments.
1
5
386
FYI
@danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug we detailed, which was particularly problematic on A100s: On A100 GPUs (and L20s), under certain batch/seq lengths, the kernel triggers its split_kv path. This path had a bug that incorrectly transposed the LSE layout. This corrupted tensor caused a "complete precision collapse" in Cascade Attention, which is what we measured as that massive mismatch KL. Good luck with the FP16/BF16 tests, I'd love to know what you find! yingru.notion.site/When-Spee…
1
5
271
The #first prior-dependent analysis of posterior sampling #RL with function approx. It implies that integration of prior knowledge (such as offline #data or pre-trained #LLMs ) can significantly improve the #efficiency of RL #agents before online #exploration. @aistats_conf
Prior-dependent analysis of posterior sampling reinforcement learning with function approximation ift.tt/CVNAZhy
5
360
4/ Key insight: My original analysis of actor-critic already established this framework—bijective mappings preserve information! {δₜ} ↔ {rₜ} for actor-critic ✓ {Gₜ} ↔ {rₜ} for REINFORCE ✓ Thanks @mgostIH for pointing out REINFORCE fits the same framework! richardli.xyz/post/informati…
1
1
4
284
5/ I am grateful to @mgostIH for the insightful conversation about per-timestep advantages in REINFORCE. Their observation that using Adv_t = G_t - b (where G_t is the return from timestep t onward) preserves the bijective mapping {G_t} ↔ {r_t} and thus avoids the information collapse was the key insight that led to revising this paper to include the return-based policy gradient formulation. The original version only analyzed scalar advantages and actor-critic TD errors; the discussion clarified that REINFORCE with per-timestep returns also preserves full reward information.
1
4
273
Replying to @chijinML
Congrats! Prof. Jin. We learned a lot from your foundational work on RL Theory.
4
1,395
[7/7] Come and join us for the discussion! @RichardYRLi @ZiniuLi @ericzhang0410 @RuoyuSun_UI @chcoli #ICML2024
1
3
209
3/ Information capacity comparison: Scalar advantage: - Terminal rewards: O(1) bits - Dense independent rewards: O(log T) bits Per-timestep advantages [REINFORCE or actor-critic]: - Terminal rewards: O(1) bits - Dense independent rewards: O(T) bits The ceiling depends on BOTH gradient structure AND reward structure 📊
1
1
3
183
3/ Why did this matter? This approach led to more stable learning and deeper exploration, making it far more robust when reusing off-policy data. The results spoke for themselves. Our policy mirror descent method (PPO+DA) significantly outperformed the standard PPO baseline on a wide range of Atari games, especially in data-scarce scenarios.
1
1
3
309
Replying to @wucathy
Would be great to chat on efficient RL and game theoretical decision making under uncertainty! My homepage RichardLi.xyz
165
2/ The 1-bit ceiling is real, but specific to: - Scalar advantage: one return number per episode [sum(r_t) -b] - Terminal rewards Standard REINFORCE uses per-timestep advantages (Gₜ - b), where Gₜ is return-to-go → {Gₜ} ↔ {rₜ} bijection preserves full reward entropy!
1
1
3
218
Quite impressive! A big step towards democratizing foundation models training with Adam-mini. 🚀 @ericzhang0410 @chcoli @ZiniuLi
Thrilled to introduce Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini can also achieve 49.5% higher throughput than AdamW on Llama2-7B pre-training. The design of Adam-mini is inspired by certain Hessian structures we observed on Transformers. Feel free to try it out! Try switching to Adam-mini with the same hyperparams of AdamW, it would work with only half memory. Hope Adam-mini can help save time, cost, and energy in your tasks! Paper: "Adam-mini: Use Fewer Learning Rates To Gain More" arxiv.org/abs/2406.16793 Code: github.com/zyushun/Adam-mini
3
253
True
RL and Bayes are not machine learning subfields. They are the axioms of life and universe. You can't get rid of them.
2
165
2/ Our key insight in "Divergence-Augmented Policy Optimization" (DAPO) was to view policy updates through the lens of mirror descent on the entire state-action distribution μ(s, a). This balances maximizing future rewards (the r term) against the cost of deviating from our current distribution (the D(μ, μₜ) term). It's a more stable update rule. This is a subtle but crucial shift from just regularizing action probabilities at a given state. Instead of just penalizing policy changes, we penalize changes in the outcomes of the policy across the whole environment
1
2
3
363
Replying to @nanjiang_cs
Great to know about this! Thank you!
3
228
Visionary
Great keynote by David Silver, arguing that we need to re-focus on RL to get out of the LLM Valley @RL_Conference
3
170
Indeed, agents require (online) continual learning.
I agree with @karpathy 's take here. The interview between @RichardSSutton and @dwarkesh_sp was interesting, but I think at times there was a communication gap due to some misunderstandings. I would say that the current LLM training setup is very similar to the classic model-free RL setup, except that with LLMs: (1) the policy is warm-started from a supervised model (no de-novo, self-directed learning); (2) there is a train/test distinction (no continual learning); (3) most of the observation stream comes from human words, which already "carve nature at its joints", bypassing the harder problem of learning useful abstractions from raw sensorimotor streams. (4) when using multimodal models, the perceptual encoder is usually pre-trained and frozen, and often relies on a lot of human engineering (eg contrastive losses, or pixel-prediction losses) to come up with a good set of (soft) tokens. Most of the interview seem to focus on issue #1. However, the discussion seemed confused here due to the fact that LLMs are both a world model (predict what humans would typically say) and a policy (predict what the agent should do). Obviously the model from the supervised pretraining stage is not action-conditioned, so Sutton does not want to call it a WM - but it is a predictor of future observations given the past, so it's like a WM that marginalizes over actions (resulting in a mixture). The WM is then converted into a (goal-conditioned) policy using IFT (imitation learning) and then improved with RLFT, which further confuses the discussion. In current practice, the RLFT stage mostly just uses human provided reasoning tasks, which are bandit problems that do not involve interacting with an environment. But there is a recent move towards true multi-step RL, where LLMs do learn from external environments, as in classic RL. This fact was not emphasized enough in the interview, IMHO. Andrej argues that warm-starting is a practical alternative to evolution's outer meta-learning loop, and I agree, so I don't have a problem with #1. But I do agree with Sutton's criticisms #2-#4. In particular, I expect a lot of future progress to come from continual RL applied to multimodal problems (eg. visual GUI-using agents) in non-stationary multi-agent environments (e.g., e-commerce or embodied AI), where the agent learns its own abstractions over time (eg creating tool libraries), it learns both a (goal agnostic) world model and a (goal conditioned) policy (so it can do decision time planning), and both kinds of model become semi-parametric (eg. combining memories and ICL with gradient-based weight updates). Future agents will not just be a frozen "omni-transformer", consuming and generating tokens, they will be heterogeneous adaptive systems, with many different specialized modules, more like the brain. (This may make serving hard, but who said intelligence would be easy to reproduce?) I think Sutton will like this new paradigm more :)
2
3
631
What might be the major difference between AWAC and MARWIL? It is already implemented in Ray RLlib. Is there any comparison? MARWIL: ai.tencent.com/ailab/media/p… Ray RLlib implementation: docs.ray.io/en/latest/rllib-…
1
3
Flying out to #NeurIPS2023 today from HK!
Replying to @RichardYRLi
I will be at #NeurIPS2023 next week. Please DM me if you would like to chat about anything related to #ReinforcementLearning / Decision-making #agents / Random Projection for Efficient #Algorithms😋
2
305
The FP16 paper also confirmed that our seq level algorithm is only one survived in ther BF16 setups. The algorithm and follow-up analysis can used for general off policy problem, not just the mismatch scenarios.
2
140
I'm also excited to share our work on ensemble theory and algorithm design that could address your mentioned "computational overhead in ensemble-based information gain estimation"! Check out our Ensemble++ paper with code: arxiv.org/abs/2407.13195
1
73
Replying to @shaneguML
How do we do the right notion of online RL? It is a bit tricky as we do not even solve the catastrophic forgetting or plasticity issue in training general DNN (also applies to LLM). Second, exploration in policy based Deep RL is still full of engineering tricks, lacking principles.
2
210
My recent research is highly related to Blackwell's approachability concept.
1/ David Blackwell. Leagues ahead of his time: "What is a good prediction strategy, and how well can you do?" While some things do not seem possible, "Looking for a p [probability] that does well against every x [an outcome] seems hopeless", Blackwell does give us a strategy:
2
Very interesting! @baharanm @sjoshi804
I'll be giving a 2-hour tutorial on data-efficient learning with my PhD student @sjoshi804 on Monday July 22 at #ICML2024. Join us to learn more about this cool topic! ➡️ We can learn better from better data! ⬅️🙌🌱
1
2
466
This poster is a preliminary version of our work. We believe HyperDQN, a randomized exploration algo with modern #DeepLearning architecture, is one of right approaches bringing theory to practice and bridging their gap. Thanks to the organizers @yuxili99 of #RL4RealLife workshop.
2
Replying to @KaixuanHuang1
Cool! When will this dataset be available?
2
209
Replying to @notmisha
work from Tencent AI Lab ai.tencent.com
2
Bringing groundbreaking solutions to #NeurIPS: "Efficient RL via Hypermodel." Tackling the major RL challenge of data and computational efficiency, our results mark a paradigm shift in the field. Don’t miss this insightful session! #AIResearch #ReinforcementLearning @NeurIPSConf
1
2
434
Replying to @abeirami
Excellent analysis—it's great to see this information-theoretic lens being applied. This aligns perfectly with the framework I laid out in my post on "Information Bandwidth." We both arrive at the same core conclusion: the bottleneck is two-fold. My post is a deep dive into both of these aspects: The Signal Bottleneck: Analyzing how the information ceiling is fundamentally limited by the reward structure itself (e.g., terminal/binary rewards offering ≤1 bit vs. dense rewards offering thousands). The Algorithm Bottleneck: Showing how policy gradient's aggregation structure inherently destroys this signal, while an algorithm like Actor-Critic is designed to preserve it. It's all about the crucial interplay between signal density and algorithm design. Great to be thinking on the same wavelength! The full analysis is here: richardli.xyz/post/informati… @johnschulman2 @AlexGDimakis @rm_rafailov
1
2
461
that is why we use the hierarchical MDP formulation. arxiv.org/abs/2509.02479
Replying to @nanjiang_cs
It makes sense as by the Autoregressive modeling. But if it is intended to optimize combinatorial action set at the beginning, then it is open loop and we do even know how to optimize it in LLM. This implies the underlying MDP is always token level.
2
273
Replying to @MichaelHBowling
Our ICML24 work directly addresses this concern. Effective RL is possible with dramatic simplification: 20x fewer parameters, 7x fewer samples (figure 1), and removal of many conventional tricks (table 3) - while still achieving human-level performance. arxiv.org/abs/2402.10228
1
2
52
Interestingly, we found adding noise in the right way to value target with MSE provides significant practical and theoretical benefits in computation and data efficiency! arxiv.org/abs/2402.10228 @marcgbellemare @aviral_kumar2
2
695
Replying to @jtadkins49
Our ICML24 work directly addresses this concern. Effective RL is possible with dramatic simplification: 20x fewer parameters, 7x fewer samples (figure 1), and removal of many conventional tricks (table 3) - while still achieving human-level performance. arxiv.org/abs/2402.10228
2
212
Replying to @D_Nohara @QPHutu
These are all supported in verl verl.readthedocs.io/en/lates… bypass old_logprob and no PPO clipping. and it was discussed at section 4.2.1 of the blog yingru.notion.site/When-Spee…
2
1
134
Replying to @zzlccc
Excellent question! You've nailed the central challenge of using sequence-level IS in practice. In the blog, our current solution is to directly manage the ratio with methods like Truncated IS (TIS) and Masked IS (MIS). They control variance by capping or zeroing out extreme ratios from outlier trajectories. We found MIS works particularly well. But you're right to imply these are still heuristics. We're actively working on more principled variance reduction techniques for a future post. Stay tuned!
1
1
66
Replying to @agarwl_
Hi Rishabh, we also had a set of experiments showing (on-policy) GRPO/RLOO can crash at H100 with BF16.
1
1
202
Want to discuss more on the theory-practice gap in (randomized) exploration in RL? Welcome to join the oral presentation at 09:10, 14 Dec (ET). neurips.cc/virtual/2021/work… @ZiniuLi @EcoTheoryRL
1
(3/3) The practical implications are huge. This provides a clear, theory-backed path for deploying cheaper, more flexible fine-tuning setups and truly unlocks large-scale, low-cost custom agent development. This is what makes a frontier lab, a frontier lab. Congarts @johnschulman2 @thinkymachines
1
172
We shall bring that into practice.
1
44
Replying to @nanjiang_cs
Thank you for the detailed comments. I totally agree that in LLM, your definition of "seq-level" PG is essentially the token-level PG. and there is no ambiguity here. it is just the decomposition of log prob using the autoregressive property. What I am confused is the point on optimizing a combinatorial actions that carried out from the beginning. I will try to understand this.
1
79
Very interesting! Would be great to chat!
1
1,142
[5/7] RODI-Rep: A computation-efficient distributional RL algorithm for risk-sensitive RL with regret guarantee. [Paper](arxiv.org/abs/2210.14051v3). FoRLaC workshop @icmlconf
1
1
64
A great follow-up point for the theorists here: to be precise, this analysis focuses on the information upper bound (the "bandwidth"). To strictly prove superiority, one might need to show that lower_bound(AC) > upper_bound(PG). However, deriving a tight lower bound on information gain is difficult as it depends heavily on implementation details like critic accuracy. The key takeaway is that PG has a low, fixed structural ceiling due to reward aggregation. AC removes this ceiling, giving it access to a much richer information stream.
1
1
191
[2/7] HyperAgent: 85% data budget saving 📉 / 95% smaller model size 🧠 in deep RL benchmarks. Saving 90% labeling budget 🏷️ in human-AI collaboration for #risk oversight. Closing a theoretical gap in scalable uncertainty estimation and exploration. [HyperAgent: Paper](arxiv.org/abs/2402.10228) | [ICML Poster](icml.cc/virtual/2024/poster/…) [GPT-HyperAgent: Paper](arxiv.org/abs/2407.13195) | [Poster](icml.cc/virtual/2024/35857)
1
1
85
I will be at #NeurIPS2023 next week. Please DM me if you would like to chat about anything related to #ReinforcementLearning / Decision-making #agents / Random Projection for Efficient #Algorithms😋
437
Replying to @shaneguML
Our ICML24 work directly addresses this concern. Effective RL is possible with dramatic simplification: 20x fewer parameters, 7x fewer samples (figure 1), and removal of many conventional tricks (table 3) - while still achieving human-level performance. arxiv.org/abs/2402.10228
1
70
Replying to @DanielRuss0
congratulations!
Replying to @BlackHC
While one might propose enforcing identical calculations (e.g., using "batch invariant kernels" in thinking machines blog), these solutions come with a severe performance penalty, defeating the purpose of using a high-speed inference engine in the first place. This speed-vs-consistency trade-off is at the heart of the problem, making it a persistent challenge rather than a simple engineering fix.
1
240
HyperAgent shows that the path forward may be through simpler, more principled approaches rather than hyperparameter complexity.
1
60
Replying to @gerardsans
This is a fantastic and deeply insightful take. You're right that the root problem is a fundamental, conceptual one—not just an engineering mismatch. We see our work as diagnosing and fixing the immediate, practical symptoms (the system collapse) that arise from this "deeper rot." While the field grapples with the long-term challenge of causal grounding, stabilizing the fragile systems we have today is a critical first step. The cookbook analogy is perfect. Really appreciate you adding this valuable layer to the discussion!
1
1
493
Low-probability tokens are the weak link where the training-inference mismatch explodes. This new finding explains the core instability mechanism and complements the original great work on this problem by @fengyao1909 and team. fengyao.notion.site/off-poli…
Replying to @RichardYRLi
(3/x) Here's the smoking gun: the training-inference mismatch isn't uniform. It's catastrophically worse for these exact low-probability tokens. The training framework assigns them near-zero probability, while the inference engine saw them as merely unlikely. This massive divergence causes collapse.
1
381
[4/7] Adam-mini: A “mini” version of Adam that saves near-50% memory 💻. [Paper](arxiv.org/abs/2406.16793). ES-FoMo workshop @icmlconf
1
1
64