🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…
29
186
1,368
330,931
Super excited that @karpathy noticed our work! Hopefully it helps the broader community realize that *precision* deserves a place in our design space.
10
26
1,654
279,229
much more convinced after getting my own results: LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆 script: github.com/sail-sg/oat/blob/…
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…
8
89
786
204,741
exactly. and we will never derive a term like 1/|o|. seeing so many papers still using the original GRPO is sad.
I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action). story is different if you introduce fitted critic/Q-values or intermediate resets.
7
49
577
61,890
Nothing feels more exciting than writing a thesis proposal on RL for LLMs before 2025 ends!! Covering a subset of my first-author works done in the past 1.5 years (after switching from traditional RL to LLM RL…) Tentative title, of course
16
61
503
60,819
BF16 -> FP16 is such a simple (one configuration change in Oat) yet fundamental fix for inference-training mismatch. With FP16, the most basic importance sampling PG outperforms all algorithmic fixes in BF16. Let's rethink RL stability from the precision perspective.🔎
🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precision…
13
45
506
78,196
🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵
17
71
468
116,763
With just a few lines of code, Feng’s (@fengyao1909) suggested fix—applying importance sampling on the behavior policy—resolved the training instability in my case (oat). I believe the result can generalize to other RL frameworks as well. Great work, Feng!
7
60
465
45,438
6 months after our paper release, I still recall the debates on removing the length normalization term in DrGRPO. And people gradually think DrGRPO is just about removing the std, ignoring the most important and subtle (length) bias we tried to point out to the community. Even now, many papers (and open-source code) still divide the policy gradient loss by the response length—taking the mean instead of the sum... Fortunately, with Tinker’s implementation as a reference, I hope it will be more convincing for the OSS community to adopt the unbiased RL loss computation. So grateful to Thinking Machines for pushing the boundaries of open science 🚀
3
41
444
40,794
Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?
17
48
434
63,126
Good catch! But in fact this correction is unnecessary. We were aware of this. The N/N-1 factor affects all training instances equally, thus can be compensated by adapting the learning rate. Their gradients are the same after compensation. We have acknowledged the connection with RLOO in Appendix A.
I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy Optimization). And if you also remove PPO's clipping, you'll get RLOO's (Reinforce Leave-One-Out).
3
26
320
45,029
GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced deepseek-r1 length increasing with LoRA 3. benchmarked PPO, GRPO, REINFORCE and showed their tradeoffs 4. added Terminal, MCP, visual and multi-agent environments … Open the thread for a deep dive!
5
33
289
57,700
In the era of experience, we're training LLM agents with RL — but something's missing... We miss the good old Gym! So we built 💎GEM: a suite of environments for training LLM 𝚐𝚎𝚗𝚎𝚛𝚊𝚕𝚒𝚜𝚝𝚜. Let’s build the Gym for LLMs, together: axon-rl.notion.site/gem
4
39
291
45,325
Since the release of Dr. GRPO, many are interested in the 𝐥𝐞𝐧𝐠𝐭𝐡 𝐛𝐢𝐚𝐬 in GRPO's formulation & implementation, as well as in PPO's implementations. I did some updates on our paper and prepared a table for better comparison (details in thread):
3
47
269
22,391
Reinforcing General Reasoning without Verifiers 🈚️ R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity. Introducing *VeriFree*: ⚡ Skip the verifier 🎯 RL that directly maximizes reference answer likelihood ✅ Simpler, faster, built for general reasoning Paper: huggingface.co/papers/2505.2… Code: github.com/sail-sg/VeriFree 🧵
6
43
233
38,867
exciting to see these Figure. 1 showcasing how subtle algorithmic improvements can make a big difference in rl dynamics 😆😆
4
30
211
17,566
Go async without regret, with Tinker! I love this so much. Previously, when I heard people say, my async RL is unstable, I understand it as, perhaps it’s your infra/code that’s unstable, and async RL shouldn’t be blamed (in this scenario). With a trustworthy infra like what Tinker has provided, async/off-policy learning using the most basic importance sampling just works!
The Tinker API seems quite promising for async RL training, though I haven’t seen much discussion on this aspect. I ran a few experiments to get an initial sense of how async RL performs with Tinker. The results are pretty impressive! Async matches the sync version under the same settings, but finishes in about half the wall-clock time (max steps off-policy = 4). A few quick notes: - Efficiency can likely improve further — I did notice some rate limiting - These runs use GRPO-like algorithm; setups with more components (e.g., actor-critic + reward models) could show even greater async benefits - Compared with the non-API RL infra, the async imple with tinker doesn’t need to manage computing resources, which largely simplies the complexity Looking forward to seeing more exploration from the community on async RL with Tinker! Experiments based on the math training recipe from tinker-cookbook: github.com/thinking-machines…
2
14
188
34,492
happy to see meta's new work employs unbiased GRPO; this should really be the default, as tobi said (CEO @Shopify)
Replying to @QGallouedec @zzlccc
You really want to make this fix default…
3
12
182
35,981
Yesterday I learned this train-eval gap and was surprised because I never had this issue before -- RL on LLM is much easier\more stable than classic RL. Tried the same setup (Qwen2.5-7B + Big-Math), and got stable eval convergence using Dr. GRPO (script: github.com/sail-sg/understan…).
The worst part about doing RL is that even once you've ironed out all the training instabilities and rewards are going up, you still lose 🫠
2
18
168
16,668
Yea we did experiments for 30A3 MoE on 64xH100, using much larger dataset (the DAPO's one). Thanks Łukasz for the test on H200 with 32B dense, amazing!
Replying to @redtachyon
Well, not only A100. Here is the sanity check on H200 (GRPO, 32B dense model). The authors also mention that they did some larger-scale experiments on H100.
9
158
23,232
Anytime reasoning, a topic dating back to the ’90s, seeks the best-effort solution given a computation bound. For large reasoning models, we optimize anytime reasoning with 1) dense rewards and 2) better credit assignment (BRPO). For more details👇
👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/AnytimeRe…
24
138
16,936
oat🌾just supports GRPO, and using it I tried RL-tuning a 3B model on the countdown task. It shows DS R1-zero-like behavior: training from a base model, improving performance & getting longer responses (self-reflection emerges)! Example: github.com/sail-sg/oat/blob/…
1
23
129
16,577
#ICLR2025 is such a disappointing experience! Spent a whole week preparing PDF revision and rebuttal, getting NO response from reviewers or ACs at all🤷 I do care discussions more than a binary acceptance/rejection signal, which is why ppl prefer (good) PRM. @iclr_conf
4
7
119
18,325
After the crazy 极GRPO weekend, let's get rid of the scalar reward or any policy optimization related to it. We explored learning from *verbal feedback* and obtained interesting results:
🚀LLMs can learn directly from verbal feedback — no scalar rewards needed! 😥Scalar rewards compress rich feedback— “redundant but correct” vs “concise but typo-ridden” might both be 0.8 💡We propose to learn Feedback-Conditional Policy (FCP), an extremely scalable paradigm!
2
19
117
13,042
Can’t believe all these experiments are running on my Mac… (with the heavy lifting offloaded to an amazing platform!) Results coming soon…
6
3
109
14,180
Great to see Dr. GRPO is much more sample efficient than the original GRPO
Tina: Tiny Reasoning Models via LoRA "the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA."
2
13
112
8,817
Exactly! This line of code has been overlooked for how long?
(1/3) My favorite figure from the paper. Nearly all open-source RL frameworks introduce an unintentional bias when computing the masked mean 😮. The fix? Just replace mask.sum with a constant.
3
5
106
15,254
Nice follow-up! Spurious rewards and spurious prompts re-confirm the biases cooked into Qwen base models. Revisiting our results in March (arxiv.org/pdf/2503.20783 Section 2.2 & 3.3): - No template is the best - Much of RL's gain comes from correcting model-template mismatch
Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt. Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀 Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄
1
14
105
26,509
Environment Hub by prime-intellect is awesome with its GUIs! Scaling environments is key—they provide the signals RL agents learn from. We've been building 💎GEM with the community: 🌎Envs: math, code, games with python/search tools 🔧Framework-agnostic: 5 integrated frameworks
4
22
103
12,129
I think the core issue is still the length bias in policy gradient. In RL, we essentially ascend A(a|s) * grad(pi(a|s)), where A is the advantage. When A only depends on the observed reward, there is no length bias; when A also depends on response length (GAE with lambda < 1 in PPO, or the original GRPO), the model might exploit this bias and exhibit behaviors such as outputting increasingly longer responses. nitter.app/zzlccc/status/19031627… revealed this bias using GRPO (monte-carlo advantage), and this paper re-confirms it using PPO (general advantage estimation with lambda <1). Cool work.
As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper (arxiv.org/abs/2504.05185) shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training. I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem. What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong). So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness. In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency.
2
7
100
10,137
Exploration is an important topic in traditional RL. How does it affect online RL for LLM reasoning, like o1/r1? The most common way people add exploration for LLMs is through temperature sampling (a.k.a. Boltzmann exploration), I did a simple ablation on 1B models with PPO and found that setting suitable temperature is crucial! The best setting improves the zero-shot performance on GSM8K of a 1B model from 40.6% to 55.7% with pure RL (and not plateauing)! The implementation is super efficient with the latest oat-0.0.6🚀 Codes: github.com/sail-sg/oat
5
14
90
9,062
Good catch! I ran into the same reusability issue ~1 year ago with OpenRLHF. That’s why I built oat🌾 (github.com/sail-sg/oat) — a modular RL LLM framework inspired by DeepMind’s ecosystem. Just define your actor, learner, and env in a single script — and you’re good to go :) Example: github.com/sail-sg/understan…
why is it that every RL LLM project that uses verl has a clone of verl in its code rather than directly importing verl? This feels like bad reusability. When verl updates, these project code becomes obsolete.
1
7
93
12,611
Thanks for the thought! Some further thoughts (clarifications): 1. Reasonably designed algorithms (let’s also include precision in the design space) should not collapse on small data. It’s just like if my CNN cannot even overfit MNIST, how can I trust it will master 1000-class imagenet? 2. We do have experiments on “larger” dataset X bigger model (30A3 MoE) X H100 GPUs. The performance improvement is clear.
Aight let's unclickbait the fp16 paper. tł;dr cool paper, a little bit overstated in comms, very overstated by poasters. The thing that gave me a pause is that on the surface, it seems to claim that bf16 is horrible, borderline unusable. But that's not really the case (nor is it the claim). Yes, the fsdp-vllm mismatch is real, and yes, it can be mitigated with fp16. This is as true as it is irrelevant, because who cares about the mismatch if the algorithm empirically works? The widely circulated figures show bf16 consistently collapsing, while fp16 runs thrive. I have no reason to doubt this data, but it's performed on a very particular - and rather small - dataset. If you don't go through multiple epochs of your data, you probably won't see such a severe collapse - as many, many people can confirm from their own experiments. Does this mean that fp16 is useless? Not necessarily. As always, there are trade-offs. It might work better in some cases, but maybe you have to fuck with loss scaling. It might be worse in a big data regime, but crucial in a low data regime. Hard to tell for now. But alas, you still have to pay for your lunch.
7
9
91
47,719
wondering what type of research PhD students or small groups of researchers can do to really push the frontier of ai?
1
5
90
8,285
Inspiring talk by Prof Rich Sutton, about intelligence and beyond
2
6
88
5,909
Really fast! Thank you so much! @QGallouedec . But removing std only fixes the question-level difficulty bias, would you like to fix the response level length bias as well? In fact this bias not only affects GRPO, it appears nearly in all other algorithms by using the masked_mean
🪂 Getting GRPO Done Right (Dr GRPO) is now in TRL @zzlccc proved that scaling by the std introduces question-level difficulty bias! You can now remove this bias 🗑️
3
4
86
12,268
Hey @agarwl_ , thanks for your attention! For most small-scale experiments we used 8xA100 & the sanity check setting; for larger dense or MoE training we used 64xH100 & the 'normal' setting (DAPO dataset). We also did not observe severe collapse until we develop the "sanity check" setting - a small dataset with all solvable questions (we can expect 100% rewards in theory and we empirically got 98%). We think this is a clean setting to test any algorithm before we scale it up to massive dataset and GPUs. Using large dataset (plus algorithmic fixes like TIS or CISPO) can slow down the collapse or make it never happen in practice, despite the mismatch exists
4
6
86
50,358
This work is accepted by @COLM_conf 2025 See you in Canada! 🍁
🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…
1
9
78
4,430
RL works without doubt, the problem is how we define the MDP, or even whether we use MDP. RL is about life, we all learn from experience
🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨 That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. 🧵 How?
2
11
75
5,437
We do appreciate their efforts in writing the criticisms, but “turns out that the results in this paper are misreported” is a strong claim without running evaluation themselves. Such claim was also generalized to many other papers in a more recent blog (safe-lip-9a8.notion.site/Inc…), questioning existing RL approaches and respective evaluation. As a response, we re-conducted extensive evaluation on GPQA-Diamond under different settings (various sampling parameters, token budgets) to show that VeriFree remains effective and favorable compared to the verifier-based baseline (apple-to-apple comparison!), and still outperforms the Qwen3 instruct models in non-thinking mode even with a large token budget.
So turns out that the results in this paper are misreported. Qwen3 report shows 4b thinking is GPQA 55.9, much higher than both the 32 here, and post-RL numbers of 45. Similarly for other datasets. No wonder our RL claims keep falling apart. Noticed by @nikhilchandak29
3
12
74
20,250
2) On reinforcement learning: 2a) GRPO is biased: - The length normalization prefers shorter correct answers , and longer incorrect answers. -> length bias - The std normalization prefers too easy or too hard questions over average questions. -> difficulty bias
1
2
71
4,989
I love the KL-div curve, but not so much the later stage of the reward-mean curve. Why does it plateau? A limit of the base model’s exploration?
🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq → same output regardless of batching) 2️⃣ Ensure forward passes in training use identical kernels as inference 3️⃣ Add custom backward passes in PyTorch ✅ Verified on Qwen3 1.7B + GSM8K: • batch_inv_ON (bitwise exact) → KL=0.0, faster convergence, higher reward • batch_inv_OFF → reduced reward, instability We audited every op, imported vLLM’s fused kernels (SiLU MLPs, RMSNorm+residual), and wrote matching backward passes. Run is fully on-policy, deterministic, and reproducible. Next: • Unified model code • torch.compile support • Perf tuning (current bitwise RL ≈2.4× slower) • Broader model + op coverage 🔗 blog.vllm.ai/2025/11/10/bitw… #vLLM #TorchTitan #RL #LLM #AIResearch
3
5
72
10,107
Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers: 1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix) 2. Main Conference Spotlight: Continual Reinforcement Learning by Planning with Online World Models; I worked on the continual learning topic for one year before diving into RL for LLMs openreview.net/pdf?id=mQeZEs… 3. AI4MATH Workshop Poster: Optimizing Anytime Reasoning via Budget Relative Policy Optimization; a dense reward training framework that controls thinking budget (a cool feature of Gemini 2.5 Pro)
16
66
5,148
We take a understand-then-improve approach to study R1-Zero-like training. We first critically examine two core components: base models and reinforcement learning. 1) On base models: 1a) DeepSeek-V3-Base already exhibits "Aha moment"😲
1
2
61
8,987
scaling down the data is the path we took to discover this mismatch; it is not overfitting because it's the training reward that goes down, not eval. small dataset setting does not mean our finding is irrelevant to large-scale training. it serves as a sanity check for RL
I did indeed encounter a collapse in training reward during RL on H-series GPUs, but that only occurred when training for >10 epochs with a small dataset (e.g., 3k). I attributed this collapse to overfitting. hadn't considered that it might also be related to precision.🤔
5
3
55
13,607
I can feel #ICLR2025 already starts… welcome everyone to 🇸🇬 Singapore! Lets meet and chat about RL, LLM and reasoning :)
1
2
55
5,995
2c) To get GRPO Done Right, we propose Dr. GRPO. Two line modifications: removing both length and std normalization (red terms). Our new optimizer is unbiased, and has better token efficiency (by preventing progressively longer incorrect responses of GRPO)
1
55
3,863
These are not all... Please check out our paper, codebase, and models for more fun, such as * RL on Basic algebra (+ − ×÷) questions improves Olympiad-level reasoning capabilities * Llama models can also "Aha" * ... Paper: github.com/sail-sg/understan…
1
3
54
3,771
2/8 Using these prompts, we take questions from MATH to test 6 base models, and found 5/6 base models already exhibit self-reflection behaviors (or the Aha moment😯😯) with the following keyword occurrence frequencies -- Qwen models are the most active
3
2
50
5,508
🚨 RL x LLM folks at #ICLR2025 — come join us during the Friday lunch break! If you haven’t RSVP’d on Whova, you can also register here: lu.ma/s8udv997?tk=B4lfN0 @Benjamin_eecs and I will scout for a chill spot (likely a corner at the venue) and share the location tomorrow. It’ll be an informal meetup — just to connect, chat about ideas, and maybe grab lunch together :) Can’t wait to see y’all there!
2
12
49
3,544
Our analysis on base models and RL suggests a minimalist recipe for R1-Zero training, no tricks: - algo: Dr. GRPO - data: MATH level 3-5 questions - template: Qwen-Math - compute: 27 Hours * 8 * A100 This gives us a 7B SOTA in the Zero-RL setting: 43.3 on AIME 2024!
1
1
46
3,977
2b) Surprisingly, even though PPO's formulation is unbiased, nearly all open-source implementations introduce the length bias by computing the masked_mean The length bias is partly the reason for the increasingly longer responses.
2
47
4,274
7/8 Moreover, we find response length may not be a good indicator of self-reflections, because they seem not correlated during R1-Zero-like training.
2
2
45
3,814
6/8 We take a closer look at the RL dynamics during R1-Zero-like training. We find that length changes of model responses are mainly due to the rule-based reward shaping, which initially encourages formatting then correctness, verifying our hypothesis.
2
2
43
3,487
5/8 We hypothesize that performing RL could turn superficial self-reflections into effective self-reflections with proper reward shaping.
2
2
42
3,340
I like the apple-to-apple comparison when designing new algorithms! With group rollouts (like in GRPO), the semantic entropy is almost a free lunch to enable curriculum learning on questions to improve training efficiency and stability. A clean and useful method!
Motivation Fine-tuning LLMs beyond their knowledge boundary can hurt performance (arxiv.org/pdf/2402.18243, arxiv.org/pdf/2405.05904). SEED-GRPO mitigates this by measuring semantic entropy — estimating how confident the model is on each training sample. Method We modulate the training sample’s Advantage with uncertainty — effectively giving each problem a “custom learning rate.” High Semantic Entropy⇒ High uncertainty ⇒ weaker advantage (less trust) Low Semantic Entropy⇒ Low uncertainty ⇒ keep relatively high learning signal Result Built on Dr.GRPO(arxiv.org/abs/2503.20783), SEED-GRPO boosts the average score from 51.4 → 56.6 under identical hyperparameters (Table 2, our methods row 3). Huge thanks to @zzlccc for the excellent Dr.GRPO and insightful discussions! 📄 Paper: arxiv.org/abs/2505.12346 💻 Code: github.com/Dreamer312/SEED-G…
2
41
8,106
Replying to @lilianweng
Thanks for the great blog! Regarding the “aha moment”, is it possible to be more conservative when discussing it since 1-2 months ago there were many debates on it and we also systematically studied it and found there is no aha moment:
🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…
1
2
42
10,509
bf16 is actually a great choice for ablating algorithms that try to fix numerical mismatch — it makes the mismatch worse, so you have more room to observe the improvement
sticking w bf16, wouldn’t wanna make things too easy
2
42
10,660
Interesting paper! A funny thing is that I wrote a bug to introduce random rewards about three months ago, and observed similar evaluation improvement. Did not expect this can turn into a nice paper, great job!
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewards
2
42
3,835
8/8 We hope our findings can help understand a bit deeper about R1-Zero, and provide insights for RL training of base models. BIG thanks to @Cameron_Chann @liwenjun2016 @TianyuPang1 @duchao0726 @mavenlin for their valuable contributions during the past week.❤️
1
2
40
3,156
had fun RL-tuning language models on text games
For the past ~2 months we have been working on training reasoning models on TextArena games. The first paper (introducing what we think is a very promising new paradigm) will hopefully be up later this week / early next; and the second one, focusing on the "scaling laws" of self-play and some additional analysis, tentatively around the 18th of july. However, to get more feedback on the structure and implementation, we want to open-source the code now. UnstableBaselines is a very simple Async, Online, Multi-Turn, Multi-Agent RL library built on vLLM and Ray. The code is pretty readable and around 1.2k lines long (and includes a cool rendering interface that you can run via "unstable-terminal") 1/7
1
6
40
3,930
1b) As the popular choice for R1-Zero-like training, Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates: the average benchmark scores immediately improve by ~60%. This makes Qwen2.5 base more like SFT models trained on QA concatenation🤔
1
39
6,381
Self-play on zero-sum language games creates selection pressure for LLMs to develop transferrable reasoning patterns. Enjoyed building the multi-agent, multi-turn RL system and training agents that think strategically through self-play! Paper: huggingface.co/papers/2506.2… Code: github.com/spiral-rl/spiral
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: huggingface.co/papers/2506.2… 🧑‍💻 Code: github.com/spiral-rl/spiral
10
37
3,276
Does this invalidate experiments based on Qwen3-4B-Base?
3
2
39
7,171
4/8 However, there are still many self-reflections from base models that do not lead to correct final answers, which we call Superficial Self-Reflections (SSRs)
1
2
38
5,957
1/8 Base models can directly answer questions by proper prompting, such as the ones shown below
1
3
37
5,488
3/8 More surprisingly, Qwen2.5-7B-Math can directly solve the example question reported in SimpleRL-Zero with self-correcting CoTs even before any training
1
2
36
4,453
This project is a follow-up of our pilot study with team efforts ❤️. BIG thannnks to @Cameron_Chann , @liwenjun2016, @QPHutu , @TianyuPang1 , @duchao0726 , @mavenlin !!!
🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵
3
1
31
4,432
1a) and 1b) suggest that there are biases in the base model pretraining: self-reflection behaviors, math-solving abilities are already infused before RL reinforces them by reward signals. But is increasingly longer response a consequence of such RL process?
1
30
5,134
And still using the biased policy gradient
I think it was pretty clearly shown (by @zzlccc et al) that DeepSeek was wrong about the emergence of Aha moment through RL in the R1 paper. But seeing some papers still talking about Aha moments. Am I missing something?
28
3,372
Awesome classical RL work in the era of LLM RL
Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓
1
3
28
2,298
I believe our work offers a promising technique for building reasoners with fine-grained budget control— like what Gemini 2.5 Pro has just introduced. Truncated at any time, we deliver the best-effort solution!
Anytime reasoning, a topic dating back to the ’90s, seeks the best-effort solution given a computation bound. For large reasoning models, we optimize anytime reasoning with 1) dense rewards and 2) better credit assignment (BRPO). For more details👇
3
28
2,963
Congrats on the new release! 😍love the multi-node training support 🚀excited to see our Dr. GRPO’s fixes (removing std and 1/|o|) are adopted so quickly by the community
RL goes brrr in the latest TRL release! 🔥 Scale GRPO with multi-node training & vLLM's tensor parallelism 🚀 6x faster convergence with multi-step optimisation 📊 Support for domain specific rewards Release notes 👇 github.com/huggingface/trl/r…
26
1,849
My friend from University of Alberta, @qingfeng_lan, has provided an excellent summary of policy-based LLM RL-tuning algorithms, grounded in the Policy Gradient Theorem:
🚀RL algorithms are shaping the post-training of LLMs, but how do their objectives connect? In this blog, I explore their relationships and provide a unified perspective through the Policy Gradient Theorem—the backbone of policy gradient methods. Dive in: lancelqf.github.io/note/llm_…
2
25
2,406
The terminal is perhaps the most general environment to drop language agents into and let them learn from experience. Existing bash/shell tools don’t go far enough. Check out Min and the team’s great work on the TTY emulator and beyond! (link to a blog is at the last tweet)👇
LLMs don't need MCPs, they need a terminal. Not the bash/shell tool that the codex/claude are already using, but a real tty emulator, to be used in the same way that humans do, i.e. capable of running any REPL interactively, as we will show in the thread.
1
23
4,994
If life is a trajectory, and we are doing online continual RL, it seems that we are driven by many verifiable rewards 🤔
New blog post about asymmetry of verification and "verifier's law": jasonwei.net/blog/asymmetry-… Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of asymmetry of verification are things like sudoku puzzles, writing the code for a website like instagram, and BrowseComp problems (takes ~100 websites to find the answer, but easy to verify once you have the answer). Other tasks have near-symmetry of verification, like summing two 900-digit numbers or some data processing scripts. Yet other tasks are much easier to propose feasible solutions for than to verify them (e.g., fact-checking a long essay or stating a new diet like "only eat bison"). An important thing to understand about asymmetry of verification is that you can improve the asymmetry by doing some work beforehand. For example, if you have the answer key to a math problem or if you have test cases for a Leetcode problem. This greatly increases the set of problems with desirable verification asymmetry. "Verifier's law" states that the ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI. The ability to train AI to solve a task is proportional to whether the task has the following properties: 1. Objective truth: everyone agrees what good solutions are 2. Fast to verify: any given solution can be verified in a few seconds 3. Scalable to verify: many solutions can be verified simultaneously 4. Low noise: verification is as tightly correlated to the solution quality as possible 5. Continuous reward: it’s easy to rank the goodness of many solutions for a single problem One obvious instantiation of verifier's law is the fact that most benchmarks proposed in AI are easy to verify and so far have been solved. Notice that virtually all popular benchmarks in the past ten years fit criteria #1-4; benchmarks that don’t meet criteria #1-4 would struggle to become popular. Why is verifiability so important? The amount of learning in AI that occurs is maximized when the above criteria are satisfied; you can take a lot of gradient steps where each step has a lot of signal. Speed of iteration is critical—it’s the reason that progress in the digital world has been so much faster than progress in the physical world. AlphaEvolve from Google is one of the greatest examples of leveraging asymmetry of verification. It focuses on setups that fit all the above criteria, and has led to a number of advancements in mathematics and other fields. Different from what we've been doing in AI for the last two decades, it's a new paradigm in that all problems are optimized in a setting where the train set is equivalent to the test set. Asymmetry of verification is everywhere and it's exciting to consider a world of jagged intelligence where anything we can measure will be solved.
2
6
24
2,753
Replying to @mgostIH
I would say it's not a bug but a technique by prior RL researchers... We can even treat the number of timesteps to be included for IS computation as a hyper-parameter, to get the best bias-variance tradeoff
Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?
1
6
23
2,989
Created an issue to track how other LLM RL projects are tackling GRPO biases: github.com/sail-sg/understan… Shoutout to the TRL team for the comprehensive support! @QGallouedec @_lewtun. @verl_project would you like to consider fixing this?
1
6
23
1,858
Excited to see GDM proposing Game Arena to measure the model capabilities. Let’s also scale the environments for agent RL with GEM 💎 github.com/axon-rl/gem !
We have a long history of using games to measure progress in AI. 🎮 That’s why we’re helping unveil the @Kaggle Game Arena: an open-source platform where models go head-to-head in complex games to help us gauge their capabilities. 🧵
7
23
2,433
There are some other points to be noticed: 1. Our method is trained with generate_max_length=3000 due to limited computational resources and trained from Qwen3 base modes. 2. Qwen3 thinking mode shows strong performance with extremely long reasoning token lengths.  3. We originally followed the evaluation settings of General-Reasoner (github.com/TIGER-AI-Lab/Gene…) in our paper.
3
20
1,816
oat-0.0.5 (github.com/sail-sg/oat) now supports post-training as online learning more efficiently!⚡️ We use 𝐚𝐬𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 to minimize GPU idle time, and show that async online DPO achieves 28% time reduction while maintaining the final performance! Since the data is not exactly on-policy, this also shows that online DPO is robust to data staleness💪
4
20
1,265
The man behind OAI’s RL infra!
Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) @shengjia_zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library. github.com/openai/harmony
1
18
2,401
This animation is absolutely stunning! 🎬🔥 Never imagined we could make a comparison this fun!
Lil update on fixing deepseeks GRPO issues when training a small medical model. shoutout to @zzlccc & @leloykun 's weekend work 1.5B llms medmcqa score went up from 37% to 52%
2
18
1,748
If I infer correctly, this is Dr. GRPO: 1. summed token loss without length normalization, 2. mean reward as baseline without std normalization
1
2
18
1,219
I worked on OAT while VERL was being developed at ByteDance, and our team built GEM before Environment Hub/OpenEnv were released by Prime-Intellect and Meta. Feeling grateful to be in sync with the community and contribute in my own small way!😆
18
2,275
Thanks @_AndrewZhao for the earlier introduction of our work!
6
17
1,824
Replying to @hu_yifei
We've done a lot of GRPO training on various tasks with bf16. Haven't seen big dips like this... until the sanity check setting. We used standard hyperparameters, but need to train long enough. Let's discuss more if you'd like to reproduce!
1
16
1,957
Excited to see RL experts bring new techniques into RLxLLM fields!
At Reliant we've found RL to be incredibly efficient at improving answer quality to life sciences' hardest questions. Today we're putting out our work on LLM fine-tuning with off-policy RL, matching llama 70B performance with an 8B model - take a look! arxiv.org/abs/2503.14286
2
16
1,682
Let’s do it!
watch this space. assembling a team, plotting, scheming, etc.
3
15
2,971
A pity not joining #NeurIPS2024 in person...but my amazing collaborators @Cameron_Chann @duchao0726 are presenting "𝐒𝐚𝐦𝐩𝐥𝐞-𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐀𝐥𝐢𝐠𝐧𝐦𝐞𝐧𝐭 𝐟𝐨𝐫 𝐋𝐋𝐌𝐬" at Language Gamification workshop on Saturday @ West Meeting Room 220-222! Welcome to drop by!
3
15
1,069
1/7 We released a technical report for GEM building on top of our previous blog release. Our submission to SEA and MTI-LLM workshops at NeurIPS have been accepted for oral / spotlight presentations and we receive valuable feedback from all reviewers! arxiv.org/pdf/2510.01051
1
2
17
1,462
One opinion, and one question for what happened recently: 1. I thought 'zero' means learning a skill from scratch, but recent zero models start from well-pretrained LLMs, which might already possess skills but with very low confidence. RL training naturally reinforces these skills as long as they can be explored (by sampling with temperature). 2. How long should long-cot be so that we call it "emergence"?
2
1
15
2,766
Please please
Replying to @karpathy
@karpathy I think it would be good to distinguish RL as a problem from the algorithms that people use to address RL problems. This would allow us to discuss if the problem is with the algorithms, or if the problem is with posing a problem as an RL problem. 1/x
15
5,428
The idea is simple: we should develop a dedicated environment simulator that's agnostic to training frameworks. To this end, we respect the standard agent-environment interface and follow the OpenAI-Gym's specifications: .reset() and .step()
2
1
16
1,756
paper links if you'd like to see the details: - Dr. GRPO (by our team): arxiv.org/pdf/2503.20783 - GMPO (by @furu_wei's tem): arxiv.org/pdf/2507.20673 - GFPO (by @DimitrisPapail's team): arxiv.org/pdf/2508.09726
1
14
1,113
eat oat every morning; training with oat🌾the rest of the day: github.com/sail-sg/oat
1
13
944
In this way, different tasks (math, code, language games, general qa, reasoning gym, python & search tool) are all unified under a single interface. Leveraging this, we run the good old multi-turn REINFORCE and get those beautiful reward curves we all love to see 📈🔥
2
1
13
1,230