Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

Pinned Tweet

Zichen Liu

@zzlccc

21 Mar 2025

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…

186

1,368

330,931

Zichen Liu · Nov 1, 2025 · 3:11 AM UTC

Zichen Liu

@zzlccc

1 Nov 2025

Super excited that @karpathy noticed our work! Hopefully it helps the broader community realize that *precision* deserves a place in our design space.

1,654

279,229

Zichen Liu · Oct 2, 2025 · 4:53 AM UTC

Zichen Liu

@zzlccc

2 Oct 2025

much more convinced after getting my own results: LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆 script: github.com/sail-sg/oat/blob/…

Thinking Machines

@thinkymachines

29 Sep 2025

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…

786

204,741

Zichen Liu · Sep 22, 2025 · 1:49 AM UTC

Zichen Liu

@zzlccc

22 Sep 2025

exactly. and we will never derive a term like 1/|o|. seeing so many papers still using the original GRPO is sad.

Nan Jiang @nanjiang_cs

21 Sep 2025

I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action). story is different if you introduce fitted critic/Q-values or intermediate resets.

577

61,890

Zichen Liu · Oct 25, 2025 · 2:47 AM UTC

Zichen Liu

@zzlccc

25 Oct 2025

Nothing feels more exciting than writing a thesis proposal on RL for LLMs before 2025 ends!! Covering a subset of my first-author works done in the past 1.5 years (after switching from traditional RL to LLM RL…) Tentative title, of course

503

60,819

Zichen Liu · Oct 31, 2025 · 2:35 PM UTC

Zichen Liu

@zzlccc

31 Oct 2025

BF16 -> FP16 is such a simple (one configuration change in Oat) yet fundamental fix for inference-training mismatch. With FP16, the most basic importance sampling PG outperforms all algorithmic fixes in BF16. Let's rethink RL stability from the precision perspective.🔎

Penghui Qi

@QPHutu

31 Oct 2025

🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precision…

506

78,196

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵

468

116,763

Zichen Liu · Aug 22, 2025 · 3:35 PM UTC

Zichen Liu

@zzlccc

22 Aug 2025

With just a few lines of code, Feng’s (@fengyao1909) suggested fix—applying importance sampling on the behavior policy—resolved the training instability in my case (oat). I believe the result can generalize to other RL frameworks as well. Great work, Feng!

465

45,438

Zichen Liu · Oct 3, 2025 · 3:59 AM UTC

Zichen Liu

@zzlccc

3 Oct 2025

6 months after our paper release, I still recall the debates on removing the length normalization term in DrGRPO. And people gradually think DrGRPO is just about removing the std, ignoring the most important and subtle (length) bias we tried to point out to the community. Even now, many papers (and open-source code) still divide the policy gradient loss by the response length—taking the mean instead of the sum... Fortunately, with Tinker’s implementation as a reference, I hope it will be more convincing for the OSS community to adopt the unbiased RL loss computation. So grateful to Thinking Machines for pushing the boundaries of open science 🚀

444

40,794

Zichen Liu · Jul 27, 2025 · 8:28 AM UTC

Zichen Liu

@zzlccc

27 Jul 2025

Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?

434

63,126

Zichen Liu · Mar 22, 2025 · 10:22 AM UTC

Zichen Liu

@zzlccc

22 Mar 2025

Good catch! But in fact this correction is unnecessary. We were aware of this. The N/N-1 factor affects all training instances equally, thus can be compensated by adapting the learning rate. Their gradients are the same after compensation. We have acknowledged the connection with RLOO in Appendix A.

leloy!

@leloykun

22 Mar 2025

I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy Optimization). And if you also remove PPO's clipping, you'll get RLOO's (Reinforce Leave-One-Out).

320

45,029

Zichen Liu · Oct 6, 2025 · 4:50 PM UTC

Zichen Liu

@zzlccc

6 Oct 2025

GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced deepseek-r1 length increasing with LoRA 3. benchmarked PPO, GRPO, REINFORCE and showed their tradeoffs 4. added Terminal, MCP, visual and multi-agent environments … Open the thread for a deep dive!

289

57,700

Zichen Liu · Aug 1, 2025 · 7:06 PM UTC

Zichen Liu

@zzlccc

1 Aug 2025

In the era of experience, we're training LLM agents with RL — but something's missing... We miss the good old Gym! So we built 💎GEM: a suite of environments for training LLM 𝚐𝚎𝚗𝚎𝚛𝚊𝚕𝚒𝚜𝚝𝚜. Let’s build the Gym for LLMs, together: axon-rl.notion.site/gem

291

45,325

Zichen Liu · Mar 26, 2025 · 2:47 PM UTC

Zichen Liu

@zzlccc

26 Mar 2025

Since the release of Dr. GRPO, many are interested in the 𝐥𝐞𝐧𝐠𝐭𝐡 𝐛𝐢𝐚𝐬 in GRPO's formulation & implementation, as well as in PPO's implementations. I did some updates on our paper and prepared a table for better comparison (details in thread):

269

22,391

Zichen Liu · May 28, 2025 · 5:01 PM UTC

Zichen Liu

@zzlccc

28 May 2025

Reinforcing General Reasoning without Verifiers 🈚️ R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity. Introducing *VeriFree*: ⚡ Skip the verifier 🎯 RL that directly maximizes reference answer likelihood ✅ Simpler, faster, built for general reasoning Paper: huggingface.co/papers/2505.2… Code: github.com/sail-sg/VeriFree 🧵

233

38,867

Zichen Liu · Aug 14, 2025 · 2:45 AM UTC

Zichen Liu

@zzlccc

14 Aug 2025

exciting to see these Figure. 1 showcasing how subtle algorithmic improvements can make a big difference in rl dynamics 😆😆

211

17,566

Zichen Liu · Oct 12, 2025 · 2:13 AM UTC

Zichen Liu

@zzlccc

12 Oct 2025

Go async without regret, with Tinker! I love this so much. Previously, when I heard people say, my async RL is unstable, I understand it as, perhaps it’s your infra/code that’s unstable, and async RL shouldn’t be blamed (in this scenario). With a trustworthy infra like what Tinker has provided, async/off-policy learning using the most basic importance sampling just works!

Changyu Chen

@Cameron_Chann

11 Oct 2025

The Tinker API seems quite promising for async RL training, though I haven’t seen much discussion on this aspect. I ran a few experiments to get an initial sense of how async RL performs with Tinker. The results are pretty impressive! Async matches the sync version under the same settings, but finishes in about half the wall-clock time (max steps off-policy = 4). A few quick notes: - Efficiency can likely improve further — I did notice some rate limiting - These runs use GRPO-like algorithm; setups with more components (e.g., actor-critic + reward models) could show even greater async benefits - Compared with the non-API RL infra, the async imple with tinker doesn’t need to manage computing resources, which largely simplies the complexity Looking forward to seeing more exploration from the community on async RL with Tinker! Experiments based on the math training recipe from tinker-cookbook: github.com/thinking-machines…

188

34,492

Zichen Liu · Jun 30, 2025 · 5:59 PM UTC

Zichen Liu

@zzlccc

30 Jun 2025

happy to see meta's new work employs unbiased GRPO; this should really be the default, as tobi said (CEO @Shopify)

tobi lutke

@tobi

22 Mar 2025

Replying to @QGallouedec @zzlccc

You really want to make this fix default…

182

35,981

Zichen Liu · Apr 12, 2025 · 3:47 AM UTC

Zichen Liu

@zzlccc

12 Apr 2025

Yesterday I learned this train-eval gap and was surprised because I never had this issue before -- RL on LLM is much easier\more stable than classic RL. Tried the same setup (Qwen2.5-7B + Big-Math), and got stable eval convergence using Dr. GRPO (script: github.com/sail-sg/understan…).

Lewis Tunstall

@_lewtun

10 Apr 2025

The worst part about doing RL is that even once you've ironed out all the training instabilities and rewards are going up, you still lose 🫠

168

16,668

Zichen Liu · Nov 3, 2025 · 3:36 AM UTC

Zichen Liu

@zzlccc

3 Nov 2025

Yea we did experiments for 30A3 MoE on 64xH100, using much larger dataset (the DAPO's one). Thanks Łukasz for the test on H200 with 32B dense, amazing!

Łukasz Borchmann

@LukaszBorchmann

3 Nov 2025

Replying to @redtachyon

Well, not only A100. Here is the sanity check on H200 (GRPO, 32B dense model). The authors also mention that they did some larger-scale experiments on H100.

158

23,232

Zichen Liu · May 20, 2025 · 2:38 PM UTC

Zichen Liu

@zzlccc

20 May 2025

Anytime reasoning, a topic dating back to the ’90s, seeks the best-effort solution given a computation bound. For large reasoning models, we optimize anytime reasoning with 1) dense rewards and 2) better credit assignment (BRPO). For more details👇

Penghui Qi

@QPHutu

20 May 2025

👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/AnytimeRe…

138

16,936

Zichen Liu · Feb 1, 2025 · 6:28 PM UTC

Zichen Liu

@zzlccc

1 Feb 2025

oat🌾just supports GRPO, and using it I tried RL-tuning a 3B model on the countdown task. It shows DS R1-zero-like behavior: training from a base model, improving performance & getting longer responses (self-reflection emerges)! Example: github.com/sail-sg/oat/blob/…

129

16,577

Zichen Liu · Dec 3, 2024 · 3:20 PM UTC

Zichen Liu

@zzlccc

3 Dec 2024

#ICLR2025 is such a disappointing experience! Spent a whole week preparing PDF revision and rebuttal, getting NO response from reviewers or ACs at all🤷 I do care discussions more than a binary acceptance/rejection signal, which is why ppl prefer (good) PRM. @iclr_conf

119

18,325

Zichen Liu · Sep 29, 2025 · 2:32 PM UTC

Zichen Liu

@zzlccc

29 Sep 2025

After the crazy 极GRPO weekend, let's get rid of the scalar reward or any policy optimization related to it. We explored learning from *verbal feedback* and obtained interesting results:

Tianyu Pang

@TianyuPang1

29 Sep 2025

🚀LLMs can learn directly from verbal feedback — no scalar rewards needed! 😥Scalar rewards compress rich feedback— “redundant but correct” vs “concise but typo-ridden” might both be 0.8 💡We propose to learn Feedback-Conditional Policy (FCP), an extremely scalable paradigm!

117

13,042

Zichen Liu · Oct 4, 2025 · 5:27 PM UTC

Zichen Liu

@zzlccc

4 Oct 2025

Can’t believe all these experiments are running on my Mac… (with the heavy lifting offloaded to an amazing platform!) Results coming soon…

109

14,180

Zichen Liu · Apr 27, 2025 · 5:59 AM UTC

Zichen Liu

@zzlccc

27 Apr 2025

Great to see Dr. GRPO is much more sample efficient than the original GRPO

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

23 Apr 2025

Tina: Tiny Reasoning Models via LoRA "the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA."

112

8,817

Zichen Liu · Mar 22, 2025 · 3:42 AM UTC

Zichen Liu

@zzlccc

22 Mar 2025

Exactly! This line of code has been overlooked for how long?

Changyu Chen

@Cameron_Chann

21 Mar 2025

(1/3) My favorite figure from the paper. Nearly all open-source RL frameworks introduce an unintentional bias when computing the masked mean 😮. The fix? Just replace mask.sum with a constant.

106

15,254

Zichen Liu · Jun 14, 2025 · 3:09 PM UTC

Zichen Liu

@zzlccc

14 Jun 2025

Nice follow-up! Spurious rewards and spurious prompts re-confirm the biases cooked into Qwen base models. Revisiting our results in March (arxiv.org/pdf/2503.20783 Section 2.2 & 3.3): - No template is the best - Much of RL's gain comes from correcting model-template mismatch

Stella Li ✈️ ICML🇰🇷

@StellaLisy

13 Jun 2025

Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt. Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀 Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄

105

26,509

Zichen Liu · Aug 28, 2025 · 1:34 AM UTC

Zichen Liu

@zzlccc

28 Aug 2025

Environment Hub by prime-intellect is awesome with its GUIs! Scaling environments is key—they provide the signals RL agents learn from. We've been building 💎GEM with the community: 🌎Envs: math, code, games with python/search tools 🔧Framework-agnostic: 5 integrated frameworks

103

12,129

Zichen Liu · Apr 14, 2025 · 2:21 AM UTC

Zichen Liu

@zzlccc

14 Apr 2025

I think the core issue is still the length bias in policy gradient. In RL, we essentially ascend A(a|s) * grad(pi(a|s)), where A is the advantage. When A only depends on the observed reward, there is no length bias; when A also depends on response length (GAE with lambda < 1 in PPO, or the original GRPO), the model might exploit this bias and exhibit behaviors such as outputting increasingly longer responses. nitter.app/zzlccc/status/19031627… revealed this bias using GRPO (monte-carlo advantage), and this paper re-confirms it using PPO (general advantage estimation with lambda <1). Cool work.

Sebastian Raschka

@rasbt

13 Apr 2025

As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper (arxiv.org/abs/2504.05185) shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training. I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem. What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong). So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness. In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency.

100

10,137

Zichen Liu · Jan 26, 2025 · 2:40 AM UTC

Zichen Liu

@zzlccc

26 Jan 2025

Exploration is an important topic in traditional RL. How does it affect online RL for LLM reasoning, like o1/r1? The most common way people add exploration for LLMs is through temperature sampling (a.k.a. Boltzmann exploration), I did a simple ablation on 1B models with PPO and found that setting suitable temperature is crucial! The best setting improves the zero-shot performance on GSM8K of a 1B model from 40.6% to 55.7% with pure RL (and not plateauing)! The implementation is super efficient with the latest oat-0.0.6🚀 Codes: github.com/sail-sg/oat

9,062

Zichen Liu · May 9, 2025 · 2:02 AM UTC

Zichen Liu

@zzlccc

9 May 2025

Good catch! I ran into the same reusability issue ~1 year ago with OpenRLHF. That’s why I built oat🌾 (github.com/sail-sg/oat) — a modular RL LLM framework inspired by DeepMind’s ecosystem. Just define your actor, learner, and env in a single script — and you’re good to go :) Example: github.com/sail-sg/understan…

GitHub - sail-sg/oat: 🌾 OAT: A research-friendly framework for LLM online alignment, including...

🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc. - sail-sg/oat

github.com

Tony Chen @tonychenxyz

8 May 2025

why is it that every RL LLM project that uses verl has a clone of verl in its code rather than directly importing verl? This feels like bad reusability. When verl updates, these project code becomes obsolete.

12,611

Zichen Liu · Nov 1, 2025 · 4:55 PM UTC

Zichen Liu

@zzlccc

1 Nov 2025

Thanks for the thought! Some further thoughts (clarifications): 1. Reasonably designed algorithms (let’s also include precision in the design space) should not collapse on small data. It’s just like if my CNN cannot even overfit MNIST, how can I trust it will master 1000-class imagenet? 2. We do have experiments on “larger” dataset X bigger model (30A3 MoE) X H100 GPUs. The performance improvement is clear.

Ariel

@redtachyon

1 Nov 2025

Aight let's unclickbait the fp16 paper. tł;dr cool paper, a little bit overstated in comms, very overstated by poasters. The thing that gave me a pause is that on the surface, it seems to claim that bf16 is horrible, borderline unusable. But that's not really the case (nor is it the claim). Yes, the fsdp-vllm mismatch is real, and yes, it can be mitigated with fp16. This is as true as it is irrelevant, because who cares about the mismatch if the algorithm empirically works? The widely circulated figures show bf16 consistently collapsing, while fp16 runs thrive. I have no reason to doubt this data, but it's performed on a very particular - and rather small - dataset. If you don't go through multiple epochs of your data, you probably won't see such a severe collapse - as many, many people can confirm from their own experiments. Does this mean that fp16 is useless? Not necessarily. As always, there are trade-offs. It might work better in some cases, but maybe you have to fuck with loss scaling. It might be worse in a big data regime, but crucial in a low data regime. Hard to tell for now. But alas, you still have to pay for your lunch.

47,719

Zichen Liu · Jun 18, 2025 · 4:32 AM UTC

Zichen Liu

@zzlccc

18 Jun 2025

wondering what type of research PhD students or small groups of researchers can do to really push the frontier of ai?

8,285

Zichen Liu · May 30, 2025 · 3:00 AM UTC

Zichen Liu

@zzlccc

30 May 2025

Inspiring talk by Prof Rich Sutton, about intelligence and beyond

5,909

Zichen Liu · Mar 23, 2025 · 5:20 PM UTC

Zichen Liu

@zzlccc

23 Mar 2025

Really fast! Thank you so much! @QGallouedec . But removing std only fixes the question-level difficulty bias, would you like to fix the response level length bias as well? In fact this bias not only affects GRPO, it appears nearly in all other algorithms by using the masked_mean

Quentin Gallouédec @QGallouedec

22 Mar 2025

🪂 Getting GRPO Done Right (Dr GRPO) is now in TRL @zzlccc proved that scaling by the std introduces question-level difficulty bias! You can now remove this bias 🗑️

12,268

Zichen Liu · Nov 1, 2025 · 12:34 AM UTC

Zichen Liu

@zzlccc

1 Nov 2025

Hey @agarwl_ , thanks for your attention! For most small-scale experiments we used 8xA100 & the sanity check setting; for larger dense or MoE training we used 64xH100 & the 'normal' setting (DAPO dataset). We also did not observe severe collapse until we develop the "sanity check" setting - a small dataset with all solvable questions (we can expect 100% rewards in theory and we empirically got 98%). We think this is a clean setting to test any algorithm before we scale it up to massive dataset and GPUs. Using large dataset (plus algorithmic fixes like TIS or CISPO) can slow down the collapse or make it never happen in practice, despite the mismatch exists

50,358

Zichen Liu · Jul 8, 2025 · 7:56 AM UTC

Zichen Liu

@zzlccc

8 Jul 2025

This work is accepted by @COLM_conf 2025 See you in Canada! 🍁

Zichen Liu

@zzlccc

21 Mar 2025

4,430

Zichen Liu · Apr 28, 2025 · 1:38 AM UTC

Zichen Liu

@zzlccc

28 Apr 2025

RL works without doubt, the problem is how we define the MDP, or even whether we use MDP. RL is about life, we all learn from experience

Kunhao Zheng @KunhaoZ

27 Apr 2025

🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨 That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. 🧵 How?

5,437

Zichen Liu · May 31, 2025 · 8:57 AM UTC

Zichen Liu

@zzlccc

31 May 2025

We do appreciate their efforts in writing the criticisms, but “turns out that the results in this paper are misreported” is a strong claim without running evaluation themselves. Such claim was also generalized to many other papers in a more recent blog (safe-lip-9a8.notion.site/Inc…), questioning existing RL approaches and respective evaluation. As a response, we re-conducted extensive evaluation on GPQA-Diamond under different settings (various sampling parameters, token budgets) to show that VeriFree remains effective and favorable compared to the verifier-based baseline (apple-to-apple comparison!), and still outperforms the Qwen3 instruct models in non-thinking mode even with a large token budget.

Shashwat Goel

@ShashwatGoel7

28 May 2025

So turns out that the results in this paper are misreported. Qwen3 report shows 4b thinking is GPQA 55.9, much higher than both the 32 here, and post-RL numbers of 45. Similarly for other datasets. No wonder our RL claims keep falling apart. Noticed by @nikhilchandak29

20,250

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

2) On reinforcement learning: 2a) GRPO is biased: - The length normalization prefers shorter correct answers , and longer incorrect answers. -> length bias - The std normalization prefers too easy or too hard questions over average questions. -> difficulty bias

4,989

Zichen Liu · Nov 13, 2025 · 3:16 AM UTC

Zichen Liu

@zzlccc

13 Nov 2025

I love the KL-div curve, but not so much the later stage of the reward-mean curve. Why does it plateau? A limit of the base model’s exploration?

vLLM

@vllm_project

12 Nov 2025

🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq → same output regardless of batching) 2️⃣ Ensure forward passes in training use identical kernels as inference 3️⃣ Add custom backward passes in PyTorch ✅ Verified on Qwen3 1.7B + GSM8K: • batch_inv_ON (bitwise exact) → KL=0.0, faster convergence, higher reward • batch_inv_OFF → reduced reward, instability We audited every op, imported vLLM’s fused kernels (SiLU MLPs, RMSNorm+residual), and wrote matching backward passes. Run is fully on-policy, deterministic, and reproducible. Next: • Unified model code • torch.compile support • Perf tuning (current bitwise RL ≈2.4× slower) • Broader model + op coverage 🔗 blog.vllm.ai/2025/11/10/bitw… #vLLM #TorchTitan #RL #LLM #AIResearch

10,107

Zichen Liu · Jul 14, 2025 · 1:07 PM UTC

Zichen Liu

@zzlccc

14 Jul 2025

Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers: 1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix) 2. Main Conference Spotlight: Continual Reinforcement Learning by Planning with Online World Models; I worked on the continual learning topic for one year before diving into RL for LLMs openreview.net/pdf?id=mQeZEs… 3. AI4MATH Workshop Poster: Optimizing Anytime Reasoning via Budget Relative Policy Optimization; a dense reward training framework that controls thinking budget (a cool feature of Gemini 2.5 Pro)

5,148

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

We take a understand-then-improve approach to study R1-Zero-like training. We first critically examine two core components: base models and reinforcement learning. 1) On base models: 1a) DeepSeek-V3-Base already exhibits "Aha moment"😲

8,987

Zichen Liu · Nov 1, 2025 · 2:27 AM UTC

Zichen Liu

@zzlccc

1 Nov 2025

scaling down the data is the path we took to discover this mismatch; it is not overfitting because it's the training reward that goes down, not eval. small dataset setting does not mean our finding is irrelevant to large-scale training. it serves as a sanity check for RL

Yan Ma

@ManTle_Ma

1 Nov 2025

I did indeed encounter a collapse in training reward during RL on H-series GPUs, but that only occurred when training for >10 epochs with a small dataset (e.g., 3k). I attributed this collapse to overfitting. hadn't considered that it might also be related to precision.🤔

13,607

Zichen Liu · Apr 18, 2025 · 12:36 PM UTC

Zichen Liu

@zzlccc

18 Apr 2025

I can feel #ICLR2025 already starts… welcome everyone to 🇸🇬 Singapore! Lets meet and chat about RL, LLM and reasoning :)

5,995

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

2c) To get GRPO Done Right, we propose Dr. GRPO. Two line modifications: removing both length and std normalization (red terms). Our new optimizer is unbiased, and has better token efficiency (by preventing progressively longer incorrect responses of GRPO)

3,863

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

These are not all... Please check out our paper, codebase, and models for more fun, such as * RL on Basic algebra (+ − ×÷) questions improves Olympiad-level reasoning capabilities * Llama models can also "Aha" * ... Paper: github.com/sail-sg/understan…

understand-r1-zero/understand-r1-zero.pdf at main · sail-sg/understand-r1-zero

Understanding R1-Zero-Like Training: A Critical Perspective - sail-sg/understand-r1-zero

github.com

3,771

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

2/8 Using these prompts, we take questions from MATH to test 6 base models, and found 5/6 base models already exhibit self-reflection behaviors (or the Aha moment😯😯) with the following keyword occurrence frequencies -- Qwen models are the most active

5,508

Zichen Liu · Apr 23, 2025 · 1:45 AM UTC

Zichen Liu

@zzlccc

23 Apr 2025

🚨 RL x LLM folks at #ICLR2025 — come join us during the Friday lunch break! If you haven’t RSVP’d on Whova, you can also register here: lu.ma/s8udv997?tk=B4lfN0 @Benjamin_eecs and I will scout for a chill spot (likely a corner at the venue) and share the location tomorrow. It’ll be an informal meetup — just to connect, chat about ideas, and maybe grab lunch together :) Can’t wait to see y’all there!

3,544

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

Our analysis on base models and RL suggests a minimalist recipe for R1-Zero training, no tricks: - algo: Dr. GRPO - data: MATH level 3-5 questions - template: Qwen-Math - compute: 27 Hours * 8 * A100 This gives us a 7B SOTA in the Zero-RL setting: 43.3 on AIME 2024!

3,977

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

2b) Surprisingly, even though PPO's formulation is unbiased, nearly all open-source implementations introduce the length bias by computing the masked_mean The length bias is partly the reason for the increasingly longer responses.

4,274

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

7/8 Moreover, we find response length may not be a good indicator of self-reflections, because they seem not correlated during R1-Zero-like training.

3,814

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

6/8 We take a closer look at the RL dynamics during R1-Zero-like training. We find that length changes of model responses are mainly due to the rule-based reward shaping, which initially encourages formatting then correctness, verifying our hypothesis.

3,487

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

5/8 We hypothesize that performing RL could turn superficial self-reflections into effective self-reflections with proper reward shaping.

3,340

Zichen Liu · Oct 17, 2025 · 5:43 AM UTC

Zichen Liu

@zzlccc

17 Oct 2025

I like the apple-to-apple comparison when designing new algorithms! With group rollouts (like in GRPO), the semantic entropy is almost a free lunch to enable curriculum learning on questions to improve training efficiency and stability. A clean and useful method!

Minghan Chen

@DrTreeMan312

17 Oct 2025

Motivation Fine-tuning LLMs beyond their knowledge boundary can hurt performance (arxiv.org/pdf/2402.18243, arxiv.org/pdf/2405.05904). SEED-GRPO mitigates this by measuring semantic entropy — estimating how confident the model is on each training sample. Method We modulate the training sample’s Advantage with uncertainty — effectively giving each problem a “custom learning rate.” High Semantic Entropy⇒ High uncertainty ⇒ weaker advantage (less trust) Low Semantic Entropy⇒ Low uncertainty ⇒ keep relatively high learning signal Result Built on Dr.GRPO(arxiv.org/abs/2503.20783), SEED-GRPO boosts the average score from 51.4 → 56.6 under identical hyperparameters (Table 2, our methods row 3). Huge thanks to @zzlccc for the excellent Dr.GRPO and insightful discussions! 📄 Paper: arxiv.org/abs/2505.12346 💻 Code: github.com/Dreamer312/SEED-G…

8,106

Zichen Liu · May 17, 2025 · 3:34 PM UTC

Zichen Liu

@zzlccc

17 May 2025

Replying to @lilianweng

Thanks for the great blog! Regarding the “aha moment”, is it possible to be more conservative when discussing it since 1-2 months ago there were many debates on it and we also systematically studied it and found there is no aha moment:

Zichen Liu

@zzlccc

21 Mar 2025

10,509

Zichen Liu · Oct 31, 2025 · 11:47 PM UTC

Zichen Liu

@zzlccc

31 Oct 2025

bf16 is actually a great choice for ablating algorithms that try to fix numerical mismatch — it makes the mismatch worse, so you have more room to observe the improvement

will brown

@willccbb

31 Oct 2025

sticking w bf16, wouldn’t wanna make things too easy

10,660

Zichen Liu · May 28, 2025 · 2:27 AM UTC

Zichen Liu

@zzlccc

28 May 2025

Interesting paper! A funny thing is that I wrote a bug to introduce random rewards about three months ago, and observed similar evaluation improvement. Did not expect this can turn into a nice paper, great job!

Stella Li ✈️ ICML🇰🇷

@StellaLisy

27 May 2025

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewards

3,835

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

8/8 We hope our findings can help understand a bit deeper about R1-Zero, and provide insights for RL training of base models. BIG thanks to @Cameron_Chann @liwenjun2016 @TianyuPang1 @duchao0726 @mavenlin for their valuable contributions during the past week.❤️

3,156

Zichen Liu · Jun 23, 2025 · 2:46 PM UTC

Zichen Liu

@zzlccc

23 Jun 2025

had fun RL-tuning language models on text games

León

@LeonGuertler

23 Jun 2025

For the past ~2 months we have been working on training reasoning models on TextArena games. The first paper (introducing what we think is a very promising new paradigm) will hopefully be up later this week / early next; and the second one, focusing on the "scaling laws" of self-play and some additional analysis, tentatively around the 18th of july. However, to get more feedback on the structure and implementation, we want to open-source the code now. UnstableBaselines is a very simple Async, Online, Multi-Turn, Multi-Agent RL library built on vLLM and Ray. The code is pretty readable and around 1.2k lines long (and includes a cool rendering interface that you can run via "unstable-terminal") 1/7

3,930

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

1b) As the popular choice for R1-Zero-like training, Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates: the average benchmark scores immediately improve by ~60%. This makes Qwen2.5 base more like SFT models trained on QA concatenation🤔

6,381

Zichen Liu · Jul 1, 2025 · 4:15 PM UTC

Zichen Liu

@zzlccc

1 Jul 2025

Self-play on zero-sum language games creates selection pressure for LLMs to develop transferrable reasoning patterns. Enjoyed building the multi-agent, multi-turn RL system and training agents that think strategically through self-play! Paper: huggingface.co/papers/2506.2… Code: github.com/spiral-rl/spiral

Bo Liu (Benjamin Liu)

@Benjamin_eecs

1 Jul 2025

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: huggingface.co/papers/2506.2… 🧑‍💻 Code: github.com/spiral-rl/spiral

3,276

Zichen Liu · May 21, 2025 · 10:16 AM UTC

Zichen Liu

@zzlccc

21 May 2025

Does this invalidate experiments based on Qwen3-4B-Base?

7,171

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

4/8 However, there are still many self-reflections from base models that do not lead to correct final answers, which we call Superficial Self-Reflections (SSRs)

5,957

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

1/8 Base models can directly answer questions by proper prompting, such as the ones shown below

5,488

Zichen Liu · Feb 6, 2025 · 5:40 PM UTC

Zichen Liu

@zzlccc

6 Feb 2025

3/8 More surprisingly, Qwen2.5-7B-Math can directly solve the example question reported in SimpleRL-Zero with self-correcting CoTs even before any training

4,453

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

This project is a follow-up of our pilot study with team efforts ❤️. BIG thannnks to @Cameron_Chann , @liwenjun2016, @QPHutu , @TianyuPang1 , @duchao0726 , @mavenlin !!!

Zichen Liu

@zzlccc

6 Feb 2025

4,432

Zichen Liu · Mar 21, 2025 · 7:12 PM UTC

Zichen Liu

@zzlccc

21 Mar 2025

1a) and 1b) suggest that there are biases in the base model pretraining: self-reflection behaviors, math-solving abilities are already infused before RL reinforces them by reward signals. But is increasingly longer response a consequence of such RL process?

5,134

Zichen Liu · May 17, 2025 · 3:13 AM UTC

Zichen Liu

@zzlccc

17 May 2025

And still using the biased policy gradient

Rupesh Srivastava @rupspace

16 May 2025

I think it was pretty clearly shown (by @zzlccc et al) that DeepSeek was wrong about the emergence of Aha moment through RL in the R1 paper. But seeing some papers still talking about Aha moments. Am I missing something?

3,372

Zichen Liu · Jun 6, 2025 · 5:02 AM UTC

Zichen Liu

@zzlccc

6 Jun 2025

Awesome classical RL work in the era of LLM RL

Seohong Park @seohong_park

5 Jun 2025

Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓

2,298

Zichen Liu · May 21, 2025 · 5:06 AM UTC

Zichen Liu

@zzlccc

21 May 2025

I believe our work offers a promising technique for building reasoners with fine-grained budget control— like what Gemini 2.5 Pro has just introduced. Truncated at any time, we deliver the best-effort solution!

Zichen Liu

@zzlccc

20 May 2025

2,963

Zichen Liu · Feb 24, 2025 · 3:31 AM UTC

Zichen Liu

@zzlccc

24 Feb 2025

Replying to @vwxyzjn

You may try vLLM’s native API: .sleep() and .wake_up() My local dev branch of oat (github.com/sail-sg/oat) uses it to collocate all vLLM actors and deepspeed learners.

GitHub - sail-sg/oat: 🌾 OAT: A research-friendly framework for LLM online alignment, including...

🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc. - sail-sg/oat

github.com

1,041

Zichen Liu · Mar 25, 2025 · 2:45 PM UTC

Zichen Liu

@zzlccc

25 Mar 2025

Congrats on the new release! 😍love the multi-node training support 🚀excited to see our Dr. GRPO’s fixes (removing std and 1/|o|) are adopted so quickly by the community

Lewis Tunstall

@_lewtun

24 Mar 2025

RL goes brrr in the latest TRL release! 🔥 Scale GRPO with multi-node training & vLLM's tensor parallelism 🚀 6x faster convergence with multi-step optimisation 📊 Support for domain specific rewards Release notes 👇 github.com/huggingface/trl/r…

1,849

Zichen Liu · Apr 3, 2025 · 5:34 AM UTC

Zichen Liu

@zzlccc

3 Apr 2025

My friend from University of Alberta, @qingfeng_lan, has provided an excellent summary of policy-based LLM RL-tuning algorithms, grounded in the Policy Gradient Theorem:

Qingfeng Lan @qingfeng_lan

2 Apr 2025

🚀RL algorithms are shaping the post-training of LLMs, but how do their objectives connect? In this blog, I explore their relationships and provide a unified perspective through the Policy Gradient Theorem—the backbone of policy gradient methods. Dive in: lancelqf.github.io/note/llm_…

2,406

Zichen Liu · Oct 13, 2025 · 4:52 PM UTC

Zichen Liu

@zzlccc

13 Oct 2025

The terminal is perhaps the most general environment to drop language agents into and let them learn from experience. Existing bash/shell tools don’t go far enough. Check out Min and the team’s great work on the TTY emulator and beyond! (link to a blog is at the last tweet)👇

Min Lin

@mavenlin

13 Oct 2025

LLMs don't need MCPs, they need a terminal. Not the bash/shell tool that the codex/claude are already using, but a real tty emulator, to be used in the same way that humans do, i.e. capable of running any REPL interactively, as we will show in the thread.

4,994

Zichen Liu · Jul 16, 2025 · 2:31 AM UTC

Zichen Liu

@zzlccc

16 Jul 2025

If life is a trajectory, and we are doing online continual RL, it seems that we are driven by many verifiable rewards 🤔

Jason Wei

@_jasonwei

16 Jul 2025

New blog post about asymmetry of verification and "verifier's law": jasonwei.net/blog/asymmetry-… Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of asymmetry of verification are things like sudoku puzzles, writing the code for a website like instagram, and BrowseComp problems (takes ~100 websites to find the answer, but easy to verify once you have the answer). Other tasks have near-symmetry of verification, like summing two 900-digit numbers or some data processing scripts. Yet other tasks are much easier to propose feasible solutions for than to verify them (e.g., fact-checking a long essay or stating a new diet like "only eat bison"). An important thing to understand about asymmetry of verification is that you can improve the asymmetry by doing some work beforehand. For example, if you have the answer key to a math problem or if you have test cases for a Leetcode problem. This greatly increases the set of problems with desirable verification asymmetry. "Verifier's law" states that the ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI. The ability to train AI to solve a task is proportional to whether the task has the following properties: 1. Objective truth: everyone agrees what good solutions are 2. Fast to verify: any given solution can be verified in a few seconds 3. Scalable to verify: many solutions can be verified simultaneously 4. Low noise: verification is as tightly correlated to the solution quality as possible 5. Continuous reward: it’s easy to rank the goodness of many solutions for a single problem One obvious instantiation of verifier's law is the fact that most benchmarks proposed in AI are easy to verify and so far have been solved. Notice that virtually all popular benchmarks in the past ten years fit criteria #1-4; benchmarks that don’t meet criteria #1-4 would struggle to become popular. Why is verifiability so important? The amount of learning in AI that occurs is maximized when the above criteria are satisfied; you can take a lot of gradient steps where each step has a lot of signal. Speed of iteration is critical—it’s the reason that progress in the digital world has been so much faster than progress in the physical world. AlphaEvolve from Google is one of the greatest examples of leveraging asymmetry of verification. It focuses on setups that fit all the above criteria, and has led to a number of advancements in mathematics and other fields. Different from what we've been doing in AI for the last two decades, it's a new paradigm in that all problems are optimized in a setting where the train set is equivalent to the test set. Asymmetry of verification is everywhere and it's exciting to consider a world of jagged intelligence where anything we can measure will be solved.

2,753

Zichen Liu · Jul 27, 2025 · 5:05 PM UTC

Zichen Liu

@zzlccc

27 Jul 2025

Replying to @mgostIH

I would say it's not a bug but a technique by prior RL researchers... We can even treat the number of timesteps to be included for IS computation as a hyper-parameter, to get the best bias-variance tradeoff

Zichen Liu

@zzlccc

27 Jul 2025

2,989

Zichen Liu · Apr 10, 2025 · 4:16 AM UTC

Zichen Liu

@zzlccc

10 Apr 2025

Created an issue to track how other LLM RL projects are tackling GRPO biases: github.com/sail-sg/understan… Shoutout to the TRL team for the comprehensive support! @QGallouedec @_lewtun. @verl_project would you like to consider fixing this?

1,858

Zichen Liu · Aug 5, 2025 · 6:21 AM UTC

Zichen Liu

@zzlccc

5 Aug 2025

Excited to see GDM proposing Game Arena to measure the model capabilities. Let’s also scale the environments for agent RL with GEM 💎 github.com/axon-rl/gem !

GitHub - axon-rl/gem: A Gym for Agentic LLMs

A Gym for Agentic LLMs. Contribute to axon-rl/gem development by creating an account on GitHub.

github.com

Google DeepMind

@GoogleDeepMind

4 Aug 2025

We have a long history of using games to measure progress in AI. 🎮 That’s why we’re helping unveil the @Kaggle Game Arena: an open-source platform where models go head-to-head in complex games to help us gauge their capabilities. 🧵

2,433

Zichen Liu · Aug 28, 2025 · 4:41 AM UTC

Zichen Liu

@zzlccc

28 Aug 2025

Replying to @karpathy

Hi Andrej, you might also find this interesting: axon-rl.notion.site/gem — we share a similar vision to Env Hub and are working on building a “gym” for LLMs.

💎 GEM: A Gym for Generalist LLMs | Notion

We’re entering the era of experience, where LLM training moves beyond static datasets, towards LLM agents learning from experience gathered in complex, expressive environments. As a step towards this...

axon-rl.notion.site

2,978

Zichen Liu · May 31, 2025 · 9:03 AM UTC

Zichen Liu

@zzlccc

31 May 2025

There are some other points to be noticed: 1. Our method is trained with generate_max_length=3000 due to limited computational resources and trained from Qwen3 base modes. 2. Qwen3 thinking mode shows strong performance with extremely long reasoning token lengths. 3. We originally followed the evaluation settings of General-Reasoner (github.com/TIGER-AI-Lab/Gene…) in our paper.

1,816

Zichen Liu · Dec 17, 2024 · 2:53 PM UTC

Zichen Liu

@zzlccc

17 Dec 2024

oat-0.0.5 (github.com/sail-sg/oat) now supports post-training as online learning more efficiently!⚡️ We use 𝐚𝐬𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 to minimize GPU idle time, and show that async online DPO achieves 28% time reduction while maintaining the final performance! Since the data is not exactly on-policy, this also shows that online DPO is robust to data staleness💪

1,265

Zichen Liu · Aug 6, 2025 · 2:46 AM UTC

Zichen Liu

@zzlccc

6 Aug 2025

The man behind OAI’s RL infra!

Jiayi Weng

@Trinkle23897

5 Aug 2025

Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) @shengjia_zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library. github.com/openai/harmony

2,401

Zichen Liu · Mar 23, 2025 · 5:12 PM UTC

Zichen Liu

@zzlccc

23 Mar 2025

This animation is absolutely stunning! 🎬🔥 Never imagined we could make a comparison this fun!

nisten🇨🇦e/acc

@nisten

22 Mar 2025

Lil update on fixing deepseeks GRPO issues when training a small medical model. shoutout to @zzlccc & @leloykun 's weekend work 1.5B llms medmcqa score went up from 37% to 52%

1,748

Zichen Liu · Sep 30, 2025 · 1:33 PM UTC

Zichen Liu

@zzlccc

30 Sep 2025

If I infer correctly, this is Dr. GRPO: 1. summed token loss without length normalization, 2. mean reward as baseline without std normalization

1,219

Zichen Liu · Aug 22, 2025 · 3:35 PM UTC

Zichen Liu

@zzlccc

22 Aug 2025

A few lines of code change: github.com/sail-sg/oat/pull/…

fix: truncated importance sampling to handle precision mismatch by lkevinzc · Pull Request #62 ·...

training rewards gradient norm With truncated importance sampling (orange), the training becomes more stable. Reference: https://fengyao.notion.site/off-policy-rl

github.com

2,261

Zichen Liu · Oct 25, 2025 · 2:47 AM UTC

Zichen Liu

@zzlccc

25 Oct 2025

I worked on OAT while VERL was being developed at ByteDance, and our team built GEM before Environment Hub/OpenEnv were released by Prime-Intellect and Meta. Feeling grateful to be in sync with the community and contribute in my own small way!😆

2,275

Zichen Liu · Jul 1, 2025 · 4:28 PM UTC

Zichen Liu

@zzlccc

1 Jul 2025

Thanks @_AndrewZhao for the earlier introduction of our work!

Andrew Zhao

@_AndrewZhao

1 Jul 2025

Self-play is so back arxiv.org/pdf/2506.24119

1,824

Zichen Liu · Oct 31, 2025 · 9:53 PM UTC

Zichen Liu

@zzlccc

31 Oct 2025

Replying to @hu_yifei

We've done a lot of GRPO training on various tasks with bf16. Haven't seen big dips like this... until the sanity check setting. We used standard hyperparameters, but need to train long enough. Let's discuss more if you'd like to reproduce!

1,957

Zichen Liu · Mar 20, 2025 · 12:46 AM UTC

Zichen Liu

@zzlccc

20 Mar 2025

Excited to see RL experts bring new techniques into RLxLLM fields!

Marc G. Bellemare @marcgbellemare

19 Mar 2025

At Reliant we've found RL to be incredibly efficient at improving answer quality to life sciences' hardest questions. Today we're putting out our work on LLM fine-tuning with off-policy RL, matching llama 70B performance with an 8B model - take a look! arxiv.org/abs/2503.14286

1,682

Zichen Liu · Mar 13, 2025 · 3:19 AM UTC

Zichen Liu

@zzlccc

13 Mar 2025

Let’s do it!

will brown

@willccbb

12 Mar 2025

watch this space. assembling a team, plotting, scheming, etc.

2,971

Zichen Liu · Dec 14, 2024 · 12:33 AM UTC

Zichen Liu

@zzlccc

14 Dec 2024

A pity not joining #NeurIPS2024 in person...but my amazing collaborators @Cameron_Chann @duchao0726 are presenting "𝐒𝐚𝐦𝐩𝐥𝐞-𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐀𝐥𝐢𝐠𝐧𝐦𝐞𝐧𝐭 𝐟𝐨𝐫 𝐋𝐋𝐌𝐬" at Language Gamification workshop on Saturday @ West Meeting Room 220-222! Welcome to drop by!

1,069

Zichen Liu · Oct 6, 2025 · 4:50 PM UTC

Zichen Liu

@zzlccc

6 Oct 2025

1/7 We released a technical report for GEM building on top of our previous blog release. Our submission to SEA and MTI-LLM workshops at NeurIPS have been accepted for oral / spotlight presentations and we receive valuable feedback from all reviewers! arxiv.org/pdf/2510.01051

1,462

Zichen Liu · Jan 25, 2025 · 2:23 AM UTC

Zichen Liu

@zzlccc

25 Jan 2025

One opinion, and one question for what happened recently: 1. I thought 'zero' means learning a skill from scratch, but recent zero models start from well-pretrained LLMs, which might already possess skills but with very low confidence. RL training naturally reinforces these skills as long as they can be explored (by sampling with temperature). 2. How long should long-cot be so that we call it "emergence"?

2,766

Zichen Liu · Oct 20, 2025 · 2:44 AM UTC

Zichen Liu

@zzlccc

20 Oct 2025

Please please

Csaba Szepesvari @CsabaSzepesvari

19 Oct 2025

Replying to @karpathy

@karpathy I think it would be good to distinguish RL as a problem from the algorithms that people use to address RL problems. This would allow us to discuss if the problem is with the algorithms, or if the problem is with posing a problem as an RL problem. 1/x

5,428

Zichen Liu · Aug 1, 2025 · 7:06 PM UTC

Zichen Liu

@zzlccc

1 Aug 2025

The idea is simple: we should develop a dedicated environment simulator that's agnostic to training frameworks. To this end, we respect the standard agent-environment interface and follow the OpenAI-Gym's specifications: .reset() and .step()

1,756

Zichen Liu · Aug 14, 2025 · 2:50 AM UTC

Zichen Liu

@zzlccc

14 Aug 2025

paper links if you'd like to see the details: - Dr. GRPO (by our team): arxiv.org/pdf/2503.20783 - GMPO (by @furu_wei's tem): arxiv.org/pdf/2507.20673 - GFPO (by @DimitrisPapail's team): arxiv.org/pdf/2508.09726

1,113

Zichen Liu · Jul 26, 2025 · 2:09 AM UTC

Zichen Liu

@zzlccc

26 Jul 2025

eat oat every morning; training with oat🌾the rest of the day: github.com/sail-sg/oat

944

Zichen Liu · Aug 1, 2025 · 7:06 PM UTC

Zichen Liu

@zzlccc

1 Aug 2025

In this way, different tasks (math, code, language games, general qa, reasoning gym, python & search tool) are all unified under a single interface. Leveraging this, we run the good old multi-turn REINFORCE and get those beautiful reward curves we all love to see 📈🔥

1,230