Penghui Qi · Feb 5, 2026 · 3:39 PM UTC

Penghui Qi

Pinned Tweet

Penghui Qi

@QPHutu

Feb 5

This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇

556

49,709

Penghui Qi · Oct 31, 2025 · 1:59 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precision…

105

652

220,292

Penghui Qi · Nov 1, 2025 · 12:02 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Thanks for this fix. Actually it is not like this easy, GradScaler should be introduced to avoid gradient underflow, otherwise the performance can be even worse than BF16. See: docs.pytorch.org/docs/stable… VeRL Example: github.com/sail-sg/Precision…

Johannes Hagemann

@johannes_hage

31 Oct 2025

I present the fix for RL training instability:

194

36,139

Penghui Qi · Oct 31, 2025 · 2:35 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

⛈️ VeRL does not natively support FP16 training. A naive implementation will suffer from gradient underflow. 💊 🚀We provide a minimal patch for VeRL to enable effective FP16 training, with about 10 lines of code change.👇 ⌨️ github.com/sail-sg/Precision…

Precision-RL/verl_fp16.patch at main · sail-sg/Precision-RL

Defeating the Training-Inference Mismatch via FP16 - sail-sg/Precision-RL

github.com

Penghui Qi

@QPHutu

31 Oct 2025

115

17,676

Penghui Qi · Nov 13, 2025 · 2:48 AM UTC

Penghui Qi

@QPHutu

13 Nov 2025

Finally! Although it's 2.4 slower right now (I believe many optimizations are coming), the results are really promising! It is a huge step towards truly on-policy RL! Amazing work!

vLLM

@vllm_project

12 Nov 2025

🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq → same output regardless of batching) 2️⃣ Ensure forward passes in training use identical kernels as inference 3️⃣ Add custom backward passes in PyTorch ✅ Verified on Qwen3 1.7B + GSM8K: • batch_inv_ON (bitwise exact) → KL=0.0, faster convergence, higher reward • batch_inv_OFF → reduced reward, instability We audited every op, imported vLLM’s fused kernels (SiLU MLPs, RMSNorm+residual), and wrote matching backward passes. Run is fully on-policy, deterministic, and reproducible. Next: • Unified model code • torch.compile support • Perf tuning (current bitwise RL ≈2.4× slower) • Broader model + op coverage 🔗 blog.vllm.ai/2025/11/10/bitw… #vLLM #TorchTitan #RL #LLM #AIResearch

15,148

Penghui Qi · Nov 2, 2025 · 12:43 AM UTC

Penghui Qi

@QPHutu

2 Nov 2025

Indeed many ppl never saw their bf16 training collapse, but the problem exists as in many reports. We reproduce this instability by designing a sanity test (just like MNIST for CV) for better understanding. Large models+datasets are here👇 Give it a try, you may be suprised.

Zichen Liu

@zzlccc

1 Nov 2025

Thanks for the thought! Some further thoughts (clarifications): 1. Reasonably designed algorithms (let’s also include precision in the design space) should not collapse on small data. It’s just like if my CNN cannot even overfit MNIST, how can I trust it will master 1000-class imagenet? 2. We do have experiments on “larger” dataset X bigger model (30A3 MoE) X H100 GPUs. The performance improvement is clear.

19,080

Penghui Qi · May 20, 2025 · 2:21 PM UTC

Penghui Qi

@QPHutu

20 May 2025

👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/AnytimeRe…

33,475

Penghui Qi · Nov 1, 2025 · 2:48 PM UTC

Penghui Qi

@QPHutu

1 Nov 2025

This is exactly what we want to share by fp16 tech report! Thanks @Grad62304977 for the great explanation.

Grad

@Grad62304977

1 Nov 2025

Replying to @redtachyon

Well sort of, with just GRPO and not actually taking care of the mismatch at the algorithm level, u will encounter instability with bf16 under normal training settings like here (and as many papers for actual models like Kimi linear have mentioned). Their point is that given these instabilities occur in the long term bcs of the mismatch, patching it up with an algorithm fix is great but unclear if that just delays the instability (for now that’s what it looks like). Also even if we’re not unstable, it could turn into just lower performance (higher mismatch is sort of just training more off policy, if we can be more on policy for free than it’s likely we would get a performance boost too)

15,015

Penghui Qi · Oct 31, 2025 · 2:47 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

Huge thanks to @Grad62304977 for quickly testing out our findings on using FP16 for RL fine-tuning and confirming the results!🥇

Grad

@Grad62304977

31 Oct 2025

Replying to @Grad62304977

5,969

Penghui Qi · Nov 14, 2025 · 11:37 AM UTC

Penghui Qi

@QPHutu

14 Nov 2025

Another amazing progress on truly on-policy RL!💯 I believe it is a headache for the community to find a reproducible setting where the mismatch consistently causes training collapse. If so, may check this sanity test. Link to this dataset👇 huggingface.co/datasets/sail…

LMSYS Org

@lmsysorg

14 Nov 2025

💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.) The result? A strict KL divergence of 0. But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't found a significant difference from the baseline yet. We're calling on the community: Send us your reproducible training collapse examples! We want to see them 🤣🤣

11,605

Penghui Qi · Nov 3, 2025 · 3:17 AM UTC

Penghui Qi

@QPHutu

3 Nov 2025

Many thanks for these exciting results. I’ve been waiting all weekend for someone to reproduce them, and I’m thrilled they’re here.

Łukasz Borchmann

@LukaszBorchmann

3 Nov 2025

Replying to @redtachyon

Well, not only A100. Here is the sanity check on H200 (GRPO, 32B dense model). The authors also mention that they did some larger-scale experiments on H100.

8,460

Penghui Qi · Nov 2, 2025 · 2:22 AM UTC

Penghui Qi

@QPHutu

2 Nov 2025

Thank you @karpathy for finding our paper interesting. This is very encouraging.

Andrej Karpathy

@karpathy

1 Nov 2025

Replying to @MarFot78 @zzlccc

I think if you zoomed into the paper too you’d find it just as if not more interesting.

7,264

Penghui Qi · Nov 2, 2025 · 9:12 AM UTC

Penghui Qi

@QPHutu

2 Nov 2025

Replying to @RichardYRLi @danielhanchen

Hi @RichardYRLi , I tried this disable_cascade_attn many times, including the latest vllm version. But unfortunately it made no difference in our experiments. So I guess it really depends on the setting.

6,042

Penghui Qi · Oct 31, 2025 · 1:59 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

4,199

Penghui Qi · Oct 31, 2025 · 1:59 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

This was a wonderful team effort, huge thanks to my amazing collaborators @zzlccc @NickZhou523786 @TianyuPang1 @duchao0726 @mavenlin

5,017

Penghui Qi · Sep 26, 2024 · 2:57 PM UTC

Penghui Qi

@QPHutu

26 Sep 2024

Excited to share that our paper, Pipeline Parallelism with Controllable Memory, was accepted by #NeurIPS2024. Code: github.com/sail-sg/zero-bubb… Paper: arxiv.org/pdf/2405.15362

2,129

Penghui Qi · Nov 14, 2025 · 12:13 PM UTC

Penghui Qi

@QPHutu

14 Nov 2025

Big news👇 VeRL just released v0.6.1, which natively supports FP16 for both FSDP and Megatron.🔥 Big thanks to @zhu_xuekai and Megatron guys for their awesome reproduce and code contribution!💯 Let's make open source community even better!💪

Xuekai Zhu @zhu_xuekai

14 Nov 2025

Recently, we @QPHutu collaborated with VERL maintainers to add a new feature in v0.6.1： 🚀FP16 training (FSDP) & inference, mitigating the mismatch issue. Give it a try! PR: github.com/volcengine/verl/p… 😆PS: Open source commuity is awesome!

6,408

Penghui Qi · Nov 1, 2025 · 2:06 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Wonderful to see this curve! It aligns our observation, when runing many epochs on a small dataset, it would consistently collapse (the sanity test in Section 4). But it is definitely not overfitting, because it is the training reward going down, not the validation only.

Yan Ma

@ManTle_Ma

1 Nov 2025

I did indeed encounter a collapse in training reward during RL on H-series GPUs, but that only occurred when training for >10 epochs with a small dataset (e.g., 3k). I attributed this collapse to overfitting. hadn't considered that it might also be related to precision.🤔

2,276

Penghui Qi · Oct 31, 2025 · 3:51 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

Replying to @DimitrisPapail

we only have A100, which doesn’t support fp8 unfortunately. It’s an interesting direction to try fp8, we hope the community could do more study on the precision to enable better understanding

1,831

Penghui Qi · Aug 8, 2025 · 11:26 AM UTC

Penghui Qi

@QPHutu

8 Aug 2025

Pretty disappointed with the @NeurIPS rebuttal.😔 Initial scores: 5, 4, 4, 3. Zero response from two 4-score reviewers. 👌 The 3-score reviewer only want to reject, regardless of the rebuttal.🤮

5,457

Penghui Qi · May 21, 2025 · 10:34 PM UTC

Penghui Qi

@QPHutu

21 May 2025

Thanks @_akhaliq for sharing our work. 🔥🧨🎆We believe optimizing anytime reasoning is an interesting and promising topic to enable users to control their own token budget, and LLMs provides the best-effort solution under the budget, just like what Gemini 2.5 introduced.🏹🎯

@_akhaliq

21 May 2025

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

1,243

Penghui Qi · Nov 4, 2025 · 4:16 AM UTC

Penghui Qi

@QPHutu

4 Nov 2025

Replying to @huskydogewoof @jm_alexia

Have you enabled grad scaler? I didn't find it when searching in your blog and code. Gradient underflow is a well-known issue for FP16, typically few lines of code change can solve this. For more details 👇

Penghui Qi

@QPHutu

1 Nov 2025

4,860

Penghui Qi · Nov 1, 2025 · 3:42 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

This is crazy if true. BTW, we are using image 2, the stable one in our sanity test.

Grad

@Grad62304977

1 Nov 2025

Just to scare the timeline a bit more, I was observing that RL with the data formatted as in the first image would crash and go unstable while image 2 was fully stable

1,833

Penghui Qi · Nov 1, 2025 · 12:53 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @liliang_ren

Not yet. But I think in @fengyao1909 and @RichardYRLi 's blog, it was already proven not enough to avoid collapse. fengyao.notion.site/off-poli… yingru.notion.site/When-Spee…

Your Efficient RL Framework Secretly Brings You Off-Policy RL Training | Notion

Feng Yao* Liyuan Liu* Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao

fengyao.notion.site

1,724

Penghui Qi · Oct 31, 2025 · 3:02 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

Another interesting plot on BF16 vs. FP16

Joey (e/λ)

@shxf0072

31 Oct 2025

bf16 / fp16 plot training and inference miss match yes fp16 has close on policy also interstingly vllm bf16 + hf fp16 >> hf bf16 + vllm fp16

1,228

Penghui Qi · Jul 30, 2025 · 3:36 AM UTC

Penghui Qi

@QPHutu

30 Jul 2025

💩👇Anyone can tell me what algorithm this bullshit is?👇 💩 🏊I can understand someone just want to sell a concept quickly in the paper. It's ok the algorithm is not perfect and you don't want to release code.🥇 But why you release such a bullshit?😡

1,221

Penghui Qi · May 20, 2025 · 2:21 PM UTC

Penghui Qi

@QPHutu

20 May 2025

5/5) Our methods involve tree-like generation and training. We leverage the prefix caching feature in @vllm_project for generation, and FlexAttention in @PyTorch for fast training, without computational waste. 🚀🚀 We opensource our full implementation in github.com/sail-sg/AnytimeRe…

295

Penghui Qi · Nov 1, 2025 · 2:37 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

This is very true! Thanks for the great comment!

Yi Wan @YiWan89352121

31 Oct 2025

This is great. Another good reminder that you really need to understand your entire system to make sure it works properly. Sometimes I feel that the LLM field is evolving so fast that maybe no one in the world fully understands the agents we’re building anymore.

1,542

Penghui Qi · May 23, 2025 · 2:27 PM UTC

Penghui Qi

@QPHutu

23 May 2025

To be honest, I don't think such truncation is a reasonable setting. In an online service, even if the thinking is too long and be truncated, it should provide the best summary so far. Simply regarding them as incorrect doesn't make sense for a product.

Junxian He @junxian_he

22 May 2025

As we all know large reasoning models are extremely overthinking even when answering 1+1. Many approaches have been proposed to mitigate this issue, but it doesn't need to be that complex -- we (and several concurrent works) found that just continuing standard RL training while truncating long responses is surprisingly effective. Why? The key is the step function reward implicitly defined in context window truncation. As such, we provide a unified view to formulate various length-based reward shaping approaches, and design several SOTA variants based on step function. We achieved +6.1 acc on AIME based on R1-distilled Qwen model while reducing token usage by 63%. I really like the visualization in the Table👇to compare various length-based reward-shaping methods . Check @WeiLiu99 post and our paper for more details!

1,468

Penghui Qi · Mar 4, 2025 · 3:29 PM UTC

Penghui Qi

@QPHutu

4 Mar 2025

@deepseek_ai adopted our V-Shape version of DualPipe in their official repository. github.com/deepseek-ai/DualP… See our blog in huggingface.co/blog/ufotalen…

414

Penghui Qi · Jul 3, 2025 · 2:36 AM UTC

Penghui Qi

@QPHutu

3 Jul 2025

Not surprising. SFT Memorizes, RL Generalizes.

Xiang Yue @xiangyue96

2 Jul 2025

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we evaluated over 20 open-weight reasoning models and found that: ➡️Only models trained with RL exhibit broad transfer of math reasoning skills to other tasks. ➡️Models trained with SFT show limited or no transfer—especially to non-reasoning domains. To quantify this, we introduce the Transferability Index (TI), which measures how much gain in math could transfer to others. A positive score indicates effective transfer; a negative one suggests loss of general capability. We evaluate the models on three benchmark categories: - Math reasoning: MATH-500, AIME24/25, Olympiad - Other reasoning: GPQA-D (Science), LiveCodeBench2 (Code), ACPBench (Agent Planning), HeadQA (Medical) - Non-reasoning: CoQA (Conversational QA), IFEval (Instruction Following), HalluEval (Hallucination), MC-TACO (Commonsense) Our findings challenge the blind pursuit of leaderboard performance in math reasoning via SFT. Simply creating more math-like SFT data may inadvertently harm a model’s broader generalization. Instead, RL appears to be key for truly transferable reasoning development.

642

Penghui Qi · May 28, 2025 · 11:46 AM UTC

Penghui Qi

@QPHutu

28 May 2025

This is interesting. So many bias in GRPO optimization, also the length bias and difficulty bias as discussed in DR.GRPO.

Stella Li ✈️ ICML🇰🇷

@StellaLisy

27 May 2025

Replying to @StellaLisy

⚙️ Looking closer into GRPO: there is a "clipping bias" that amplifies high-prior model behaviors. Code reasoning could be one of the magical behaviors for Qwen-Math💻 Empirically, we disabled clipping (fig.)-the gains disappeared‼️

350

Penghui Qi · Mar 21, 2025 · 10:09 PM UTC

Penghui Qi

@QPHutu

21 Mar 2025

So proud of being part of this great work! Thanks to all collaborators @zzlccc @Cameron_Chann @liwenjun2016 @TianyuPang1 @duchao0726 @mavenlin

Zichen Liu

@zzlccc

21 Mar 2025

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…

276

Penghui Qi · May 20, 2025 · 2:21 PM UTC

Penghui Qi

@QPHutu

20 May 2025

2/5) GRPO (V2 in this figure) has high variance when the thinking is long. We propose BRPO to mitigate this issue, with consistently lower variance, thus resulting in more stable and efficient RL training.✅✅

493

Penghui Qi · May 20, 2025 · 3:09 PM UTC

Penghui Qi

@QPHutu

20 May 2025

Thanks Changyu for the great comment. I guess one reason why this topic is still underexplored is that it faces nontrivial engineering challenges for efficient implementation. So we open source our code to facilitate further exploration on this interesting topic.

Changyu Chen

@Cameron_Chann

20 May 2025

Really interesting work for reasoning performance given any budget: - Better than GRPO at anytime including the final performance - Provide flexibility for users to trade off the cost and performance - Great engineering effort - code optimized for tree-like generation & training

329

Penghui Qi · May 20, 2025 · 2:21 PM UTC

Penghui Qi

@QPHutu

20 May 2025

1/5) We optimize anytime reasoning by sampling the token budget for thinking from a prior distribution. This approach naturally introduces verifiable dense rewards into the objective, leading to better credit assignment in RL optimization. 🎯🎯

722

Penghui Qi · Nov 8, 2025 · 1:54 AM UTC

Penghui Qi

@QPHutu

8 Nov 2025

Replying to @D_Nohara

Yea both additional forward pass and clipping are not necessary for PG-Seq-MIS

264

Penghui Qi · Jul 1, 2025 · 11:52 PM UTC

Penghui Qi

@QPHutu

1 Jul 2025

Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average.👇

Bo Liu (Benjamin Liu)

@Benjamin_eecs

1 Jul 2025

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: huggingface.co/papers/2506.2… 🧑‍💻 Code: github.com/spiral-rl/spiral

597

Penghui Qi · May 21, 2025 · 2:11 PM UTC

Penghui Qi

@QPHutu

21 May 2025

Exactly! This is precisely what we want to optimize in our paper.🎯🎯🎯

Zichen Liu

@zzlccc

21 May 2025

I believe our work offers a promising technique for building reasoners with fine-grained budget control— like what Gemini 2.5 Pro has just introduced. Truncated at any time, we deliver the best-effort solution!

264

Penghui Qi · Feb 27, 2025 · 3:23 PM UTC

Penghui Qi

@QPHutu

27 Feb 2025

DualPipe could be better without the Dual! Check our blog for more information. hackmd.io/@GUk3RmbpT_eyTE6-K…

DualPipe could be better without the Dual - HackMD

DualPipe and ZeroBubble

hackmd.io

DeepSeek

@deepseek_ai

27 Feb 2025

🚀 Day 4 of #OpenSourceWeek: Optimized Parallelism Strategies ✅ DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training. 🔗 github.com/deepseek-ai/DualP… ✅ EPLB - an expert-parallel load balancer for V3/R1. 🔗 github.com/deepseek-ai/eplb 📊 Analyze computation-communication overlap in V3/R1. 🔗 github.com/deepseek-ai/profi…

352

Penghui Qi · Jul 15, 2025 · 3:47 PM UTC

Penghui Qi

@QPHutu

15 Jul 2025

👇👇This is my favorite poster among all of our work. 🙈🙈 💡Want linearly scaled activation memory with almost zero cost in pipeline parallelism?🥊 Checkout out paper for more details: arxiv.org/pdf/2503.01328

Xinyi Wan @UfotalentZju

15 Jul 2025

🚀 New at #ICML25: PipeOffload breaks the memory wall for LLM training with Pipeline Parallelism — linear activation memory scaling with zero performance loss. High memory usage is no longer a blocker. 👉 Follow our series of work pushing limits of PP!

224

Penghui Qi · May 21, 2025 · 11:03 PM UTC

Penghui Qi

@QPHutu

21 May 2025

Replying to @natolambert

This is exactly what we are doing in our paper "Optimizing Anytime Reasoning via Budget Relative Policy Optimization". By directly optimizing RL under various thinking budgets (sampled from a prior distribution), we can achieve better performance under all thinking budgets!🚀🚀

166

Penghui Qi · May 20, 2025 · 2:21 PM UTC

Penghui Qi

@QPHutu

20 May 2025

4/5) Ablations on verifible dense rewards and our BRPO variance reduction show the effectiveness of our approach.💪💪

305

Penghui Qi · Nov 14, 2025 · 12:36 PM UTC

Penghui Qi

@QPHutu

14 Nov 2025

Replying to @lmsysorg

I guess the max_new_tokens = 2048 is a bit short to make substantial difference. The mismatch will accumlate over seq length.

177

Penghui Qi · Nov 1, 2025 · 12:42 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @fabmilo

The major problem is, fp32 is too slow. We run 130h for this fp32 rollout, only got 1200 steps (even without fp32 training). The good thing is, fp32 seems quite stable, even when the initial mismatch is large (due to bf16 training).

1,502

Penghui Qi · Feb 27, 2025 · 3:11 PM UTC

Penghui Qi

@QPHutu

27 Feb 2025

DualPipe could be better without the Dual! @deepseek_ai

Xinyi Wan @UfotalentZju

27 Feb 2025

♨️DualPipe open sourced by Deepseek is hot. ❓However, DualPipe could be better without the Dual! ✅By simply cutting the schedule in half, we can reduce the unnecessary parameters duplication. 💡Check our blog on hackmd.io/@ufotalent/r1lVXsa…

535

Penghui Qi · Oct 31, 2025 · 4:01 PM UTC

Penghui Qi

@QPHutu

31 Oct 2025

Replying to @Grad62304977 @natolambert

Do you enable GradScaler? A naive FP16 would suffer from gradient underflow which makes it even worse.

207

Penghui Qi · Nov 13, 2025 · 4:36 AM UTC

Penghui Qi

@QPHutu

13 Nov 2025

Replying to @samsja19

it’s much more faster?

284

Penghui Qi · May 24, 2025 · 2:12 PM UTC

Penghui Qi

@QPHutu

24 May 2025

Thanks for the comment! We're really glad to see people enjoying our work. We truly hope the community can gain some interesting insights from our research—that's the best reward for us.

Jing Wang @NickySGJingWang

23 May 2025

Some of the best reasoning-related papers I've seen in the past three months—truly impressive work. Another great contribution from the SAIL, following the success of Dr.GRPO.

210

Penghui Qi · Jan 22, 2024 · 7:13 AM UTC

Penghui Qi

@QPHutu

22 Jan 2024

Replying to @_akhaliq

Thanks for the post. Here is a playground for zero bubble schedulers: huggingface.co/spaces/sail/z…

166

Penghui Qi · May 20, 2025 · 2:21 PM UTC

Penghui Qi

@QPHutu

20 May 2025

3/5) Our methods significantly outperforms GRPO under various prior distributions, in both anytime and standard reasoning tasks.🧨🧨

286

Penghui Qi · Jul 30, 2025 · 3:46 AM UTC

Penghui Qi

@QPHutu

30 Jul 2025

我感觉我的智商被狠狠的侮辱了 -> I feel like my intelligence has been seriously insulted. translated by GPT4

Penghui Qi

@QPHutu

30 Jul 2025

201

Penghui Qi · Nov 5, 2025 · 3:18 PM UTC

Penghui Qi

@QPHutu

5 Nov 2025

Replying to @huskydogewoof

Thanks for the sharing. It’s interesting, to be honest, I have no idea why it’s like this. But I do agree your takeaway, neither BF16 nor FP16 is the best for everything, we should carefully choose the precision based on the task.

242

Penghui Qi · Nov 3, 2025 · 1:37 AM UTC

Penghui Qi

@QPHutu

3 Nov 2025

Replying to @RichardYRLi @danielhanchen

I tested on two vllm versions, 0.8.4 and 0.11.0

333

Penghui Qi · May 20, 2025 · 2:56 PM UTC

Penghui Qi

@QPHutu

20 May 2025

Anytime algorithm is an old topic and should get more attention in LLM reasoning. If you want a better test-time scaling, try anytime reasoning!

Zichen Liu

@zzlccc

20 May 2025

Anytime reasoning, a topic dating back to the ’90s, seeks the best-effort solution given a computation bound. For large reasoning models, we optimize anytime reasoning with 1) dense rewards and 2) better credit assignment (BRPO). For more details👇

266

Penghui Qi · Aug 2, 2025 · 12:56 AM UTC

Penghui Qi

@QPHutu

2 Aug 2025

Really meaningful engineering effort for the community. Congrats on the great work!

Zichen Liu

@zzlccc

1 Aug 2025

In the era of experience, we're training LLM agents with RL — but something's missing... We miss the good old Gym! So we built 💎GEM: a suite of environments for training LLM 𝚐𝚎𝚗𝚎𝚛𝚊𝚕𝚒𝚜𝚝𝚜. Let’s build the Gym for LLMs, together: axon-rl.notion.site/gem

261

Penghui Qi · Nov 1, 2025 · 12:48 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @ziruirayliu

Yes we also reported similar phenomenon (Figure 3). I think the key is using fp16 in both training and inference, instead of solely inference.

876

Penghui Qi · Nov 1, 2025 · 12:16 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @mavenlin @redtachyon

Absolutely right. Many factors can mitigate the collapse, it's less often seeing the collapse in a large dataset (but it may converge slower). We design a sanity test (Section 4) where collapse happens a lot, and some "real" tasks in the last 6 exp in Figures 1. Do give it a try!

Penghui Qi · Aug 8, 2025 · 11:37 AM UTC

Penghui Qi

@QPHutu

8 Aug 2025

As a reviewer, I gave concrete suggestions (not just criticism) for each paper, responded promptly to the rebuttal, and actively participated in the discussion. I raised my score for both papers I handled eventually. Hope to meet equally responsible reviewers next time.😀

Penghui Qi

@QPHutu

8 Aug 2025

642

Penghui Qi · Nov 10, 2025 · 7:03 AM UTC

Penghui Qi

@QPHutu

10 Nov 2025

Replying to @D_Nohara

This is super nice! Thanks for your sharing!

223

Penghui Qi · May 23, 2025 · 2:36 PM UTC

Penghui Qi

@QPHutu

23 May 2025

The point is, we should always decouple the generation of thinking and summary. Otherwise, the summary will be too weak to be a product, because the summary is only trained on complete thinking process and is vulnerable to those incomplete thinking.

Penghui Qi

@QPHutu

23 May 2025

146

Penghui Qi · Jan 23, 2024 · 2:26 AM UTC

Penghui Qi

@QPHutu

23 Jan 2024

Replying to @ddvd233 @sivil_taram

ZB-V does increase the io transfer by one time, but other methods (ZB-1p/ZB-2p) have exactly the same communication volume with 1F1B. Here is a rough comparison for better understand.

102

Penghui Qi · May 23, 2025 · 2:45 PM UTC

Penghui Qi

@QPHutu

23 May 2025

Replying to @WeiLiu99 @junxian_he @yuzhenh17 @junteng88716710 @yuntiandeng @yyoraa22 @ruochenz1018

The experiments in table 3 seems unfair to other methods. Because LASER is given chance to generate longer responses, but truncation methods are not? So the accuracy improvement is a bit misleading if I understand correctly.

206

Penghui Qi · Nov 1, 2025 · 2:14 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @hassanience

Thanks. I don't think it is a "bug", the tradeoff between bf16 and fp16 is well-known, but ppl often take it for grant to use bf16 by default because it is better in many cases (especially for pre-training). We hope our findings can motivate a reconsideration on this tradeoff.

1,085

Penghui Qi · Nov 1, 2025 · 12:35 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @DimitrisPapail

This expeiment (2k steps) took about 82 hours on a single A100. So I guess it may be about one day if using FP8 on a single H100?

297

Penghui Qi · May 23, 2025 · 2:39 PM UTC

Penghui Qi

@QPHutu

23 May 2025

For example, if we want to control the token budget just like Gemini 2.5, the model should always be able to summerize an incomplete thinking due to the thinking budget limit.

Penghui Qi

@QPHutu

23 May 2025

244

Penghui Qi · Jul 16, 2025 · 1:34 AM UTC

Penghui Qi

@QPHutu

16 Jul 2025

Not attending #icml25 in person, but welcome any discussion about DR.GRPO and BRPO (anytime reasoning).

Zichen Liu

@zzlccc

14 Jul 2025

Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers: 1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix) 2. Main Conference Spotlight: Continual Reinforcement Learning by Planning with Online World Models; I worked on the continual learning topic for one year before diving into RL for LLMs openreview.net/pdf?id=mQeZEs… 3. AI4MATH Workshop Poster: Optimizing Anytime Reasoning via Budget Relative Policy Optimization; a dense reward training framework that controls thinking budget (a cool feature of Gemini 2.5 Pro)

212

Penghui Qi · May 24, 2025 · 4:49 AM UTC

Penghui Qi

@QPHutu

24 May 2025

Replying to @junxian_he

Thanks for clarifying. For truncation methods, I would expect a performance degrade when deploying, because of the train-eval mismatch. That's also why I think it is unreasonable. I do believe LASER has mitigated this issue due to dynamic control. Thanks for the great work.

125

Penghui Qi · Jan 22, 2024 · 7:30 AM UTC

Penghui Qi

@QPHutu

22 Jan 2024

Replying to @QPHutu @_akhaliq

Code: github.com/sail-sg/zero-bubb…

GitHub - sail-sg/zero-bubble-pipeline-parallelism: Zero Bubble Pipeline Parallelism

Zero Bubble Pipeline Parallelism. Contribute to sail-sg/zero-bubble-pipeline-parallelism development by creating an account on GitHub.

github.com

Penghui Qi · Jan 23, 2024 · 3:00 AM UTC

Penghui Qi

@QPHutu

23 Jan 2024

Replying to @ddvd233 @sivil_taram

For the cases you mentioned, I think the answer is no for most parameters. Actually, fully eliminiating bubbles is not always achievable. It requires enough micro-batches (typically >= 2x pipeline stages), enough memory (typically 1.5x for ZB-V and 2x for ZB-2p).

Penghui Qi · Nov 2, 2025 · 2:04 AM UTC

Penghui Qi

@QPHutu

2 Nov 2025

Replying to @raydelvecc

Thanks 😀

383

Penghui Qi · Nov 1, 2025 · 11:26 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @SagarJoglekar

yes, it’s more often to see a collapse in small datasets

Penghui Qi · Jul 30, 2025 · 4:56 AM UTC

Penghui Qi

@QPHutu

30 Jul 2025

To explain why it's a bullshit: 1. It does not make any sense as a RL algorithm, especially in RLVR where reward is sparse. 2. It is totally different with the algorithm introduced in the paper.

Penghui Qi

@QPHutu

30 Jul 2025

237

Penghui Qi · Nov 1, 2025 · 12:23 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @redtachyon @mavenlin

I think so. We do use a small dataset and run many epochs to achieve the theoretically 100% training accuracy (that's how we design this sanity test). Under this sanity test, the differences between algorithms would be amplified.

Penghui Qi · May 24, 2025 · 3:55 AM UTC

Penghui Qi

@QPHutu

24 May 2025

Replying to @WeiLiu99 @junxian_he @yuzhenh17 @junteng88716710 @yuntiandeng @yyoraa22 @ruochenz1018

Thanks for the detailed response. I do agree LASER has addressed or mitigated some issues in truncation methods due to its dynamic length rewards. For table 3, I'm still confused the eval is based on full or truncated context? It would be great to make it clear in the paper.

101

Penghui Qi · Jun 25, 2025 · 6:36 AM UTC

Penghui Qi

@QPHutu

25 Jun 2025

Really interesting and insightful paper!

Aviral Kumar

@aviral_kumar2

13 Jun 2025

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️ Website: matthewyryang.com/e3/ Paper: arxiv.org/abs/2506.09026

149

Penghui Qi · Jan 23, 2024 · 3:07 AM UTC

Penghui Qi

@QPHutu

23 Jan 2024

Replying to @ddvd233 @sivil_taram

Backward pass usually costs about 2x time compared to forward pass because matrix multiplication computations usually hold the most TFLOPS in a network and the backward pass of matrix multiplication is exactly 2 times of forward pass.

Penghui Qi · Nov 1, 2025 · 2:08 AM UTC

Penghui Qi

@QPHutu

1 Nov 2025

Replying to @ziruirayliu

yes, abs. diff in probability

Penghui Qi · Nov 2, 2025 · 2:03 AM UTC

Penghui Qi

@QPHutu

2 Nov 2025

Replying to @ziruirayliu

Agreed. We have verified the results on both A100 and H100.

Penghui Qi · Jul 15, 2025 · 4:00 PM UTC

Penghui Qi

@QPHutu

15 Jul 2025

Also checkout our series of work to improve pipeline parallelism 👇👇 💡Although zero-bubble PP is the most impactful one, my favorite is the second one: PP with controllable memory, which includes a lot of insights and interesting methods to build an efficient PP.💯

Xinyi Wan @UfotalentZju

15 Jul 2025

This is a part of our Zero Bubble Pipeline Parallelism series, pushing throughput🚀, memory💾 and scalability📈of LLM training to extreme! Project link: github.com/sail-sg/zero-bubb…

161

Penghui Qi · Nov 8, 2025 · 1:50 AM UTC

Penghui Qi

@QPHutu

8 Nov 2025

Replying to @HemilDesai10

yea we use fp32 optimizer for both bf16 and fp16

127