Researcher @SeaAIL PhD student @NUSingapore Working on RL, LLM Reasoning, and MLSys.

Pinned Tweet
This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇
14
76
556
49,709
🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precision…
20
105
652
220,292
Thanks for this fix. Actually it is not like this easy, GradScaler should be introduced to avoid gradient underflow, otherwise the performance can be even worse than BF16. See: docs.pytorch.org/docs/stable… VeRL Example: github.com/sail-sg/Precision…
I present the fix for RL training instability:
2
9
194
36,139
⛈️ VeRL does not natively support FP16 training. A naive implementation will suffer from gradient underflow. 💊 🚀We provide a minimal patch for VeRL to enable effective FP16 training, with about 10 lines of code change.👇 ⌨️ github.com/sail-sg/Precision…
🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precision…
4
13
115
17,676
Finally! Although it's 2.4 slower right now (I believe many optimizations are coming), the results are really promising! It is a huge step towards truly on-policy RL! Amazing work!
🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq → same output regardless of batching) 2️⃣ Ensure forward passes in training use identical kernels as inference 3️⃣ Add custom backward passes in PyTorch ✅ Verified on Qwen3 1.7B + GSM8K: • batch_inv_ON (bitwise exact) → KL=0.0, faster convergence, higher reward • batch_inv_OFF → reduced reward, instability We audited every op, imported vLLM’s fused kernels (SiLU MLPs, RMSNorm+residual), and wrote matching backward passes. Run is fully on-policy, deterministic, and reproducible. Next: • Unified model code • torch.compile support • Perf tuning (current bitwise RL ≈2.4× slower) • Broader model + op coverage 🔗 blog.vllm.ai/2025/11/10/bitw… #vLLM #TorchTitan #RL #LLM #AIResearch
3
12
99
15,148
Indeed many ppl never saw their bf16 training collapse, but the problem exists as in many reports. We reproduce this instability by designing a sanity test (just like MNIST for CV) for better understanding. Large models+datasets are here👇 Give it a try, you may be suprised.
Thanks for the thought! Some further thoughts (clarifications): 1. Reasonably designed algorithms (let’s also include precision in the design space) should not collapse on small data. It’s just like if my CNN cannot even overfit MNIST, how can I trust it will master 1000-class imagenet? 2. We do have experiments on “larger” dataset X bigger model (30A3 MoE) X H100 GPUs. The performance improvement is clear.
2
7
89
19,080
👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/AnytimeRe…
2
26
78
33,475
This is exactly what we want to share by fp16 tech report! Thanks @Grad62304977 for the great explanation.
Replying to @redtachyon
Well sort of, with just GRPO and not actually taking care of the mismatch at the algorithm level, u will encounter instability with bf16 under normal training settings like here (and as many papers for actual models like Kimi linear have mentioned). Their point is that given these instabilities occur in the long term bcs of the mismatch, patching it up with an algorithm fix is great but unclear if that just delays the instability (for now that’s what it looks like). Also even if we’re not unstable, it could turn into just lower performance (higher mismatch is sort of just training more off policy, if we can be more on policy for free than it’s likely we would get a performance boost too)
3
4
71
15,015
Huge thanks to @Grad62304977 for quickly testing out our findings on using FP16 for RL fine-tuning and confirming the results!🥇
1
1
37
5,969
Another amazing progress on truly on-policy RL!💯 I believe it is a headache for the community to find a reproducible setting where the mismatch consistently causes training collapse. If so, may check this sanity test. Link to this dataset👇 huggingface.co/datasets/sail…
💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.) The result? A strict KL divergence of 0. But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't found a significant difference from the baseline yet. We're calling on the community: Send us your reproducible training collapse examples! We want to see them 🤣🤣
3
8
75
11,605
Many thanks for these exciting results. I’ve been waiting all weekend for someone to reproduce them, and I’m thrilled they’re here.
Replying to @redtachyon
Well, not only A100. Here is the sanity check on H200 (GRPO, 32B dense model). The authors also mention that they did some larger-scale experiments on H100.
2
29
8,460
Thank you @karpathy for finding our paper interesting. This is very encouraging.
Replying to @MarFot78 @zzlccc
I think if you zoomed into the paper too you’d find it just as if not more interesting.
2
27
7,264
Hi @RichardYRLi , I tried this disable_cascade_attn many times, including the latest vllm version. But unfortunately it made no difference in our experiments. So I guess it really depends on the setting.
1
1
25
6,042
This was a wonderful team effort, huge thanks to my amazing collaborators @zzlccc @NickZhou523786 @TianyuPang1 @duchao0726 @mavenlin
1
1
23
5,017
Excited to share that our paper, Pipeline Parallelism with Controllable Memory, was accepted by #NeurIPS2024. Code: github.com/sail-sg/zero-bubb… Paper: arxiv.org/pdf/2405.15362
6
18
2,129
Big news👇 VeRL just released v0.6.1, which natively supports FP16 for both FSDP and Megatron.🔥 Big thanks to @zhu_xuekai and Megatron guys for their awesome reproduce and code contribution!💯 Let's make open source community even better!💪
Recently, we @QPHutu collaborated with VERL maintainers to add a new feature in v0.6.1: 🚀FP16 training (FSDP) & inference, mitigating the mismatch issue. Give it a try! PR: github.com/volcengine/verl/p… 😆PS: Open source commuity is awesome!
1
4
38
6,408
Wonderful to see this curve! It aligns our observation, when runing many epochs on a small dataset, it would consistently collapse (the sanity test in Section 4). But it is definitely not overfitting, because it is the training reward going down, not the validation only.
I did indeed encounter a collapse in training reward during RL on H-series GPUs, but that only occurred when training for >10 epochs with a small dataset (e.g., 3k). I attributed this collapse to overfitting. hadn't considered that it might also be related to precision.🤔
3
16
2,276
Replying to @DimitrisPapail
we only have A100, which doesn’t support fp8 unfortunately. It’s an interesting direction to try fp8, we hope the community could do more study on the precision to enable better understanding
2
16
1,831
Pretty disappointed with the @NeurIPS rebuttal.😔 Initial scores: 5, 4, 4, 3. Zero response from two 4-score reviewers. 👌 The 3-score reviewer only want to reject, regardless of the rebuttal.🤮
2
1
15
5,457
Thanks @_akhaliq for sharing our work. 🔥🧨🎆We believe optimizing anytime reasoning is an interesting and promising topic to enable users to control their own token budget, and LLMs provides the best-effort solution under the budget, just like what Gemini 2.5 introduced.🏹🎯
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
2
14
1,243
Have you enabled grad scaler? I didn't find it when searching in your blog and code. Gradient underflow is a well-known issue for FP16, typically few lines of code change can solve this. For more details 👇
Thanks for this fix. Actually it is not like this easy, GradScaler should be introduced to avoid gradient underflow, otherwise the performance can be even worse than BF16. See: docs.pytorch.org/docs/stable… VeRL Example: github.com/sail-sg/Precision…
1
2
13
4,860
This is crazy if true. BTW, we are using image 2, the stable one in our sanity test.
Just to scare the timeline a bit more, I was observing that RL with the data formatted as in the first image would crash and go unstable while image 2 was fully stable
11
1,833
Another interesting plot on BF16 vs. FP16
bf16 / fp16 plot training and inference miss match yes fp16 has close on policy also interstingly vllm bf16 + hf fp16 >> hf bf16 + vllm fp16
1
10
1,228
💩👇Anyone can tell me what algorithm this bullshit is?👇 💩 🏊I can understand someone just want to sell a concept quickly in the paper. It's ok the algorithm is not perfect and you don't want to release code.🥇 But why you release such a bullshit?😡
2
9
1,221
5/5) Our methods involve tree-like generation and training. We leverage the prefix caching feature in @vllm_project for generation, and FlexAttention in @PyTorch for fast training, without computational waste. 🚀🚀 We opensource our full implementation in github.com/sail-sg/AnytimeRe…
1
8
295
This is very true! Thanks for the great comment!
This is great. Another good reminder that you really need to understand your entire system to make sure it works properly. Sometimes I feel that the LLM field is evolving so fast that maybe no one in the world fully understands the agents we’re building anymore.
8
1,542
To be honest, I don't think such truncation is a reasonable setting. In an online service, even if the thinking is too long and be truncated, it should provide the best summary so far. Simply regarding them as incorrect doesn't make sense for a product.
As we all know large reasoning models are extremely overthinking even when answering 1+1. Many approaches have been proposed to mitigate this issue, but it doesn't need to be that complex -- we (and several concurrent works) found that just continuing standard RL training while truncating long responses is surprisingly effective. Why? The key is the step function reward implicitly defined in context window truncation. As such, we provide a unified view to formulate various length-based reward shaping approaches, and design several SOTA variants based on step function. We achieved +6.1 acc on AIME based on R1-distilled Qwen model while reducing token usage by 63%. I really like the visualization in the Table👇to compare various length-based reward-shaping methods . Check @WeiLiu99 post and our paper for more details!
1
7
1,468
@deepseek_ai adopted our V-Shape version of DualPipe in their official repository. github.com/deepseek-ai/DualP… See our blog in huggingface.co/blog/ufotalen…
3
7
414
Not surprising. SFT Memorizes, RL Generalizes.
People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we evaluated over 20 open-weight reasoning models and found that: ➡️Only models trained with RL exhibit broad transfer of math reasoning skills to other tasks. ➡️Models trained with SFT show limited or no transfer—especially to non-reasoning domains. To quantify this, we introduce the Transferability Index (TI), which measures how much gain in math could transfer to others. A positive score indicates effective transfer; a negative one suggests loss of general capability. We evaluate the models on three benchmark categories: - Math reasoning: MATH-500, AIME24/25, Olympiad - Other reasoning: GPQA-D (Science), LiveCodeBench2 (Code), ACPBench (Agent Planning), HeadQA (Medical) - Non-reasoning: CoQA (Conversational QA), IFEval (Instruction Following), HalluEval (Hallucination), MC-TACO (Commonsense) Our findings challenge the blind pursuit of leaderboard performance in math reasoning via SFT. Simply creating more math-like SFT data may inadvertently harm a model’s broader generalization. Instead, RL appears to be key for truly transferable reasoning development.
7
642
This is interesting. So many bias in GRPO optimization, also the length bias and difficulty bias as discussed in DR.GRPO.
Replying to @StellaLisy
⚙️ Looking closer into GRPO: there is a "clipping bias" that amplifies high-prior model behaviors. Code reasoning could be one of the magical behaviors for Qwen-Math💻 Empirically, we disabled clipping (fig.)-the gains disappeared‼️
7
350
So proud of being part of this great work! Thanks to all collaborators @zzlccc @Cameron_Chann @liwenjun2016 @TianyuPang1 @duchao0726 @mavenlin
🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…
1
7
276
2/5) GRPO (V2 in this figure) has high variance when the thinking is long. We propose BRPO to mitigate this issue, with consistently lower variance, thus resulting in more stable and efficient RL training.✅✅
1
1
6
493
Thanks Changyu for the great comment. I guess one reason why this topic is still underexplored is that it faces nontrivial engineering challenges for efficient implementation. So we open source our code to facilitate further exploration on this interesting topic.
Really interesting work for reasoning performance given any budget: - Better than GRPO at anytime including the final performance - Provide flexibility for users to trade off the cost and performance - Great engineering effort - code optimized for tree-like generation & training
5
329
1/5) We optimize anytime reasoning by sampling the token budget for thinking from a prior distribution. This approach naturally introduces verifiable dense rewards into the objective, leading to better credit assignment in RL optimization. 🎯🎯
1
1
5
722
Replying to @D_Nohara
Yea both additional forward pass and clipping are not necessary for PG-Seq-MIS
1
5
264
Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average.👇
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: huggingface.co/papers/2506.2… 🧑‍💻 Code: github.com/spiral-rl/spiral
5
597
Exactly! This is precisely what we want to optimize in our paper.🎯🎯🎯
I believe our work offers a promising technique for building reasoners with fine-grained budget control— like what Gemini 2.5 Pro has just introduced. Truncated at any time, we deliver the best-effort solution!
5
264
DualPipe could be better without the Dual! Check our blog for more information. hackmd.io/@GUk3RmbpT_eyTE6-K…
🚀 Day 4 of #OpenSourceWeek: Optimized Parallelism Strategies ✅ DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training. 🔗 github.com/deepseek-ai/DualP… ✅ EPLB - an expert-parallel load balancer for V3/R1. 🔗 github.com/deepseek-ai/eplb 📊 Analyze computation-communication overlap in V3/R1. 🔗 github.com/deepseek-ai/profi…
5
352
👇👇This is my favorite poster among all of our work. 🙈🙈 💡Want linearly scaled activation memory with almost zero cost in pipeline parallelism?🥊 Checkout out paper for more details: arxiv.org/pdf/2503.01328
🚀 New at #ICML25: PipeOffload breaks the memory wall for LLM training with Pipeline Parallelism — linear activation memory scaling with zero performance loss. High memory usage is no longer a blocker. 👉 Follow our series of work pushing limits of PP!
1
1
5
224
Replying to @natolambert
This is exactly what we are doing in our paper "Optimizing Anytime Reasoning via Budget Relative Policy Optimization". By directly optimizing RL under various thinking budgets (sampled from a prior distribution), we can achieve better performance under all thinking budgets!🚀🚀
5
166
4/5) Ablations on verifible dense rewards and our BRPO variance reduction show the effectiveness of our approach.💪💪
1
1
5
305
Replying to @lmsysorg
I guess the max_new_tokens = 2048 is a bit short to make substantial difference. The mismatch will accumlate over seq length.
1
7
177
Replying to @fabmilo
The major problem is, fp32 is too slow. We run 130h for this fp32 rollout, only got 1200 steps (even without fp32 training). The good thing is, fp32 seems quite stable, even when the initial mismatch is large (due to bf16 training).
4
1,502
DualPipe could be better without the Dual! @deepseek_ai
♨️DualPipe open sourced by Deepseek is hot. ❓However, DualPipe could be better without the Dual! ✅By simply cutting the schedule in half, we can reduce the unnecessary parameters duplication. 💡Check our blog on hackmd.io/@ufotalent/r1lVXsa…
4
535
Do you enable GradScaler? A naive FP16 would suffer from gradient underflow which makes it even worse.
1
4
207
Replying to @samsja19
it’s much more faster?
1
4
284
Thanks for the comment! We're really glad to see people enjoying our work. We truly hope the community can gain some interesting insights from our research—that's the best reward for us.
Some of the best reasoning-related papers I've seen in the past three months—truly impressive work. Another great contribution from the SAIL, following the success of Dr.GRPO.
4
210
Replying to @_akhaliq
Thanks for the post. Here is a playground for zero bubble schedulers: huggingface.co/spaces/sail/z…
1
4
166
3/5) Our methods significantly outperforms GRPO under various prior distributions, in both anytime and standard reasoning tasks.🧨🧨
1
4
286
我感觉我的智商被狠狠的侮辱了 -> I feel like my intelligence has been seriously insulted. translated by GPT4
💩👇Anyone can tell me what algorithm this bullshit is?👇 💩 🏊I can understand someone just want to sell a concept quickly in the paper. It's ok the algorithm is not perfect and you don't want to release code.🥇 But why you release such a bullshit?😡
1
4
201
Replying to @huskydogewoof
Thanks for the sharing. It’s interesting, to be honest, I have no idea why it’s like this. But I do agree your takeaway, neither BF16 nor FP16 is the best for everything, we should carefully choose the precision based on the task.
1
3
242
I tested on two vllm versions, 0.8.4 and 0.11.0
1
3
333
Anytime algorithm is an old topic and should get more attention in LLM reasoning. If you want a better test-time scaling, try anytime reasoning!
Anytime reasoning, a topic dating back to the ’90s, seeks the best-effort solution given a computation bound. For large reasoning models, we optimize anytime reasoning with 1) dense rewards and 2) better credit assignment (BRPO). For more details👇
3
266
Really meaningful engineering effort for the community. Congrats on the great work!
In the era of experience, we're training LLM agents with RL — but something's missing... We miss the good old Gym! So we built 💎GEM: a suite of environments for training LLM 𝚐𝚎𝚗𝚎𝚛𝚊𝚕𝚒𝚜𝚝𝚜. Let’s build the Gym for LLMs, together: axon-rl.notion.site/gem
3
261
Replying to @ziruirayliu
Yes we also reported similar phenomenon (Figure 3). I think the key is using fp16 in both training and inference, instead of solely inference.
1
1
3
876
Absolutely right. Many factors can mitigate the collapse, it's less often seeing the collapse in a large dataset (but it may converge slower). We design a sanity test (Section 4) where collapse happens a lot, and some "real" tasks in the last 6 exp in Figures 1. Do give it a try!
1
3
84
As a reviewer, I gave concrete suggestions (not just criticism) for each paper, responded promptly to the rebuttal, and actively participated in the discussion. I raised my score for both papers I handled eventually. Hope to meet equally responsible reviewers next time.😀
Pretty disappointed with the @NeurIPS rebuttal.😔 Initial scores: 5, 4, 4, 3. Zero response from two 4-score reviewers. 👌 The 3-score reviewer only want to reject, regardless of the rebuttal.🤮
2
3
642
Replying to @D_Nohara
This is super nice! Thanks for your sharing!
3
223
The point is, we should always decouple the generation of thinking and summary. Otherwise, the summary will be too weak to be a product, because the summary is only trained on complete thinking process and is vulnerable to those incomplete thinking.
To be honest, I don't think such truncation is a reasonable setting. In an online service, even if the thinking is too long and be truncated, it should provide the best summary so far. Simply regarding them as incorrect doesn't make sense for a product.
2
146
ZB-V does increase the io transfer by one time, but other methods (ZB-1p/ZB-2p) have exactly the same communication volume with 1F1B. Here is a rough comparison for better understand.
1
1
2
102
The experiments in table 3 seems unfair to other methods. Because LASER is given chance to generate longer responses, but truncation methods are not? So the accuracy improvement is a bit misleading if I understand correctly.
1
2
206
Replying to @hassanience
Thanks. I don't think it is a "bug", the tradeoff between bf16 and fp16 is well-known, but ppl often take it for grant to use bf16 by default because it is better in many cases (especially for pre-training). We hope our findings can motivate a reconsideration on this tradeoff.
1
2
1,085
Replying to @DimitrisPapail
This expeiment (2k steps) took about 82 hours on a single A100. So I guess it may be about one day if using FP8 on a single H100?
1
2
297
For example, if we want to control the token budget just like Gemini 2.5, the model should always be able to summerize an incomplete thinking due to the thinking budget limit.
To be honest, I don't think such truncation is a reasonable setting. In an online service, even if the thinking is too long and be truncated, it should provide the best summary so far. Simply regarding them as incorrect doesn't make sense for a product.
2
244
Not attending #icml25 in person, but welcome any discussion about DR.GRPO and BRPO (anytime reasoning).
Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers: 1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix) 2. Main Conference Spotlight: Continual Reinforcement Learning by Planning with Online World Models; I worked on the continual learning topic for one year before diving into RL for LLMs openreview.net/pdf?id=mQeZEs… 3. AI4MATH Workshop Poster: Optimizing Anytime Reasoning via Budget Relative Policy Optimization; a dense reward training framework that controls thinking budget (a cool feature of Gemini 2.5 Pro)
1
2
212
Replying to @junxian_he
Thanks for clarifying. For truncation methods, I would expect a performance degrade when deploying, because of the train-eval mismatch. That's also why I think it is unreasonable. I do believe LASER has mitigated this issue due to dynamic control. Thanks for the great work.
2
125
For the cases you mentioned, I think the answer is no for most parameters. Actually, fully eliminiating bubbles is not always achievable. It requires enough micro-batches (typically >= 2x pipeline stages), enough memory (typically 1.5x for ZB-V and 2x for ZB-2p).
1
39
Replying to @raydelvecc
Thanks 😀
1
383
Replying to @SagarJoglekar
yes, it’s more often to see a collapse in small datasets
1
37
To explain why it's a bullshit: 1. It does not make any sense as a RL algorithm, especially in RLVR where reward is sparse. 2. It is totally different with the algorithm introduced in the paper.
💩👇Anyone can tell me what algorithm this bullshit is?👇 💩 🏊I can understand someone just want to sell a concept quickly in the paper. It's ok the algorithm is not perfect and you don't want to release code.🥇 But why you release such a bullshit?😡
1
1
237
I think so. We do use a small dataset and run many epochs to achieve the theoretically 100% training accuracy (that's how we design this sanity test). Under this sanity test, the differences between algorithms would be amplified.
1
42
Thanks for the detailed response. I do agree LASER has addressed or mitigated some issues in truncation methods due to its dynamic length rewards. For table 3, I'm still confused the eval is based on full or truncated context? It would be great to make it clear in the paper.
1
1
101
Really interesting and insightful paper!
Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️ Website: matthewyryang.com/e3/ Paper: arxiv.org/abs/2506.09026
1
149
Backward pass usually costs about 2x time compared to forward pass because matrix multiplication computations usually hold the most TFLOPS in a network and the backward pass of matrix multiplication is exactly 2 times of forward pass.
1
1
49
Replying to @ziruirayliu
yes, abs. diff in probability
1
1
78
Replying to @ziruirayliu
Agreed. We have verified the results on both A100 and H100.
1
63
Also checkout our series of work to improve pipeline parallelism 👇👇 💡Although zero-bubble PP is the most impactful one, my favorite is the second one: PP with controllable memory, which includes a lot of insights and interesting methods to build an efficient PP.💯
This is a part of our Zero Bubble Pipeline Parallelism series, pushing throughput🚀, memory💾 and scalability📈of LLM training to extreme! Project link: github.com/sail-sg/zero-bubb…
1
161
Replying to @HemilDesai10
yea we use fp32 optimizer for both bf16 and fp16
1
127