Yingru Li · May 15, 2026 · 2:50 AM UTC

Yingru Li

Pinned Tweet

Yingru Li

@RichardYRLi

May 15

Finally released. More to come.

xAI

@xai

May 14

An early beta of Grok Build, an agentic CLI for coding, building apps, and automating workflows is now available for SuperGrok Heavy subscribers. Through this early beta, we will improve the model and product based on your feedback. Try it at x.ai/cli

971

2,831,756

Yingru Li · Nov 2, 2025 · 5:43 AM UTC

Yingru Li

@RichardYRLi

2 Nov 2025

@danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug we detailed, which was particularly problematic on A100s: On A100 GPUs (and L20s), under certain batch/seq lengths, the kernel triggers its split_kv path. This path had a bug that incorrectly transposed the LSE layout. This corrupted tensor caused a "complete precision collapse" in Cascade Attention, which is what we measured as that massive mismatch KL. Good luck with the FP16/BF16 tests, I'd love to know what you find! yingru.notion.site/When-Spee…

Daniel Han

@danielhanchen

2 Nov 2025

Replying to @_arohan_

:) Original plots come from nitter.app/RichardYRLi/status/197… - also their blog is super good! - still unsure if the FP16 vs BF16 debate is due to hardware issues due to FP32 accumulation sizes - planning to run some experiments!

339

122,824

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(1/x) Ever had your #LLM-#RL training mysteriously collapse? 📉 You're not alone. We saw #agentic RL runs fail with exploding #gradients, and found the culprit: a fundamental "training-inference mismatch." Our new #blog post demystifies this vicious cycle. yingru.notion.site/When-Spee…

332

49,009

Yingru Li · Oct 2, 2025 · 4:45 PM UTC

Yingru Li

@RichardYRLi

2 Oct 2025

Inspired by @thinkymachines 's "#LoRA Without Regret" post, I formalized their insight that policy gradient learns ~1 bit per episode via Bayesian #RL formulation. I prove this is a hard information-theoretic ceiling and extend the analysis to actor-critic methods. Full writeup with theorems: richardli.xyz/post/informati… 🧵

Information Bandwidth in Reinforcement Learning | Yingru Li

An information-theoretic analysis showing that scalar advantage formulations learn ≤ log₂(B) bits per episode, while per-timestep advantages preserve full reward entropy.

richardli.xyz

Thinking Machines

@thinkymachines

29 Sep 2025

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…

312

49,260

Yingru Li · Nov 11, 2025 · 3:28 AM UTC

Yingru Li

@RichardYRLi

11 Nov 2025

1/ Great thread by @IdanShenfeld on Policy Mirror Descent (PMD) and its likely use in models like Kimi K2. It's a powerful technique for stabilizing RL. We'd like to highlight our NeurIPS 2019 work which was one of the first to frame policy optimization as a mirror descent problem and demonstrate its effectiveness in deep RL, where our method significantly outperformed the PPO baseline. Our paper: "Divergence-Augmented Policy Optimization" with Professor Tong Zhang. proceedings.neurips.cc/paper…

idan shenfeld

@IdanShenfeld

9 Nov 2025

Everyone’s talking about Kimi K2 Thinking and its impressive performance. No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick that’s quietly becoming standard in frontier labs. Let’s break down what it is:

9,893

Yingru Li · Oct 14, 2024 · 1:16 AM UTC

Yingru Li

@RichardYRLi

14 Oct 2024

Discover theoretical advancements and applications in #GenAI Reasoning & Agents at #INFORMS2024! 🚀 Sessions on: #LLM Agents, RL, Exploration, Alignment, Gene-Editing, Math Reasoning & more! Oct 20 & 21, Summit-342, Seattle Convention Center. @Zanette_ai @RichardYRLi @MingYin_0312 @g_k_swamy @JoeyHejna @KaixuanHuang1 @weixiong_1 @chanwoopark20

7,647

Yingru Li · Oct 7, 2025 · 3:34 PM UTC

Yingru Li

@RichardYRLi

7 Oct 2025

Replying to @agarwl_

Great thread on the "training-inference mismatch," which our post detailed last month. The core issue is that the severe mismatch on low-probability tokens creates an off-policy problem, making sequence-level Importance Sampling the principled fix. Simpler ideas you mentioned, like an fp32 lm_head, unfortunately don't work—we tested it. The good news: based on this, the vLLM/veRL & SGLang/SLIME teams are on it! 🚀 Full deep dive: yingru.notion.site/When-Spee…

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch | Notion

Authors: Jiacai Liu* Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†

yingru.notion.site

2,383

Yingru Li · Oct 20, 2025 · 5:55 PM UTC

Yingru Li

@RichardYRLi

20 Oct 2025

Spot on. The core challenge is still sample-efficient and scalable exploration for sparse rewards. This is why work building on ideas like Bootstrapped DQN, Ensemble Sampling and EpiNet—such as the new HyperAgent—is so important. arxiv.org/abs/2402.10228

Peter Henderson

@PeterHndrsn

20 Oct 2025

We're still (mostly) in the massively low sample regime for LLMs+RL. Recall that Montezuma's Revenge and other sparse reward settings would take O(1-10M) steps/frames before starting to actually learn anything at all.

3,819

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(4/x) We were shocked to find that hardware is a first-order variable. The exact same code can succeed on one GPU but fail spectacularly on another. Our tests showed mismatch levels followed H20 < L20 < A100. Results aren't always portable! #MLOps #GPU

2,166

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(2/x) Why does agentic RL crash? The chain reaction often starts with tool use. Tool outputs are Out-of-Distribution (OOD) context for the LLM. This unfamiliar input makes the policy uncertain, causing it to sample more low-probability tokens—the first domino to fall.

5,441

Yingru Li · Jul 21, 2024 · 10:26 AM UTC

Yingru Li

@RichardYRLi

21 Jul 2024

[1/7] Want to save #GPU and #Data budget, addressing #AI #risk? Join us @icmlconf, a group of researchers from @cuhksz, focusing on #optimization, #RL, and #LLMs. #ICML2024

1,613

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(3/x) Here's the smoking gun: the training-inference mismatch isn't uniform. It's catastrophically worse for these exact low-probability tokens. The training framework assigns them near-zero probability, while the inference engine saw them as merely unlikely. This massive divergence causes collapse.

2,788

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(9/x) This is an active area of research for us. The training-inference mismatch is a deep problem, especially with the rise of #MoE models and more complex #agentic #systems. We will be continually updating our blog with new findings. Follow us and stay tuned for future updates!

1,259

Yingru Li · Nov 6, 2025 · 1:36 PM UTC

Yingru Li

@RichardYRLi

6 Nov 2025

🚨 UPDATE to the "1 bit per episode" analysis (inspired by @johnschulman's post at @thinkymachines ): After discussion with @mgostIH, I ned to points out the limit only applies to *scalar advantage*! REINFORCE with per-timestep advantages can learn O(T) bits when rewards are dense and independent.

Yingru Li

@RichardYRLi

2 Oct 2025

2,536

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(7/x) Key Takeaways for Practitioners: 1️⃣ Mismatch is a fundamental speed vs. stability trade-off. 2️⃣ Monitor vllm-kl & gradient norms. 3️⃣ Beware of low-prob tokens from OOD inputs (tool use!). 4️⃣ Hardware is a key variable. 5️⃣ Use sequence-level IS as the default fix.

1,509

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(8/x) The training-inference mismatch is a critical challenge for the future of agentic AI. We documented all our findings, effective and ineffective, to help others navigate this issue. Read the full investigation: yingru.notion.site/When-Spee… By Jiacai Liu @RichardYRLi @Chert_Fu @JarvisMSUst @sivil_taram Yu Shen #LLMRL #Research

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch | Notion

Authors: Jiacai Liu* Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†

yingru.notion.site

2,110

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(6/x) The principled solution is Sequence-Level Importance Sampling. It provides an unbiased gradient correction by accounting for the entire trajectory's probability, restoring training stability. We also propose Masked Importance Sampling (MIS), which further improves performance.

3,205

Yingru Li · Nov 2, 2025 · 9:23 AM UTC

Yingru Li

@RichardYRLi

2 Nov 2025

Replying to @QPHutu @danielhanchen

Have the latest version of vLLM fixed the mentioned issue? github.com/vllm-project/flas…

Fix LSE output error in FA2 kv-split by griii · Pull Request #87 · vllm-project/flash-attention

Background During vLLM inference, some features like Cascade Attention require the LSE output from the attention mechanism. When the FlashAttention-2 kernel operates with seqlenq_ngroups_swapped = ...

github.com

3,309

Yingru Li · Nov 12, 2025 · 8:01 PM UTC

Yingru Li

@RichardYRLi

12 Nov 2025

Great to see the vLLM team tackling the training-inference mismatch! Their new post achieves bitwise consistency—a deep engineering fix to make training & inference identical (at a 2.4x speed cost). They also mentioned our blog in the post. In our blog, we explored the algorithmic side: accepting the mismatch to preserve speed & correcting the resulting off-policy bias with rollout distribution correction like importance sampling and rejection sampling. We found the mismatch is most dangerous for low-probability tokens in agentic/tool-use RL. Two vital approaches to the same core challenge! Our deep-dive on the problem & our algorithmic fix: yingru.notion.site/When-Spee…

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch | Notion

Authors: Jiacai Liu* Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†

yingru.notion.site

vLLM

@vllm_project

12 Nov 2025

🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq → same output regardless of batching) 2️⃣ Ensure forward passes in training use identical kernels as inference 3️⃣ Add custom backward passes in PyTorch ✅ Verified on Qwen3 1.7B + GSM8K: • batch_inv_ON (bitwise exact) → KL=0.0, faster convergence, higher reward • batch_inv_OFF → reduced reward, instability We audited every op, imported vLLM’s fused kernels (SiLU MLPs, RMSNorm+residual), and wrote matching backward passes. Run is fully on-policy, deterministic, and reproducible. Next: • Unified model code • torch.compile support • Perf tuning (current bitwise RL ≈2.4× slower) • Broader model + op coverage 🔗 blog.vllm.ai/2025/11/10/bitw… #vLLM #TorchTitan #RL #LLM #AIResearch

2,004

Yingru Li · Sep 27, 2025 · 9:08 AM UTC

Yingru Li

@RichardYRLi

27 Sep 2025

So, why is sequence-level IS the principled solution for the training-inference mismatch, not the token-level heuristic? The token-level approximation is theoretically unsound due to a double mismatch: 🗺️ State Occupancy Mismatch: It uses the state distribution from the rollout policy, failing to correct for taking a different path. 🎯 Mismatched Reward Signal: It evaluates actions using the rollout policy's advantage function—the wrong measuring stick. Sequence-level IS elegantly solves both with a single trajectory-wide ratio. The derivations below show the math is unambiguous. @zzlccc @willccbb @nanjiang_cs #ReinforcementLearning #LLM #AI #RL

Yingru Li

@RichardYRLi

26 Sep 2025

Replying to @RichardYRLi

1,483

Yingru Li · Sep 3, 2025 · 6:42 PM UTC

Yingru Li

@RichardYRLi

3 Sep 2025

The SimpleTIR paper is officially out! We go beyond our July blog post to provide a deeper mathematical explanation and rigorous proof for why multi-turn RL agents are so unstable. The root cause? A predictable domino effect: OOD Tool Feedback → Low-Prob Tokens → Exploding Importance Ratios and logits update magnitude → Gradient Explosion. We discovered that "void turns" are the critical signal for when this is about to happen. We stress-tested this theory against the latest trajectory quality control methods. None of them could fix the core instability problem (max score: 26.3 on AIME24). But by simply filtering out these "void turns," SimpleTIR makes training fundamentally stable. This minimalist approach rockets the Qwen2.5-7B-Base AIME24 score from 22.1 to 50.5, completely from base model without any SFT. Simple, effective, and battle-tested. Hope this helps the community build more reliable agents! 📜 Paper: huggingface.co/papers/2509.0… 💻 Code: github.com/ltzheng/SimpleTIR ✍️ Blog: simpletir.notion.site/report

Paper page - SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Join the discussion on this paper page

huggingface.co

Qian Liu

@sivil_taram

3 Sep 2025

Thanks AK for sharing our work! 🔥 🧵 Back to Jan when we just started this project... we were living a nightmare 😩 Months of watching our multi-turn RL models collapse. Every. Single. Time. 💥 We thought we were doing something wrong... until we discovered other research teams seem to hit the same invisible wall (Devin, verl and other reported issues) 🧱 Multi-turn tool reasoning just... BROKE 💔 It's NOT like the elegant simplicity of R1-Zero’s approach. This was pure chaos. Then came the "aha!" moment of our SimpleTIR💡✨ The secret was hiding in plain sight: “void turns” - those meaningless steps where models generates text that leads... absolutely NOWHERE 🕳️ One simple filter changed everything ✨ Our 7B model jumped from 22% (DAPO) to 50% (Multi-Turn Tool Use) on AIME24 📈 No complex algorithms, no fancy techniques. Just removing the void turn examples that were poisoning the training 🎯 Sometimes, the biggest gains come from understanding what NOT to learn 💡 📄 Paper: huggingface.co/papers/2509.0… 💻 Code: github.com/ltzheng/SimpleTIR ✍️ Blog: simpletir.notion.site/report

1,518

Yingru Li · Oct 3, 2025 · 6:42 PM UTC

Yingru Li

@RichardYRLi

3 Oct 2025

Replying to @willccbb

what happened?

3,948

Yingru Li · Sep 29, 2025 · 6:41 PM UTC

Yingru Li

@RichardYRLi

29 Sep 2025

This is more than just a new recipe for LoRA; it's a masterclass in how first-principles thinking drives frontier research. For years, the community saw a performance gap, but this work shows it was never an architectural flaw—just a misunderstanding of the method. @thinkymachines @johnschulman2

Thinking Machines

@thinkymachines

29 Sep 2025

1,096

Yingru Li · Sep 26, 2025 · 1:00 PM UTC

Yingru Li

@RichardYRLi

26 Sep 2025

(5/x) The fix is Importance Sampling (IS), but the token-level approach is a trap. We show it's theoretically flawed due to biased gradients—it ignores state distribution shifts & uses mismatched rewards. In our experiments, token-level IS still collapsed.

1,722

Yingru Li · Jul 18, 2024 · 3:30 PM UTC

Yingru Li

@RichardYRLi

18 Jul 2024

I'll be at #ICML2024 from July 21st to 27th! Check out the picture for detailed locations and times of my presentations. Also, don't miss my lightning talk on "Agile #Human-#AI #Collaboration for #RiskOversight" at the #AlignmentWorkshop @farairesearch on 21st and 22nd. #RL #Probability #GPT #Martingale #Statistics #Uncertainty #OnlineLearning #decisionmaking #Moderation #Alignment #Risk #Oversight #Dynamics #ICML #conference #Vienna #FoundationModel #LLMs #exploration #poster #HyperAgent #Theory #Practice #math

Scalable posterior sampling, High-dimensional probability, Learning dynamics, Martingale, Foundation model, Online decision-making, Uncertainty quantification, Online learning, Probability, Statistics, RL

ALT Scalable posterior sampling, High-dimensional probability, Learning dynamics, Martingale, Foundation model, Online decision-making, Uncertainty quantification, Online learning, Probability, Statistics, RL

1,399

Yingru Li · Oct 20, 2025 · 5:40 AM UTC

Yingru Li

@RichardYRLi

20 Oct 2025

Quoted from Ben Van Roy (Stanford Professor): @karpathy piped.video/fERcdhg0Kds?si=lhVj…

Csaba Szepesvari @CsabaSzepesvari

19 Oct 2025

Replying to @karpathy

@karpathy I think it would be good to distinguish RL as a problem from the algorithms that people use to address RL problems. This would allow us to discuss if the problem is with the algorithms, or if the problem is with posing a problem as an RL problem. 1/x

2,173

Yingru Li · Jul 21, 2021 · 8:45 PM UTC

Yingru Li

@RichardYRLi

21 Jul 2021

Efficient Exploration by HyperDQN in Deep Reinforcement Learning. At #ICML2021 RL4RealLife Workshop on Friday 23rd. See you at 23:00 - 1:00 Poster Discussion. #GatherTown: A4 in eventhosts.gather.town/PHQYR… With @ZiniuLi, Hao Liang, and Tong Zhang

Yingru Li · Dec 11, 2024 · 2:39 PM UTC

Yingru Li

@RichardYRLi

11 Dec 2024

Our ICML24 work directly addresses this concern. Effective RL is possible with dramatic simplification: 20x fewer parameters, 7x fewer samples (figure 1), and removal of many conventional tricks (table 3) - while still achieving human-level performance. arxiv.org/abs/2402.10228

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and...

We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors...

arxiv.org

Jacob Adkins @jtadkins49

10 Dec 2024

Is there a hyperparameter crisis in reinforcement learning? Let's talk about it at NeurIPS. #NeurIPS2024

1,452

Yingru Li · Nov 2, 2025 · 2:41 PM UTC

Yingru Li

@RichardYRLi

2 Nov 2025

Replying to @agarwl_ @danielhanchen

on A100s? How large is your starting mismatch_KL? Have you converted the rollout_logprob to the higher precision when calculating the mismatch_KL

3,251

Yingru Li · Nov 2, 2025 · 2:35 PM UTC

Yingru Li

@RichardYRLi

2 Nov 2025

Replying to @agarwl_ @danielhanchen

FYI

Yingru Li

@RichardYRLi

2 Nov 2025

1,043

Yingru Li · Jul 23, 2024 · 7:24 AM UTC

Yingru Li

@RichardYRLi

23 Jul 2024

📈 Scale real-time online decisions without scaling model size? Yes! Presenting simple, cost-effective, theory-backed RL solutions at #ICML2024: 🕐 Today, 23 Jul, 11:30am-1pm 📍 Hall C 4-9, Poster #1303 🔬 Poster Session 1 Title: "Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent" Code: github.com/szrlee/HyperAgent #Posterior #DeepRL #AI #Uncertainty #Exploration #Regret #Theory #Computaion #Efficiency

Yingru Li

@RichardYRLi

18 Jul 2024

598

Yingru Li · Nov 11, 2025 · 3:28 AM UTC

Yingru Li

@RichardYRLi

11 Nov 2025

5/ And of course, all modern work on this topic stands on the shoulders of giants. A huge credit is due to Nemirovsky and Yudin (1983) for their seminal work that introduced the mirror descent method in the first place. A classic from optimization theory powering today's AI. Link: isye.gatech.edu/~nemirovs/Ne…

349

Yingru Li · Nov 11, 2025 · 3:28 AM UTC

Yingru Li

@RichardYRLi

11 Nov 2025

4/ It's exciting to see these foundational ideas from 2019 becoming central to today's frontier models. As @IdanShenfeld noted, while PMD-like methods were tricky in continuous domains, they are perfectly suited for the discrete action spaces of LLMs. arXiv Link to the NeurIPS2019 paper: arxiv.org/abs/2501.15034

Divergence-Augmented Policy Optimization

In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle...

arxiv.org

398

Yingru Li · Nov 3, 2025 · 3:04 AM UTC

Yingru Li

@RichardYRLi

3 Nov 2025

Replying to @danielhanchen @QPHutu

I said hardware differences remain but didn’t say FP16 is better than BF16. I have not done the experiments.

386

Yingru Li · Nov 2, 2025 · 3:18 PM UTC

Yingru Li

@RichardYRLi

2 Nov 2025

Replying to @samsja19 @danielhanchen @samxpatterson @agarwl_

FYI

Yingru Li

@RichardYRLi

2 Nov 2025

271

Yingru Li · Mar 19, 2024 · 6:50 AM UTC

Yingru Li

@RichardYRLi

19 Mar 2024

The #first prior-dependent analysis of posterior sampling #RL with function approx. It implies that integration of prior knowledge (such as offline #data or pre-trained #LLMs ) can significantly improve the #efficiency of RL #agents before online #exploration. @aistats_conf

Stat.ML Papers @StatMLPapers

19 Mar 2024

Prior-dependent analysis of posterior sampling reinforcement learning with function approximation ift.tt/CVNAZhy

360

Yingru Li · Nov 6, 2025 · 1:36 PM UTC

Yingru Li

@RichardYRLi

6 Nov 2025

4/ Key insight: My original analysis of actor-critic already established this framework—bijective mappings preserve information! {δₜ} ↔ {rₜ} for actor-critic ✓ {Gₜ} ↔ {rₜ} for REINFORCE ✓ Thanks @mgostIH for pointing out REINFORCE fits the same framework! richardli.xyz/post/informati…

Information Bandwidth in Reinforcement Learning | Yingru Li

An information-theoretic analysis showing that scalar advantage formulations learn ≤ log₂(B) bits per episode, while per-timestep advantages preserve full reward entropy.

richardli.xyz

284

Yingru Li · Nov 6, 2025 · 1:36 PM UTC

Yingru Li

@RichardYRLi

6 Nov 2025

5/ I am grateful to @mgostIH for the insightful conversation about per-timestep advantages in REINFORCE. Their observation that using Adv_t = G_t - b (where G_t is the return from timestep t onward) preserves the bijective mapping {G_t} ↔ {r_t} and thus avoids the information collapse was the key insight that led to revising this paper to include the return-based policy gradient formulation. The original version only analyzed scalar advantages and actor-critic TD errors; the discussion clarified that REINFORCE with per-timestep returns also preserves full reward information.

273

Yingru Li · Oct 2, 2025 · 5:29 PM UTC

Yingru Li

@RichardYRLi

2 Oct 2025

Replying to @chijinML

Congrats! Prof. Jin. We learned a lot from your foundational work on RL Theory.

1,395

Yingru Li · Sep 14, 2024 · 7:26 AM UTC

Yingru Li

@RichardYRLi

14 Sep 2024

Replying to @agarwl_ @max_a_schwarzer

Exciting times for RL indeed! BTW, our ICML 2024 work (arxiv.org/abs/2402.10228) shows comparable sample efficiency for deep RL, but with much smaller networks & simpler training. Love to hear your thoughts!

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and...

We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors...

arxiv.org

357

Yingru Li · Jun 14, 2025 · 3:02 PM UTC

Yingru Li

@RichardYRLi

14 Jun 2025

Replying to @nanjiang_cs

Is this value targeted regression? arxiv.org/abs/2403.11175

Prior-dependent analysis of posterior sampling reinforcement...

This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound...

arxiv.org

588

Yingru Li · Jul 21, 2024 · 10:35 AM UTC

Yingru Li

@RichardYRLi

21 Jul 2024

[7/7] Come and join us for the discussion! @RichardYRLi @ZiniuLi @ericzhang0410 @RuoyuSun_UI @chcoli #ICML2024

209

Yingru Li · Nov 6, 2025 · 1:36 PM UTC

Yingru Li

@RichardYRLi

6 Nov 2025

3/ Information capacity comparison: Scalar advantage: - Terminal rewards: O(1) bits - Dense independent rewards: O(log T) bits Per-timestep advantages [REINFORCE or actor-critic]: - Terminal rewards: O(1) bits - Dense independent rewards: O(T) bits The ceiling depends on BOTH gradient structure AND reward structure 📊

183

Yingru Li · Nov 11, 2025 · 3:28 AM UTC

Yingru Li

@RichardYRLi

11 Nov 2025

3/ Why did this matter? This approach led to more stable learning and deeper exploration, making it far more robust when reusing off-policy data. The results spoke for themselves. Our policy mirror descent method (PPO+DA) significantly outperformed the standard PPO baseline on a wide range of Atari games, especially in data-scarce scenarios.

309

Yingru Li · Dec 10, 2023 · 5:39 PM UTC

Yingru Li

@RichardYRLi

10 Dec 2023

Replying to @wucathy

Would be great to chat on efficient RL and game theoretical decision making under uncertainty! My homepage RichardLi.xyz

165

Yingru Li · Nov 6, 2025 · 1:36 PM UTC

Yingru Li

@RichardYRLi

6 Nov 2025

2/ The 1-bit ceiling is real, but specific to: - Scalar advantage: one return number per episode [sum(r_t) -b] - Terminal rewards Standard REINFORCE uses per-timestep advantages (Gₜ - b), where Gₜ is return-to-go → {Gₜ} ↔ {rₜ} bijection preserves full reward entropy!

218

Yingru Li · Jun 29, 2024 · 7:27 AM UTC

Yingru Li

@RichardYRLi

29 Jun 2024

Quite impressive! A big step towards democratizing foundation models training with Adam-mini. 🚀 @ericzhang0410 @chcoli @ZiniuLi

Yushun Zhang

@yushun_zzz

28 Jun 2024

Thrilled to introduce Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini can also achieve 49.5% higher throughput than AdamW on Llama2-7B pre-training. The design of Adam-mini is inspired by certain Hessian structures we observed on Transformers. Feel free to try it out! Try switching to Adam-mini with the same hyperparams of AdamW, it would work with only half memory. Hope Adam-mini can help save time, cost, and energy in your tasks! Paper: "Adam-mini: Use Fewer Learning Rates To Gain More" arxiv.org/abs/2406.16793 Code: github.com/zyushun/Adam-mini

253

Yingru Li · Mar 6, 2025 · 5:47 AM UTC

Yingru Li

@RichardYRLi

6 Mar 2025

True

Shane Gu

@shaneguML

5 Mar 2025

RL and Bayes are not machine learning subfields. They are the axioms of life and universe. You can't get rid of them.

165

Yingru Li · Nov 11, 2025 · 3:28 AM UTC

Yingru Li

@RichardYRLi

11 Nov 2025

2/ Our key insight in "Divergence-Augmented Policy Optimization" (DAPO) was to view policy updates through the lens of mirror descent on the entire state-action distribution μ(s, a). This balances maximizing future rewards (the r term) against the cost of deviating from our current distribution (the D(μ, μₜ) term). It's a more stable update rule. This is a subtle but crucial shift from just regularizing action probabilities at a given state. Instead of just penalizing policy changes, we penalize changes in the outcomes of the policy across the whole environment

363

Yingru Li · Jun 14, 2025 · 3:23 PM UTC

Yingru Li

@RichardYRLi

14 Jun 2025

Replying to @nanjiang_cs

Great to know about this! Thank you!

228

Yingru Li · Sep 13, 2024 · 7:05 PM UTC

Yingru Li

@RichardYRLi

13 Sep 2024

Visionary

Pablo Samuel Castro @pcastr

11 Aug 2024

Great keynote by David Silver, arguing that we need to re-focus on RL to get out of the LLM Valley @RL_Conference

170

Yingru Li · Oct 4, 2025 · 3:04 AM UTC

Yingru Li

@RichardYRLi

4 Oct 2025

Indeed, agents require (online) continual learning.

Kevin Patrick Murphy

@sirbayes

3 Oct 2025

I agree with @karpathy 's take here. The interview between @RichardSSutton and @dwarkesh_sp was interesting, but I think at times there was a communication gap due to some misunderstandings. I would say that the current LLM training setup is very similar to the classic model-free RL setup, except that with LLMs: (1) the policy is warm-started from a supervised model (no de-novo, self-directed learning); (2) there is a train/test distinction (no continual learning); (3) most of the observation stream comes from human words, which already "carve nature at its joints", bypassing the harder problem of learning useful abstractions from raw sensorimotor streams. (4) when using multimodal models, the perceptual encoder is usually pre-trained and frozen, and often relies on a lot of human engineering (eg contrastive losses, or pixel-prediction losses) to come up with a good set of (soft) tokens. Most of the interview seem to focus on issue #1. However, the discussion seemed confused here due to the fact that LLMs are both a world model (predict what humans would typically say) and a policy (predict what the agent should do). Obviously the model from the supervised pretraining stage is not action-conditioned, so Sutton does not want to call it a WM - but it is a predictor of future observations given the past, so it's like a WM that marginalizes over actions (resulting in a mixture). The WM is then converted into a (goal-conditioned) policy using IFT (imitation learning) and then improved with RLFT, which further confuses the discussion. In current practice, the RLFT stage mostly just uses human provided reasoning tasks, which are bandit problems that do not involve interacting with an environment. But there is a recent move towards true multi-step RL, where LLMs do learn from external environments, as in classic RL. This fact was not emphasized enough in the interview, IMHO. Andrej argues that warm-starting is a practical alternative to evolution's outer meta-learning loop, and I agree, so I don't have a problem with #1. But I do agree with Sutton's criticisms #2-#4. In particular, I expect a lot of future progress to come from continual RL applied to multimodal problems (eg. visual GUI-using agents) in non-stationary multi-agent environments (e.g., e-commerce or embodied AI), where the agent learns its own abstractions over time (eg creating tool libraries), it learns both a (goal agnostic) world model and a (goal conditioned) policy (so it can do decision time planning), and both kinds of model become semi-parametric (eg. combining memories and ICL with gradient-based weight updates). Future agents will not just be a frozen "omni-transformer", consuming and generating tokens, they will be heterogeneous adaptive systems, with many different specialized modules, more like the brain. (This may make serving hard, but who said intelligence would be easy to reproduce?) I think Sutton will like this new paradigm more :)

631

Yingru Li · Sep 11, 2020 · 9:45 AM UTC

Yingru Li

@RichardYRLi

11 Sep 2020

Replying to @svlevine @abhishekunique7

What might be the major difference between AWAC and MARWIL? It is already implemented in Ray RLlib. Is there any comparison? MARWIL: ai.tencent.com/ailab/media/p… Ray RLlib implementation: docs.ray.io/en/latest/rllib-…

Yingru Li · Sep 22, 2018 · 8:32 AM UTC

Yingru Li

@RichardYRLi

22 Sep 2018

@DeepMindAI @GameAiConf @GameAINorth @GameAI2018 @GameAI_jp @GameAI_QMUL @StarCraft @esportstarcraft @TencentGames_ #Tencent #TencentAILab #TStarBots #Starcraft #GameAI #awesome #exciting

机器之心 JIQIZHIXIN

@jiqizhixin

21 Sep 2018

#Tencent TStarBots Defeat @StarCraft II's Powerful Builtin #AI in the Full Game syncedreview.com/2018/09/21/…

Yingru Li · Dec 9, 2023 · 8:07 PM UTC

Yingru Li

@RichardYRLi

9 Dec 2023

Flying out to #NeurIPS2023 today from HK!

Yingru Li

@RichardYRLi

5 Dec 2023

Replying to @RichardYRLi

I will be at #NeurIPS2023 next week. Please DM me if you would like to chat about anything related to #ReinforcementLearning / Decision-making #agents / Random Projection for Efficient #Algorithms😋

305

Yingru Li · Nov 1, 2025 · 4:40 AM UTC

Yingru Li

@RichardYRLi

1 Nov 2025

Replying to @RichardYRLi @agarwl_

The FP16 paper also confirmed that our seq level algorithm is only one survived in ther BF16 setups. The algorithm and follow-up analysis can used for general off policy problem, not just the mismatch scenarios.

140

Yingru Li · Dec 20, 2024 · 3:58 PM UTC

Yingru Li

@RichardYRLi

20 Dec 2024

Replying to @RichardYRLi @carlo_sferrazza @pabbeel

I'm also excited to share our work on ensemble theory and algorithm design that could address your mentioned "computational overhead in ensemble-based information gain estimation"! Check out our Ensemble++ paper with code: arxiv.org/abs/2407.13195

Scalable Exploration via Ensemble++

Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While...

arxiv.org

Yingru Li · Nov 3, 2025 · 4:54 PM UTC

Yingru Li

@RichardYRLi

3 Nov 2025

Replying to @huskydogewoof @jm_alexia

FYI: link to our blogpost yingru.notion.site/When-Spee…

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch | Notion

Authors: Jiacai Liu* Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†

yingru.notion.site

Yingru Li · Jan 29, 2025 · 7:42 AM UTC

Yingru Li

@RichardYRLi

29 Jan 2025

Replying to @shaneguML

How do we do the right notion of online RL? It is a bit tricky as we do not even solve the catastrophic forgetting or plasticity issue in training general DNN (also applies to LLM). Second, exploration in policy based Deep RL is still full of engineering tricks, lacking principles.

210

Yingru Li · Jun 8, 2020 · 2:31 AM UTC

Yingru Li

@RichardYRLi

8 Jun 2020

My recent research is highly related to Blackwell's approachability concept.

Sham Kakade

@ShamKakade6

7 Jun 2020

1/ David Blackwell. Leagues ahead of his time: "What is a good prediction strategy, and how well can you do?" While some things do not seem possible, "Looking for a p [probability] that does well against every x [an outcome] seems hopeless", Blackwell does give us a strategy:

Yingru Li · Jul 19, 2024 · 2:28 AM UTC

Yingru Li

@RichardYRLi

19 Jul 2024

Very interesting! @baharanm @sjoshi804

Baharan Mirzasoleiman @baharanm

19 Jul 2024

I'll be giving a 2-hour tutorial on data-efficient learning with my PhD student @sjoshi804 on Monday July 22 at #ICML2024. Join us to learn more about this cool topic! ➡️ We can learn better from better data! ⬅️🙌🌱

466

Yingru Li · Jul 22, 2021 · 5:32 AM UTC

Yingru Li

@RichardYRLi

22 Jul 2021

This poster is a preliminary version of our work. We believe HyperDQN, a randomized exploration algo with modern #DeepLearning architecture, is one of right approaches bringing theory to practice and bridging their gap. Thanks to the organizers @yuxili99 of #RL4RealLife workshop.

Yingru Li · Feb 13, 2025 · 4:41 PM UTC

Yingru Li

@RichardYRLi

13 Feb 2025

Replying to @KaixuanHuang1

Cool! When will this dataset be available?

209

Yingru Li · Sep 22, 2018 · 8:35 AM UTC

Yingru Li

@RichardYRLi

22 Sep 2018

Replying to @notmisha

work from Tencent AI Lab ai.tencent.com

Yingru Li · Dec 5, 2023 · 12:37 PM UTC

Yingru Li

@RichardYRLi

5 Dec 2023

Bringing groundbreaking solutions to #NeurIPS: "Efficient RL via Hypermodel." Tackling the major RL challenge of data and computational efficiency, our results mark a paradigm shift in the field. Don’t miss this insightful session! #AIResearch #ReinforcementLearning @NeurIPSConf

434

Yingru Li · Oct 5, 2025 · 1:36 PM UTC

Yingru Li

@RichardYRLi

5 Oct 2025

Replying to @abeirami

Excellent analysis—it's great to see this information-theoretic lens being applied. This aligns perfectly with the framework I laid out in my post on "Information Bandwidth." We both arrive at the same core conclusion: the bottleneck is two-fold. My post is a deep dive into both of these aspects: The Signal Bottleneck: Analyzing how the information ceiling is fundamentally limited by the reward structure itself (e.g., terminal/binary rewards offering ≤1 bit vs. dense rewards offering thousands). The Algorithm Bottleneck: Showing how policy gradient's aggregation structure inherently destroys this signal, while an algorithm like Actor-Critic is designed to preserve it. It's all about the crucial interplay between signal density and algorithm design. Great to be thinking on the same wavelength! The full analysis is here: richardli.xyz/post/informati… @johnschulman2 @AlexGDimakis @rm_rafailov

Information Bandwidth in Reinforcement Learning | Yingru Li

An information-theoretic analysis showing that scalar advantage formulations learn ≤ log₂(B) bits per episode, while per-timestep advantages preserve full reward entropy.

richardli.xyz

461

Yingru Li · Sep 22, 2025 · 4:50 AM UTC

Yingru Li

@RichardYRLi

22 Sep 2025

that is why we use the hierarchical MDP formulation. arxiv.org/abs/2509.02479

Yingru Li

@RichardYRLi

22 Sep 2025

Replying to @nanjiang_cs

It makes sense as by the Autoregressive modeling. But if it is intended to optimize combinatorial action set at the beginning, then it is open loop and we do even know how to optimize it in LLM. This implies the underlying MDP is always token level.

273

Yingru Li · Dec 11, 2024 · 2:42 PM UTC

Yingru Li

@RichardYRLi

11 Dec 2024

Replying to @MichaelHBowling

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and...

We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors...

arxiv.org

Yingru Li · Mar 8, 2024 · 7:03 PM UTC

Yingru Li

@RichardYRLi

8 Mar 2024

Replying to @seohong_park @JesseFarebro

Interestingly, we found adding noise in the right way to value target with MSE provides significant practical and theoretical benefits in computation and data efficiency! arxiv.org/abs/2402.10228 @marcgbellemare @aviral_kumar2

695

Yingru Li · Dec 11, 2024 · 2:43 PM UTC

Yingru Li

@RichardYRLi

11 Dec 2024

Replying to @jtadkins49

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and...

We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors...

arxiv.org

212

Yingru Li · Nov 8, 2025 · 6:32 AM UTC

Yingru Li

@RichardYRLi

8 Nov 2025

Replying to @D_Nohara @QPHutu

These are all supported in verl verl.readthedocs.io/en/lates… bypass old_logprob and no PPO clipping. and it was discussed at section 4.2.1 of the blog yingru.notion.site/When-Spee…

134

Yingru Li · Sep 28, 2025 · 2:18 PM UTC

Yingru Li

@RichardYRLi

28 Sep 2025

Replying to @zzlccc

Excellent question! You've nailed the central challenge of using sequence-level IS in practice. In the blog, our current solution is to directly manage the ratio with methods like Truncated IS (TIS) and Masked IS (MIS). They control variance by capping or zeroing out extreme ratios from outlier trajectories. We found MIS works particularly well. But you're right to imply these are still heuristics. We're actively working on more principled variance reduction techniques for a future post. Stay tuned!

Yingru Li · Nov 1, 2025 · 1:21 AM UTC

Yingru Li

@RichardYRLi

1 Nov 2025

Replying to @agarwl_

Hi Rishabh, we also had a set of experiments showing (on-policy) GRPO/RLOO can crash at H100 with BF16.

202

Yingru Li · Dec 13, 2021 · 2:27 PM UTC

Yingru Li

@RichardYRLi

13 Dec 2021

Want to discuss more on the theory-practice gap in (randomized) exploration in RL? Welcome to join the oral presentation at 09:10, 14 Dec (ET). neurips.cc/virtual/2021/work… @ZiniuLi @EcoTheoryRL

Yingru Li · Sep 29, 2025 · 6:41 PM UTC

Yingru Li

@RichardYRLi

29 Sep 2025

(3/3) The practical implications are huge. This provides a clear, theory-backed path for deploying cheaper, more flexible fine-tuning setups and truly unlocks large-scale, low-cost custom agent development. This is what makes a frontier lab, a frontier lab. Congarts @johnschulman2 @thinkymachines

172

Yingru Li · Nov 8, 2025 · 6:00 AM UTC

Yingru Li

@RichardYRLi

8 Nov 2025

Replying to @ProfKuang @profkuang @MengdiWang10

We shall bring that into practice.

Yingru Li · Oct 10, 2019 · 3:04 AM UTC

Yingru Li

@RichardYRLi

10 Oct 2019

Replying to @daniel_goldsmth

@overleaf Please

Yingru Li · Dec 6, 2021 · 6:26 PM UTC

Yingru Li

@RichardYRLi

6 Dec 2021

2 years ago @NeurIPSConf

Yingru Li · Sep 22, 2025 · 8:21 AM UTC

Yingru Li

@RichardYRLi

22 Sep 2025

Replying to @nanjiang_cs

Thank you for the detailed comments. I totally agree that in LLM, your definition of "seq-level" PG is essentially the token-level PG. and there is no ambiguity here. it is just the decomposition of log prob using the autoregressive property. What I am confused is the point on optimizing a combinatorial actions that carried out from the beginning. I will try to understand this.

Yingru Li · Jul 18, 2024 · 4:30 PM UTC

Yingru Li

@RichardYRLi

18 Jul 2024

Replying to @bonniesjli @GoogleDeepMind

Very interesting! Would be great to chat!

1,142

Yingru Li · Jul 21, 2024 · 10:26 AM UTC

Yingru Li

@RichardYRLi

21 Jul 2024

[5/7] RODI-Rep: A computation-efficient distributional RL algorithm for risk-sensitive RL with regret guarantee. [Paper](arxiv.org/abs/2210.14051v3). FoRLaC workshop @icmlconf

Yingru Li · Jul 21, 2021 · 8:45 PM UTC

Yingru Li

@RichardYRLi

21 Jul 2021

Poster: liziniu.org/docs/icml2021_wo… Paper: liziniu.org/docs/icml2021_wo…

Yingru Li · Oct 5, 2025 · 1:39 PM UTC

Yingru Li

@RichardYRLi

5 Oct 2025

Replying to @RichardYRLi @abeirami

A great follow-up point for the theorists here: to be precise, this analysis focuses on the information upper bound (the "bandwidth"). To strictly prove superiority, one might need to show that lower_bound(AC) > upper_bound(PG). However, deriving a tight lower bound on information gain is difficult as it depends heavily on implementation details like critic accuracy. The key takeaway is that PG has a low, fixed structural ceiling due to reward aggregation. AC removes this ceiling, giving it access to a much richer information stream.

191

Yingru Li · Jul 21, 2024 · 10:26 AM UTC

Yingru Li

@RichardYRLi

21 Jul 2024

[2/7] HyperAgent: 85% data budget saving 📉 / 95% smaller model size 🧠 in deep RL benchmarks. Saving 90% labeling budget 🏷️ in human-AI collaboration for #risk oversight. Closing a theoretical gap in scalable uncertainty estimation and exploration. [HyperAgent: Paper](arxiv.org/abs/2402.10228) | [ICML Poster](icml.cc/virtual/2024/poster/…) [GPT-HyperAgent: Paper](arxiv.org/abs/2407.13195) | [Poster](icml.cc/virtual/2024/35857)

Yingru Li · Dec 5, 2023 · 12:37 PM UTC

Yingru Li

@RichardYRLi

5 Dec 2023

437

Yingru Li · Dec 11, 2024 · 2:41 PM UTC

Yingru Li

@RichardYRLi

11 Dec 2024

Replying to @shaneguML

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and...

We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors...

arxiv.org

Yingru Li · Oct 20, 2025 · 12:06 PM UTC

Yingru Li

@RichardYRLi

20 Oct 2025

Replying to @Dilip_Arumugam @CsabaSzepesvari @karpathy

Ben also had a talk.

Yingru Li

@RichardYRLi

20 Oct 2025

Quoted from Ben Van Roy (Stanford Professor): @karpathy piped.video/fERcdhg0Kds?si=lhVj…

129

Yingru Li · Oct 29, 2025 · 1:08 AM UTC

Yingru Li

@RichardYRLi

29 Oct 2025

Replying to @QuanquanGu

it is all connected. arxiv.org/abs/2504.13173 arxiv.org/pdf/2505.23735

It's All Connected: A Journey Through Test-Time Memorization,...

Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of...

arxiv.org

207

Yingru Li · Jul 21, 2024 · 10:26 AM UTC

Yingru Li

@RichardYRLi

21 Jul 2024

[3/7] ReMax: A simple RLHF algorithm alternative to PPO, using 50% less GPU memory 💻, running 2x as fast⚡, and eliminating 4+ hyperparameters. [Paper](arxiv.org/abs/2310.10505) [ICML Poster](icml.cc/virtual/2024/poster/…)

ReMax: A Simple, Effective, and Efficient Reinforcement Learning...

Reinforcement Learning from Human Feedback (RLHF) is key to aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful...

arxiv.org

Yingru Li · Nov 21, 2022 · 12:57 PM UTC

Yingru Li

@RichardYRLi

21 Nov 2022

Replying to @DanielRuss0

congratulations!

Yingru Li · Sep 28, 2025 · 12:33 PM UTC

Yingru Li

@RichardYRLi

28 Sep 2025

Replying to @BlackHC

While one might propose enforcing identical calculations (e.g., using "batch invariant kernels" in thinking machines blog), these solutions come with a severe performance penalty, defeating the purpose of using a high-speed inference engine in the first place. This speed-vs-consistency trade-off is at the heart of the problem, making it a persistent challenge rather than a simple engineering fix.

240

Yingru Li · Dec 11, 2024 · 2:39 PM UTC

Yingru Li

@RichardYRLi

11 Dec 2024

HyperAgent shows that the path forward may be through simpler, more principled approaches rather than hyperparameter complexity.

Yingru Li · Sep 27, 2025 · 1:03 PM UTC

Yingru Li

@RichardYRLi

27 Sep 2025

Replying to @gerardsans

This is a fantastic and deeply insightful take. You're right that the root problem is a fundamental, conceptual one—not just an engineering mismatch. We see our work as diagnosing and fixing the immediate, practical symptoms (the system collapse) that arise from this "deeper rot." While the field grapples with the long-term challenge of causal grounding, stabilizing the fragile systems we have today is a critical first step. The cookbook analogy is perfect. Really appreciate you adding this valuable layer to the discussion!

493

Yingru Li · Sep 27, 2025 · 6:32 AM UTC

Yingru Li

@RichardYRLi

27 Sep 2025

Low-probability tokens are the weak link where the training-inference mismatch explodes. This new finding explains the core instability mechanism and complements the original great work on this problem by @fengyao1909 and team. fengyao.notion.site/off-poli…

Your Efficient RL Framework Secretly Brings You Off-Policy RL Training | Notion

Feng Yao* Liyuan Liu* Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao

fengyao.notion.site

Yingru Li

@RichardYRLi

26 Sep 2025

Replying to @RichardYRLi

381

Yingru Li · Jul 21, 2024 · 10:26 AM UTC

Yingru Li

@RichardYRLi

21 Jul 2024

[4/7] Adam-mini: A “mini” version of Adam that saves near-50% memory 💻. [Paper](arxiv.org/abs/2406.16793). ES-FoMo workshop @icmlconf