Ahmad Beirami ✈️ ICML · Jan 13, 2026 · 10:38 PM UTC

Ahmad Beirami ✈️ ICML

Pinned Tweet

Jan 13

We are hiring Members of Technical Staff (Research Engineers)! Current LLM agents lack reliability, creating a gap between demos and production. We solve this by automating the complex workflow of debugging, evaluation, and iteration required to make agents robust. 👇

649

82,811

Ahmad Beirami ✈️ ICML · Oct 21, 2025 · 12:42 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

21 Oct 2025

They should have broken the 10k to 10-100 stacks of $100-1k and given them to the identical copies of the same model to be able to see anything remotely meaningful. Right now we are looking at noise!

This tweet is unavailable

1,301

117,557

Ahmad Beirami ✈️ ICML · Aug 25, 2025 · 10:53 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

25 Aug 2025

What are the founders going to own? 🤔

1,036

127,337

Ahmad Beirami ✈️ ICML · Aug 25, 2025 · 3:12 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

25 Aug 2025

There is no free lunch here. Pruning hurts robustness and capabilities other than those captured by the success metric used to prune.

Avi Chawla

@_avichawla

25 Aug 2025

I removed 74% of neurons from a neural network. It dropped the accuracy by just 0.50%. Here's a breakdown (with code):

737

92,476

Ahmad Beirami ✈️ ICML · Jun 2, 2025 · 8:31 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

2 Jun 2025

After three incredible years, today is my last day at Google DeepMind! I am truly grateful to the amazing colleagues who made the journey 1000x more fruitful and enjoyable! I am forever indebted to my collaborators who showed me how to be better at everything via demonstrations.

750

85,415

Ahmad Beirami ✈️ ICML · Sep 10, 2025 · 7:19 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

10 Sep 2025

This is a great example of what good research looks like. You start with a real problem. You peel it layer by layer to find the root cause. You form a new hypothesis and keep digging. At the end, you have something insightful to share!

Thinking Machines

@thinkymachines

10 Sep 2025

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/def…

698

53,874

Ahmad Beirami ✈️ ICML · Aug 12, 2025 · 7:32 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

12 Aug 2025

The best AI researchers zoom at three abstraction levels: - High: paper-level ideas & math - Mid: code-level implementation - Low: GPU/TPU reality (kernels/memory) Low exposes bottlenecks. High accelerates exploration. Mid makes it real. The job is to translate between them!

664

39,280

Ahmad Beirami ✈️ ICML · Aug 9, 2025 · 12:55 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

9 Aug 2025

The main ingredient that led to GRPO's performance leap is the calibration of the reward/value via multiple rollouts per prompt. Let me elaborate on what I mean by that and a cheaper way of doing it offline.

655

117,600

Ahmad Beirami ✈️ ICML · May 27, 2025 · 6:01 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

27 May 2025

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.

523

77,724

Ahmad Beirami ✈️ ICML · Feb 4, 2025 · 1:50 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

4 Feb 2025

A very nice blogpost on GRPO (the method that was used to train R1) by Youssef Mroueh

474

53,634

Ahmad Beirami ✈️ ICML · Aug 10, 2025 · 3:19 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

10 Aug 2025

Post-training research was fueled by the KL-regularized RL mathematical foundation. That led to a lot of algorithmic research and a ton of progress over a few years. This helped us learn how to "distill" metrics back into models. But today we are optimizing workflows/agents.

453

41,955

Ahmad Beirami ✈️ ICML · Sep 6, 2025 · 8:34 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

6 Sep 2025

Unpopular opinion: When a paper has a senior mentor and a junior mentee, the senior author must make sure the claims are correct and well supported. They must check every claim and gate the submission until it meets that bar. The junior author is the generator. The senior author is the verifier. The verifier should teach/distill some checks to the generator, but the verifier keeps final responsibility. If a wrong claim gets out, it is on the verifier!

395

43,476

Ahmad Beirami ✈️ ICML · Apr 13, 2022 · 2:24 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

13 Apr 2022

The question that a reviewer should ask themselves is: Does this paper take a gradient step in the right direction? Is the community better off with this paper published? If the answer is yes, then the recommendation should be to accept.

368

Ahmad Beirami ✈️ ICML · Oct 4, 2025 · 7:48 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

4 Oct 2025

This post should have been titled "Finetuning without regret" Finetuning models for specific narrow use cases to save costs, and to keep data on prem during inference, makes sense in general. However, that is bottlenecked by two main issues IMO: 1. Today, OSS models are lacking in capabilities, so the quality ceiling stays low even after finetuning. 2. Finetuning is delicate. It is easy to overoptimize and regress on core skills like reasoning or instruction following. This is hard to detect outside big labs, since most teams lack broad eval sets and only measure the narrow task they care about. We have not solved 1 yet, but Tinker shows promise on 2. The direction seems right: build tooling that makes finetuning easier, from low-level infra up to higher-level algorithms and evals that keep performance on track. I peeked at the code and liked how clean it is, and it reads as built from first principles. If OSS models catch up to frontier quality, I can see this service really taking off.

Thinking Machines

@thinkymachines

1 Oct 2025

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker

355

69,929

Ahmad Beirami ✈️ ICML · Nov 5, 2025 · 1:20 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

5 Nov 2025

Once you see a math concept geometrically, it becomes much easier to think about, and it’s hard to go back to any other way of seeing it.

353

17,157

Ahmad Beirami ✈️ ICML · Feb 3, 2025 · 12:59 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

3 Feb 2025

𝐛𝐞𝐬𝐭-𝐨𝐟-𝐧 is a strong baseline for - improving agents - scaling inference-time compute - preference alignment - jailbreaking models How does 𝐁𝐨𝐧 work? and why is it so strong? Find some answers in the paper we wrote over two Christmas breaks!🧵

351

46,590

Ahmad Beirami ✈️ ICML · May 7, 2025 · 12:49 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

7 May 2025

A great PhD advisor helps you discover the best version of yourself! In doing so, they'll help set a research vision, set yearly plans, break the problem down to solvable pieces, how to measure success, when to pivot, how to communicate clearly & concisely (written/verbal), etc.

Enes Arda @realenesarda

6 May 2025

Replying to @abeirami

may I ask what you think an advisor's most important role is?

304

29,097

Ahmad Beirami ✈️ ICML · Sep 9, 2025 · 1:28 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

9 Sep 2025

I occasionally get messages asking how to follow my path and get into Meta, DeepMind, or similar places. That is the wrong question. Do not focus on the brand! Focus on what you want to work on, then find the opportunity that fits your goals best.

295

23,772

Ahmad Beirami ✈️ ICML · Nov 2, 2025 · 11:57 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

2 Nov 2025

Everyone should know Chernoff bounds, and Sunil has written an excellent introductory post. Check it out!

Sunil Madhow @MadhowSunil

2 Nov 2025

Ever wonder why the Chernoff bound feels like magic? A geometric answer: KL divergence loves exponential families. This post shares some reflections — and sets up a series on how KL geometry connects classical statistics, online learning (OCO), and more. sunilmadhow.github.io/posts/…

297

57,598

Ahmad Beirami ✈️ ICML · Oct 8, 2025 · 3:54 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

8 Oct 2025

ICLR season, and my timeline is flooded with paper threads that jump straight to we beat SOTA. But the solution only makes sense in the context of the problem, which is usually missing. What most threads skip: - What problem are you solving? - Why does it matter? - What did prior work miss? Instead, we get a tour of the method and a leaderboard screenshot. Remember that the audience for the problem is much larger than the audience for your particular solution.

294

26,537

Ahmad Beirami ✈️ ICML · Feb 5, 2024 · 1:00 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

5 Feb 2024

[#eacl2024 paper] TL;DR We introduce 𝗴𝗿𝗮𝗱𝗶𝗲𝗻𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝗱 𝘁𝗲𝗮𝗺𝗶𝗻𝗴 (𝗚𝗕𝗥𝗧), an effective method for triggering language models to produce unsafe responses, even when the LM is finetuned to be safe through 𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡.

280

47,513

Ahmad Beirami ✈️ ICML · May 11, 2025 · 1:01 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

11 May 2025

NeurIPS seems to be on track to get 25k+ submissions 😳

270

38,123

Ahmad Beirami ✈️ ICML · Nov 9, 2024 · 1:05 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

9 Nov 2024

RLHF provably can't teach models any new knowledge. If you need to teach new skills, you need to look at pre-training and SFT. Why? 👇

273

83,668

Ahmad Beirami ✈️ ICML · May 7, 2025 · 3:38 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

7 May 2025

For a long time, the biggest problem in machine learning has been improving and understanding robustness and generalization to OOD. We are just increasingly making more & more problems in-distribution but the models still don't generalize out-of-the-box to the tail of problems.

262

29,547

Ahmad Beirami ✈️ ICML · May 6, 2025 · 6:46 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

6 May 2025

That's exactly the kind of advisor you do not want to do a PhD with!

Aravind Srinivas

@AravSrinivas

6 May 2025

At this point, you can use any AIs (perplexity, chatgpt, etc) as a PhD advisor on whatever topic you're getting deep into. It's pretty good. I have my preference. The core point: Research advice is no longer an elite academic university thing.

253

37,322

Ahmad Beirami ✈️ ICML · Dec 15, 2024 · 12:16 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

15 Dec 2024

In today's publication culture, most authors are after being SOTA, showing tables with 𝐛𝐨𝐥𝐝 numbers, and writing the minimum viable paper! The goal of a scientific paper should be to push the field forward with new intuition/insights on how to think about solving a problem.

251

23,019

Ahmad Beirami ✈️ ICML · Jun 5, 2024 · 7:32 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

5 Jun 2024

[#ICML2024] 𝗖𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗲𝗱 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 CD is an inference-time alignment that 𝘸𝘪𝘵𝘩 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘵𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘳𝘶𝘯 - Gives configurable reward/KL tradeoffs - Transfers to a new base model - Allows reward aggregation

Controlled Decoding from Language Models
Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami
KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We pose a tokenwise RL objective and propose a modular solver for it, called controlled decoding (CD). CD exerts control through a separate prefix scorer module, which is trained to learn a value function for the reward. The prefix scorer is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that prefix scorers for multiple rewards may be combined at inference time, effectively solving a multi-objective RL problem with no additional train

ALT Controlled Decoding from Language Models Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We pose a tokenwise RL objective and propose a modular solver for it, called controlled decoding (CD). CD exerts control through a separate prefix scorer module, which is trained to learn a value function for the reward. The prefix scorer is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that prefix scorers for multiple rewards may be combined at inference time, effectively solving a multi-objective RL problem with no additional train

252

39,967

Ahmad Beirami ✈️ ICML · Nov 2, 2023 · 12:30 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

2 Nov 2023

⚠️#Internship2024 Alert⚠️ Are you a 𝐏𝐡𝐃 𝐬𝐭𝐮𝐝𝐞𝐧𝐭 excited about #ResponsibleAI in foundation models? Do you have experience training/evaluating them? Team members: @PreethiLahoti @infoxiao @ananthbshankar @sroy_subhrajit @aradsinha @YaoQinUCSD (RADML at @GoogleResearch)

252

76,588

Ahmad Beirami ✈️ ICML · Jun 9, 2025 · 7:44 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

9 Jun 2025

Good time to remind ourselves that: If a paper is great, the credit goes to the first author. If a paper has any flaws, the responsibility falls on the last author.

236

27,336

Ahmad Beirami ✈️ ICML · May 6, 2025 · 7:24 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

6 May 2025

A PhD is not about getting over the noisy conference bar and publishing a bunch of papers. If that's what your PhD is about, you need to take a step back and reconsider your path. And a PhD advisor's job is certainly not just to help make papers happen.

Momin Abbas @MominAbbas2

6 May 2025

Replying to @khademinori @abeirami

Given the ongoing issues we continue to hear about in the peer review process at top venues, and considering the rapid advancements in AI, it's not hard to imagine a future—not too far off—where an AI system successfully passes peer review at a top research venue.

235

68,834

Ahmad Beirami ✈️ ICML · Nov 11, 2024 · 12:38 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

11 Nov 2024

To get a research-y job, you need to have published a few good papers. Your job talk & interviews will cover 1-3 papers you significantly contributed to. Maximizing the # of pubs actually 𝐡𝐮𝐫𝐭𝐬 your prospects as it leads you to shallow "minimum publishable unit" papers.

224

25,007

Ahmad Beirami ✈️ ICML · Nov 27, 2024 · 6:31 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

27 Nov 2024

The question that a reviewer should ask themselves is: Does this paper take a gradient step in a promising direction? Is the community better off with this paper published? If the answer is yes, then the recommendation should be to accept.

222

27,173

Ahmad Beirami ✈️ ICML · Jan 1, 2025 · 7:52 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

1 Jan 2025

Excited to share 𝐈𝐧𝐟𝐀𝐥𝐢𝐠𝐧! Alignment optimization objective implicitly assumes 𝘴𝘢𝘮𝘱𝘭𝘪𝘯𝘨 from the resulting aligned model. But we are increasingly using different and sometimes sophisticated inference-time compute algorithms. How to resolve this discrepancy?🧵

InfAlign: Inference-aware language model alignment
Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

ALT InfAlign: Inference-aware language model alignment Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

218

22,661

Ahmad Beirami ✈️ ICML · Oct 21, 2025 · 9:10 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

21 Oct 2025

The debate over “Is RL needed?” often misses the point. On-policy RL is one tool for fitting the reward-tilted posterior π⋆(y∣x) ∝ p(y∣x) · exp(r(x,y)∕β), where p is the base generative model and r is a reward. This distills the capability signaled by the reward back into the generator. RL isn’t required: you can sample, approximate, or distill the same target via best-of-N, beam search, rejection sampling, MCMC, or DPO. They are all equivalent theoretically. The real question is generalization and efficiency: - generalization: how well does a chosen method transfer to unseen prompts? For training-based methods, as with any distillation problem, performance depends on the dataset and the loss you optimize. - efficiency: what is the cost of sampling from the desired distribution? The cost is on-par with sampling from the base model for most training solutions while inference-time solutions generally pay a larger decoding cost. A fair comparison across methods entails looking rigorously at the "performance" vs "inference cost" (ignoring training cost). Considering only performance, best-of-N is all you need!

215

26,298

Ahmad Beirami ✈️ ICML · Nov 7, 2025 · 7:58 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

7 Nov 2025

This is one of the most exciting agentic AI results I have seen! An AI agent (through several rounds of reasoning and experimentation) discovers distributed systems algorithms (e.g., GPU load balancing) that perform on par with those designed by world-renowned human experts in the field.

Ali Parandeh Gheibi

@aparandehgheibi

7 Nov 2025

We built a Systems Researcher AI agent! Glia discovers novel distributed systems algorithms matching PhD-level experts in creativity & performance. We ran it on various networked systems problems and obtained publication-worthy results on each! Let me tell you how we did it 🧵

208

37,040

Ahmad Beirami ✈️ ICML · May 16, 2025 · 11:24 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

16 May 2025

Thrilled to share that 0 papers were accepted to ACL and 0 papers were submitted to NeurIPS! If you got a paper accepted, please let us know why you're thrilled about it. Why should we read it? I'm specifically interested in papers on post training / RL / agents.

215

74,361

Ahmad Beirami ✈️ ICML · Sep 20, 2025 · 3:36 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

20 Sep 2025

My thoughts on the broken state of AI conference reviewing: Years ago, when I was in graduate school and a postdoc in Information Theory, I always felt fortunate to be invited to review for IEEE Transactions on Information Theory or IEEE Transactions on Signal Processing. I felt I was being considered an expert on the subject matter and that my opinion helped make a difference in the review process. I was treated with respect, and I was happy to be part of it. I started publishing in mainstream ML venues in 2017 and was invited to be a reviewer for ML conferences in 2018. I jumped in with the same passion, but it soon became clear that being a reviewer didn’t really matter. I would spend time engaging with a paper only to be overturned by an AC who didn’t read or understand my review, or didn’t care to engage and discuss their rationale. I witnessed terribly wrong papers being accepted and good papers being unduly rejected. I quickly started to feel my role as a reviewer was irrelevant. I wasn’t treated with respect, and I couldn’t make a meaningful difference in the system. I probably would have stopped reviewing for ML long ago if I hadn’t been promoted to AC, and I know a lot of senior researchers who stopped engaging and caring. When I became an AC, I tried to engage reviewers, treat them with respect, and make sure they felt empowered and valued (especially if I made a decision that went against their recommendation). This is crucial for a functioning review system because reviewers need to feel their work matters if we want them to keep coming back and engaging. Now, we’ve reached a point where many reviewers feel their work doesn’t matter. The more experienced reviewers aren’t excited to engage, and the review quality keeps going down to the extent that we really lack qualified reviewers in the system. What solution have we found to this problem? Force everyone to be a reviewer. This is what EMNLP, NeurIPS, CVPR, etc., did this year, which (in my opinion) is a terrible way to address the root cause. We should have thought about how to bring back dignity to the reviewer’s role instead. Fast forward to today, and we are taking steps to make the role of an AC irrelevant as well. When PCs/SACs overturn an AC’s decision without even engaging the AC in the process, the AC doesn’t feel respected or useful. This has been happening in the NLP conferences for a while and happened this year at NeurIPS. Not only does an AC likely feel irrelevant, the SACs will feel the same. We will quickly see experienced members of the community stop accepting AC/SAC roles because they feel it’s a useless activity. The quality of ACs/SACs will go down, and more of the burden will shift to the PCs, who have to decide on 20k+ papers. The end result is that the entire system will soon crumble.

212

51,903

Ahmad Beirami ✈️ ICML · Sep 5, 2025 · 12:35 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

5 Sep 2025

This advice is underrated! Today, Industry research is focused on short term (3-6months) bets. Academics have an opportunity to balance their portfolio with medium term (1-2 years) and long term (3-10 years) bets. Putting all academic efforts in short-term basket is suboptimal!

Andrew Gordon Wilson

@andrewgwils

5 Sep 2025

Advice for academics: don't try to beat industry at their own game. Invent a new more interesting game, with different rules.

207

16,244

Ahmad Beirami ✈️ ICML · Apr 3, 2024 · 1:16 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

3 Apr 2024

Have you been perplexed by the surprising performance of 𝗯𝗲𝘀𝘁-𝗼𝗳-𝗻 in alignment compared to SOTA method (𝗣𝗣𝗢/𝗗𝗣𝗢/𝗜𝗣𝗢)? We have theory that explains this phenomenon.

preprint
Asymptotics of Language Model Alignment
Joy Qiping Yang (University of Sydney), Salman Salamatian (MIT), Ziteng Sun, Ananda Theertha Suresh, Ahmad Beirami (Google Research)

ALT preprint Asymptotics of Language Model Alignment Joy Qiping Yang (University of Sydney), Salman Salamatian (MIT), Ziteng Sun, Ananda Theertha Suresh, Ahmad Beirami (Google Research)

196

38,987

Ahmad Beirami ✈️ ICML · Jan 23, 2024 · 4:08 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

23 Jan 2024

A periodic reminder to reviewers: If you ask authors for more experiments, then you need to communicate a clear hypothesis you're trying to verify with those (e.g., effectiveness on imbalanced data, generalization beyond a certain modality, scalability, etc). Otherwise don't!

198

26,797

Ahmad Beirami ✈️ ICML · Nov 28, 2022 · 8:28 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 Nov 2022

Please RT Are you a *PhD* student conducting research on generative models? Are you excited about #ResponsibleAI aspects of generation? We are looking to host a student researcher/#internship in 2023. Please get in touch via *email*. P.S. I'll be at #NeurIPS2022 and #emnlp2022

197

Ahmad Beirami ✈️ ICML · Oct 20, 2025 · 3:39 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

20 Oct 2025

Here is a simple observation that should be better known than it is, so I thought I'd share broadly. Let's say we have a language model p(y|x), where x is a prompt and y is an outcome (a full sequence). Sampling from p(y|x)^α for some α≥1 is commonly known as "tilting" the distribution. Tilting has a lot of use cases in probability and statistics, which are also applicable to language modeling (see comments for references). 1. Tilting a language model can be interpreted as "outcome-level" temperature sampling (as opposed to the common token-level temperature sampling). Unlike token-level temperature sampling, outcome-level tilting is intractable because the partition function Z_α(x) = Σ_y p(y|x)^α is a sum over the exponentially large space of all possible outcomes y. One needs to either come up with sophisticated sampling methods (inference-time solutions) or try to directly distill the tilted distribution in a new model (training-time solutions) to be able to sample from it. 2. Sampling from the distribution proportional to p(y|x)^∞ (α →∞) concentrates all probability mass on the single maximum likelihood outcome y^* = argmax_y p(y|x), which again is intractable. We generally try to approximate this using beam search. Remember that this is different from the (token-level) temperature 0 sampling, which is greedy decoding. 3. One can interpret sampling from p(y|x)^α as the solution to a KL-regularized RL problem. Let the reward be r(x,y) = log p(y|x). Then consider solving the following KL-regularized RL problem: max_q(.|x) E_q[r(x,y)] - β KL(q(.|x)||p(.|x)) This would lead to sampling from p(y|x)^(1+1/β). In other words, tilting the distribution is equivalent to (implicitly) using log-likelihood as a reward. 4. KL-regularized RL and Best-of-N (BoN) are almost equivalent (see comments for references). Thus, Best-of-N that uses the (log-)likelihood as its re-ranking function is equivalent to tilting. Increasing the value of N is equivalent to increasing the value of α (subject to a monotonic reparameterization). Happy tilting!

198

21,836

Ahmad Beirami ✈️ ICML · Nov 15, 2025 · 5:45 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

15 Nov 2025

PSA: If a paper is published (in a top venue), it doesn't mean it is useful (or even correct). You should develop your own judgement system decoupled from the noisy signals to efficiently extract useful information nuggets from the literature!

Mufan Li @mufan_li

14 Nov 2025

Replying to @abeirami @thegautamkamath @jon_barron

I've sunk way too much time reading published papers of this type early on in my PhD, and it's really sad when I imagine how much more time is wasted for others.

190

30,316

Ahmad Beirami ✈️ ICML · Aug 15, 2024 · 12:26 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

15 Aug 2024

Are you interested in theoretical aspects of sampling from language models? These tutorial slides should have good pointers to get started: theertha.info/papers/isit_20… p.s. The slides include my 𝑳𝒂𝒏𝒈𝒖𝒂𝒈𝒆 𝑴𝒐𝒅𝒆𝒍 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕: 𝑻𝒉𝒆𝒐𝒓𝒚 & 𝑷𝒓𝒂𝒄𝒕𝒊𝒄𝒆 talk

Ananda Theertha Suresh @th33rtha

13 Aug 2024

Excited to share the slides of the 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐦𝐨𝐝𝐞𝐥 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞: 𝐭𝐡𝐞𝐨𝐫𝐲 & 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝐬 tutorial that @abeirami and I gave at #ISIT2024. We covered basics of 𝘪𝘯𝘧𝘦𝘳𝘦𝘯𝘤𝘦, 𝘢𝘭𝘪𝘨𝘯𝘮𝘦𝘯𝘵 and 𝘦𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘤𝘺 from a theoretical lens.

188

25,692

Ahmad Beirami ✈️ ICML · Aug 28, 2025 · 1:08 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 Aug 2025

In 2009, a prominent signal processing professor said the market was tough and h-index ≥6 was needed just to get a faculty interview. We now seem to be drifting toward the same bar for PhD program entrance.

186

16,769

Ahmad Beirami ✈️ ICML · Jul 25, 2025 · 1:19 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

25 Jul 2025

Only 2 out of 120 papers in my NeurIPS SAC batch have all reviewer scores ≥4! Don't despair with the low scores and focus on writing a clear and concise rebuttal! Good luck!

190

60,657

Ahmad Beirami ✈️ ICML · Feb 15, 2024 · 11:05 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

15 Feb 2024

Giving a new talk tomorrow at @USC 𝑳𝒂𝒏𝒈𝒖𝒂𝒈𝒆 𝑴𝒐𝒅𝒆𝒍 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕: 𝑻𝒉𝒆𝒐𝒓𝒚 & 𝑷𝒓𝒂𝒄𝒕𝒊𝒄𝒆 hosted by @mahdisoltanol viterbi.usc.edu/calendar/?ev… While the talk is mostly an 𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 tutorial, I will also touch upon some of our recent work👇

Generative language models have advanced to a level where they can effectively solve a variety of open-domain tasks with little task specific supervision. However, the generated content from these models may still not satisfy the preference of a human user. The goal of the 𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 process is to remedy this issue by generating content from an aligned model that improves a reward (e.g., make the generation safer) but does not perturb much from the base model. A simple baseline for this task is best-of-N, where N responses are drawn from the base model, ranked based on a reward, and the highest ranking one is selected. More sophisticated techniques generally solve a KL-regularized reinforcement learning (RL) problem with the goal of maximizing expected reward subject to a KL divergence constraint between the aligned model and the base model. An alignment technique is preferred if its reward-KL tradeoff curve dominates other techniques.

ALT Generative language models have advanced to a level where they can effectively solve a variety of open-domain tasks with little task specific supervision. However, the generated content from these models may still not satisfy the preference of a human user. The goal of the 𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 process is to remedy this issue by generating content from an aligned model that improves a reward (e.g., make the generation safer) but does not perturb much from the base model. A simple baseline for this task is best-of-N, where N responses are drawn from the base model, ranked based on a reward, and the highest ranking one is selected. More sophisticated techniques generally solve a KL-regularized reinforcement learning (RL) problem with the goal of maximizing expected reward subject to a KL divergence constraint between the aligned model and the base model. An alignment technique is preferred if its reward-KL tradeoff curve dominates other techniques.

186

45,139

Ahmad Beirami ✈️ ICML · Oct 1, 2025 · 6:36 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

1 Oct 2025

The new post by @johnschulman2 et al. on LoRA vs Full-FT is excellent! In this short 🧵I’ll - engage with the findings - examine the info-theoretic intuition - offer a sharper argument with some further thoughts

Thinking Machines

@thinkymachines

29 Sep 2025

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…

178

29,764

Ahmad Beirami ✈️ ICML · May 6, 2025 · 8:59 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

6 May 2025

The problem is clear when you see a student asking themselves: "What am I submitting to NeurIPS?" vs "What important problem am I going to try to solve in the next few years?"

This tweet is unavailable

177

38,914

Ahmad Beirami ✈️ ICML · Sep 10, 2025 · 1:06 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

10 Sep 2025

This is the conclusion slide of a talk I gave more than a year ago on RL/Alignment! It still holds true today.

Slide titled “Takeaways (alignment recipe).”

Step 1: Perform Best-of-n and make sure it works as desired.
– Inspect a few responses and verify the reward-induced ranking makes sense.
– Best-of-n gives the best trade-offs; if it doesn’t work, no fancy method will.
– You can debug best-of-n much faster.

Step 2: Only then train your favorite alignment method.
– Track KL(π‖p) throughout training:
• KL > 100: results are unlikely to be useful.
• KL > 15: inspect outcomes for reward hacking.
• KL < 8: you are probably OK.

Bottom banner in a black box repeats “(1) Look at your data! (2) Look at your data! (3) Look at your data!” in blue, green, and red.

ALT Slide titled “Takeaways (alignment recipe).” Step 1: Perform Best-of-n and make sure it works as desired. – Inspect a few responses and verify the reward-induced ranking makes sense. – Best-of-n gives the best trade-offs; if it doesn’t work, no fancy method will. – You can debug best-of-n much faster. Step 2: Only then train your favorite alignment method. – Track KL(π‖p) throughout training: • KL > 100: results are unlikely to be useful. • KL > 15: inspect outcomes for reward hacking. • KL < 8: you are probably OK. Bottom banner in a black box repeats “(1) Look at your data! (2) Look at your data! (3) Look at your data!” in blue, green, and red.

184

20,165

Ahmad Beirami ✈️ ICML · May 28, 2025 · 6:21 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 May 2025

I need a Bayesian agent that can read all new RL posts, and update its belief on what works and what not, and report back every day. Who is building that?

175

15,726

Ahmad Beirami ✈️ ICML · Sep 14, 2025 · 7:11 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

14 Sep 2025

Even though Nando's problem statement is underdetermined, let's assume we are FTing a generalist, i.e., learn a new capability while retaining the core capabilities of the model (reasoning, instruction following, etc). The answer is quite different depending on your identity👇

Nando de Freitas

@NandoDF

14 Sep 2025

If you have 10K data instances, would you: 1. SFT an LLM with 10K data, or 2. Learn a reward with 5K, and RL the LLM on the remaining 5K with the learned reward 3. Other (explain)?

171

37,090

Ahmad Beirami ✈️ ICML · Apr 23, 2025 · 10:11 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

23 Apr 2025

Excited that our paper "safety alignment should be made more than just a few tokens deep" recognized as an #ICLR2025 Outstanding Paper! We identified a common root cause to many safety vulnerabilities and pointed out some paths forward to address it!

ICLR @iclr_conf

23 Apr 2025

Replying to @iclr_conf

Outstanding Papers Safety Alignment Should be Made More Than Just a Few Tokens Deep. Xiangyu Qi, et al. Learning Dynamics of LLM Finetuning. Yi Ren and Danica J. Sutherland. AlphaEdit: Null-Space Constrained Model Editing for Language Models. Junfeng Fang, et al.

169

13,569

Ahmad Beirami ✈️ ICML · Jun 8, 2025 · 5:41 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

8 Jun 2025

Time for my rants again: A PhD is not about counting papers :)

Richard She

@Richard_She

2 Jun 2025

Replying to @abeirami

happy for you bruv, but that's only half a Ph.D.

162

25,274

Ahmad Beirami ✈️ ICML · May 11, 2024 · 9:15 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

11 May 2024

Happy #WomeninMathematics day! May 12 marks the birthday of Maryam Mirzakhani, a mathematician who was awarded the Fields medal (highest honor in math) for her contributions to geometry and dynamical systems. Two of my fav mathematicians: Maryam Mirzakhani & Ingrid Daubechies

151

31,590

Ahmad Beirami ✈️ ICML · Jul 2, 2025 · 12:40 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

2 Jul 2025

As NeurIPS review deadline is around the corner, please remember that you cannot use any non-local LLM like chatgpt/gemini for understanding the paper and drafting/revising your review as that breaks the confidentiality agreement.

156

23,289

Ahmad Beirami ✈️ ICML · Aug 31, 2025 · 11:58 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

31 Aug 2025

A research career is built *slowly* one good paper at a time! "publishing one [great] paper which catches the right people’s attention" can be very helpful to one's career! publishing one bad paper which catches the wrong people's attention can also be extremely damaging to one's career!

Alexander Terenin @avt_im

30 Aug 2025

PSA: having many NeurIPS papers does not lead, and has never causally led, to high-paying jobs or top internships. However, working with the right advisor in grad school, or publishing one paper which catches the right people’s attention - these can both do this.

158

40,289

Ahmad Beirami ✈️ ICML · Oct 7, 2023 · 8:51 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

7 Oct 2023

Hey @emnlpmeeting: is this a joke or a metareview because I classified as joke initially but then changed to metareview. #emnlp2023

155

124,586

Ahmad Beirami ✈️ ICML · Jul 18, 2023 · 12:47 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

18 Jul 2023

Both @icmlconf and @NeurIPSConf held in the US in 2022-23! The US is one of the most visa unfriendly states (appointment wait times 6+ months, processing another 6+ months), this is significantly hurting diversity & inclusion. We should strive to do better! #ICML2023 #NeurIPS2023

149

26,043

Ahmad Beirami ✈️ ICML · Jan 4, 2024 · 3:12 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

4 Jan 2024

Have you been compiling reward-KL tradeoffs to compare different alignment methods? Have you been using 𝐛𝐞𝐬𝐭-𝐨𝐟-𝐧 as a baseline? Have you wondered about the analytical formula that claims this formula? 𝐾𝐿 (𝑏𝑒𝑠𝑡-𝑜𝑓-𝑛 || 𝑏𝑎𝑠𝑒) = 𝑙𝑜𝑔(𝑛) - (𝑛-1)/𝑛

146

27,626

Ahmad Beirami ✈️ ICML · Dec 8, 2023 · 2:51 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

8 Dec 2023

We publish a ton of our core work on language models. Here are a few samples just from our team released in the last few weeks: - arxiv.org/abs/2310.16523 - arxiv.org/abs/2310.15141 - arxiv.org/abs/2310.16959 - arxiv.org/abs/2310.17022 How's that "giving almost nothing back"?

Luca Soldaini 🎀

@soldni

8 Dec 2023

I can't help being a bit sad in thinking how much the Gemini release is taking from the academic and open LMs communities, while giving almost nothing back I can't help but wish our relationship with commercial players would be more of a two way channel

144

58,121

Ahmad Beirami ✈️ ICML · Jul 28, 2025 · 3:24 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 Jul 2025

Instead of complaining that peer review is dead, take a positive step to improve it today. The reviewers are not aliens, they are us! - Revise your review and make it clear. Identify the crucial points that impacted your score negatively and positively. - If the paper is lacking information about its claims, communicate your asks and the reasoning concretely. Don't just ask for 2 more experiments because you feel the authors didn't work hard enough. Don't ask for experiments unless they verify a hypothesis (which you clearly explained). - Look for the missing information that you identified and make sure they are not in the paper. - Recommend acceptance if the paper's claim is adding a new nugget of information to the literature (no matter the size of the nugget), and if the paper has substantiated the claim via theoretical / empirical evidence. Believe me, this doesn't take much time, and will improve the state of peer review significantly!

149

18,983

Ahmad Beirami ✈️ ICML · May 10, 2025 · 10:58 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

10 May 2025

TMLR's bar is not lower. The car is drawn differently to remove subjectivity from the acceptance criteria. I have read many good TMLR papers and many disastrous ICML papers. In fact I think those disastrous ICML papers wouldn't stand a chance at TMLR.

Neel Nanda

@NeelNanda5

10 May 2025

Replying to @PandaAshwinee @sethkarten

TMLR defines it's bar for acceptance in a way I consider pretty clearly lower than conferences Completely agreed that peer review is dumb and noisy in many ways. But, like, I think that prestige is a real and meaningful phenomena that really affects careers

151

38,199

Ahmad Beirami ✈️ ICML · Jun 8, 2025 · 10:08 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

8 Jun 2025

This take couldn't get more wrong! Industry people do some of their most interesting research with interns since internships allow for more risky projects!

This tweet is unavailable

146

10,926

Ahmad Beirami ✈️ ICML · Nov 8, 2025 · 10:45 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

8 Nov 2025

Apple’s “smart” charging decided my Mac only needed to be full by 7:30am so I woke up to a half-charged laptop before a flight. I miss dumb, predictable tech.

142

14,158

Ahmad Beirami ✈️ ICML · Sep 3, 2025 · 11:42 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

3 Sep 2025

Varadhan's Lemma remains the most effective tool for characterizing large deviations in probability theory. It says: the log of an exponential average is governed by the outcome that best balances payoff and cost, where cost is how improbable that outcome is under the data distribution (the large-deviation rate), and payoff is the numerical score being averaged. Tilted ERM is inspired by this principle. Let f be the per-example loss (which is the payoff) on data z₁,…,zₙ. The tilted empirical risk with tilt t is R_t = (1/t) log( (1/n) ∑ᵢ exp(t f(zᵢ)) ). The induced weights are wᵢ ∝ exp(t f(zᵢ)). For t>0 the weighting emphasizes higher losses, for t<0 it emphasizes lower losses, as t→0 it approaches the mean loss, and for large |t| it approaches the max or min loss. This is exactly the payoff versus cost tradeoff, recast for ML. Screenshot: Dembo and Zeitouni’s book.

Probability and Statistics

@probnstat

2 Sep 2025

S. R. S. Varadhan, a recipient of the 2007 Abel Prize, is a towering figure in modern probability theory. His most notable contribution is the creation of a unified theory of large deviations. This theory provides a powerful framework for understanding the probability of rare events, which are often exponentially unlikely. While traditional probability focuses on average or typical outcomes, Varadhan's work gives precise mathematical tools to analyze the behavior of systems when they deviate significantly from the norm. His work has far-reaching applications in fields from statistical physics to finance and risk management.

ALT Source: https://share.google/images/rsoGPGNNjRSiYOrJi

142

16,182

Ahmad Beirami ✈️ ICML · Jul 23, 2025 · 11:58 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

23 Jul 2025

If you are interested in safety/security jailbreaking of LLMs, defenses against them, and how the safety issues become more complicated when we design agentic workflows, this tutorial by @HamedSHassani, @aminkarbasi, @AlexRobey23 is highly recommended

144

9,863

Ahmad Beirami ✈️ ICML · Mar 28, 2022 · 6:33 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 Mar 2022

I left Meta AI a few weeks ago. I am filled with gratitude for the opportunity to “think big” and propose Project CAIRaoke to redefine the future of #ConversationalAI! I am also humbled to have worked with so many amazing people along the way!

140

Ahmad Beirami ✈️ ICML · Jul 31, 2024 · 8:30 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

31 Jul 2024

To the reviewer who claimed 8% improvement is marginal and not significant enough for a top conference paper: The goal of a scientific paper is to further our collective understanding of how to solve problem, it's not to launch a new algorithm in production setting.

139

10,553

Ahmad Beirami ✈️ ICML · Nov 14, 2025 · 4:26 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

14 Nov 2025

Even from NeurIPS experiments a decade ago, the bottom 25% of papers had a 5% chance of acceptance whereas the top 25% had a 50% chance. Review scores are noisy! With the recent flood of papers and the downfall of reviews, the correlation between merit and acceptance is going to 0 fast. Can someone estimate the convergence rate?

Micah Goldblum @micahgoldblum

13 Nov 2025

An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3

140

26,289

Ahmad Beirami ✈️ ICML · Aug 23, 2025 · 2:51 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

23 Aug 2025

Too often, researchers propose solutions before they’ve defined the problem first.

139

37,681

Ahmad Beirami ✈️ ICML · Jun 13, 2025 · 9:40 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

13 Jun 2025

I hope my Iranian and Israeli friends and their loved ones are safe, and I wish this war ends in a peaceful terminating state soon!

138

18,551

Ahmad Beirami ✈️ ICML · Nov 1, 2025 · 2:34 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

1 Nov 2025

arXiv is already beyond the point where anyone can follow everything, and a fair amount of junk gets through. I miss the days when I could read all new cs.IT abstracts over my morning coffee! I understand the goal of filtering obvious junk, but the proposed gatekeeping will misclassify valuable work, slow timely release, and likely push authors to other platforms for immediate dissemination. There is already a good number of papers that sit in moderation for weeks before appearing without a clear signal on why. The better path IMHO is still light moderation that leans on author reputation/endorsement, with clear and verifiable criteria rather than noisy peer review requirements. Peer review itself is in crisis, so using it as a hard prerequisite risks taking down arXiv too.

Thomas G. Dietterich @tdietterich

31 Oct 2025

The Computer Science section of @arxiv is now requiring prior peer review for Literature Surveys and Position Papers. Details in a new blog post

136

28,634

Ahmad Beirami ✈️ ICML · May 27, 2025 · 6:26 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

27 May 2025

Very cool findings! But this says more about Qwen models than about RL.

Stella Li ✈️ ICML🇰🇷

@StellaLisy

27 May 2025

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewards

138

24,369

Ahmad Beirami ✈️ ICML · Sep 13, 2025 · 4:07 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

13 Sep 2025

Cool paper packed with lots of useful information, which also has important implications for context engineering!

Arvindh Arun

@arvindh__a

12 Sep 2025

Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵

132

16,683

Ahmad Beirami ✈️ ICML · Oct 4, 2025 · 6:39 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

4 Oct 2025

Does each policy gradient update only contain O(1) bits of information about the model we are trying to train while each SFT update has O(T) where T is the number of tokens in the example? I think the answer is not a clean binary. Let’s examine this in a simple setup. Let’s consider a simple language model. It is one that could only be prompted with "Tell me a joke" (and won't work with anything else), and it can generate exactly N jokes, each exactly T tokens, and each with probability 1/N. We will see that T appears in none of our arguments. Let’s assume that we have a binary reward that determines whether the joke is desirable (say safe and good). And let’s assume that each joke is good with probability α, and bad with probability (1 − α). So, the ideal solution is clear. Filter out the bad jokes and generate each of the K good jokes with probability 1/K each. Here K is a random variable whose mean is αN. This is what we’d get from solving population level RL with no KL regularization. Let’s call this solution p*. Now, let’s consider a policy gradient update on trajectory τ. Before doing that, we need to define a model family p_θ; and for that let’s just define a simple model family where we have θ is an element of the N-dimensional simplex, i.e., θ_i ∈ [0, 1], and ∑_i θ_i = 1. Then, p_θ(τ_i) = θ_i, where τ_i is the i’th joke (or trajectory). With this notation let τ be a random trajectory drawn from p_θ. We have G = S · Adv, where S = ∇ log p_θ(τ). Let’s consider the very first step of training where θ = (1/N, …, 1/N) and hence S is independent of p*, and history = []. Thus, I(G; p* | history) ≤ I(Adv; p* | S) ≤ h(α). This bound is intended only for the very first step of training when θ is uniform and S is independent of p*. After the first step, S can depend on history, so this simple bound need not apply as written but overall information is probably still upper bounded by 1 bit. Let’s also consider supervised learning in this case, and consider training on data that keeps getting randomly generated through positive examples. The end state of training is also going to be the same p* in this case but let’s see how much mutual information does one example contain about p*? The update is going to be G = ∇ log p_θ(τ) where τ is a positive example. I(G; p* | history) = I(τ; p* | history) ≤ log(1/α). In the RL case, each example resolves at most 1 bit of information about the model by determining whether or not a random rollout was good. In the SFT case, the information depends on how rare such an outcome is. If α > 1/2, then SFT also provides at most 1 bit of information about the ideal model. But if this is a rare outcome, then SFT provides significantly more information about the outcome. Think about it this way, if there is only one good joke out of the millions that the model may generate, then RL needs millions of rollouts to stumble upon that good joke, whereas if SFT points directly to that joke, then we can just learn it and move on. But if there is just one bad joke in the millions of jokes that the model can generate, neither SFT nor RL is sample efficient because we need to see millions of rollout to determine which joke was bad and filter it out. Notes and clarifications. • T does not appear in these arguments. The contrast here is not O(1) vs O(T) in this toy setup. It is O(1) for a binary sequence-level reward in the first PG step, versus up to log(1/α) for a positive SFT example. • Data processing. You can view p* as filtering each outcome independently with probability (1 − α), so any K-sized good set where E{K} = αN is possible. Under this process the log(1/α) bound follows from how a positive example shrinks the consistent good-set space. • Richer rewards. The ≤ 1 bit conclusion for RL depends on the reward being binary at the sequence level. If the reward is real-valued and informative, a single update can carry more than 1 bit. The discussion here is about the binary case. Cases like GRPO/RLOO where the reward is calibrated against several rollouts potentially contain more information. • Exploration. The “rare good joke” gap assumes naive on-policy sampling. Guided exploration or data selection (for example using a reward model or replay that favors promising trajectories or some other form of supervision) can reduce the number of rollouts needed to find rare good outcomes. • KL regularization. In practice RLHF is usually KL-regularized around a reference model. The KL shapes token-level behavior and can significantly change these information-theoretic arguments. • In practice, it is not uncommon to do poor man's RL through SFT by rolling out the model many times and keep only the examples that are good and filter out the bad ones and do SFT on the good examples. If we do that, then we will end up needing 1/α rollouts per each SFT example so even though we get log(1/α) bits of information per SFT example update, we will still get O(1) information per original model rollout. Happy to hear your thoughts on these arguments!

133

18,666

Ahmad Beirami ✈️ ICML · Feb 3, 2024 · 1:21 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

3 Feb 2024

[PSA] Please use \𝗰𝗶𝘁𝗲𝘁{} and \𝗰𝗶𝘁𝗲𝗽{} correctly or the paper will be hard to read. \𝗰𝗶𝘁𝗲𝘁{}: Author name is intended to be part of the sentence: X et al. (2024) \𝗰𝗶𝘁𝗲𝗽{}: Citation appears in parentheses and not read in the sentence: (X et al., 2024)

128

14,308

Ahmad Beirami ✈️ ICML · Oct 2, 2025 · 12:24 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

2 Oct 2025

Disrespecting and threatening the chairs is a terrible idea, even if you believe the decision on your paper is totally unfair! If you feel the urge to email the chairs, wait a day and reflect on the result a bit more before doing so. One of the best pieces of advice I have received: What you say today, you cannot take back tomorrow. What you don't say today, you can still say tomorrow! P.S. You cannot imagine how much effort the PCs put in to assemble the program. So we should all be thankful to them for taking on this thankless job on behalf of all of us!

NeurIPS Conference

@NeurIPSConf

1 Oct 2025

We are disappointed that some community members have been sending threatening and disrespectful messages to our organizers after NeurIPS decisions. While we always welcome feedback, we wish to remind all that our organizers are volunteers who work tirelessly on behalf of the community, and such messages will not be tolerated and may potentially be investigated as code of conduct violations.

127

17,176

Ahmad Beirami ✈️ ICML · May 11, 2025 · 1:58 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

11 May 2025

The review quality in TMLR is better because: 1. The authors suggest the AE. This means that the AE is more likely to be the right fit for the paper. 2. The AE selects the best reviewers for the paper who may or may not be in the reviewer pool. ...

Neel Nanda

@NeelNanda5

11 May 2025

Replying to @PandaAshwinee @abeirami

I've had good and garbage reviews at the 3 big conferences and TMLR. I don't recall a clear difference. Good to hear you've found TMLR to be good!

130

58,482

Ahmad Beirami ✈️ ICML · May 7, 2025 · 1:36 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

7 May 2025

I consider candidates with >1 published papers. I read one of their papers to ground the interview conversation and make a decision based on - quality of their paper (writing, experiment design, etc) - their knowledge in their area of competency - their general knowledge - ...

Mohsen davami @MohsenDava64048

7 May 2025

Replying to @abeirami

Say you were to take interns/students. Who'd you pick between a candidate who's passionate about fundamental questions without top-tier conference papers and one who has numerous top-tier conference papers on topics that no one would care about in a few years?

126

37,323

Ahmad Beirami ✈️ ICML · May 15, 2025 · 2:32 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

15 May 2025

Make the review process for all conferences fully open like ICLR and TMLR. Among other things, this will deter submissions of half baked papers.

131

11,173

Ahmad Beirami ✈️ ICML · Sep 28, 2025 · 11:27 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 Sep 2025

When we (RL) finetune a generalist model on a reward, there are three major axes that matter.

Slide listing three goals: 1) Reward—improve reward in the finetuned checkpoint; 2) Capabilities—maintain pre-finetune abilities (instruction following, reasoning); 3) Efficiency—reach the reward-vs-capabilities Pareto frontier with minimal compute and sample complexity.

ALT Slide listing three goals: 1) Reward—improve reward in the finetuned checkpoint; 2) Capabilities—maintain pre-finetune abilities (instruction following, reasoning); 3) Efficiency—reach the reward-vs-capabilities Pareto frontier with minimal compute and sample complexity.

124

9,399

Ahmad Beirami ✈️ ICML · Feb 21, 2024 · 10:57 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

21 Feb 2024

Off policy algorithms (even the simplest ones) are indeed useful for RLHF and can train a value function on massive amounts of data and 𝗼𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺 𝗜𝗣𝗢/𝗗𝗣𝗢/𝗣𝗣𝗢. arxiv.org/abs/2310.17022

Controlled decoding from language models

KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We propose a modular solver for this RL objective, called controlled decoding (CD), which exerts control through a separate prefix scorer module. At training time, the prefix scorer learns a value function for the reward, and it is used at inference time to control the generation from a frozen base model, provably sampling from a so- lution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that a single prefix scorer can learn multiple rewards and different reward combinations can be con- figurable at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an un- seen base model with no further tuning.

ALT Controlled decoding from language models KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We propose a modular solver for this RL objective, called controlled decoding (CD), which exerts control through a separate prefix scorer module. At training time, the prefix scorer learns a value function for the reward, and it is used at inference time to control the generation from a frozen base model, provably sampling from a so- lution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that a single prefix scorer can learn multiple rewards and different reward combinations can be con- figurable at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an un- seen base model with no further tuning.

Nathan Lambert

@natolambert

21 Feb 2024

A brief summary on what REINFORCE is in terms of RLHF and history of RL. The algorithm known as REINFORCE is really just the vanilla policy gradient approach. The name comes from Williams 1992, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning". Policy gradient algorithms directly update the weights of the policy based on some estimate of reward. Mostly, it's important to know how PPO compares and why it emerged, and why that may not matter for RLHF. REINFORCE methods were known for high variance on the policy gradients, leading to unstable learning. Basic methods like baselining and other regularization tools (off-policy gradients, actor-critic methods, etc) all emerged. The notable paper on the path towards PPO (which lots of people use for RLHF today) was Trust Region Policy Optimization (TRPO). Both TRPO and PPO try to answer the same question: How do we take the right sized step with a potentially noisy policy gradient estimate? TRPO does this with a second order approximation. PPO does this with a first order approximation. It ended up being much simpler to implement, and it's popularity is now obvious. We'll see if we go down the same paths of off-policy algorithms being useful for RLHF (much more data to learn from) and actor-critic algorithms (separate learning of value function and policy). The things that were important for state-based control may not be important for language because our reward functions are very different with a reward model. Google may need extremely good reward models to get REINFORCE to work, as they mentioned in the Gemma paper. Regardless, a big way around high variance gradients is to take big batch sizes. We know Google has the compute for that, we're not sure all of us DPO hackers do. Links / references Williams' paper: link.springer.com/content/pd… Policy gradient slides from @svlevine: rail.eecs.berkeley.edu/deepr… Policy gradient blog post from @lilianweng : lilianweng.github.io/posts/2… Spinning up docs: spinningup.openai.com/en/lat…

125

28,502

Ahmad Beirami ✈️ ICML · Sep 26, 2023 · 3:13 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

26 Sep 2023

The list of #NeurIPS2023 accepted papers are available here: neurips.cc/virtual/2023/pape…

125

26,335

Ahmad Beirami ✈️ ICML · Aug 24, 2025 · 3:40 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

24 Aug 2025

My personal thoughts on conference acceptance rate being artificially kept low. From my own experience, if we focus on merit based acceptance with two simple criteria (like TMLR): - claims are substantiated with theoretical/empirical evidence - claims push the science envelope by epsilon. Then, we will still end up with ~25% acceptance rate and won't need to artificially reject any papers. The problem we have with the noisy review system is that we are accepting a lot of papers that are not even correct.

Ahmad Beirami ✈️ ICML @abeirami

13 Apr 2022

125

27,832

Ahmad Beirami ✈️ ICML · Jul 25, 2022 · 5:04 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

25 Jul 2022

Excited to share that our paper titled “Winning is not everything: enhancing game development with intelligent agents,” @IEEETxnOnGames, June 2020, has been selected to receive the 2023 Outstanding Paper Award by @ieeecis Awards Committee.

120

Ahmad Beirami ✈️ ICML · Oct 23, 2021 · 3:54 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

23 Oct 2021

[2022 #Internships at the @facebookai Conversational AI Research (CAIR) team] The CAIR team is seeking to recruit multiple PhD student interns to work with us (@SatwikKottur, @Chinnadhurai, @abeirami, and others) on different aspects of #ConversationalAI and #NLProc. 1/5

120

Ahmad Beirami ✈️ ICML · Jul 5, 2025 · 3:05 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

5 Jul 2025

Yes, soft distillation generalizes well. Hard distillation (SFT on teacher generated data) does not. Wasn't this (widely) known already?

Dimitris Papailiopoulos

@DimitrisPapail

5 Jul 2025

A model SFT’d on curated synth data, generated from a teacher, is not the same as distillation. It’s a form of “off policy RL”. Filtering entails a form of reward. Which is why models trained on mostly synthetic data like phi-4 can become better than the teacher in some tasks.

122

31,785

Ahmad Beirami ✈️ ICML · Aug 1, 2024 · 3:07 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

1 Aug 2024

If a paper clears the bar, give it a score ≥6. Here is how I think about ratings: - Should be oral? 8/9 - Should be spotlight? 7/8 - Clears the acceptance bar? 6/7 - Could be accepted after minor revs? 4/5 - Could be accepted after major revs? 3/4 - Fundamentally flawed 2/3

Ahmad Beirami ✈️ ICML @abeirami

13 Apr 2022

121

27,560

Ahmad Beirami ✈️ ICML · Aug 7, 2024 · 2:44 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

7 Aug 2024

If you decide to withdraw your paper without a rebuttal, it's nice to write a short (3-4 sentences) withdrawal note to thank the reviewers for their feedback, describe what you agree/disagree with, and what you plan to do. Besides, you may get the same reviewers again.

113

11,717

Ahmad Beirami ✈️ ICML · Oct 23, 2025 · 10:54 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

23 Oct 2025

That's called loss maximization!

Yuandong Tian

@tydsh

23 Oct 2025

Several of my team members + myself are impacted by this layoff today. Welcome to connect :)

116

12,865

Ahmad Beirami ✈️ ICML · Nov 9, 2024 · 4:40 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

9 Nov 2024

There is a subtle distinction between RL in RLHF and RL in domains with a clear reward signal that captures what we want like winning in games, correctness in math With a clear reward, RL is quite effective and can lead to novel sequences of actions (e.g. move 37). But, ... 👇

Phillip Isola @phillip_isola

9 Nov 2024

Replying to @abeirami

interesting, but I think the early work on RLHF was pretty impressive on teaching new skills, without pre-training/SFT, e.g., openai.com/index/learning-fr… how do you equate your argument with that? those allowed large KL?

116

28,661

Ahmad Beirami ✈️ ICML · Nov 16, 2025 · 12:27 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

16 Nov 2025

PSA: If you are writing a paper that would be obsolete by the time it gets published (i.e., has no archival value), then don't! That should be a blogpost or a tweet. A paper should be reserved to communicate an insight beyond a bunch of bold numbers in a table.

Santiago M.

@sanmking

15 Nov 2025

Replying to @abeirami

At the same time, if you are reading a published paper. You are almost certainly already behind!

111

13,945

Ahmad Beirami ✈️ ICML · Dec 27, 2024 · 11:34 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

27 Dec 2024

Very interesting paper by @th33rtha et al For categorical/Gaussian distributions, they derive the rate at which a sample is forgotten to be 1/k after k rounds of recursive training (hence 𝐦𝐨𝐝𝐞𝐥 𝐜𝐨𝐥𝐥𝐚𝐩𝐬𝐞 happens more slowly than intuitively expected)

113

13,479

Ahmad Beirami ✈️ ICML · Mar 28, 2022 · 6:33 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 Mar 2022

As for next steps, I am excited to share that I have joined #GoogleResearch to lead efforts around robust and fair development of core machine learning techniques. I am also moving to New York City, and excited to be back to the east coast!

110

Ahmad Beirami ✈️ ICML · Nov 6, 2023 · 1:45 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

6 Nov 2023

When I first started reviewing ML papers, I was fighting to reject bad papers. These days I find myself fighting to accept good papers. The change in perspective has also made my reviews more constructive even in cases where the recommendation has to be reject.

107

7,785

Ahmad Beirami ✈️ ICML · Oct 27, 2025 · 10:28 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

27 Oct 2025

0: Do Best-of-N on your rubric/reward first, look at the outcomes, and verify that it works as intended.

Casper Hansen

@casper_hansen_

27 Oct 2025

i think i figured out the correct pipeline for RL 1 : forget RL and just do DSPy with GEPA. develop your agent loop and eval here. it's cheaper and faster. 2 : convert it to RL and compare to baseline. 3 : hopefully RL worked, otherwise revert to prompt optimizers

111

22,743

Ahmad Beirami ✈️ ICML · Jun 13, 2025 · 12:08 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

13 Jun 2025

Has your LLM been leaking? We have the perfect solution for you! Introducing our new venture, Gemini Waterproofing (Since 1989)

107

11,461

Ahmad Beirami ✈️ ICML · Aug 7, 2025 · 3:58 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

7 Aug 2025

We are still in an Evaluation crisis!

111

11,532

Ahmad Beirami ✈️ ICML · Jul 28, 2025 · 11:20 AM UTC

Ahmad Beirami ✈️ ICML @abeirami

28 Jul 2025

There is a rich set of research questions in design and optimization of agentic workflows with a ton of room for theoretical & algorithmic work! A great starting point to get exposed to them is the MIPRO paper (@kristahopsalong @lateinteraction et al.) and the DSPy framework.

111

8,716

Ahmad Beirami ✈️ ICML · Aug 19, 2025 · 12:41 PM UTC

Ahmad Beirami ✈️ ICML @abeirami

19 Aug 2025

Most “robustness” work (adversarial, shift, etc.) is just training on reweighted samples (augmented, model-generated, or mined). OOD generalization then comes from: (1) inductive bias (2) similarity to train data (3) luck The 3rd one is the most important of the three.

108

12,579