stealth // ex Gemini RL+Inference @GoogleDeepMind // Chat AI @AIatMeta // RL Agents @EA // ML+Information Theory @MIT+@Harvard+@GeorgiaTech

{NYC, SFO, YYZ}
We are hiring Members of Technical Staff (Research Engineers)! Current LLM agents lack reliability, creating a gap between demos and production. We solve this by automating the complex workflow of debugging, evaluation, and iteration required to make agents robust. 👇
25
38
649
82,811
They should have broken the 10k to 10-100 stacks of $100-1k and given them to the identical copies of the same model to be able to see anything remotely meaningful. Right now we are looking at noise!
55
27
1,301
117,557
What are the founders going to own? 🤔
50
11
1,036
127,337
There is no free lunch here. Pruning hurts robustness and capabilities other than those captured by the success metric used to prune.
I removed 74% of neurons from a neural network. It dropped the accuracy by just 0.50%. Here's a breakdown (with code):
25
8
737
92,476
After three incredible years, today is my last day at Google DeepMind! I am truly grateful to the amazing colleagues who made the journey 1000x more fruitful and enjoyable! I am forever indebted to my collaborators who showed me how to be better at everything via demonstrations.
38
13
750
85,415
This is a great example of what good research looks like. You start with a real problem. You peel it layer by layer to find the root cause. You form a new hypothesis and keep digging. At the end, you have something insightful to share!
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/def…
6
32
698
53,874
The best AI researchers zoom at three abstraction levels: - High: paper-level ideas & math - Mid: code-level implementation - Low: GPU/TPU reality (kernels/memory) Low exposes bottlenecks. High accelerates exploration. Mid makes it real. The job is to translate between them!
9
40
664
39,280
The main ingredient that led to GRPO's performance leap is the calibration of the reward/value via multiple rollouts per prompt. Let me elaborate on what I mean by that and a cheaper way of doing it offline.
11
53
655
117,600
As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.
15
55
523
77,724
A very nice blogpost on GRPO (the method that was used to train R1) by Youssef Mroueh
5
44
474
53,634
Post-training research was fueled by the KL-regularized RL mathematical foundation. That led to a lot of algorithmic research and a ton of progress over a few years. This helped us learn how to "distill" metrics back into models. But today we are optimizing workflows/agents.
7
44
453
41,955
Unpopular opinion: When a paper has a senior mentor and a junior mentee, the senior author must make sure the claims are correct and well supported. They must check every claim and gate the submission until it meets that bar. The junior author is the generator. The senior author is the verifier. The verifier should teach/distill some checks to the generator, but the verifier keeps final responsibility. If a wrong claim gets out, it is on the verifier!
9
20
395
43,476
The question that a reviewer should ask themselves is: Does this paper take a gradient step in the right direction? Is the community better off with this paper published? If the answer is yes, then the recommendation should be to accept.
9
36
368
This post should have been titled "Finetuning without regret" Finetuning models for specific narrow use cases to save costs, and to keep data on prem during inference, makes sense in general. However, that is bottlenecked by two main issues IMO: 1. Today, OSS models are lacking in capabilities, so the quality ceiling stays low even after finetuning. 2. Finetuning is delicate. It is easy to overoptimize and regress on core skills like reasoning or instruction following. This is hard to detect outside big labs, since most teams lack broad eval sets and only measure the narrow task they care about. We have not solved 1 yet, but Tinker shows promise on 2. The direction seems right: build tooling that makes finetuning easier, from low-level infra up to higher-level algorithms and evals that keep performance on track. I peeked at the code and liked how clean it is, and it reads as built from first principles. If OSS models catch up to frontier quality, I can see this service really taking off.
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker
16
25
355
69,929
Once you see a math concept geometrically, it becomes much easier to think about, and it’s hard to go back to any other way of seeing it.
23
26
353
17,157
𝐛𝐞𝐬𝐭-𝐨𝐟-𝐧 is a strong baseline for - improving agents - scaling inference-time compute - preference alignment - jailbreaking models How does 𝐁𝐨𝐧 work? and why is it so strong? Find some answers in the paper we wrote over two Christmas breaks!🧵
5
51
351
46,590
A great PhD advisor helps you discover the best version of yourself! In doing so, they'll help set a research vision, set yearly plans, break the problem down to solvable pieces, how to measure success, when to pivot, how to communicate clearly & concisely (written/verbal), etc.
Replying to @abeirami
may I ask what you think an advisor's most important role is?
3
17
304
29,097
I occasionally get messages asking how to follow my path and get into Meta, DeepMind, or similar places. That is the wrong question. Do not focus on the brand! Focus on what you want to work on, then find the opportunity that fits your goals best.
4
11
295
23,772
Everyone should know Chernoff bounds, and Sunil has written an excellent introductory post. Check it out!
Ever wonder why the Chernoff bound feels like magic? A geometric answer: KL divergence loves exponential families. This post shares some reflections — and sets up a series on how KL geometry connects classical statistics, online learning (OCO), and more. sunilmadhow.github.io/posts/…
3
23
297
57,598
ICLR season, and my timeline is flooded with paper threads that jump straight to we beat SOTA. But the solution only makes sense in the context of the problem, which is usually missing. What most threads skip: - What problem are you solving? - Why does it matter? - What did prior work miss? Instead, we get a tour of the method and a leaderboard screenshot. Remember that the audience for the problem is much larger than the audience for your particular solution.
5
22
294
26,537
[#eacl2024 paper] TL;DR We introduce 𝗴𝗿𝗮𝗱𝗶𝗲𝗻𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝗱 𝘁𝗲𝗮𝗺𝗶𝗻𝗴 (𝗚𝗕𝗥𝗧), an effective method for triggering language models to produce unsafe responses, even when the LM is finetuned to be safe through 𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡.
4
35
280
47,513
NeurIPS seems to be on track to get 25k+ submissions 😳
7
10
270
38,123
RLHF provably can't teach models any new knowledge. If you need to teach new skills, you need to look at pre-training and SFT. Why? 👇
11
20
273
83,668
For a long time, the biggest problem in machine learning has been improving and understanding robustness and generalization to OOD. We are just increasingly making more & more problems in-distribution but the models still don't generalize out-of-the-box to the tail of problems.
19
17
262
29,547
That's exactly the kind of advisor you do not want to do a PhD with!
At this point, you can use any AIs (perplexity, chatgpt, etc) as a PhD advisor on whatever topic you're getting deep into. It's pretty good. I have my preference. The core point: Research advice is no longer an elite academic university thing.
10
11
253
37,322
In today's publication culture, most authors are after being SOTA, showing tables with 𝐛𝐨𝐥𝐝 numbers, and writing the minimum viable paper! The goal of a scientific paper should be to push the field forward with new intuition/insights on how to think about solving a problem.
5
25
251
23,019
[#ICML2024] 𝗖𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗲𝗱 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 CD is an inference-time alignment that 𝘸𝘪𝘵𝘩 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘵𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘳𝘶𝘯 - Gives configurable reward/KL tradeoffs - Transfers to a new base model - Allows reward aggregation
2
37
252
39,967
⚠️#Internship2024 Alert⚠️ Are you a 𝐏𝐡𝐃 𝐬𝐭𝐮𝐝𝐞𝐧𝐭 excited about #ResponsibleAI in foundation models? Do you have experience training/evaluating them? Team members: @PreethiLahoti @infoxiao @ananthbshankar @sroy_subhrajit @aradsinha @YaoQinUCSD (RADML at @GoogleResearch)
9
52
252
76,588
Good time to remind ourselves that: If a paper is great, the credit goes to the first author. If a paper has any flaws, the responsibility falls on the last author.
6
4
236
27,336
A PhD is not about getting over the noisy conference bar and publishing a bunch of papers. If that's what your PhD is about, you need to take a step back and reconsider your path. And a PhD advisor's job is certainly not just to help make papers happen.
Given the ongoing issues we continue to hear about in the peer review process at top venues, and considering the rapid advancements in AI, it's not hard to imagine a future—not too far off—where an AI system successfully passes peer review at a top research venue.
7
12
235
68,834
To get a research-y job, you need to have published a few good papers. Your job talk & interviews will cover 1-3 papers you significantly contributed to. Maximizing the # of pubs actually 𝐡𝐮𝐫𝐭𝐬 your prospects as it leads you to shallow "minimum publishable unit" papers.
3
5
224
25,007
The question that a reviewer should ask themselves is: Does this paper take a gradient step in a promising direction? Is the community better off with this paper published? If the answer is yes, then the recommendation should be to accept.
6
18
222
27,173
Excited to share 𝐈𝐧𝐟𝐀𝐥𝐢𝐠𝐧! Alignment optimization objective implicitly assumes 𝘴𝘢𝘮𝘱𝘭𝘪𝘯𝘨 from the resulting aligned model. But we are increasingly using different and sometimes sophisticated inference-time compute algorithms. How to resolve this discrepancy?🧵
6
38
218
22,661
The debate over “Is RL needed?” often misses the point. On-policy RL is one tool for fitting the reward-tilted posterior π⋆(y∣x) ∝ p(y∣x) · exp(r(x,y)∕β), where p is the base generative model and r is a reward. This distills the capability signaled by the reward back into the generator. RL isn’t required: you can sample, approximate, or distill the same target via best-of-N, beam search, rejection sampling, MCMC, or DPO. They are all equivalent theoretically. The real question is generalization and efficiency: - generalization: how well does a chosen method transfer to unseen prompts? For training-based methods, as with any distillation problem, performance depends on the dataset and the loss you optimize. - efficiency: what is the cost of sampling from the desired distribution? The cost is on-par with sampling from the base model for most training solutions while inference-time solutions generally pay a larger decoding cost. A fair comparison across methods entails looking rigorously at the "performance" vs "inference cost" (ignoring training cost). Considering only performance, best-of-N is all you need!
9
21
215
26,298
This is one of the most exciting agentic AI results I have seen! An AI agent (through several rounds of reasoning and experimentation) discovers distributed systems algorithms (e.g., GPU load balancing) that perform on par with those designed by world-renowned human experts in the field.
We built a Systems Researcher AI agent! Glia discovers novel distributed systems algorithms matching PhD-level experts in creativity & performance. We ran it on various networked systems problems and obtained publication-worthy results on each! Let me tell you how we did it 🧵
6
26
208
37,040
Thrilled to share that 0 papers were accepted to ACL and 0 papers were submitted to NeurIPS! If you got a paper accepted, please let us know why you're thrilled about it. Why should we read it? I'm specifically interested in papers on post training / RL / agents.
11
8
215
74,361
My thoughts on the broken state of AI conference reviewing: Years ago, when I was in graduate school and a postdoc in Information Theory, I always felt fortunate to be invited to review for IEEE Transactions on Information Theory or IEEE Transactions on Signal Processing. I felt I was being considered an expert on the subject matter and that my opinion helped make a difference in the review process. I was treated with respect, and I was happy to be part of it. I started publishing in mainstream ML venues in 2017 and was invited to be a reviewer for ML conferences in 2018. I jumped in with the same passion, but it soon became clear that being a reviewer didn’t really matter. I would spend time engaging with a paper only to be overturned by an AC who didn’t read or understand my review, or didn’t care to engage and discuss their rationale. I witnessed terribly wrong papers being accepted and good papers being unduly rejected. I quickly started to feel my role as a reviewer was irrelevant. I wasn’t treated with respect, and I couldn’t make a meaningful difference in the system. I probably would have stopped reviewing for ML long ago if I hadn’t been promoted to AC, and I know a lot of senior researchers who stopped engaging and caring. When I became an AC, I tried to engage reviewers, treat them with respect, and make sure they felt empowered and valued (especially if I made a decision that went against their recommendation). This is crucial for a functioning review system because reviewers need to feel their work matters if we want them to keep coming back and engaging. Now, we’ve reached a point where many reviewers feel their work doesn’t matter. The more experienced reviewers aren’t excited to engage, and the review quality keeps going down to the extent that we really lack qualified reviewers in the system. What solution have we found to this problem? Force everyone to be a reviewer. This is what EMNLP, NeurIPS, CVPR, etc., did this year, which (in my opinion) is a terrible way to address the root cause. We should have thought about how to bring back dignity to the reviewer’s role instead. Fast forward to today, and we are taking steps to make the role of an AC irrelevant as well. When PCs/SACs overturn an AC’s decision without even engaging the AC in the process, the AC doesn’t feel respected or useful. This has been happening in the NLP conferences for a while and happened this year at NeurIPS. Not only does an AC likely feel irrelevant, the SACs will feel the same. We will quickly see experienced members of the community stop accepting AC/SAC roles because they feel it’s a useless activity. The quality of ACs/SACs will go down, and more of the burden will shift to the PCs, who have to decide on 20k+ papers. The end result is that the entire system will soon crumble.
14
18
212
51,903
This advice is underrated! Today, Industry research is focused on short term (3-6months) bets. Academics have an opportunity to balance their portfolio with medium term (1-2 years) and long term (3-10 years) bets. Putting all academic efforts in short-term basket is suboptimal!
Advice for academics: don't try to beat industry at their own game. Invent a new more interesting game, with different rules.
3
13
207
16,244
Have you been perplexed by the surprising performance of 𝗯𝗲𝘀𝘁-𝗼𝗳-𝗻 in alignment compared to SOTA method (𝗣𝗣𝗢/𝗗𝗣𝗢/𝗜𝗣𝗢)? We have theory that explains this phenomenon.
3
20
196
38,987
A periodic reminder to reviewers: If you ask authors for more experiments, then you need to communicate a clear hypothesis you're trying to verify with those (e.g., effectiveness on imbalanced data, generalization beyond a certain modality, scalability, etc). Otherwise don't!
3
16
198
26,797
Please RT Are you a *PhD* student conducting research on generative models? Are you excited about #ResponsibleAI aspects of generation? We are looking to host a student researcher/#internship in 2023. Please get in touch via *email*. P.S. I'll be at #NeurIPS2022 and #emnlp2022
2
77
197
Here is a simple observation that should be better known than it is, so I thought I'd share broadly. Let's say we have a language model p(y|x), where x is a prompt and y is an outcome (a full sequence). Sampling from p(y|x)^α for some α≥1 is commonly known as "tilting" the distribution. Tilting has a lot of use cases in probability and statistics, which are also applicable to language modeling (see comments for references). 1. Tilting a language model can be interpreted as "outcome-level" temperature sampling (as opposed to the common token-level temperature sampling). Unlike token-level temperature sampling, outcome-level tilting is intractable because the partition function Z_α(x) = Σ_y p(y|x)^α is a sum over the exponentially large space of all possible outcomes y. One needs to either come up with sophisticated sampling methods (inference-time solutions) or try to directly distill the tilted distribution in a new model (training-time solutions) to be able to sample from it. 2. Sampling from the distribution proportional to p(y|x)^∞ (α →∞) concentrates all probability mass on the single maximum likelihood outcome y^* = argmax_y p(y|x), which again is intractable. We generally try to approximate this using beam search. Remember that this is different from the (token-level) temperature 0 sampling, which is greedy decoding. 3. One can interpret sampling from p(y|x)^α as the solution to a KL-regularized RL problem. Let the reward be r(x,y) = log p(y|x). Then consider solving the following KL-regularized RL problem: max_q(.|x) E_q[r(x,y)] - β KL(q(.|x)||p(.|x)) This would lead to sampling from p(y|x)^(1+1/β). In other words, tilting the distribution is equivalent to (implicitly) using log-likelihood as a reward. 4. KL-regularized RL and Best-of-N (BoN) are almost equivalent (see comments for references). Thus, Best-of-N that uses the (log-)likelihood as its re-ranking function is equivalent to tilting. Increasing the value of N is equivalent to increasing the value of α (subject to a monotonic reparameterization). Happy tilting!
10
16
198
21,836
PSA: If a paper is published (in a top venue), it doesn't mean it is useful (or even correct). You should develop your own judgement system decoupled from the noisy signals to efficiently extract useful information nuggets from the literature!
I've sunk way too much time reading published papers of this type early on in my PhD, and it's really sad when I imagine how much more time is wasted for others.
4
9
190
30,316
Are you interested in theoretical aspects of sampling from language models? These tutorial slides should have good pointers to get started: theertha.info/papers/isit_20… p.s. The slides include my 𝑳𝒂𝒏𝒈𝒖𝒂𝒈𝒆 𝑴𝒐𝒅𝒆𝒍 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕: 𝑻𝒉𝒆𝒐𝒓𝒚 & 𝑷𝒓𝒂𝒄𝒕𝒊𝒄𝒆 talk
Excited to share the slides of the 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐦𝐨𝐝𝐞𝐥 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞: 𝐭𝐡𝐞𝐨𝐫𝐲 & 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝐬 tutorial that @abeirami and I gave at #ISIT2024. We covered basics of 𝘪𝘯𝘧𝘦𝘳𝘦𝘯𝘤𝘦, 𝘢𝘭𝘪𝘨𝘯𝘮𝘦𝘯𝘵 and 𝘦𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘤𝘺 from a theoretical lens.
26
188
25,692
In 2009, a prominent signal processing professor said the market was tough and h-index ≥6 was needed just to get a faculty interview. We now seem to be drifting toward the same bar for PhD program entrance.
7
3
186
16,769
Only 2 out of 120 papers in my NeurIPS SAC batch have all reviewer scores ≥4! Don't despair with the low scores and focus on writing a clear and concise rebuttal! Good luck!
6
4
190
60,657
Giving a new talk tomorrow at @USC 𝑳𝒂𝒏𝒈𝒖𝒂𝒈𝒆 𝑴𝒐𝒅𝒆𝒍 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕: 𝑻𝒉𝒆𝒐𝒓𝒚 & 𝑷𝒓𝒂𝒄𝒕𝒊𝒄𝒆 hosted by @mahdisoltanol viterbi.usc.edu/calendar/?ev… While the talk is mostly an 𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 tutorial, I will also touch upon some of our recent work👇
5
15
186
45,139
The new post by @johnschulman2 et al. on LoRA vs Full-FT is excellent! In this short 🧵I’ll - engage with the findings - examine the info-theoretic intuition - offer a sharper argument with some further thoughts
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…
3
16
178
29,764
The problem is clear when you see a student asking themselves: "What am I submitting to NeurIPS?" vs "What important problem am I going to try to solve in the next few years?"
9
15
177
38,914
This is the conclusion slide of a talk I gave more than a year ago on RL/Alignment! It still holds true today.
2
12
184
20,165
I need a Bayesian agent that can read all new RL posts, and update its belief on what works and what not, and report back every day. Who is building that?
9
2
175
15,726
Even though Nando's problem statement is underdetermined, let's assume we are FTing a generalist, i.e., learn a new capability while retaining the core capabilities of the model (reasoning, instruction following, etc). The answer is quite different depending on your identity👇
If you have 10K data instances, would you: 1. SFT an LLM with 10K data, or 2. Learn a reward with 5K, and RL the LLM on the remaining 5K with the learned reward 3. Other (explain)?
3
13
171
37,090
Excited that our paper "safety alignment should be made more than just a few tokens deep" recognized as an #ICLR2025 Outstanding Paper! We identified a common root cause to many safety vulnerabilities and pointed out some paths forward to address it!
Replying to @iclr_conf
Outstanding Papers Safety Alignment Should be Made More Than Just a Few Tokens Deep. Xiangyu Qi, et al. Learning Dynamics of LLM Finetuning. Yi Ren and Danica J. Sutherland. AlphaEdit: Null-Space Constrained Model Editing for Language Models. Junfeng Fang, et al.
16
7
169
13,569
Time for my rants again: A PhD is not about counting papers :)
Replying to @abeirami
happy for you bruv, but that's only half a Ph.D.
5
3
162
25,274
Happy #WomeninMathematics day! May 12 marks the birthday of Maryam Mirzakhani, a mathematician who was awarded the Fields medal (highest honor in math) for her contributions to geometry and dynamical systems. Two of my fav mathematicians: Maryam Mirzakhani & Ingrid Daubechies
1
21
151
31,590
As NeurIPS review deadline is around the corner, please remember that you cannot use any non-local LLM like chatgpt/gemini for understanding the paper and drafting/revising your review as that breaks the confidentiality agreement.
5
6
156
23,289
A research career is built *slowly* one good paper at a time! "publishing one [great] paper which catches the right people’s attention" can be very helpful to one's career! publishing one bad paper which catches the wrong people's attention can also be extremely damaging to one's career!
PSA: having many NeurIPS papers does not lead, and has never causally led, to high-paying jobs or top internships. However, working with the right advisor in grad school, or publishing one paper which catches the right people’s attention - these can both do this.
5
4
158
40,289
Hey @emnlpmeeting: is this a joke or a metareview because I classified as joke initially but then changed to metareview. #emnlp2023
8
3
155
124,586
Both @icmlconf and @NeurIPSConf held in the US in 2022-23! The US is one of the most visa unfriendly states (appointment wait times 6+ months, processing another 6+ months), this is significantly hurting diversity & inclusion. We should strive to do better! #ICML2023 #NeurIPS2023
8
21
149
26,043
Have you been compiling reward-KL tradeoffs to compare different alignment methods? Have you been using 𝐛𝐞𝐬𝐭-𝐨𝐟-𝐧 as a baseline? Have you wondered about the analytical formula that claims this formula? 𝐾𝐿 (𝑏𝑒𝑠𝑡-𝑜𝑓-𝑛 || 𝑏𝑎𝑠𝑒) = 𝑙𝑜𝑔(𝑛) - (𝑛-1)/𝑛
4
22
146
27,626
We publish a ton of our core work on language models. Here are a few samples just from our team released in the last few weeks: - arxiv.org/abs/2310.16523 - arxiv.org/abs/2310.15141 - arxiv.org/abs/2310.16959 - arxiv.org/abs/2310.17022 How's that "giving almost nothing back"?
I can't help being a bit sad in thinking how much the Gemini release is taking from the academic and open LMs communities, while giving almost nothing back I can't help but wish our relationship with commercial players would be more of a two way channel
2
11
144
58,121
Instead of complaining that peer review is dead, take a positive step to improve it today. The reviewers are not aliens, they are us! - Revise your review and make it clear. Identify the crucial points that impacted your score negatively and positively. - If the paper is lacking information about its claims, communicate your asks and the reasoning concretely. Don't just ask for 2 more experiments because you feel the authors didn't work hard enough. Don't ask for experiments unless they verify a hypothesis (which you clearly explained). - Look for the missing information that you identified and make sure they are not in the paper. - Recommend acceptance if the paper's claim is adding a new nugget of information to the literature (no matter the size of the nugget), and if the paper has substantiated the claim via theoretical / empirical evidence. Believe me, this doesn't take much time, and will improve the state of peer review significantly!
10
14
149
18,983
TMLR's bar is not lower. The car is drawn differently to remove subjectivity from the acceptance criteria. I have read many good TMLR papers and many disastrous ICML papers. In fact I think those disastrous ICML papers wouldn't stand a chance at TMLR.
TMLR defines it's bar for acceptance in a way I consider pretty clearly lower than conferences Completely agreed that peer review is dumb and noisy in many ways. But, like, I think that prestige is a real and meaningful phenomena that really affects careers
9
8
151
38,199
This take couldn't get more wrong! Industry people do some of their most interesting research with interns since internships allow for more risky projects!
1
5
146
10,926
Apple’s “smart” charging decided my Mac only needed to be full by 7:30am so I woke up to a half-charged laptop before a flight. I miss dumb, predictable tech.
12
6
142
14,158
Varadhan's Lemma remains the most effective tool for characterizing large deviations in probability theory. It says: the log of an exponential average is governed by the outcome that best balances payoff and cost, where cost is how improbable that outcome is under the data distribution (the large-deviation rate), and payoff is the numerical score being averaged. Tilted ERM is inspired by this principle. Let f be the per-example loss (which is the payoff) on data z₁,…,zₙ. The tilted empirical risk with tilt t is R_t = (1/t) log( (1/n) ∑ᵢ exp(t f(zᵢ)) ). The induced weights are wᵢ ∝ exp(t f(zᵢ)). For t>0 the weighting emphasizes higher losses, for t<0 it emphasizes lower losses, as t→0 it approaches the mean loss, and for large |t| it approaches the max or min loss. This is exactly the payoff versus cost tradeoff, recast for ML. Screenshot: Dembo and Zeitouni’s book.
S. R. S. Varadhan, a recipient of the 2007 Abel Prize, is a towering figure in modern probability theory. His most notable contribution is the creation of a unified theory of large deviations. This theory provides a powerful framework for understanding the probability of rare events, which are often exponentially unlikely. While traditional probability focuses on average or typical outcomes, Varadhan's work gives precise mathematical tools to analyze the behavior of systems when they deviate significantly from the norm. His work has far-reaching applications in fields from statistical physics to finance and risk management.
7
6
142
16,182
If you are interested in safety/security jailbreaking of LLMs, defenses against them, and how the safety issues become more complicated when we design agentic workflows, this tutorial by @HamedSHassani, @aminkarbasi, @AlexRobey23 is highly recommended
2
14
144
9,863
I left Meta AI a few weeks ago. I am filled with gratitude for the opportunity to “think big” and propose Project CAIRaoke to redefine the future of #ConversationalAI! I am also humbled to have worked with so many amazing people along the way!
1
140
To the reviewer who claimed 8% improvement is marginal and not significant enough for a top conference paper: The goal of a scientific paper is to further our collective understanding of how to solve problem, it's not to launch a new algorithm in production setting.
6
11
139
10,553
Even from NeurIPS experiments a decade ago, the bottom 25% of papers had a 5% chance of acceptance whereas the top 25% had a 50% chance. Review scores are noisy! With the recent flood of papers and the downfall of reviews, the correlation between merit and acceptance is going to 0 fast. Can someone estimate the convergence rate?
An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3
2
8
140
26,289
Too often, researchers propose solutions before they’ve defined the problem first.
8
9
139
37,681
I hope my Iranian and Israeli friends and their loved ones are safe, and I wish this war ends in a peaceful terminating state soon!
2
8
138
18,551
arXiv is already beyond the point where anyone can follow everything, and a fair amount of junk gets through. I miss the days when I could read all new cs.IT abstracts over my morning coffee! I understand the goal of filtering obvious junk, but the proposed gatekeeping will misclassify valuable work, slow timely release, and likely push authors to other platforms for immediate dissemination. There is already a good number of papers that sit in moderation for weeks before appearing without a clear signal on why. The better path IMHO is still light moderation that leans on author reputation/endorsement, with clear and verifiable criteria rather than noisy peer review requirements. Peer review itself is in crisis, so using it as a hard prerequisite risks taking down arXiv too.
The Computer Science section of @arxiv is now requiring prior peer review for Literature Surveys and Position Papers. Details in a new blog post
7
2
136
28,634
Very cool findings! But this says more about Qwen models than about RL.
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewards
5
5
138
24,369
Cool paper packed with lots of useful information, which also has important implications for context engineering!
Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵
4
8
132
16,683
Does each policy gradient update only contain O(1) bits of information about the model we are trying to train while each SFT update has O(T) where T is the number of tokens in the example? I think the answer is not a clean binary. Let’s examine this in a simple setup. Let’s consider a simple language model. It is one that could only be prompted with "Tell me a joke" (and won't work with anything else), and it can generate exactly N jokes, each exactly T tokens, and each with probability 1/N. We will see that T appears in none of our arguments. Let’s assume that we have a binary reward that determines whether the joke is desirable (say safe and good). And let’s assume that each joke is good with probability α, and bad with probability (1 − α). So, the ideal solution is clear. Filter out the bad jokes and generate each of the K good jokes with probability 1/K each. Here K is a random variable whose mean is αN. This is what we’d get from solving population level RL with no KL regularization. Let’s call this solution p*. Now, let’s consider a policy gradient update on trajectory τ. Before doing that, we need to define a model family p_θ; and for that let’s just define a simple model family where we have θ is an element of the N-dimensional simplex, i.e., θ_i ∈ [0, 1], and ∑_i θ_i = 1. Then, p_θ(τ_i) = θ_i, where τ_i is the i’th joke (or trajectory). With this notation let τ be a random trajectory drawn from p_θ. We have G = S · Adv, where S = ∇ log p_θ(τ). Let’s consider the very first step of training where θ = (1/N, …, 1/N) and hence S is independent of p*, and history = []. Thus, I(G; p* | history) ≤ I(Adv; p* | S) ≤ h(α). This bound is intended only for the very first step of training when θ is uniform and S is independent of p*. After the first step, S can depend on history, so this simple bound need not apply as written but overall information is probably still upper bounded by 1 bit. Let’s also consider supervised learning in this case, and consider training on data that keeps getting randomly generated through positive examples. The end state of training is also going to be the same p* in this case but let’s see how much mutual information does one example contain about p*? The update is going to be G = ∇ log p_θ(τ) where τ is a positive example. I(G; p* | history) = I(τ; p* | history) ≤ log(1/α). In the RL case, each example resolves at most 1 bit of information about the model by determining whether or not a random rollout was good. In the SFT case, the information depends on how rare such an outcome is. If α > 1/2, then SFT also provides at most 1 bit of information about the ideal model. But if this is a rare outcome, then SFT provides significantly more information about the outcome. Think about it this way, if there is only one good joke out of the millions that the model may generate, then RL needs millions of rollouts to stumble upon that good joke, whereas if SFT points directly to that joke, then we can just learn it and move on. But if there is just one bad joke in the millions of jokes that the model can generate, neither SFT nor RL is sample efficient because we need to see millions of rollout to determine which joke was bad and filter it out. Notes and clarifications. • T does not appear in these arguments. The contrast here is not O(1) vs O(T) in this toy setup. It is O(1) for a binary sequence-level reward in the first PG step, versus up to log(1/α) for a positive SFT example. • Data processing. You can view p* as filtering each outcome independently with probability (1 − α), so any K-sized good set where E{K} = αN is possible. Under this process the log(1/α) bound follows from how a positive example shrinks the consistent good-set space. • Richer rewards. The ≤ 1 bit conclusion for RL depends on the reward being binary at the sequence level. If the reward is real-valued and informative, a single update can carry more than 1 bit. The discussion here is about the binary case. Cases like GRPO/RLOO where the reward is calibrated against several rollouts potentially contain more information. • Exploration. The “rare good joke” gap assumes naive on-policy sampling. Guided exploration or data selection (for example using a reward model or replay that favors promising trajectories or some other form of supervision) can reduce the number of rollouts needed to find rare good outcomes. • KL regularization. In practice RLHF is usually KL-regularized around a reference model. The KL shapes token-level behavior and can significantly change these information-theoretic arguments. • In practice, it is not uncommon to do poor man's RL through SFT by rolling out the model many times and keep only the examples that are good and filter out the bad ones and do SFT on the good examples. If we do that, then we will end up needing 1/α rollouts per each SFT example so even though we get log(1/α) bits of information per SFT example update, we will still get O(1) information per original model rollout. Happy to hear your thoughts on these arguments!
2
14
133
18,666
[PSA] Please use \𝗰𝗶𝘁𝗲𝘁{} and \𝗰𝗶𝘁𝗲𝗽{} correctly or the paper will be hard to read. \𝗰𝗶𝘁𝗲𝘁{}: Author name is intended to be part of the sentence: X et al. (2024) \𝗰𝗶𝘁𝗲𝗽{}: Citation appears in parentheses and not read in the sentence: (X et al., 2024)
3
4
128
14,308
Disrespecting and threatening the chairs is a terrible idea, even if you believe the decision on your paper is totally unfair! If you feel the urge to email the chairs, wait a day and reflect on the result a bit more before doing so. One of the best pieces of advice I have received: What you say today, you cannot take back tomorrow. What you don't say today, you can still say tomorrow! P.S. You cannot imagine how much effort the PCs put in to assemble the program. So we should all be thankful to them for taking on this thankless job on behalf of all of us!
We are disappointed that some community members have been sending threatening and disrespectful messages to our organizers after NeurIPS decisions. While we always welcome feedback, we wish to remind all that our organizers are volunteers who work tirelessly on behalf of the community, and such messages will not be tolerated and may potentially be investigated as code of conduct violations.
1
3
127
17,176
The review quality in TMLR is better because: 1. The authors suggest the AE. This means that the AE is more likely to be the right fit for the paper. 2. The AE selects the best reviewers for the paper who may or may not be in the reviewer pool. ...
I've had good and garbage reviews at the 3 big conferences and TMLR. I don't recall a clear difference. Good to hear you've found TMLR to be good!
3
10
130
58,482
I consider candidates with >1 published papers. I read one of their papers to ground the interview conversation and make a decision based on - quality of their paper (writing, experiment design, etc) - their knowledge in their area of competency - their general knowledge - ...
Replying to @abeirami
Say you were to take interns/students. Who'd you pick between a candidate who's passionate about fundamental questions without top-tier conference papers and one who has numerous top-tier conference papers on topics that no one would care about in a few years?
5
1
126
37,323
Make the review process for all conferences fully open like ICLR and TMLR. Among other things, this will deter submissions of half baked papers.
4
5
131
11,173
When we (RL) finetune a generalist model on a reward, there are three major axes that matter.
1
10
124
9,399
Off policy algorithms (even the simplest ones) are indeed useful for RLHF and can train a value function on massive amounts of data and 𝗼𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺 𝗜𝗣𝗢/𝗗𝗣𝗢/𝗣𝗣𝗢. arxiv.org/abs/2310.17022
A brief summary on what REINFORCE is in terms of RLHF and history of RL. The algorithm known as REINFORCE is really just the vanilla policy gradient approach. The name comes from Williams 1992, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning". Policy gradient algorithms directly update the weights of the policy based on some estimate of reward. Mostly, it's important to know how PPO compares and why it emerged, and why that may not matter for RLHF. REINFORCE methods were known for high variance on the policy gradients, leading to unstable learning. Basic methods like baselining and other regularization tools (off-policy gradients, actor-critic methods, etc) all emerged. The notable paper on the path towards PPO (which lots of people use for RLHF today) was Trust Region Policy Optimization (TRPO). Both TRPO and PPO try to answer the same question: How do we take the right sized step with a potentially noisy policy gradient estimate? TRPO does this with a second order approximation. PPO does this with a first order approximation. It ended up being much simpler to implement, and it's popularity is now obvious. We'll see if we go down the same paths of off-policy algorithms being useful for RLHF (much more data to learn from) and actor-critic algorithms (separate learning of value function and policy). The things that were important for state-based control may not be important for language because our reward functions are very different with a reward model. Google may need extremely good reward models to get REINFORCE to work, as they mentioned in the Gemma paper. Regardless, a big way around high variance gradients is to take big batch sizes. We know Google has the compute for that, we're not sure all of us DPO hackers do. Links / references Williams' paper: link.springer.com/content/pd… Policy gradient slides from @svlevine: rail.eecs.berkeley.edu/deepr… Policy gradient blog post from @lilianweng : lilianweng.github.io/posts/2… Spinning up docs: spinningup.openai.com/en/lat…
2
16
125
28,502
The list of #NeurIPS2023 accepted papers are available here: neurips.cc/virtual/2023/pape…
26
125
26,335
My personal thoughts on conference acceptance rate being artificially kept low. From my own experience, if we focus on merit based acceptance with two simple criteria (like TMLR): - claims are substantiated with theoretical/empirical evidence - claims push the science envelope by epsilon. Then, we will still end up with ~25% acceptance rate and won't need to artificially reject any papers. The problem we have with the noisy review system is that we are accepting a lot of papers that are not even correct.
The question that a reviewer should ask themselves is: Does this paper take a gradient step in the right direction? Is the community better off with this paper published? If the answer is yes, then the recommendation should be to accept.
4
6
125
27,832
Excited to share that our paper titled “Winning is not everything: enhancing game development with intelligent agents,” @IEEETxnOnGames, June 2020, has been selected to receive the 2023 Outstanding Paper Award by @ieeecis Awards Committee.
3
7
120
[2022 #Internships at the @facebookai Conversational AI Research (CAIR) team] The CAIR team is seeking to recruit multiple PhD student interns to work with us (@SatwikKottur, @Chinnadhurai, @abeirami, and others) on different aspects of #ConversationalAI and #NLProc. 1/5
1
35
120
Yes, soft distillation generalizes well. Hard distillation (SFT on teacher generated data) does not. Wasn't this (widely) known already?
A model SFT’d on curated synth data, generated from a teacher, is not the same as distillation. It’s a form of “off policy RL”. Filtering entails a form of reward. Which is why models trained on mostly synthetic data like phi-4 can become better than the teacher in some tasks.
6
2
122
31,785
If a paper clears the bar, give it a score ≥6. Here is how I think about ratings: - Should be oral? 8/9 - Should be spotlight? 7/8 - Clears the acceptance bar? 6/7 - Could be accepted after minor revs? 4/5 - Could be accepted after major revs? 3/4 - Fundamentally flawed 2/3
The question that a reviewer should ask themselves is: Does this paper take a gradient step in the right direction? Is the community better off with this paper published? If the answer is yes, then the recommendation should be to accept.
2
7
121
27,560
If you decide to withdraw your paper without a rebuttal, it's nice to write a short (3-4 sentences) withdrawal note to thank the reviewers for their feedback, describe what you agree/disagree with, and what you plan to do. Besides, you may get the same reviewers again.
4
3
113
11,717
That's called loss maximization!
Several of my team members + myself are impacted by this layoff today. Welcome to connect :)
2
2
116
12,865
There is a subtle distinction between RL in RLHF and RL in domains with a clear reward signal that captures what we want like winning in games, correctness in math With a clear reward, RL is quite effective and can lead to novel sequences of actions (e.g. move 37). But, ... 👇
Replying to @abeirami
interesting, but I think the early work on RLHF was pretty impressive on teaching new skills, without pre-training/SFT, e.g., openai.com/index/learning-fr… how do you equate your argument with that? those allowed large KL?
4
9
116
28,661
PSA: If you are writing a paper that would be obsolete by the time it gets published (i.e., has no archival value), then don't! That should be a blogpost or a tweet. A paper should be reserved to communicate an insight beyond a bunch of bold numbers in a table.
Replying to @abeirami
At the same time, if you are reading a published paper. You are almost certainly already behind!
3
3
111
13,945
Very interesting paper by @th33rtha et al For categorical/Gaussian distributions, they derive the rate at which a sample is forgotten to be 1/k after k rounds of recursive training (hence 𝐦𝐨𝐝𝐞𝐥 𝐜𝐨𝐥𝐥𝐚𝐩𝐬𝐞 happens more slowly than intuitively expected)
2
17
113
13,479
As for next steps, I am excited to share that I have joined #GoogleResearch to lead efforts around robust and fair development of core machine learning techniques. I am also moving to New York City, and excited to be back to the east coast!
24
110
When I first started reviewing ML papers, I was fighting to reject bad papers. These days I find myself fighting to accept good papers. The change in perspective has also made my reviews more constructive even in cases where the recommendation has to be reject.
1
1
107
7,785
0: Do Best-of-N on your rubric/reward first, look at the outcomes, and verify that it works as intended.
i think i figured out the correct pipeline for RL 1 : forget RL and just do DSPy with GEPA. develop your agent loop and eval here. it's cheaper and faster. 2 : convert it to RL and compare to baseline. 3 : hopefully RL worked, otherwise revert to prompt optimizers
2
5
111
22,743
Has your LLM been leaking? We have the perfect solution for you! Introducing our new venture, Gemini Waterproofing (Since 1989)
6
4
107
11,461
We are still in an Evaluation crisis!
8
9
111
11,532
There is a rich set of research questions in design and optimization of agentic workflows with a ton of room for theoretical & algorithmic work! A great starting point to get exposed to them is the MIPRO paper (@kristahopsalong @lateinteraction et al.) and the DSPy framework.
3
11
111
8,716
Most “robustness” work (adversarial, shift, etc.) is just training on reweighted samples (augmented, model-generated, or mined). OOD generalization then comes from: (1) inductive bias (2) similarity to train data (3) luck The 3rd one is the most important of the three.
3
7
108
12,579