Rafael Rafailov @ NeurIPS · Dec 2, 2025 · 7:05 PM UTC

Rafael Rafailov @ NeurIPS

Pinned Tweet

Rafael Rafailov @ NeurIPS

@rm_rafailov

2 Dec 2025

I will be at NeurIPS this week! If you want to talk about research, RL, life at @thinkymachines or get some Tinker credits reach out!

189

18,999

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:04 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

224

1,362

181,431

Rafael Rafailov @ NeurIPS · Apr 19, 2024 · 2:18 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

19 Apr 2024

We have a new preprint out - your language model is not a reward, it’s a Q function! 1. The likelihood of the preferred answer must go down - it’s a policy divergence 2. MCTS guided decoding on language is equivalent to likelihood search on DPO 3. DPO learns credit assignment

149

938

100,382

Rafael Rafailov @ NeurIPS · Aug 13, 2024 · 8:53 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

13 Aug 2024

Super excited to announce what we have been working on in the last six months - Agent Q is out now! This is a framework for self-supervised agent reasoning and search that can self-correct and autonomously improve by self-play and RL on real tasks on the real internet! 👇

103

756

166,500

Rafael Rafailov @ NeurIPS · Sep 29, 2025 · 5:21 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Sep 2025

The most surprising thing working on this was that RL with LoRA completely matches full training and develops the same extended reasoning patterns. I think this is a great sign for custom agent training.

Thinking Machines

@thinkymachines

29 Sep 2025

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lor…

509

45,475

Rafael Rafailov @ NeurIPS · Nov 30, 2023 · 4:45 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

30 Nov 2023

Excited to announce DPO has gone multi-modal! New paper out on RLHF for text-to-image diffusion models! We obtain large-scale state of the art results with 70% win rates against Stable Diffusion XL on human evals! Deep dive below 🧵

471

233,679

Rafael Rafailov @ NeurIPS · Aug 26, 2024 · 7:58 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

26 Aug 2024

My Bet: Strawberry is algorithm distillation/procedural cloning. Everyone right now is coming up with ways to distill System 2 into System 1, but that will always be limited. We need to train the model to run the algorithms, not just outputs (and post-train with RL of course).

459

123,999

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

I saw this challenge aimoprize.com/ to develop an AI that can win a gold medal at the IMO. I competed at that level a couple of times (only silver medals though) and have been working on RL and LLMs for a bit. Here is my thoughts on what the challenges are: 1/N

AIMO Prize

aimoprize.com

450

161,251

Rafael Rafailov @ NeurIPS · May 11, 2024 · 10:28 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

11 May 2024

Not to mention that most students don’t even have access to that cluster. I don’t have access to any A100s myself. It is becoming increasingly hard to even do research and that is Stanford, other places have it even worse.

Tsarathustra @tsarnick

10 May 2024

Fei-Fei Li says Stanford's Natural Language computing lab has only 64 GPUs and academia is "falling off a cliff" relative to industry

399

106,056

Rafael Rafailov @ NeurIPS · Oct 7, 2024 · 8:24 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

7 Oct 2024

Excited to announce our latest work on generative reward models that unify RLHF and RLAIF approaches! We begin with a standard LLM-as-a-judge RLAIF framework and use further RL tuning to align the judge model's evaluations with the preference dataset.

396

65,625

Rafael Rafailov @ NeurIPS · Oct 2, 2025 · 6:53 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

2 Oct 2025

I actually believe Tinker could be the most advanced ML system in the world. It optimizes everything from the kernel level to a distributed system that can process millions of simultaneous requests with near 100% reliability and insane throughput efficiency.

Myle Ott @myleott

1 Oct 2025

So excited about this! Tinker provides a simple+powerful interface for postraining/RL research. It also manages all the infrastructure so that users can focus on data and environments. Hidden behind that simple interface is a ton of interesting and complex ML systems challenges! In addition to the work building an efficient RL stack (orchestration, numerics, parallelism, weight transfer, etc.), we also tackled a bunch of new challenges (transparent failure recovery, multi-tenant scheduling, autoscaling, etc.). I had a lot of fun working on early parts of this system and am excited to see what others are able to build with it!

331

45,806

Rafael Rafailov @ NeurIPS · Sep 12, 2024 · 6:00 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

12 Sep 2024

Fn nailed it - tree search distillation + RL post training!

Rafael Rafailov @ NeurIPS

@rm_rafailov

26 Aug 2024

308

39,489

Rafael Rafailov @ NeurIPS · Oct 1, 2025 · 6:13 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 Oct 2025

Very excited to share what I have been working on with a great team of people at @thinkymachines. Tinker is a whole new way to train and customize models all the way up to frontier scale. Most importantly, it allows everyone to use their own code, data, tools and environments, while it provides a frontier level training stack with a few lines of code.

Thinking Machines

@thinkymachines

1 Oct 2025

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker

301

51,681

Rafael Rafailov @ NeurIPS · Mar 6, 2025 · 8:44 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

6 Mar 2025

This is a really cool project where we trained a multi-agent system of 3 LLMs to do cooperative problem-solving end-to-end with reinforcement learning! MARL holds a lot of promise to teach models to be more cooperative with real collaborators! Check out @sumeetrm's thread bellow!

Sumeet Motwani

@sumeetrm

6 Mar 2025

Introducing MALT: Improving Reasoning with Multi-Agent LLM Training🫡 We present a new multi-agent post-training method that uses credit assigned synthetic data to improve the reasoning capabilities and self-correction rates of a generator, critic, and refinement model working together🧵

291

56,114

Rafael Rafailov @ NeurIPS · Sep 28, 2025 · 5:10 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

28 Sep 2025

It’s weird how people still blindly copy it. There was a whole paper about this.

Quanquan Gu

@QuanquanGu

28 Sep 2025

Replying to @zjasper

The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:

281

45,056

Rafael Rafailov @ NeurIPS · Aug 8, 2024 · 6:28 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

8 Aug 2024

After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇

250

44,647

Rafael Rafailov @ NeurIPS · Feb 29, 2024 · 8:35 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Feb 2024

We are just entering the RoboGPT era. Will have some big news on this soon!

OpenAI

@OpenAI

29 Feb 2024

OpenAI + humanoid robots — we’re collaborating with @Figure_robot to expand our multimodal models to robotic perception, reasoning, and interaction. prnewswire.com/news-releases…

238

73,477

Rafael Rafailov @ NeurIPS · Nov 27, 2023 · 6:42 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

27 Nov 2023

Given the successful recent releases of Zephyr, NeuralChat and Tulu 2, there has been a lot of discussion around DPO (and variants) for RLHF and comparisons to the classical reward modeling + online RL (PPO) pipeline. What I think is missing from the discussion: 1/N

239

170,859

Rafael Rafailov @ NeurIPS · Feb 12, 2025 · 10:24 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

12 Feb 2025

For 1% of Stanford's 36.5 Billion endowment we could blow DeepSeek out of the water. For 2.5% we could probably compete with OpenAI. Yet for some reason as a Ph.D. student I can use 4GPUs on a good day and pray my one 8B fine-tuning run goes well. Food for thought.

Stanford NLP Group

@stanfordnlp

12 Feb 2025

½ wrong, ½ right: The problem is not API 💰💰 but whether students can hack on—“research”—the details of models! .@ericschmidt: “said a [US] failure to invest in open-source [AI] would prevent scientific discovery in western universities, which could not afford closed models.”

234

36,814

Rafael Rafailov @ NeurIPS · Aug 15, 2024 · 1:19 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

15 Aug 2024

The latest SalesForce approach achieves SOTA 55% on SWE-Bench Lite. The key component is a critic model which selects among a number of proposed solutions. It's the same approach we used in Agent Q for web tasks. I am pretty bullish about the idea of TRAINING generative critics.

211

21,090

Rafael Rafailov @ NeurIPS · Dec 22, 2024 · 7:49 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

22 Dec 2024

I highly recommend zoning out all AIfluencers over the next few weeks (or indefinitely really).

159

16,427

Rafael Rafailov @ NeurIPS · Jan 25, 2025 · 6:13 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

25 Jan 2025

These models can’t even learn \n\n from 50 gradient steps, much less complex exploration like this. If it generates code to solve math problems is clear it had a bunch of curated data in pre-training.

Junxian He @junxian_he

25 Jan 2025

We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to PRIME and rStar-MATH that use >50x more data and more complicated components. 🚀 Increased CoT length and self-reflection emerge We share the details and our findings in the blog: hkust-nlp.notion.site/simple… Training code and implementation details here: github.com/hkust-nlp/simpleR…

152

56,116

Rafael Rafailov @ NeurIPS · Sep 6, 2025 · 4:10 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

6 Sep 2025

I’ll take the opposite view - current methods are saturating and we need at least 1 practical breakthrough and at least two fundamental ones (which will likely take years) just off the top of my head to reach AGI. None of these are oversight or safety related.

Stephen McAleer

@McaleerStephen

5 Sep 2025

Scalable oversight is pretty much the last big research problem left. Once you get an unhackable reward function for anything then you can RL on everything.

153

30,111

Rafael Rafailov @ NeurIPS · Jun 9, 2025 · 12:38 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jun 2025

When we first published our work on this 9 months ago it was rejected for being impractical in realistic cases. Six months later it was rejected for lack of novelty. It’s the way academic publishing goes.

Nathan Lambert

@natolambert

8 Jun 2025

Another generative / inference-time scaling reward modeling paper. It's the direction things are going.

153

14,698

Rafael Rafailov @ NeurIPS · Apr 1, 2024 · 4:20 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 Apr 2024

New preprint is out on interplay between DPO and verbosity. Some of the first feedback we got on DPO was that training on LARGE scale the model becomes increasingly verbose until it diverges. Verbosity effects have also been observed in the OS community. Credit to @peterjliu

138

31,013

Rafael Rafailov @ NeurIPS · Jan 20, 2025 · 2:39 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

20 Jan 2025

DeepSeek R1 with "Cold Start" pretty much works as expected. I still don't buy the R1 Zero result, the base models barely output coherent solutions without finagling. My bet is there is some correction/reflection/backtracking-like data in mid-training.

132

24,103

Rafael Rafailov @ NeurIPS · Dec 21, 2024 · 6:03 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

21 Dec 2024

Despite all the twitter hype there still hasn't been public proof that the "reasoning" models have any emergence. I.e. is there a class of problems that are solvable with "advanced reasoning" that were not under GPT4o with search under some computational budget?

127

20,857

Rafael Rafailov @ NeurIPS · Apr 19, 2024 · 5:25 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

19 Apr 2024

From the LLaMa 3 blogpost - they use a combination of rejection sampling, DPO and PPO for post-training. Really interested to know what tasks/parts of the process each algorithms benefits the most.

117

71,745

Rafael Rafailov @ NeurIPS · May 30, 2023 · 5:27 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

30 May 2023

Our new paper on RL From Human Feedback is out: arxiv.org/abs/2305.18290. In Direct Preference Optimization (DPO) we reparameterize the reward model in a suitable way without any loss in generality and optimize the EXACT RLHF objective directly with a simple classification loss.

120

22,823

Rafael Rafailov @ NeurIPS · Nov 10, 2025 · 7:18 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

10 Nov 2025

Replying to @kalomaze

You are wrong on like three different levels, but something that will blow your mind - GRPO was first published in the DPO paper under the name “PPO-ours” which was group size 4 (but our version was mathematically correct unlike the actual “GRPO”).

122

36,818

Rafael Rafailov @ NeurIPS · Jan 10, 2025 · 12:20 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

10 Jan 2025

"Superintelligence isn't about discovering new things; it's about discovering new ways to discover" -> Meta RL

nathan lile

@NathanThinks

10 Jan 2025

Superintelligence isn't about discovering new things; it's about discovering new ways to discover I think our latest work formalizes Meta Chain-of-Thought which we believe lies on the path to ASI When we train models on the problem-solving process itself—rather than the final solution—they internalize how to think about reasoning tasks, not just what to think The next wave of AI is a Meta-CoT loop. We can't predict what novel forms of thinking might emerge, but it points to an extraordinary synthetic future I'm so proud of @synth_labs team & our incredible open science collaborators for getting this work out

115

16,499

Rafael Rafailov @ NeurIPS · Jan 13, 2025 · 12:24 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

13 Jan 2025

DPO was designed as an offline algorithm, but works better online for data constrained reasons. However I believe the most efficient formulation is as s massive distributed async off-policy RL. This allows you to reuse real with AI feedback while massively scaling data.

Sanmi Koyejo @sanmikoyejo

11 Jan 2025

Replying to @sanmikoyejo

PrefLearn: How Do Advanced Replay Buffers and Online DPO Affect the Performance of RL Tetris with DQNs by Andy Liang, Abhinav Sinha, Jeremy Tian, and Kenny Dao proposes PrefLearn with superior performance and faster convergence tinyurl.com/preflearn 4/n

114

18,340

Rafael Rafailov @ NeurIPS · Oct 23, 2024 · 6:28 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

23 Oct 2024

New follow-up work on the effects of synthetic data on model pre-training. It’s becoming increasingly clear that the model collapse issues predicted by prior works are not panning out in theory and practice. Industry labs now even have entire synthetic data pre-training teams.

Rylan Schaeffer @RylanSchaeffer

23 Oct 2024

📢New preprint📢 🔄Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄 A deeper dive into the effects of self-generated synthetic data on model-data feedback loops w/ @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo 1/9

108

17,458

Rafael Rafailov @ NeurIPS · Oct 31, 2025 · 5:08 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

31 Oct 2025

Replying to @jxmnop

Conference reviewing is completely broken these days, I don’t hold this in very high regard.

104

9,535

Rafael Rafailov @ NeurIPS · Mar 11, 2025 · 12:51 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

11 Mar 2025

Been working on a new off-policy policy-gradient approach that should be more numerically stable (no importance ratios). The final goal is long-context and agentic RL where these can become a big issue, but I wonder how many RL people know what HalfCheetah is these days.

9,287

Rafael Rafailov @ NeurIPS · Jul 23, 2025 · 6:09 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

23 Jul 2025

Missed this paper, but it’s pretty cool - it managed to scale our “Meta-CoT” proposal to 70B models by creating synthetic CoTs from search traces and post-training with RL. Thanks for the shout-out as well!

Joongwon Kim

@danieljwkim

3 Jul 2025

Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417

10,869

Rafael Rafailov @ NeurIPS · Nov 18, 2024 · 9:19 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

18 Nov 2024

Inference time search is going to be huge for agents and likely very hard to match with standard training. That was my biggest take-away from the Agent Q work as well.

hr0nix

@hr0nix

15 Nov 2024

Can open-weight models match frontier LLM performance on SWE-bench? They can if you equip them with search! We've been studying how guided search can improve SWE agents, and built an SWE-agent-based system that scores 40.6% on SWE-Bench Verified using only open-weight models. 🧵

14,282

Rafael Rafailov @ NeurIPS · Nov 6, 2024 · 12:18 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

6 Nov 2024

O1-mini asking for pairwise feedback. Interesting, I didn't expect this.

11,426

Rafael Rafailov @ NeurIPS · Oct 24, 2025 · 2:31 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

24 Oct 2025

Replying to @kalomaze

You know he’s one of the transformer inventors?

5,397

Rafael Rafailov @ NeurIPS · Oct 10, 2025 · 7:28 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

10 Oct 2025

Excited and grateful for the opportunity to speak at TED AI this year!

TEDAI San Francisco @TEDAISF

10 Oct 2025

📢Excited to welcome @rm_rafailov researcher at @thinkymachines His work in reinforcement & continuous learning is shaping how AI learns to learn. Hear him at #TEDAI, Oct 21–22 in SF. Apply to attend: tedai-sanfrancisco.ted.com/

10,116

Rafael Rafailov @ NeurIPS · May 1, 2024 · 8:02 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 May 2024

We address the model collapse issue in a new preprint! With AI generated data growing exponentially and making it's way into pre-training datasets new concerns are being raised about potential degradation in model performance. In this new work we claim this might not be an issue!

9,318

Rafael Rafailov @ NeurIPS · Oct 11, 2025 · 5:42 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

11 Oct 2025

We are working to make applications like this even faster!

Changyu Chen

@Cameron_Chann

11 Oct 2025

The Tinker API seems quite promising for async RL training, though I haven’t seen much discussion on this aspect. I ran a few experiments to get an initial sense of how async RL performs with Tinker. The results are pretty impressive! Async matches the sync version under the same settings, but finishes in about half the wall-clock time (max steps off-policy = 4). A few quick notes: - Efficiency can likely improve further — I did notice some rate limiting - These runs use GRPO-like algorithm; setups with more components (e.g., actor-critic + reward models) could show even greater async benefits - Compared with the non-API RL infra, the async imple with tinker doesn’t need to manage computing resources, which largely simplies the complexity Looking forward to seeing more exploration from the community on async RL with Tinker! Experiments based on the math training recipe from tinker-cookbook: github.com/thinking-machines…

15,662

Rafael Rafailov @ NeurIPS · Mar 9, 2024 · 6:55 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Mar 2024

From @StabilityAI Stable Diffusion 3 paper - fine-tuning with Diffusion-DPO achieves close to 20% higher win rate over the base model under human evals!

23,884

Rafael Rafailov @ NeurIPS · Nov 22, 2024 · 1:45 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

22 Nov 2024

Sometimes even LLaMa has had enough.

5,139

Rafael Rafailov @ NeurIPS · Apr 5, 2025 · 8:48 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

5 Apr 2025

“We developed a fully asynchronous online RL training framework that enhanced flexibility. …. This innovation resulted in a ~10x improvement in training efficiency over previous generations.” Asynch distributed RL strikes again!

6,136

Rafael Rafailov @ NeurIPS · Jan 16, 2025 · 8:44 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

16 Jan 2025

My guess is that scaling inference time compute wont do much for agents (maybe a little - aka internal world models) but we need to scale inference-time interaction.

Alejandro Cuadron

@Alex_Cuadron

5 Jan 2025

Surprising find: OpenAI's O1 - reasoning-high only hit 30% on SWE-Bench Verified - far below their 48.9% claim. Even more interesting: Claude achieves 53% in the same framework. Something's off with O1's "enhanced reasoning"... 🧵1/8

9,958

Rafael Rafailov @ NeurIPS · Nov 12, 2024 · 9:21 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

12 Nov 2024

Replying to @jxmnop

We can replicate it, it just takes insane amounts of data, compute and infrastructure, all of which are in short supply in academia/oss community.

3,962

Rafael Rafailov @ NeurIPS · Jan 10, 2025 · 3:44 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

10 Jan 2025

A lot of my own thinking on the inference compute problem has been influenced by discussions with Aviral, check out his stance on this at their blog post!

Aviral Kumar

@aviral_kumar2

10 Jan 2025

Lots of buzz around scaling test-time compute! But from an ML viewpoint: what does it mean to "use" test-time compute wisely? How to train to do so? How to measure that scaling it is useful? This new blog from students @mldcmu provides a conceptual perspective on these! 🧵⬇️ blog.ml.cmu.edu/2025/01/08/o…

7,101

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

In summary, we would likely need a new architecture, one that can: 1. See and understand spatial reasoning. 2. Be able to maintain and prior use knowledge 3. Do latent planning with arbitrary depth over concepts 4. Have hierarchical structure to generate the actual output 16/N

4,405

Rafael Rafailov @ NeurIPS · Sep 17, 2024 · 6:43 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

17 Sep 2024

"Hmm"🤔

6,938

Rafael Rafailov @ NeurIPS · Feb 7, 2025 · 12:28 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

7 Feb 2025

Replying to @polynoamial

Most people pursue Ph.Ds to do novel science and discoveries. If that is no longer on the table, the whole endeavor seems … pointless.

1,320

Rafael Rafailov @ NeurIPS · Jun 12, 2025 · 5:41 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

12 Jun 2025

I make the AI, very nice!

James Alcorn

@JamesAlcorn94

12 Jun 2025

congrats @rm_rafailov on your hard-earned acceptance to the USofA as alien of officially extraordinary ability. The alien piece comes as no surprise to your mates of course, but at least the general public now has fair warning and a fighting chance. To celebrate with a fitting observation from a preeminent Kazakh journalist: "My country send me to United States to make AI model. Please, come and see my model. If it not success, I will be execute." 🇺🇸🇺🇸🇺🇸🇺🇸🇧🇬🇧🇬🇧🇬🇧🇬

5,558

Rafael Rafailov @ NeurIPS · Sep 12, 2024 · 7:23 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

12 Sep 2024

Replying to @teortaxesTex

We need a massive coordinated effort to generate a few billion samples of process supervision data. I think the recipe is actually clear the data and compute requirements are gonna be orders of magnitude larger.

8,966

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:04 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

The main point of why we need "advanced reasoning" is complexity. The model training data contains solutions for hard problems, but NOT the true data generating process for those solutions. The solution itself is the output of some complex Meta-CoT, which is not written down.

9,661

Rafael Rafailov @ NeurIPS · Sep 16, 2024 · 6:35 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

16 Sep 2024

What about a massive community effort to generate reasoning process training data?

Noam Brown

@polynoamial

16 Sep 2024

Great to see @scale_AI and @cais initiating a massive effort on harder evals! Many popular benchmarks are now saturated by @OpenAI o1, and we expect rapid progress to continue.

10,425

Rafael Rafailov @ NeurIPS · Feb 28, 2025 · 7:55 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

28 Feb 2025

This is the dataset we curated for our own reasoning experiments. There is a lot of reasoning data coming out now, but we spend extra time on this to make sure all the problems are high-quality and suitable for RL training!

nathan lile

@NathanThinks

28 Feb 2025

thrilled to see Big-MATH climbing to #3️⃣ on @huggingface—clear signal the community wants more high-quality, verifiable RL datasets. grateful to everyone who’s been liking, downloading, and supporting ❤️

11,143

Rafael Rafailov @ NeurIPS · Jan 25, 2025 · 6:20 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

25 Jan 2025

Replying to @Shalev_lif

I’ve grown to think that the RL algorithm doesn’t matter as much as long as it’s scalable enough and can work slightly off-policy. The model priors matter a lot.

2,802

Rafael Rafailov @ NeurIPS · Jan 13, 2024 · 11:50 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

13 Jan 2024

Replying to @McaleerStephen

You will have a great answer in three weeks.

3,812

Rafael Rafailov @ NeurIPS · Feb 6, 2025 · 7:39 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

6 Feb 2025

Pretty good analysis, we have similar observations and trying to dig more into the DS V3 model itself. I believe these behaviors are also likely due to deliberate use/curation of synthetic data in pre/mid training (maybe even o1 traces in the case of R1).

Zichen Liu

@zzlccc

6 Feb 2025

🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵

4,914

Rafael Rafailov @ NeurIPS · Oct 24, 2023 · 1:30 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

24 Oct 2023

Contrastive Preference Learning (CPL) extends DPO to arbitrarily MDPs and achieves great results on robot control! It uses a simple 1-line objective and does not require a reward model or any RL! I am so excited about this work, more below 🧵 arXiv: arxiv.org/abs/2310.13639

6,496

Rafael Rafailov @ NeurIPS · Sep 28, 2025 · 5:14 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

28 Sep 2025

Replying to @HanchungLee

arxiv.org/abs/2506.09477

On a few pitfalls in KL divergence gradient estimation for RL

We point out a few pitfalls in implementing gradient estimation for KL divergence in RL training for LLM, as seen in a number of open source projects and papers. The first major pitfall is to...

arxiv.org

1,975

Rafael Rafailov @ NeurIPS · Jan 19, 2025 · 5:31 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

19 Jan 2025

Replying to @iScienceLuvr

I’m getting more partial to the “q-STaR” formulation these days, but likely multiple objectives work. I would bet different labs actually ended up using slightly different approaches. arxiv.org/abs/2501.04682

Towards System 2 Reasoning in LLMs: Learning How to Think With...

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular...

arxiv.org

6,255

Rafael Rafailov @ NeurIPS · Mar 29, 2024 · 4:01 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Mar 2024

Multi-turn interactive RL should be a bigger focus. Current methods are not well-suited for this - i.e. PPO can't train with user in the loop generally and offline Q-learning still does not work at scale. It's interesting to see more work in that direction.

Philipp Fränken @jphilippfranken

29 Mar 2024

When prompting language models to complete a task, users often leave important things unsaid. Can language models teach themselves to ask clarifying questions? In STaR-GATE, we explore LMs' ability to self-improve by rewarding the model for generating useful questions!

7,424

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

The image below is the solution to the Geometry problem from IMO 2023. Even providing the solution plot, the model struggles with basic geometric structures and messes up multiple spatial relations. It does not understand how these things relate to each other. 4/N

11,288

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

Similar to many of @ylecun's arguments the model needs to be able to do planning over prior results/techniques AND develop them on the fly if they are not available. This is fundamentally different from token space planning, since it needs to be done over concepts. 9/N

5,051

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:05 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

Check out our position paper: arxiv.org/abs/2501.04682 for a lot more discussion, empirical results and technical details. We have nearly two pages of open research problems and we need people to work on them! If these interest you and want to work on open research, get in touch!

Towards System 2 Reasoning in LLMs: Learning How to Think With...

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular...

arxiv.org

3,069

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

Finally the “gold” level at the IMO requires a certain amount of “spark”. My favorite example is the “windmill” problem from IMO 2011 (I did not solve it). The solution is only a few lines long and does not require any prior knowledge (try t think about it). 12/N

5,204

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

To be able to solve these problems, the model would need significant specialized training on spatial understanding and geometry, something which does not emerge from generic Internet data, since that is more semantics based (perhaps textbooks could help). 5/N

6,201

Rafael Rafailov @ NeurIPS · Nov 10, 2025 · 8:21 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

10 Nov 2025

Replying to @kalomaze

I don’t understand this? There is no insider knowledge or anything. The thing in the DPO paper called “PPO-ours” was something we implemented that used four rollouts per prompt and centered the rewards. It uses a correct KL estimate (as part of the reward) unlike the original GRPO paper. At the time I thought about this as an implementation trick rather than anything groundbreaking.

3,419

Rafael Rafailov @ NeurIPS · Jan 27, 2025 · 9:02 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

27 Jan 2025

I agree completely, the question is what changed in the base model? The internet data distribution on “reasoning” is the same.

Prithviraj (Raj) Ammanabrolu

@rajammanabrolu

26 Jan 2025

Simply, no. I've been looking at my old results from doing RL with "verifiable" rewards (math puzzle games, python code to pass unit tests) starting from 2019 with GPT-1/2 to 2024 with Qwen Math Deepseek's success likely lies in the base models improving, the RL is constant

10,813

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:04 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

As a former math competitor this definitely fit my own thinking process - evaluating potential approaches to a solution, pruning directions that don't make progress, exploring branching claims trying to build a graph towards the final goal (solution/proof) based on intuition-v(S)

5,072

Rafael Rafailov @ NeurIPS · Jun 11, 2025 · 3:35 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

11 Jun 2025

Replying to @hallerite @prathamgrv

You’ve “derived” nothing, this is a definition.

736

Rafael Rafailov @ NeurIPS · Jun 5, 2025 · 11:18 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

5 Jun 2025

(Meta) CoTs are search inside world models (the prompt is the goal specification).

Jon Richens @jonathanrichens

4 Jun 2025

Are world models necessary to achieve human-level agents, or is there a model-free short-cut? Our new #ICML2025 paper tackles this question from first principles, and finds a surprising answer, agents _are_ world models… 🧵

3,518

Rafael Rafailov @ NeurIPS · Jan 25, 2025 · 8:29 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

25 Jan 2025

Replying to @DimitrisPapail @lateinteraction

We’ve thrown all algorithms we have at this problem, including PPO and MCTS, over the last 3 years. All of them saturated. What changed is what goes in the “base” model. Literally thousands of papers on this, idk how its a discussion.

5,723

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

Out of the gate, the first main challenge is Geometry. Solving these problems requires a significant amount of spatial understanding and reasoning, which a pure LLM likely cannot develop. Even a strong VLM such as GPT4-V struggles a lot with basic spatial understanding. 3/N

7,213

Rafael Rafailov @ NeurIPS · Jul 3, 2024 · 2:21 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 Jul 2024

Our robotics foundation model OpenVLA has crossed 10,000 downloads in the last month! If you're using or fine-tuning the model, I'd be really interested to hear about your use cases and experience!

6,510

Rafael Rafailov @ NeurIPS · Jun 11, 2025 · 3:15 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

11 Jun 2025

Replying to @prathamgrv

How does one “derive” KL divergence in your mind?

2,180

Rafael Rafailov @ NeurIPS · Jul 23, 2024 · 6:18 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

23 Jul 2024

We are presenting this work at ICML today 11.30-1pm! Stop by to discuss anything related to RLHF and LLM fine-tuning!

Anikait Singh

@Anikait_Singh_

22 Apr 2024

(1/N) Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Our new work finds that approaches employing on-policy sampling or negative gradients outperform offline, maximum likelihood objectives.

6,441

Rafael Rafailov @ NeurIPS · Mar 6, 2025 · 7:06 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

6 Mar 2025

Really cool work towards explaining the persistent gap between fully offline and online(ish) RLHF methods.

Gokul Swamy @g_k_swamy

4 Mar 2025

1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to🤿!

7,042

Rafael Rafailov @ NeurIPS · Dec 17, 2023 · 5:20 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

17 Dec 2023

Another DPO SOTA model. @lmsysorg can we get this one in the arena?

BURKOV

@burkov

17 Dec 2023

SOLAR: an 11B model that beats every open model, including Mixtral, Yi-34B, Llama 2 70B, and Falcon 180B: huggingface.co/upstage/SOLAR…

40,988

Rafael Rafailov @ NeurIPS · Mar 13, 2025 · 6:14 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

13 Mar 2025

Off-policy RL ftw

hr0nix

@hr0nix

13 Mar 2025

Dudes will carefully optimize GRPO clipping epsilon, only to immediately discard their generations after one parameter update.

6,879

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:05 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

So do we (and advanced reasoning models) just need to do search? No, we need to TEACH the models to do this themselves for two main reasons: 1. Efficiency - training a model to search in-context can teach it to avoid exploring similar branches. 2. Super-Intelligence.

3,400

Rafael Rafailov @ NeurIPS · Jul 10, 2024 · 1:36 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

10 Jul 2024

Our new paper MJ-BENCH evaluating generative reward models for text-to-image generation is now out! We find that Large Vision Language Models can act as zero shot feedback providers for diffusion models! More details below 👇

7,045

Rafael Rafailov @ NeurIPS · Oct 6, 2025 · 8:02 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

6 Oct 2025

Excited for more community integrations!

Zichen Liu

@zzlccc

6 Oct 2025

GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced deepseek-r1 length increasing with LoRA 3. benchmarked PPO, GRPO, REINFORCE and showed their tradeoffs 4. added Terminal, MCP, visual and multi-agent environments … Open the thread for a deep dive!

8,134

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

The competition consists of two days with three problems each with 4.5 hours each day. The problems usually cover four main areas - Algebra, Geometry, Number Theory and Combinatorics. These problems are quite challenging, the average high-schooler would likely get 0. 2/N

7,557

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:05 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

So, do advanced reasoning models also carry out in-context search? We believe so! 1. O1 seems to implement a general search with backtracking and branching. 2. DeepSeek R1 uses additional self-criticism or inner-dialogue. 3. Gemini Think follows a revision-based format.

3,433

Rafael Rafailov @ NeurIPS · Jan 28, 2025 · 9:41 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

28 Jan 2025

Meta-RL- learning to think

Andrej Karpathy

@karpathy

28 Jan 2025

"Move 37" is the word-of-day - it's when an AI, trained via the trial-and-error process of reinforcement learning, discovers actions that are new, surprising, and secretly brilliant even to expert humans. It is a magical, just slightly unnerving, emergent phenomenon only achievable by large-scale reinforcement learning. You can't get there by expert imitation. It's when AlphaGo played move 37 in Game 2 against Lee Sedol, a weird move that was estimated to only have 1 in 10,000 chance to be played by a human, but one that was creative and brilliant in retrospect, leading to a win in that game. We've seen Move 37 in a closed, game-like environment like Go, but with the latest crop of "thinking" LLM models (e.g. OpenAI-o1, DeepSeek-R1, Gemini 2.0 Flash Thinking), we are seeing the first very early glimmers of things like it in open world domains. The models discover, in the process of trying to solve many diverse math/code/etc. problems, strategies that resemble the internal monologue of humans, which are very hard (/impossible) to directly program into the models. I call these "cognitive strategies" - things like approaching a problem from different angles, trying out different ideas, finding analogies, backtracking, re-examining, etc. Weird as it sounds, it's plausible that LLMs can discover better ways of thinking, of solving problems, of connecting ideas across disciplines, and do so in a way we will find surprising, puzzling, but creative and brilliant in retrospect. It could get plenty weirder too - it's plausible (even likely, if it's done well) that the optimization invents its own language that is inscrutable to us, but that is more efficient or effective at problem solving. The weirdness of reinforcement learning is in principle unbounded. I don't think we've seen equivalents of Move 37 yet. I don't know what it will look like. I think we're still quite early and that there is a lot of work ahead, both engineering and research. But the technology feels on track to find them. piped.video/watch?v=HT-UZkiO…

5,480

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:04 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

To predict the next token in the training data the model needs to internalize the whole meta-reasoning process in it's activations, which have limited capacity. This thread makes the point very clearly:

Jason Wei

@_jasonwei

10 Nov 2024

There is a nuanced but important difference between chain-of-thought before and after o1. Before the o1 paradigm (i.e., chain-of-thought prompting), there was a mismatch between what chain of thought was and what we wanted it to be. We wanted chain of thought to reflect the thinking process of the model, but what the model was really doing was just imitating reasoning paths that it had seen in pretraining, e.g., math homework solutions. The problem with this type of data is that it is a post-hoc solution summarized after the author did all the work somewhere else, and not really a sequence of thoughts. So the solutions often had poor information density, with an egregious example being things like “The answer is 5 because…”, where the token “5” has a huge amount of new information. With the o1 paradigm, you can see that the chain of thought looks very different from a textbook math solution (you can view examples in the blog post). These chains of thought are kinda like “inner monologue” or “stream of consciousness”. You can see the model backtracking; it says things like “alternatively, let’s try” or “wait, but”. And I have not measured directly, but I would wager a bet (my psycholinguistics friends would probably be able to confirm) that the information density is *much* more uniform in the chain of thought than average text on the internet.

6,741

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

Beyond the issues of understanding graphs and spatial reasoning, the challenges only get harder. A strong IMO competitor is not unlike an athlete. It takes years of problem solving for multiple hours a day to develop the background and skills to compete at this level. 6/N

6,053

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:05 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

However, the biggest unanswered question is about Super-Intelligence - can these models discover novel ALGORITHMS of thinking, which allow them to solve problems that classical search CANNOT solve under ANY compute budget? DID THE COMPUTE-PERFORMANCE CURVE MOVE LEFT OR UP?

3,625

Rafael Rafailov @ NeurIPS · Nov 29, 2023 · 1:16 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

29 Nov 2023

A solution is usually a few hundred tokens, but only represents a combination of less than 5 concepts - we need some sort of latent compositional planning. It is not clear that current LLM systems can do that (perhaps ToT + Q learning)? 10/N

4,782

Rafael Rafailov @ NeurIPS · Sep 10, 2024 · 3:47 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

10 Sep 2024

It’s important to eat healthy!

3,429

Rafael Rafailov @ NeurIPS · Mar 7, 2024 · 5:37 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

7 Mar 2024

As I said earlier, we need to figure out RL at foundation model scale. This work is yet another piece of the missing puzzle. What I still wonder is how dynamic programming RL training affects the knowledge inherent within a pre-trained model? Some thoughts on this soon.

Aviral Kumar

@aviral_kumar2

7 Mar 2024

Super simple code change to get value-based deep RL scale *much* better w/ big models across the board on Atari games, robotic manipulation w/ transformers, LLM + text games, & even Chess! Just use classification loss (i.e., cross entropy), not MSE!! arxiv.org/abs/2403.03950🧵⬇️

6,122

Rafael Rafailov @ NeurIPS · Dec 9, 2023 · 10:54 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Dec 2023

At NeurIPS this whole week. Hit me up if you want to chat about RLHF, generative models, agents, robot learning, world models or anything about foundation models and decision making!

9,462

Rafael Rafailov @ NeurIPS · Aug 14, 2024 · 5:57 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

14 Aug 2024

Evidence is mounting that increased inference time budgets are a capabilities shift. The question is how to really use them. Longer term I believe large scale RL will allow us to discover better optimization strategies.

@_akhaliq

14 Aug 2024

Salesforce releases DEI, an open AI software engineering agents org with a 55% resolve rate on SWE-Bench Lite Discussion: huggingface.co/papers/2408.0… We propose DEI (Diversity Empowered Intelligence), a framework that leverages SWE agents' unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions.

4,063

Rafael Rafailov @ NeurIPS · Dec 1, 2023 · 2:13 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 Dec 2023

It really surprises me how far we can push a 7B model. It feels like with the right data mix and a 70B range model, we could already be able to match or even out-perform GPT 3.5 with an open-source model!

Argilla @argilla_io

1 Dec 2023

🔥Open-source, open-science, and data curation for the win! Meet Notus 7B, a new LLM tuned with DPO on a new curated UltraFeedback dataset, surpassing Zephyr and Claude 2 on AlpacaEval. Built on the shoulders of giants: 🙌@huggingface Alignment Handbook argilla.io/blog/notus7b

5,549

Rafael Rafailov @ NeurIPS · Mar 2, 2024 · 5:15 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

2 Mar 2024

Doing efficient RL properly at Foundation Model scale is still an open problem in my opinion. It’s especially prominent in agent and robotics applications and we can get significant benefits from figuring this out. This work is a step in that direction.

Aviral Kumar

@aviral_kumar2

1 Mar 2024

How can we train LLM Agents, to learn from their own experience autonomously? Introducing ArCHer, a simple (i.e., small change on top of standard RLHF) and effective way of doing so with multi-turn RL 🧵⬇️ Paper: arxiv.org/abs/2402.19446 Website: yifeizhou02.github.io/archer…

13,152

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:05 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

We've been working on distributed, highly-scallable online inference, search and RL infrastructure on top of the Neo-X framework, shooting for SOTA, which we aim to be FULLY OPEN. If you're interested in Infra, get in touch! synthlabs.ai/blog/rlhf-and-r…

Introducing Large-Scale RLHF and RLAIF in GPT-NeoX

SynthLabs & EleutherAI bring large-scale preference learning (DPO, KTO, reward models) to GPT-NeoX, boosting performance 30–40% for wider access.

synthlabs.ai

2,540

Rafael Rafailov @ NeurIPS · Oct 25, 2025 · 6:40 AM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

25 Oct 2025

Replying to @pashmerepat

This is not at all what an environment is. This is an abstraction that has existed and been build over 30 years.

2,063

Rafael Rafailov @ NeurIPS · Jan 9, 2025 · 8:04 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

9 Jan 2025

So how does the Meta-CoT look like? It's hard to tell since people don't write down their problem-solving processes. However, we stipulate that in domains with a generator-verifier gap this is fundamentally represented by a SEARCH process.

3,968

Rafael Rafailov @ NeurIPS · Apr 23, 2024 · 9:05 PM UTC

Rafael Rafailov @ NeurIPS

@rm_rafailov

23 Apr 2024

This was an awesome project - we teach models to follow constitutional principles with self-supervision (no labels). We also show that a weak model can generate principles for a stronger one, which self-aligns (SUPERALIGNENT!) and can beat the instruction-tuned (RLHF-ed) model!

Philipp Fränken @jphilippfranken

23 Apr 2024

Constitutional AI showed LMs can learn to follow constitutions by labeling their own outputs. But why can't we just tell a base model the principles of desired behavior and rely on it to act appropriately? Introducing SAMI: Self-Supervised Alignment with Mutual Information!

11,819