Research Scientist @HuggingFace. PhD in Deep RL approaches for Robotic Navigation @INRIA.

Lyon, France
As part of our open reproduction of R1, we have roughly reproduced DeepSeek's MATH-500 eval numbers with Hugging Face's lighteval suite. We had to improve our latex parser to get the last few %.
23
104
1,193
145,286
A month ago I joined🤗@huggingface as a Research Scientist. They're great: opening an office in Lyon, allowing me to work on open-source projects and trusting me to define my own schedule. I am proud to have added the Decision Transformer to🤗transformers. huggingface.co/blog/decision…
5
47
424
Today @huggingface 🤗 release a long awaited tutorial on training Decision Tranformers models as a blogpost and colab notebook. The is part of a series on the application of transformer models in Deep RL settings. 👉 huggingface.co/blog/train-de… #reinforcementlearning #transformers
1
36
236
Sneak peak of WIP of an upcoming FPS environment for my @godotengine Reinforcement Learning library. Agents trained using async PPO and population-based training with sample-factory. 👉 github.com/edbeeching/godot_… It will soon be available on the @huggingface hub!
4
22
222
We will soon release the @huggingface LLM alignment handbook. Using these recipes you can build state of the art chatbots such as Zephyr-7b, released today. Register your interest by starring the github github.com/huggingface/align… You can find out about Zephyr-7b in this thread:
3
35
208
50,047
The winning AI Math Olypiad model is out! Using an approach we call Self-Consistency with Tool Integrated Reasoning. Constraints of Kaggle (T4 GPUs) required us to use activation aware quantization in order to not degrade model performance. Details and code to follow next week.
Introducing NuminaMath-7B-TIR, the small but mighty model that won the first progress prize of the AI Math Olympiad 🥇! > Fine-tuned with iterative SFT on DeepSeekMath-7B from @deepseek_ai > Stage 1: learn math with chain of thought samples > Stage 2: learn code with tool-integrated reasoning (TIR) > Inference: self-consistency decoding with tool-integrated reasoning to generate solutions 🤖 Model: huggingface.co/AI-MO/NuminaM… ♾️ Demo: huggingface.co/spaces/AI-MO/… This has been quite a wild journey and I am grateful to have collaborated with a cracked team of researchers from Numina and Hugging Face - kudos to @edwardbeeching @JiaLi52524397 @ben_lipkin @vwxyzjn @krasul @AlbertQJiang and Roman Soletskyi for creating high-quality datasets & training kick ass models! #AIMO #Kaggle #AIMathOlympiad
1
24
184
19,809
We are proud to release the first open-source multi-modal, multi-task and multi-domain model! Called JAT. A crucial step for generalist agents. What started out as an open reproduction of GATO with @QGallouedec, @ClementRomac and myself, has evolved into a far greater project.
5
36
172
28,270
Our prize winning Math recipe is now released with datasets, training code and a new 72B math model. See thread for more details:
We have just released the ✨NuminaMath datasets: the largest collection of ~1M math competition problem-solution pairs, ranging in difficulty from junior challenge to Math Olympiad preselection. These datasets were used to win the 1st Progress Prize of the AI Math Olympiad and consist of two subsets: ⛓️ Chain of Thought (CoT): 860k problem-solution pairs templated with CoT to enhance mathematical reasoning in natural language 🛠️ Tool-integrated reasoning (TIR): 73k synthetic solutions derived from GPT-4 with code-execution feedback to decompose hard problems into simpler subproblems that can be solved with Python Models trained on NuminaMath achieve best-in-class performance among open weight models and approach or surpass proprietary models on math competition benchmarks 🔥 Our datasets and models can be found on the 🤗 Hub: huggingface.co/collections/A…
4
26
160
39,867
When I am not busy Aligning LLMs, I spend my free time developing Godot RL Agents, an RL library for the #Godot game engine. Today we released version 0.7.0 with a number of new features, bugfixes and examples. Thanks to all the contributors for creating cool example envs:
4
19
146
12,246
Announcing the release of Sample Factory 2.0. A lightning fast production grade Deep RL library. Sample Factory 2.0 is a collaboration between @petrenko_ai from @uscresl and 🤗 @huggingface. 👉 github.com/alex-petrenko/sam… Find out more on this 🧵
5
39
143
The last environment for my @godotengine Reinforcement Learning lib, a team FPS. The release is planned for tomorrow once I have updated the docs. 👉 github.com/edbeeching/godot_… Env source code and builds are already available on the @huggingface hub: 👉 huggingface.co/datasets?sort…
1
25
141
16,577
We have a new leader on the Open LLM leaderboard. Congrats to ausboss/llama-30b-supercot! They combined chain-of-thought datasets, code explanations and instructions, snippets, logical deductions and Alpaca GPT-4 prompts. Check it out here: huggingface.co/spaces/Huggin…
4
28
131
33,688
Godot RL Agents v0.4.0 has been released. 👉github.com/edbeeching/godot_… This includes: ‣ Godot 4 support ‣ 3 RL frameworks: Sample Factory, Stable Baselines 3 and rllib ‣ 2 Advanced Racing and FPS environments ‣ Updated docs (still WIP🙂) Find out more in this thread 🧵
3
24
95
14,717
Thanks to the community for their feedback on DPO vs. IPO vs. KTO. In particular, we thank the authors of IPO, who have worked with us this week to improve TRL's IPO implementation. IPO now is comparable to DPO! Check out the updated blogpost. huggingface.co/blog/pref-tun…
1
26
96
20,711
Today we demonstrate how the performance Llama 1B can be scaled to outperform Llama 8B with tree search, guided by a Process Reward Model. During our efforts to replicate DeepMind's Test Time Compute paper, we found that beam search resulted in poor diversity when n>16. 👇🧵
2
12
77
6,482
We have added Online Direct Preference Optimization to TRL. We observe that online methods, while slower to optimize, outperform their offline counterparts at various model scales.
2
11
72
22,165
After many requests, v0.1 the LLM alignment handbook is now available. We've worked hard to make this as accessible as possible, so you can run: 🏋️‍♂️ Full fine-tuning with @MSFTDeepSpeed ZeRO-3 on A100s 🐭 LoRA or QLoRA fine-tuning on consumer GPUs Code: github.com/huggingface/align…
1
17
69
8,893
It seems like every week there is a new LLM or chatbot being released. In order to keep track of the progress of the open-source community, I created the🤗open LLM leaderboard. It benchmarks against 4 key metrics from the @EleutherAI LM Harness. huggingface.co/spaces/Huggin…
5
13
68
15,213
A year ago I created the Open LLM Leaderboard. Now it has over 10,000 likes and is the #2 Space. In the next month it will overtake Stable Diffusion and become the #1 Space on Hugging Face!
The Open LLM leaderboard is now the #2 most liked space ever on @huggingface with 10,000+ likes (huggingface.co/spaces?sort=l…)! Also, there are now hundreds of leaderboards for tons of different tasks, domains, languages,... on spaces (huggingface.co/spaces?search…) Very cool to see HF becoming the place to be for AI evaluation!
3
6
58
24,988
Over the past few weeks, we've been focused on pushing the boundaries of competitive programming models by reproducing key elements of DeepSeek-R1. Today, we're excited to release 3 open-source artifacts: 🧵
1
4
57
4,855
In our latest blog post, we summarize our extensive evaluation of three state of the art alignment algorithms. DPO vs IPO vs KTO. The results demonstrate a complex interaction between key hyper-parameters, models and datasets. #RLHF #DPO huggingface.co/blog/pref-tun…
2
14
49
17,099
Proud to release an RL interface for the @godotengine Included are wrappers for both Ray RLlib @raydistributed and StableBaselines3 @araffin2 Find out more on the GitHub page github.com/edbeeching/godot_…
9
44
Does your LLM know what a pizza looks like? You need a Vision Language Model. Here at @huggingface we have just added VLM finetuning support to TRL's SFTTrainer.
3
5
44
19,934
I've added a Fall Guys style environment to Godot RL Agents. The agent learned its behavior fairly quickly, 20 minutes / 2M steps of PPO and default hyperparameters. Check out the library and more examples here: github.com/edbeeching/godot_… The amazing assets are from @KayLousberg
2
5
41
3,059
Replying to @jaxgriot
You can see other roadmap on the repo:
1
35
3,855
One of my contributions was the Tree of Thoughts algorithm that interleaved generation with code execution and correction. The constraints of running on Kaggle required an optimized and elegant solution to scale up to majority voting with 48 candidate solutions per problem.
4
3
37
3,401
Fine-tune Vision Language Models in a few lines of code:
1
3
28
3,289
Happy to announce that our recent work on augmenting a Deep RL agent with differentiable projective geometry and spatially structured memory is now available on ArXiv. arxiv.org/abs/2002.02286
5
26
@huggingface are proud to have teamed up with Numina and @MistralAI to win the first AI Math Olympiad. We will be sharing the details of our method over the coming weeks. This will include open source models, training code and evaluation pipelines.
Six months ago, we launched Numina to lead open research in AI4Math. Today we are super excited to share that our Numina Math 7B model won the 1st progress prize of the AI Math Olympiad 🔥🔥🔥 kaggle.com/competitions/ai-m…
1
2
25
1,786
Imitation Learning support has been added to Godot RL Agents, you can now learn complex behaviours from player demonstrations and then fine-tune with RL. Check out the trained agent (a Neural Network) from our example game.
1
3
21
1,961
Tuesday the 3rd of May at 10am CEST I will be defending my PhD thesis "Large Scale Automatic Learning of Autonomous Agent Behavior with Structured Deep Reinforcement Learning". I will be livestreaming the defense, you are all welcome to come watch. piped.video/vHiEB5LDEho
1
3
19
I'm updating my @godotengine Reinforcement Learning library to Godot 4 and adding @huggingface integration. 👉 github.com/edbeeching/godot_… I am also adding a number of example games, such as this racing game. Are there any other example games people would like me to add?
1
5
20
Commands to launch the evals in openr1: sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B math_500 sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-7B math_500 sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-14B math_500 ...
18
5,661
@chriswolfvision your last day here at INSA Lyon / LIRIS. Best of luck at NaverLabs and enjoy the C64!
1
13
After 3 months of hard work, we are at the top of the leaderboard for the first AI Math Olympiad!
Good data is all you need
1
14
1,160
Sample Factory integration allows for the training of complex AI behaviors, such as this team FPS game. Pew Pew Pew!
1
1
12
543
You can try out Zephyr-7b here: huggingfaceh4-zephyr-chat.hf… This work was done with my H4 colleagues @_lewtun @natolambert @nazneenrajani and many others at 🤗!
1
1
12
435
As part of TRL's v0.10.1 release I also added liger kernel support to TRL's SFT Trainer, it works with DeepSpeed zero3 out of the box and enables a 4x larger batch size! Thanks to the amazing open source work from AI researchers @LinkedIn
TRL v0.10.1 is here and it's beefy 💪 🔁 Online DPO by @GoogleDeepMind for aligning better LLMs 🐯 Liger kernel integration from @LinkedIn to supercharge SFT 🖼️ DPO for VLMs: 🌋 LLaVa, ✨ PaliGemma, 🐶 Idefics2 👩‍⚖️ Use LLMs as a judge for to compute win rates during training 🔍 Anchored Preference Optimization by @ContextualAI for fine-grained human/AI feedback github.com/huggingface/trl/r…
3
11
1,881
Replying to @chriswolfvision
Christmas has definitely come early!
2
10
We have published a blog post with more details on how we trained and deployed our AIMO-winning model. Find out more about the Self-Consistency with Tool-Integrated Reasoning decoding algorithm (SC-TIR) that I implemented for the winning pipeline. huggingface.co/blog/winning-…
2
2
12
839
For SFT we used UltraChat, which consists of ~1.6M dialogues generated by gpt-3.5 We originally trained on all the data, but found the resulting model had an annoying personality 😅. So we filtered this down to ~200k examples that focused on helpfulness huggingface.co/datasets/stin…
1
10
407
We also include an example Racing game. What environments or functionality would you like to see in the next version of Godot RL Agents?
2
11
496
We have just released an Imitation Learning tutorial for Godot RL Agents as part of Hugging Face's Deep RL class. Learn how to train an agent to solve this complex RL environment. huggingface.co/learn/deep-rl…
1
10
455
Training wise, we used 🤗 TRL and DeepSpeed ZeRO-3 for all our experiments: - SFTTrainer: huggingface.co/docs/trl/sft_… - DPOTrainer: huggingface.co/docs/trl/dpo_… Total compute cost: $500 or 8h on 16 x A100s Kudos to @krasul for implementing DPO in TRL!
1
9
310
Our OlympicCoder-32B model achieves top-tier performance, surpassing all open-weight models we tested—even some 100x larger! Learn more about how we built the dataset, benchmark, and models: huggingface.co/blog/open-r1/…
1
8
576
We also wanted to share with the community the winning recipe, so we also have release the training code for those who want to take a deeper dive into LLMs for Mathematics! github.com/project-numina/ai…
1
8
580
Tomorrow, February 8 at 11 AM Pacific Time (8PM CET) we will be presenting a workshop on aligning LLMs with DPO. We will discuss the theory behind it and get hands-on with the Hugging Face Transformer Reinforcement Learning (TRL) library. Register now: eventbrite.com/e/aligning-ll…
7
388
- CodeForces-CoTs – A dataset of 100k competitive programming samples in C++ and Python. - The IOI Benchmark – A new set featuring 2024 International Olympiad in Informatics problems. - OlympicCoder Models (7B & 32B) – Fine-tuned models that outperform closed-source models
1
1
9
589
First introduced in a paper by @ShawnGuo13 at @GoogleDeepMind , Online DPO is a new alignment method to boost the performance of LLMs. The integration is the result of a fantastic collaboration between @ShawnGuo13 , @mnoukhov, @vwxyzjn , @QGallouedec, @_lewtun and myself.
1
1
7
707
To build a strong math model, the team at projectnumina.ai led by @JiaLi52524397 built two datasets of math problems with 1M examples, comprising of problems answered with Chain of Thought and Tool Integrated Reasoning: huggingface.co/datasets/AI-M… huggingface.co/datasets/AI-M…
1
1
7
680
@huggingface has released StarChat2, a programming assistant based on BigCode's StarCoder2. We used a variant of the Zephry recipe to add chat to this strong math and code capabilities of StarCoder2. Demo: huggingface.co/spaces/Huggin… Training code: github.com/huggingface/align… MT Bench
8
387
Zephyr is a mistral-7b finetune that outperforms llama2-70b on MT Bench and is the highest performing 7b model on the Open LLM Leaderboard. We used a combination of instruction fine-tuning and Direct Preference Optimization on publicly available datasets.
1
8
611
For DPO we used UltraFeedback, which contains 64k prompts and completions spanning a wide range of open and closed access models. Each completion is ranked by GPT-4 according to criteria like helpfulness, and given a score to derive AI preferences from. hf.co/datasets/openbmb/Ultra…
1
7
330
All the code is open source in the alignment handbook: github.com/huggingface/align…
2
7
428
For evaluations we used the excellent MT Bench from @lmsysorg This multi-turn benchmark evaluates chatbot capabilities across various domains like creative writing, code and math. It provides a much higher signal on chatbot perf than other leaderboards huggingface.co/spaces/lmsys/…
1
7
367
@_lewtun and I used this data for two stage fine-tuning. For the competition, we released a 7B model. We wanted to see how our recipe scales, today we release a 72B model with comparable performance to GPT-o when evaluated with Tool Integrated Reasoning. huggingface.co/AI-MO/NuminaM…
1
2
6
713
Preference Alignment for Multimodal models is now supported in TRL, amazing work by @QGallouedec and the team at @huggingface ! What algorithm should we implement next?
🤔 Can we train a VLM to 𝐩𝐫𝐞𝐟𝐞𝐫? This is now possible, thanks to the new TRL/DPO support for VLMs! 🎉 As an example, we've trained a model to reduce hallucinations. Check out: 📰 Blog post: huggingface.co/blog/dpo_vlm 🐙 TRL: github.com/huggingface/trl Thanks to @mervenoyann, @vwxyzjn and @krasul who helped me with this work!
5
592
The agent's observations are raycasts rather than pixels, so during training I do not render any cameras and run headless. You can also accelerate the rate of the physics which gets a nice speedup.
4
184
A broad range of things: implementing new models in the transformers library, reading papers, working on open-source projects such as my Deep RL interface for the Godot Game Engine, building environments for Embodied AI, and sharing expertise with the rest of the team here at 🤗
5
The DAPO paper did some nice ablations of this and confirmed our intuition / more limited empirical observations: arxiv.org/pdf/2503.14476
1
125
It is easy to get started with Online DPO, check out the example script: github.com/huggingface/trl/b…
1
1
4
584
Hi, I had a look at your GitHub and it should be fairly easy to integrate the model in the transformers library and host the model checkpoints on the🤗Hub. The dataset license is indeed restrictive, we are looking into this. I will send you an email about the model integration.
5
Delighted to see that our work on augmenting RL agents with egocentric neural memory has been accepted to ECML-PKDD 2020!
How to automatically discover objects and affordances from reward through projective egocentric memory: @edwardbeeching's paper has been accepted to ECM-PKDD 2020 (with Jilles Dibangoye, Olivier Simonin and yours, truly). @chroma_inria @LIRISLyon @citi_lab
1
3
Replying to @_lewtun @teknium
Thanks for highlighting this powerful @huggingface Datasets feature, although I don't think you can infer that 99.9% of convs are single turn, just that 99.9% of convs have between 2-8 turns.
1
2
190
and a 3D Lunar lander example:
1
4
243
Replying to @araffin2
For this run of the algorithm the best one at 10M steps is A? But let me guess, it is the same algorithm?
4
Anonymous comment from a colleague earlier: "Ok DeepSpeed has defeated me for another day. Will revisit tomorrow." I have a love-hate relationship with DeepSpeed, when it works it is magical, but it can be quite frustrating to debug when it doesn't work out of the box.
3
337
fixed, let us know if you spot anything else
1
1
110
Physics can be sped up for all environments, enabling accelerated training. Cut training time from hours to minutes.
1
1
3
422
We also added a bunch of new examples including, learning how to park a car:
1
2
164
Replying to @abidlabs
Do {thing} or we are going to go wash your hair. Works every time, apart from when you need to wash their hair.
1
1
468
Racer hovercrafts!
1
2
132
You can submit your own models for evaluation at the bottom of the leaderboard and they will be queued and run automatically on spare nodes on the 🤗 research cluster! You can even submit delta weights for non-commercial models such as llama.
1
3
305
A reminder: * DPO: casts the RLHF objective via a loss based on a prompt and its positive and negative completions * IPO: has an identity function rather than DPO's sigmoid that can potentially cause overfitting * KTO: rather than +ve, -ve pair takes unpaired good and bad data
1
2
281
This release adds a long requested feature: grid sensors. Set sail on high seas with a pirate example demonstrating the new sensor.
1
3
185
The next version of the library will add a long-requested feature. Imitation learning support, which should allow the learning of far more complex behaviors!
3
226
While the observations about each algorithm remain the same with OpenHermes, that is the best algorithm is DPO > KTO > IPO, the sweet spot for beta varies wildly with each algorithm. With the best choice of beta for DPO, KTO, and IPO being 0.6, 0.3 and 0.01, respectively.
1
3
494
Replying to @chriswolfvision
I have a UDP joke but maybe you won't get it.
3
The observation is a small 12x12 cone of raycasts, a normalized vector pointing to the goal and a 4D 1-hot indicating which of the 4 levels the agent is on. Reward is the improvement in best distance to goal.
1
2
60
Replying to @_lewtun
That's easy for you to say. I had to review it!
3
110
Sample Factory achieves high throughput, training at hundreds of thousands of interactions per second 🔥 It includes a number of advanced features: 🟢 Multi-agent training 🟢 Self-play 🟢 Multi-GPU population-based training 🟢 Support for vectorized and GPU accelerated envs
1
2
Awesome analytics, our H4 models have been downloaded over 10M times!
The new analytics tab in Hub orgs is very cool and we can see that the H4 models have been downloaded ~10M times, driven mostly by Zephyr / StarChat Funnily enough, I thought the huge spike was from Zephyr, but it's actually from StarChat ... perhaps someone accidentally put in their CI pipeline 😅
967
Great work and thanks for sharing the recipe. I will review the PR now :)
1
258
I totally agree, running the experiments for the post left me with more questions than answers. I think we may have a more extensive follow-up where we evaluate on some other benchmarks such as Alpaca eval.
13