As part of our open reproduction of R1, we have roughly reproduced DeepSeek's MATH-500 eval numbers with Hugging Face's lighteval suite.
We had to improve our latex parser to get the last few %.
A month ago I joined🤗@huggingface as a Research Scientist. They're great: opening an office in Lyon, allowing me to work on open-source projects and trusting me to define my own schedule. I am proud to have added the Decision Transformer to🤗transformers. huggingface.co/blog/decision…
Sneak peak of WIP of an upcoming FPS environment for my @godotengine Reinforcement Learning library.
Agents trained using async PPO and population-based training with sample-factory.
👉 github.com/edbeeching/godot_…
It will soon be available on the @huggingface hub!
We will soon release the @huggingface LLM alignment handbook. Using these recipes you can build state of the art chatbots such as Zephyr-7b, released today. Register your interest by starring the github github.com/huggingface/align…
You can find out about Zephyr-7b in this thread:
The winning AI Math Olypiad model is out!
Using an approach we call Self-Consistency with Tool Integrated Reasoning. Constraints of Kaggle (T4 GPUs) required us to use activation aware quantization in order to not degrade model performance. Details and code to follow next week.
Introducing NuminaMath-7B-TIR, the small but mighty model that won the first progress prize of the AI Math Olympiad 🥇!
> Fine-tuned with iterative SFT on DeepSeekMath-7B from @deepseek_ai
> Stage 1: learn math with chain of thought samples
> Stage 2: learn code with tool-integrated reasoning (TIR)
> Inference: self-consistency decoding with tool-integrated reasoning to generate solutions
🤖 Model: huggingface.co/AI-MO/NuminaM…
♾️ Demo: huggingface.co/spaces/AI-MO/…
This has been quite a wild journey and I am grateful to have collaborated with a cracked team of researchers from Numina and Hugging Face - kudos to @edwardbeeching@JiaLi52524397@ben_lipkin@vwxyzjn@krasul@AlbertQJiang and Roman Soletskyi for creating high-quality datasets & training kick ass models!
#AIMO#Kaggle#AIMathOlympiad
We are proud to release the first open-source multi-modal, multi-task and multi-domain model! Called JAT. A crucial step for generalist agents. What started out as an open reproduction of GATO with @QGallouedec, @ClementRomac and myself, has evolved into a far greater project.
We have just released the ✨NuminaMath datasets: the largest collection of ~1M math competition problem-solution pairs, ranging in difficulty from junior challenge to Math Olympiad preselection.
These datasets were used to win the 1st Progress Prize of the AI Math Olympiad and consist of two subsets:
⛓️ Chain of Thought (CoT): 860k problem-solution pairs templated with CoT to enhance mathematical reasoning in natural language
🛠️ Tool-integrated reasoning (TIR): 73k synthetic solutions derived from GPT-4 with code-execution feedback to decompose hard problems into simpler subproblems that can be solved with Python
Models trained on NuminaMath achieve best-in-class performance among open weight models and approach or surpass proprietary models on math competition benchmarks 🔥
Our datasets and models can be found on the 🤗 Hub: huggingface.co/collections/A…
When I am not busy Aligning LLMs, I spend my free time developing Godot RL Agents, an RL library for the #Godot game engine. Today we released version 0.7.0 with a number of new features, bugfixes and examples. Thanks to all the contributors for creating cool example envs:
Announcing the release of Sample Factory 2.0. A lightning fast production grade Deep RL library.
Sample Factory 2.0 is a collaboration between @petrenko_ai from @uscresl and 🤗 @huggingface.
👉 github.com/alex-petrenko/sam…
Find out more on this 🧵
We have a new leader on the Open LLM leaderboard.
Congrats to ausboss/llama-30b-supercot!
They combined chain-of-thought datasets, code explanations and instructions, snippets, logical deductions and Alpaca GPT-4 prompts.
Check it out here: huggingface.co/spaces/Huggin…
Godot RL Agents v0.4.0 has been released.
👉github.com/edbeeching/godot_…
This includes:
‣ Godot 4 support
‣ 3 RL frameworks: Sample Factory, Stable Baselines 3 and rllib
‣ 2 Advanced Racing and FPS environments
‣ Updated docs (still WIP🙂)
Find out more in this thread 🧵
Thanks to the community for their feedback on DPO vs. IPO vs. KTO. In particular, we thank the authors of IPO, who have worked with us this week to improve TRL's IPO implementation. IPO now is comparable to DPO! Check out the updated blogpost.
huggingface.co/blog/pref-tun…
Today we demonstrate how the performance Llama 1B can be scaled to outperform Llama 8B with tree search, guided by a Process Reward Model.
During our efforts to replicate DeepMind's Test Time Compute paper, we found that beam search resulted in poor diversity when n>16.
👇🧵
We have added Online Direct Preference Optimization to TRL. We observe that online methods, while slower to optimize, outperform their offline counterparts at various model scales.
After many requests, v0.1 the LLM alignment handbook is now available.
We've worked hard to make this as accessible as possible, so you can run:
🏋️♂️ Full fine-tuning with @MSFTDeepSpeed ZeRO-3 on A100s
🐭 LoRA or QLoRA fine-tuning on consumer GPUs
Code: github.com/huggingface/align…
It seems like every week there is a new LLM or chatbot being released. In order to keep track of the progress of the open-source community, I created the🤗open LLM leaderboard. It benchmarks against 4 key metrics from the @EleutherAI LM Harness.
huggingface.co/spaces/Huggin…
A year ago I created the Open LLM Leaderboard. Now it has over 10,000 likes and is the #2 Space. In the next month it will overtake Stable Diffusion and become the #1 Space on Hugging Face!
The Open LLM leaderboard is now the #2 most liked space ever on @huggingface with 10,000+ likes (huggingface.co/spaces?sort=l…)!
Also, there are now hundreds of leaderboards for tons of different tasks, domains, languages,... on spaces (huggingface.co/spaces?search…)
Very cool to see HF becoming the place to be for AI evaluation!
Over the past few weeks, we've been focused on pushing the boundaries of competitive programming models by reproducing key elements of DeepSeek-R1. Today, we're excited to release 3 open-source artifacts: 🧵
In our latest blog post, we summarize our extensive evaluation of three state of the art alignment algorithms. DPO vs IPO vs KTO.
The results demonstrate a complex interaction between key hyper-parameters, models and datasets.
#RLHF#DPOhuggingface.co/blog/pref-tun…
Does your LLM know what a pizza looks like? You need a Vision Language Model. Here at @huggingface we have just added VLM finetuning support to TRL's SFTTrainer.
I've added a Fall Guys style environment to Godot RL Agents. The agent learned its behavior fairly quickly, 20 minutes / 2M steps of PPO and default hyperparameters.
Check out the library and more examples here:
github.com/edbeeching/godot_…
The amazing assets are from @KayLousberg
One of my contributions was the Tree of Thoughts algorithm that interleaved generation with code execution and correction. The constraints of running on Kaggle required an optimized and elegant solution to scale up to majority voting with 48 candidate solutions per problem.
Happy to announce that our recent work on augmenting a Deep RL agent with differentiable projective geometry and spatially structured memory is now available on ArXiv.
arxiv.org/abs/2002.02286
To celebrate the release of #GodotEngine 4.0, I have added a tutorial on creating custom Godot RL envs in Godot RL Agents:
github.com/edbeeching/godot_…
The tutorial was created as part of the Hugging Face Deep RL course, check it out to learn about Deep RL!
👉huggingface.co/deep-rl-cours…
@huggingface are proud to have teamed up with Numina and @MistralAI to win the first AI Math Olympiad. We will be sharing the details of our method over the coming weeks. This will include open source models, training code and evaluation pipelines.
Six months ago, we launched Numina to lead open research in AI4Math. Today we are super excited to share that our Numina Math 7B model won the 1st progress prize of the AI Math Olympiad 🔥🔥🔥 kaggle.com/competitions/ai-m…
Imitation Learning support has been added to Godot RL Agents, you can now learn complex behaviours from player demonstrations and then fine-tune with RL. Check out the trained agent (a Neural Network) from our example game.
Tuesday the 3rd of May at 10am CEST I will be defending my PhD thesis "Large Scale Automatic Learning of Autonomous Agent Behavior with Structured Deep Reinforcement Learning". I will be livestreaming the defense, you are all welcome to come watch. piped.video/vHiEB5LDEho
I'm updating my @godotengine Reinforcement Learning library to Godot 4 and adding @huggingface integration.
👉 github.com/edbeeching/godot_…
I am also adding a number of example games, such as this racing game.
Are there any other example games people would like me to add?
Happy to announce that a preprint of our ECCV 2020 spotlight paper "Learning to plan with uncertain topological maps" is now on Arxiv arxiv.org/abs/2007.05270 . We approximate a classical path planning algorithm and learn to plan under uncertainty. @chriswolfvision@chroma_inria
We open-source the datasets and codebase for training JAT. We look forward to the community's contribution to this project.
Find out more in our blogpost: huggingface.co/blog/jat
Today marks the 0.5.0 release of an RL interface I develop for the @GodotEngine.
The release adds many new features, including ONNX support to export and run trained agents in Godot without the need for python.
Check it out:
github.com/edbeeching/godot_…
As part of TRL's v0.10.1 release I also added liger kernel support to TRL's SFT Trainer, it works with DeepSpeed zero3 out of the box and enables a 4x larger batch size!
Thanks to the amazing open source work from AI researchers @LinkedIn
TRL v0.10.1 is here and it's beefy 💪
🔁 Online DPO by @GoogleDeepMind for aligning better LLMs
🐯 Liger kernel integration from @LinkedIn to supercharge SFT
🖼️ DPO for VLMs: 🌋 LLaVa, ✨ PaliGemma, 🐶 Idefics2
👩⚖️ Use LLMs as a judge for to compute win rates during training
🔍 Anchored Preference Optimization by @ContextualAI for fine-grained human/AI feedback
github.com/huggingface/trl/r…
We have published a blog post with more details on how we trained and deployed our AIMO-winning model.
Find out more about the Self-Consistency with Tool-Integrated Reasoning decoding algorithm (SC-TIR) that I implemented for the winning pipeline.
huggingface.co/blog/winning-…
So I developed a novel method called Diverse Verifier Tree Search, which outperforms beam search at large n.
We take a deep dive into the details in our blog post:
huggingface.co/spaces/Huggin…
For SFT we used UltraChat, which consists of ~1.6M dialogues generated by gpt-3.5
We originally trained on all the data, but found the resulting model had an annoying personality 😅. So we filtered this down to ~200k examples that focused on helpfulness
huggingface.co/datasets/stin…
We have just released an Imitation Learning tutorial for Godot RL Agents as part of Hugging Face's Deep RL class. Learn how to train an agent to solve this complex RL environment.
huggingface.co/learn/deep-rl…
Our OlympicCoder-32B model achieves top-tier performance, surpassing all open-weight models we tested—even some 100x larger!
Learn more about how we built the dataset, benchmark, and models:
huggingface.co/blog/open-r1/…
We also wanted to share with the community the winning recipe, so we also have release the training code for those who want to take a deeper dive into LLMs for Mathematics!
github.com/project-numina/ai…
Tomorrow, February 8 at 11 AM Pacific Time (8PM CET) we will be presenting a workshop on aligning LLMs with DPO.
We will discuss the theory behind it and get hands-on with the Hugging Face Transformer Reinforcement Learning (TRL) library.
Register now: eventbrite.com/e/aligning-ll…
- CodeForces-CoTs – A dataset of 100k competitive programming samples in C++ and Python.
- The IOI Benchmark – A new set featuring 2024 International Olympiad in Informatics problems.
- OlympicCoder Models (7B & 32B) – Fine-tuned models that outperform closed-source models
First introduced in a paper by @ShawnGuo13 at @GoogleDeepMind , Online DPO is a new alignment method to boost the performance of LLMs.
The integration is the result of a fantastic collaboration between @ShawnGuo13 , @mnoukhov, @vwxyzjn , @QGallouedec, @_lewtun and myself.
@huggingface has released StarChat2, a programming assistant based on BigCode's StarCoder2.
We used a variant of the Zephry recipe to add chat to this strong math and code capabilities of StarCoder2.
Demo: huggingface.co/spaces/Huggin…
Training code: github.com/huggingface/align…
MT Bench
Zephyr is a mistral-7b finetune that outperforms llama2-70b on MT Bench and is the highest performing 7b model on the Open LLM Leaderboard. We used a combination of instruction fine-tuning and Direct Preference Optimization on publicly available datasets.
For DPO we used UltraFeedback, which contains 64k prompts and completions spanning a wide range of open and closed access models.
Each completion is ranked by GPT-4 according to criteria like helpfulness, and given a score to derive AI preferences from.
hf.co/datasets/openbmb/Ultra…
For evaluations we used the excellent MT Bench from @lmsysorg
This multi-turn benchmark evaluates chatbot capabilities across various domains like creative writing, code and math.
It provides a much higher signal on chatbot perf than other leaderboards
huggingface.co/spaces/lmsys/…
@_lewtun and I used this data for two stage fine-tuning. For the competition, we released a 7B model. We wanted to see how our recipe scales, today we release a 72B model with comparable performance to GPT-o when evaluated with Tool Integrated Reasoning.
huggingface.co/AI-MO/NuminaM…
We've just released Godot RL agents version 0.8.0.
Which adds Imitation Learning support and multi-policy training.
Thanks to all the contributors, find out more details in the release notes:
github.com/edbeeching/godot_…
Preference Alignment for Multimodal models is now supported in TRL, amazing work by @QGallouedec and the team at @huggingface ! What algorithm should we implement next?
🤔 Can we train a VLM to 𝐩𝐫𝐞𝐟𝐞𝐫?
This is now possible, thanks to the new TRL/DPO support for VLMs! 🎉
As an example, we've trained a model to reduce hallucinations.
Check out:
📰 Blog post: huggingface.co/blog/dpo_vlm
🐙 TRL: github.com/huggingface/trl
Thanks to @mervenoyann, @vwxyzjn and @krasul who helped me with this work!
The agent's observations are raycasts rather than pixels, so during training I do not render any cameras and run headless. You can also accelerate the rate of the physics which gets a nice speedup.
A broad range of things: implementing new models in the transformers library, reading papers, working on open-source projects such as my Deep RL interface for the Godot Game Engine, building environments for Embodied AI, and sharing expertise with the rest of the team here at 🤗
Hi, I had a look at your GitHub and it should be fairly easy to integrate the model in the transformers library and host the model checkpoints on the🤗Hub. The dataset license is indeed restrictive, we are looking into this. I will send you an email about the model integration.
How to automatically discover objects and affordances from reward through projective egocentric memory: @edwardbeeching's paper has been accepted to ECM-PKDD 2020 (with Jilles Dibangoye, Olivier Simonin and yours, truly). @chroma_inria@LIRISLyon@citi_lab
Thanks for highlighting this powerful @huggingface Datasets feature, although I don't think you can infer that 99.9% of convs are single turn, just that 99.9% of convs have between 2-8 turns.
Anonymous comment from a colleague earlier:
"Ok DeepSpeed has defeated me for another day. Will revisit tomorrow."
I have a love-hate relationship with DeepSpeed, when it works it is magical, but it can be quite frustrating to debug when it doesn't work out of the box.
You can submit your own models for evaluation at the bottom of the leaderboard and they will be queued and run automatically on spare nodes on the 🤗 research cluster! You can even submit delta weights for non-commercial models such as llama.
A reminder:
* DPO: casts the RLHF objective via a loss based on a prompt and its positive and negative completions
* IPO: has an identity function rather than DPO's sigmoid that can potentially cause overfitting
* KTO: rather than +ve, -ve pair takes unpaired good and bad data
The next version of the library will add a long-requested feature. Imitation learning support, which should allow the learning of far more complex behaviors!
While the observations about each algorithm remain the same with OpenHermes, that is the best algorithm is DPO > KTO > IPO, the sweet spot for beta varies wildly with each algorithm. With the best choice of beta for DPO, KTO, and IPO being 0.6, 0.3 and 0.01, respectively.
The observation is a small 12x12 cone of raycasts, a normalized vector pointing to the goal and a 4D 1-hot indicating which of the 4 levels the agent is on. Reward is the improvement in best distance to goal.
Sample Factory achieves high throughput, training at hundreds of thousands of interactions per second 🔥
It includes a number of advanced features:
🟢 Multi-agent training
🟢 Self-play
🟢 Multi-GPU population-based training
🟢 Support for vectorized and GPU accelerated envs
The new analytics tab in Hub orgs is very cool and we can see that the H4 models have been downloaded ~10M times, driven mostly by Zephyr / StarChat
Funnily enough, I thought the huge spike was from Zephyr, but it's actually from StarChat ... perhaps someone accidentally put in their CI pipeline 😅
I totally agree, running the experiments for the post left me with more questions than answers. I think we may have a more extensive follow-up where we evaluate on some other benchmarks such as Alpaca eval.