Lead agent research @Meta MSL TBD Lab. previously posttraining/agent research @OpenAI. CS PhD @LTIatCMU

San Francisco, CA
Excited to share Muse Spark, the first model from whole team’s work in MSL! 🚀 It’s natively multimodal and agentic. I’ve been using it for my daily coding and research tasks. Still plenty of room to improve in agentic domains, but we’re moving with great velocity. It’s a seriously good model! Check out the full breakdown and try it out in meta.ai
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
8
26
205
22,417
I successfully defended my PhD thesis today! 🎉 "Scalable Alignment of Large Language Models Towards Truth-Seeking, Complex Reasoning, and Human Values" Slides (Fact-RLHF, Lean-STaR, Easy-to-Hard Generalization, Self-Align, Instructable Reward Model): docs.google.com/presentation… A huge thank you to my thesis committee and all attendees for their valuable feedback and support! ❤️ @wellecks @lileics @denny_zhou & Yiming
90
55
1,237
98,483
Excited to finally share what I’ve been working on since joining OpenAI last June! The goal of deep-research is to enable reasoning models with tools to tackle long-horizon tasks in the real world and discover new knowledge. It’s a highly autonomous agent—hand it a hard problem, grab a coffee, and come back to a well-researched solution in 10–30 minutes. Trained end-to-end with reinforcement learning in a tool-enabled environment, deep-research is built to seek truth and understand the universe. A key milestone is its performance on humanity’s "last exam," demonstrating the true power of an end-to-end trained agent. 2025 is the year of agents. Looking forward to what’s ahead! openai.com/index/introducing…
71
78
980
165,390
We’re releasing BrowseComp, which stands for Browsing Competition. 🏎️ Think of it like coding or math competitions — while these contests may not perfectly reflect real-world SWE or mathematical research, they do capture a spark of intelligence. This is THE benchmark we should care about when evaluating the intelligence of deep research-like browsing agents.
We’re open-sourcing BrowseComp (“Browsing Competition”), a new, challenging benchmark designed to test how well AI agents can browse the internet to find hard-to-locate information. It’s like an online scavenger hunt…but for browsing agents. openai.com/index/browsecomp/
30
75
913
473,431
Bad take (opinions are my own)
americans sure love giving their data away to the CCP in exchange for free stuff
Community note
DeepSeek can be run locally without an internet connection, unlike OpenAI's models. github.com/deepseek-a
28
36
826
134,408
Excited to share that I recently joined the MSL team! Building personal superintelligence is serious and fun here. Join us!
After a great time at OpenAI, we (@EdwardSun0909, @_jasonwei) recently joined @Meta Superintelligence Labs. The first month has already been so much fun building from a clean slate with a truly talent-dense team! Very excited about the compute and long term focus of the new lab
59
24
846
269,524
honored to have contributed to o3😎
New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4
22
18
772
101,081
You can just do things 🖱️
34
18
674
90,322
Challenge accepted — 2025 will be our best year yet!
common themes: AGI agents much better 4o upgrade much better memory longer context “grown up mode” deep research feature better sora more personalization (interestingly, many great updates we have coming were mentioned not at all or very little!)
14
12
648
135,195
We’re rolling out Deep Research to Plus users today! Deep Research was the biggest “Feel The AGI” moment I’ve ever had since ChatGPT. And I’m glad more people will experience their first AGI moment! The team also worked super hard to make more tools including image citations / python / user files etc available to the model in this launch!
Replying to @OpenAI
We're also sharing the system card, detailing how we built deep research, assessed its capabilities and risks, and improved safety. openai.com/index/deep-resear…
25
29
480
49,265
just tried and the agent solved level 1 in its own browser lol. thanks for creating the benchmark!
Replying to @arcprize
o3 (left) and Grok 4 (right) replays below spoiler: neither complete a single level
18
26
458
106,451
Excited to share the agent with the world! It’s a good agent!
ChatGPT can now do work for you using its own computer. Introducing ChatGPT agent—a unified agentic system combining Operator’s action-taking remote browser, deep research’s web synthesis, and ChatGPT’s conversational strengths.
33
17
425
78,834
I heard reinforcement learning only works with verifiable rewards? 😛 Congrats!!
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
9
13
407
33,608
I don’t often tweet on technical topics but I may have an opposite opinion here…
10
8
376
90,344
How can LLMs such as GPT-3 and ChatGPT achieve greater factual accuracy without relying on an external retrieval search engine? Our #ICLR2023 paper shows that recitation can help - like humans! Recitation-Augmented Language Models arxiv.org/abs/2210.01296 1/N
11
84
364
59,356
Our research on easy-to-hard generalization will be supported by the OpenAI Superalignment Fast Grant. Congratulations to the team and stay tuned!😎
🌟Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision 🌟 arxiv.org/abs/2403.09472 How can we keep improving AI systems when their capabilities surpass those of human supervisors? (1/n)
10
12
342
53,780
Memory is the next scaling laws paradigm shift
Starting today, memory in ChatGPT can now reference all of your past chats to provide more personalized responses, drawing on your preferences and interests to make it even more helpful for writing, getting advice, learning, and beyond.
17
17
317
26,840
🚀 Can RLAIF fully replace RLHF to align language models from scratch, enhancing both their alignment and capabilities? SALMON introduces a principle-following reward model in the realm of self-alignment, using just 6 ICL exemplars and 31 principles to outperform LLaMA-2-Chat!
5
90
295
100,070
the real agi competition is between vllm and sglang
13
12
273
32,998
🌟Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision 🌟 arxiv.org/abs/2403.09472 How can we keep improving AI systems when their capabilities surpass those of human supervisors? (1/n)
7
58
256
106,753
All I see is @GaryMarcus saying “Deep Research is genuinely useful” 🙂
Deep Research is genuinely useful - depending on your application - but crucially (as anticipated by Rebooting AI in 2019, and by @yudapearl) facts and temporal reasoning remain problematic for current neural network-based approaches that lean heavily on statistics rather than deep understanding.
11
9
211
67,798
gpt-5-reasoning is a good model 🫡
OpenAI gave us early access to GPT-5: our independent benchmarks verify a new high for AI intelligence. We have tested all four GPT-5 reasoning effort levels, revealing 23x differences in token usage and cost between the ‘high’ and ‘minimal’ options and substantial differences in intelligence We have run our full suite of eight evaluations independently across all reasoning effort configurations of GPT-5 and are reporting benchmark results for intelligence, token usage, and end-to-end latency. What @OpenAI released: OpenAI has released a single endpoint for GPT-5, but different reasoning efforts offer vastly different intelligence. GPT-5 with reasoning effort “High” reaches a new intelligence frontier, while “Minimal” is near GPT-4.1 level (but more token efficient). Takeaways from our independent benchmarks: ⚙️ Reasoning effort configuration: GPT-5 offers four reasoning effort configurations: high, medium, low, and minimal. Reasoning effort options steer the model to “think” more or less hard for each query, driving large differences in intelligence, token usage, speed, and cost. 🧠 Intelligence achieved ranges from frontier to GPT-4.1 level: GPT-5 sets a new standard with a score of 68 on our Artificial Analysis Intelligence Index (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench & AA-LCR) at High reasoning effort. Medium (67) is close to o3, Low (64) sits between DeepSeek R1 and o3, and Minimal (44) is close to GPT-4.1. While High sets a new standard, the increase over o3 is not comparable to the jump from GPT-3 to GPT-4 or GPT-4o to o1. 💬 Token usage varies 23x between reasoning efforts: GPT-5 with High reasoning effort used more tokens than o3 (82M vs. 50M) to complete our Index, but still fewer than Gemini 2.5 Pro (98M) and DeepSeek R1 0528 (99M). However, Minimal reasoning effort used only 3.5M tokens which is substantially less than GPT-4.1, making GPT-5 Minimal significantly more token-efficient for similar intelligence. 📖 Long Context Reasoning: We released our own Long Context Reasoning (AA-LCR) benchmark earlier this week to test the reasoning capabilities of models across long sequence lengths (sets of documents ~100k tokens in total). GPT-5 stands out for its performance in AA-LCR, with GPT-5 in both High and Medium reasoning efforts topping the benchmark. 🤖 Agentic Capabilities: OpenAI also commented on improvements across capabilities increasingly important to how AI models are used, including agents (long horizon tool calling). We recently added IFBench to our Intelligence Index to cover instruction following and will be adding further evals to cover agentic tool calling to independently test these capabilities. 📡 Vibe checks: We’re testing the personality of the model through MicroEvals on our website which supports running the same prompt across models and comparing results. It’s free to use, we’ll provide an update with our perspective shortly but feel free to share your own! See below for further analysis:
8
5
205
21,382
deep research mini is here 🔭 share your feedback with us!
Replying to @OpenAI
The lightweight version of deep research is powered by a version of OpenAI o4-mini and is nearly as intelligent as the deep research people already know and love, while being significantly cheaper to serve. Responses will typically be shorter while maintaining the depth and quality you’ve come to expect. Once limits for the original version of deep research are reached, queries automatically default to the lightweight version.
10
5
160
11,781
🔭 Understand the Universe.
8
6
161
18,775
⭐Self-Play Preference Optimization for Language Model Alignment⭐ arxiv.org/abs/2405.00675 Bradley-Terry models in RLHF fall short in capturing the intransitivity and irrationality in human preferences. How can we identify the Nash equilibrium policy with general preferences?🧵
1
42
152
26,993
🔭🔭🔭
Deep Research Live from Tokyo 4pm PT / 9am JST Stay tuned for link to livestream.
6
3
133
21,034
OK we need a good benchmark for all Deep Research-like products to quantitatively tell who’s the deepest researcher
🔭 Understand the Universe in less than a min. Grok 3 DeepSearch
12
7
137
15,889
GPT-4.5 surely memorizes lots of knowledge in its weights :)
4
2
123
11,694
Beating Alpaca or Davinci003 with only 1k samples is indeed impressive! Personally, I find myself in alignment with their Superficial Alignment Hypothesis, as in Self-Align, we have shown that a mere set of 16 rules is sufficient to outperform Alpaca or Davinci003!
LIMA: Less Is More for Alignment LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback paper page: huggingface.co/papers/2305.1…
2
15
116
33,429
📢 One detail we didn't spotlight earlier: Dromedary-2 might just be the world's best open-source, non-distilled LLM for commercial use! 🌍🚀 Here's a comparison with other baselines from the Humpback paper. Dromedary-2 notably pushes the boundaries on info extraction and math!
🚀 Can RLAIF fully replace RLHF to align language models from scratch, enhancing both their alignment and capabilities? SALMON introduces a principle-following reward model in the realm of self-alignment, using just 6 ICL exemplars and 31 principles to outperform LLaMA-2-Chat!
1
21
120
39,191
Absolutely thrilled to be a recipient of the 2023 Google PhD Fellowship! Deep gratitude to my advisors/mentors Yiming, Xuezhi, @denny_zhou , and all my dedicated collaborators. Also, thanks for the generous support from @GoogleAI @Google.
In 2009, Google created the PhD Fellowship Program to recognize and support outstanding graduate students pursuing exceptional research in computer science and related fields. Today, we congratulate the recipients of the 2023 Google PhD Fellowship! goo.gle/3PYfLXl
12
3
120
18,205
I'm usually skeptical when people say DPO achieves similar results as PPO, especially as DPO models often stem from GPT-4, making it more like knowledge distillation. But now, my favorite project, alpacafarm, has just confirmed this w/o kd! Wow, definitely something real here!😱
4
18
110
23,297
One model, all tools
Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.
2
106
6,619
We developed Dromedary, a self-aligned AI agent with minimal human supervision!
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision abs: arxiv.org/abs/2305.03047 paper page: huggingface.co/papers/2305.0… project page: mitibmdemos.draco.res.ibm.co…
5
14
95
22,449
Our paper on ✨ Self-Aligning Language Models via RLAIF ✨ has been officially accepted at @iclr_conf 2024! We're thrilled to share our insights in Vienna. Stay tuned for self-aligning advancements in LLMs. #ICLR2024 See you there! 🌍🚀
🚀 Can RLAIF fully replace RLHF to align language models from scratch, enhancing both their alignment and capabilities? SALMON introduces a principle-following reward model in the realm of self-alignment, using just 6 ICL exemplars and 31 principles to outperform LLaMA-2-Chat!
1
17
95
13,382
Deep research in your own repos!
You can now connect GitHub repos to deep research in ChatGPT. 🐙 Ask a question and the deep research agent will read and search the repo’s source code and PRs, returning a detailed report with citations. Hit deep research → GitHub to get started.
2
88
6,793
Excited to present with @isafulf tonight at the OpenAI Forum, introducing the research behind Deep Research! Join us at 6pm PT to explore how this new agentic capability in ChatGPT works. Register here:
1
5
83
12,718
high-taste testers yield high-taste takes
7
2
71
7,891
"A Re-evaluation of Knowledge Graph Completion Methods" accepted to ACL 2020 #acl2020nlp . We performed an extensive re-examination study of recent neural network based KGC techniques. arxiv.org/abs/1911.03903 Joint work with @svjan5 , @ssanyal8 , @partha_p_t , and Yiming Yang
3
13
60
Pretty sure we just dropped a benchmark for deep research agents 😬 openai.com/index/browsecomp/ Need a hand over here?
Today we’re launching Research, alongside a new Google Workspace integration. Claude now brings together information from your work and the web.
3
61
7,908
Replying to @dhruv31415
I didn’t realize Aidan just unfollowed me for this 😅 I asked chatgpt to polish my wording
1
53
9,549
🩵🩵🩵
congrats to the team, especially @isafulf and @EdwardSun0909, for building an incredible product. my very approximate vibe is that it can do a single-digit percentage of all economically valuable tasks in the world, which is a wild milestone.
2
50
9,057
Excited to share the final work from my PhD: Easy-to-Hard Generalization at NeurIPS! Join me at my poster on Friday—happy to chat about reasoning, scalable alignment, and more. Bonus: we’ll also have an oral presentation at the MATH-AI workshop on the inference scaling law!
Easy-to-Hard Generalization was accepted to NeurIPS! Congrats to @EdwardSun0909 and @scut_longhui! Check out the updated camera-ready version here: openreview.net/pdf?id=qwgfh2…
4
52
26,751
Many people asked me about SELF-ALIGN vs. Constitutional AI (CAI). In short: CAI is self-critique: input ➡️ output ➡️ one rule ➡️ refined output SELF-ALIGN: input ➡️ self-chosen rules ➡️ output Thus, we're limited to 16 rules in our prompt, whereas CAI can have up to 58+ rules.
How does a language model decide which questions it will engage with and which it deems inappropriate? We use Constitutional AI to more directly encode values into our language models.
4
6
50
11,158
Open AI🫡
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: openai.com/open-model-feedba… we are excited to make this a very, very good model! __ we are planning to release our first open-weigh language model since GPT-2. we’ve been thinking about this for a long time but other priorities took precedence. now it feels important to do. before release, we will evaluate this model according out our preparedness framework, like we would for any other model. and we will do extra work given that we know this model will be modified post-release. we still have some decisions to make, so we are hosting developer events to gather feedback and later play with early prototypes. we’ll start in SF in a couple of weeks followed by sessions in europe and APAC. if you are interested in joining, please sign up at the link above. we’re excited to see what developers build and how large companies and governments use it where they prefer to run a model themselves.
42
4,874
Another tip: it generates a real pptx file. So you can download the artifact, open it in microsoft powerpoint app, and apply the design you want to all of them!
tip for chatgpt agent slides: first ask it to do the research only, then ask it to make the slides!
1
4
47
10,528
Check our new work on improving neural theorem proving by giving LLMs more time to think before each tactic action! I think this is an important step towards fully exploiting LLMs’ reasoning power & agentic abolity in formal mathematics.💪
How can informal reasoning improve formal theorem proving? New paper: "Lean-STaR: Learning to Interleave Thinking and Proving" arxiv.org/abs/2407.10040 We introduce a framework for learning to interleave informal thoughts with steps of formal proving. 46.3% on miniF2F 🔥
2
8
44
6,936
🐐🐐🐐@xinw_ai @michiyasunaga @ren_hongyu
Replying to @sama
Sam this performance is crazy
2
43
9,168
Feel the largest model vibe 🚀
GPT-4.5 has entered the Chat. openai.com/live/
1
1
35
5,740
En route to #ACL2023!🤟 No submissions from me this time, but I'm all set for exciting poster chats and casual networking! (all thanks to LTI financial support) Feel free to DM me if you’d like to chat about Self-Alignment (Scalable Oversight) about LLMs
1
37
4,517
En route to #ICML2023 at Hawaii 🏝️! This time I’ll present a main conference paper on neural PDE solving and a workshop paper on neural combinatorial optimization solving. I’m also happy to share my thoughts on (my recent research on) LLM Alignment. Feel free to DM me!
2
1
34
5,870
no we go back to 1 in 2024
1
33
7,780
I blame @MistralAI for being the first to make this kind of confusing diagam 😅
Welcome Gemma 3, our new open-weight LLM from @GoogleDeepMind. All sizes (1B, 4B, 12B and 27B) excel on benchmarks, but the key result may be the 27B reaching 1338 on LMSYS. For this, we scaled post-training, with our novel distillation, RL and merging strategies. Happy building!
4
28
7,108
what
Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference @NeurIPSConf We have ethical reviews for authors, but missed it for invited speakers? 😡
1
27
3,779
🫡🫡🫡
Replying to @BLCNYY @OpenAI
we are working to make it much more efficient and then will offer much higher usage limits in the meantime, please send me your email address
2
26
6,743
Check out our winning formula in AIMO with only $1000 budget😎 Amazing work by @WYZ0402 in only 1 post-NeurIPS month💪
🔥Our CMU-MATH team proudly clinched 2nd place in the Artificial Intelligence Mathematical Olympiad (AIMO) out of 1,161 participating teams with the best performance of an academic team! Dive into our blog to discover our winning formula: blog.ml.cmu.edu/2024/07/29/c…
1
3
28
5,636
Congrats!!
this is ready now for the world
1
25
3,330
Replying to @shengjia_zhao
Congrats!
2
20
8,505
Great work!!
Just launched ChatGPT Agent (sorry GPT-5 waiters, it is coming!), the most capable AI agent model to date! It has been such an honor to be part of a crazy sprint to get this amazing model trained and shipped together with an absolutely gem team (@isafulf , @caseychu9 , @EdwardSun0909 , @josh_tobin_ Yash Kumar and many more)! I am so proud of this project, so I want to share some highlights, personal takes and lessons learned while working on it: 1. Used for research 📕 + actions 💻 + slides generation: Deep Research can do research. Operator can take actions for you. ChatGPT Agent can do both at the same time! E.g. you can ask it to make a plan for a trip to Hawaii, find good deals on hotels and flights, and book them on your behalf using its own computer! It can also generate slides! 2. Power of end-to-end RL: How do we build it? You guess it right! It is us, @OpenAI RL diehards. You are probably tired of hearing about RL scaling. Me, too. But when I feel its power first-hand, its effectiveness and data efficiency still shock me and feel like magic 🪄. 3. First OpenAI model of high biorisk 💀: Not sure this is something I should proud of or not :) For an ex-AI bio PhD researcher like me, this is something a bit personal. One one hand, many of my biomedicine researcher friends tell me that AI agents have significantly helped with their research. On the other hand, such a capable model can amplify the risk of malicious actors building bioweapons. Our safety team has done incredible work to mitigate the risks. 4. Collaboration with users 👪 is core: We want our AI to augment and enhance humans, not to replace them, so we work hard to make the model good at collaborating with the user. You can type a message at anytime to interrupt it and steer it to new directions. The model will always confirm with you before taking actions like buying things for you or deleting a file on your google drive. And the model will ask clarification questions only when it needs more clarity from you! 5. How to generate good slides: As in other cases, writing a well-specified prompt always helps! Also try first telling it to generate a report, then convert the report into slides! 6. Real-world performance > benchmark chasing: One thing outside people may not know about us is how little attention we pay to external benchmarks during the model dev process. We do not focus on hill-climbing on them, and we do not care that much about how we end up on the leaderboard. That said, as a byproduct of our pursuit to great real-world performance and true intelligence, ChatGPT Agent does crush many benchmarks! Wanna learn more? Read our blog linked in the end! In the end, I want to shout out to my amazing team again. These extremely talented and kind people are the reason why OpenAI is constantly making magic like this! ❤️ Also please try ChatGPT Agent and give us feedback! You can reply here in the thread or my DM is open. This is just the start. We will continue working hard towards more and more capable super-human AI agents! 🤖 openai.com/index/introducing…
23
3,956
I'm excited to announce "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices" accepted to ACL 2020. Joint work with researchers from @LTIatCMU @GoogleAI Paper: arxiv.org/abs/2004.02984 Code & Pretrained Model: github.com/google-research/g… [1/4]
1
3
23
The new era of post-training has arrived. Join us!
Yall heard it from the man himself
23
4,885
🍓🍓🍓
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math. openai.com/index/introducing…
2
23
3,237
How to steer the model’s behavior in a scalable manner so that we can control how we want these models to behave? Perhaps try Principle-Driven (Self-)Alignment! We have a series of work such as Self-Align & SALMON covering context distillation & RLAIF🤩🤩🤩
To deepen the public conversation about how AI models should behave, we’re sharing our Model Spec — our approach to shaping desired model behavior. openai.com/index/introducing…
1
5
22
5,909
How to technically realize @OpenAI Model Spec based on any set of human-defined principles? Discover why an Instructable Reward Model is all you need at our SALMON poster session #ICLR tmr, presented by the brilliant @QinhongZhou. 📅 Thurs, May 9, 10:45 AM CEST 📍 Halle B #7
🚀 Can RLAIF fully replace RLHF to align language models from scratch, enhancing both their alignment and capabilities? SALMON introduces a principle-following reward model in the realm of self-alignment, using just 6 ICL exemplars and 31 principles to outperform LLaMA-2-Chat!
3
20
4,183
Check our new work!
Active LLM Retrieval Augmented Generation -Iteratively uses a prediction of upcoming sentence to anticipate future content which is used as query to retrieve relevant docs to regenerate sentence -On 4 long-form generation tasks: superior / competitive arxiv.org/abs/2305.06983
1
22
5,558
Very beautiful scaling plot😍
Replying to @wellecks
We study common inference strategies (e.g., majority voting, MCTS) and a new tree search, along with various model sizes. First, we hold the inference strategy fixed, and find that across different model sizes, smaller models typically have better accuracy-cost tradeoffs
1
2
21
5,565
🥹
Replying to @OpenAIDevs
o3-deep-research: platform.openai.com/docs/mod… o4-mini-deep-research: platform.openai.com/docs/mod… These models are the same post-trained o3 and o4-mini models that power deep research in ChatGPT. They also support MCP (search/fetch) and Code Interpreter.
20
3,583
🍳
2025 is the year of agents.
1
20
4,190
We propose a new paradigm called RECITation-augmented gEneration (RECITE) that helps Large Language Models (LLMs) generate accurate factual knowledge by reciting relevant passages from their own memory before producing final answers. 2/N
1
20
1,338
Combinatorial optimization (CO) problems are essential in many fields like operation research / software engineering / algorithm theory. We introduce a new paradigm to tackle CO problems with diffusion models. Accepted at NeurIPS as Spotlight! arxiv.org/abs/2302.08224 w/ Yiming
1
1
20
1,583
Replying to @clu_cheng
what kind of tears 😏
1
1
18
3,732
If I were given one hour to build an #AGI system, I would spend 59 minutes defining the principles it should follow, and one minute clicking the training button 🔥🔥🔥🔥 github.com/IBM/SALMON/blob/m…
“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.” — Albert Einstein
4
19
15,413
The code and model weights will be released soon. We found there is a community implementation of the hard prob version of SPPO in TRL. We have submitted a PR to fix some bugs: github.com/huggingface/trl/p… Please note that the iterative SPPO results in our paper use soft probs.
1
5
19
5,850
Replying to @lilianweng
👀 might be related:
🚀 Can RLAIF fully replace RLHF to align language models from scratch, enhancing both their alignment and capabilities? SALMON introduces a principle-following reward model in the realm of self-alignment, using just 6 ICL exemplars and 31 principles to outperform LLaMA-2-Chat!
1
3
17
2,358
Totally feel you🤪. You might find our new preprint interesting – it's about aligning LLMs from scratch! arxiv.org/abs/2305.03047
Genuine question: What's the (scientific) value of these recent papers that finetune a smaller LM on GPT-4 outputs? It's obviously useful to have a smaller LM that performs ≈ GPT-4 in specific settings. But I don't see the value in packaging into a paper and flooding arXiv.
18
3,601
Replying to @ren_hongyu
♣️🫵🐮!
1
18
6,404
Congrats!
Today, @rhythmrg, @lindensli and I are introducing @appliedcompute. We’re building Specific Intelligence for the enterprise. Achieving SOTA today means specialization in both human and machine talent. We’ve spent the last six months working with companies like @cognition, @DoorDash, and @mercor_ai, unlocking their company knowledge to build custom agent workforces that outperform frontier models at specific tasks. My cofounders and I all worked on different parts of this problem while at OpenAI, from Codex to o1 to the ML systems and infrastructure for RL training. Two-thirds of our team (see below!) are former founders, and everyone brings a deep technical background, from top AI researchers to Math Olympiad winners. We’ve raised $80M from @benchmark, @sequoia, @Lux_Capital, @eladgil, @victoralazarte, and @Casspi18, and we’re hiring across engineering and research.
1
19
9,867
Curious about how to improve non-autoregressive models with conditional random fields? How to deal with extremely large vocabulary size in CRF for machine translation? Come to our poster at Wednesday evening at East hall #109 at #NeurIPS2019 @NeurIPSConf @zhuohan123 @suzzzylin
1
4
14
Very interesting observation by @AnthropicAI on AIs often producing 'sycophantic' responses to appease users. Curious if the RLAIF could address this. Maybe a new principle under SALMON's RL-Time Preference Intervention could be "Maintain integrity, avoid sycophancy"? 🤔
AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
15
2,172
Scaling laws for DPO is still unproven ==> I wonder if the problem is the scaling laws. To me, the primary concern of DPO is that it's only proven effective when we have high-quality (or distilled) demonstrations as positive examples. Similar observation:
The Zephyr-beta model from @huggingface H4 (led by @_lewtun and @edwardbeeching these days) is a great example of engineering practices and know how slowly kicking into gear for RLHF. Some takeaways beyond "high MT Bench and AlpacaEval scores": * DPO can work great for smaller models. This is huge for open-ML as small specialized models are the future. People need to try DPO on specialized application feedback datasets! * Long-term engineering investment can pay off on unexpected timelines. I left a week or two before Zephyr, and we didn't even have it on the plan yet. Finding the right dataset and plugging it into the pipeline can be everything. * Scaling laws for DPO is still unproven. Lots and lots of RLHF experts are skeptical of it for larger models. I think there may need a slight change of the loss function for stability (and maybe sample efficiency is lower), which is why the H4 team found success in multiple Epochs. * MT Bench / Alpaca Eval are pretty saturated. Next year, your model is going to need to get 7+ on MTBench to be considered, but above that may not matter. It's getting normalized, but we need more eval tools still. *AI Feedback is extremely broad. For this model, it was data curation. I expect it to also work for filtering and more. * Releasing both SFT and RL checkpoints with data is great for replication (as the team did). Excited to see where this goes next! Paper: arxiv.org/abs/2310.16944 Artifacts: huggingface.co/collections/H…
14
2,117
Building on the idea that “evaluation is easier than generation”, we find that strong reward models trained on easy data facilitate easy-to-hard generalization via reranking or reinforcement learning. (3/n)
1
13
1,612
🥲🥲🥲
This is what I sent to my colleagues at OpenAI: Hi all, I made the difficult decision to leave OpenAI as an employee, but I’m looking to work closely together as a partner going forward. Contributing to the mission of OpenAI and working with world-class teams to create and improve ChatGPT has been an experience of a lifetime. But I’ve gotten really excited about AI for science. My undergrad was in physics and I’m keen to apply this technology there. Because AI for science is one of the most strategically important areas to OpenAI and achieving ASI, OpenAI is planning to invest in and partner with my new company. So I’ll see you all around! Thanks to all the leadership who believed in me early on, especially, Sam, Greg, and Mark. Thank you everyone on post-training and to all of our collaborators across research and product. I’ll miss working with so many of you, but will be cheering you on! Post-training has an amazing roster of talent and leaders who will continue to drive its success.
13
3,525
no we’ll achieve the real agi and serve it with vllm 😎
2
14
1,835
Check out our active retrieval augmented generation method that actively decides when and what to retrieve!
[1/4] Large language models (LLMs) tend to hallucinate, especially when generating long outputs. We present active retrieval augmented generation, in which an LLM actively decides when and what to retrieve throughout the generation process.
2
13
2,824
Replying to @ibab
The stars will align at 4 pm 🫡
11
791
Nice work! 👏 But when it comes to extreme data-efficiency, I guess our Dromedary takes the lead in 2023! We've achieved a MT-Bench score of 7.37 and a 88.32 AlpacaEval with just **6** (no K) SFT samples. The secret sauce lies in our Self-Align and RLAIF. arxiv.org/abs/2310.05910
💡 We release methods and datasets for extremely data-efficient alignment 🚀 6K SFT samples lead to 7.22 MT-Bench score, further DPO with 10K samples achieve 7.55 MT-Bench +90+% AlpacaEval Try our data to align models more efficiently: github.com/hkust-nlp/deita
1
1
13
1,909
Replying to @generatorman_ai
We tried the same self-align prompt (step2 in our paper) on the LLaMA-7b and GPT-NeoX-20B models, but their performance did not match that of the 65b model. So I believe that the principle-driven self-align method only works for models that are powerful enough, though (cont.)
1
13
4,594
Hi Yizhong, you classified the instr.-following datasets into: 1) existing NLP datasets 2) written by humans from scratch 3) generated by proprietary models 4) user-shared. Have you considered "generated by OSS models", like applying Self-Instruct or Self-Align to base LLaMA?
🦙🐪🐫 So many instruction tuning datasets came out recently! How valuable are they, and how far are open models really from proprietary ones like ChatGPT? 🧐We did a systematic exploration, and built Tülu---a suite of LLaMa-tuned models up to 65B! 📜arxiv.org/abs/2306.04751
2
12
3,390
We study this in terms of easy-to-hard generalization. This is a conceptual analogy of superalignment (weak-to-strong generalization). Instead of letting strong models learn from weak teachers, we let models generalize to problems more difficult than those seen during training.
1
1
13
2,359
Check out the new work on aligning Video LMMs with factually-enhanced DPO!🐕 🐕 🐕
[p1] 🐕Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward🐕 Paper link: arxiv.org/pdf/2404.01258.pdf… page: github.com/RifleZhang/LLaVA-… How to effectively train video large multimodal Model (LMM) alignment with preference modeling?
12
2,369
Inspired by Asimov's three laws of robotics, we envision a future where a few general principles can be internalized by AI systems. This aligns with recent advances in self-alignment, aiming for models to improve themselves with minimal human supervision. 2/N
2
2
11
1,912
Replying to @natolambert
Excited to share our recent work: SALMON (Self-ALignMent with principle-fOllowiNg reward models) - minimizing dependency on human annotations for aligning LLM-based AI agents, through principle-following reward models. Would love to be featured! arxiv.org/abs/2310.05910
1
12
4,479
Come check our paper at #NeurIPS22 Now! Poster Session 4-6 pm Booth #128 DIMES: A Differentiable Meta Solver for Combinatorial Optimization Problems bytez.com/read/neurips/54442 #NeurIPS2022 #bytez #friendly-papers via @bytez
2
10
BTW, we are not the first to study the easy-to-hard scenario. Concurrent study by Hase et al. (2024) backs training on easy tasks as a strong baseline for ARC & MMLU. Our work on the harder MATH dataset shows reranking & RL have even better generalization than ICL & SFT. (6/n)
2
1
11
1,232
Interesting work! Have you tried our magic Self-Align prompt?🧐 We also used some kind of ICL but uses an additional explicit principle-following step: Re-align: raw.githubusercontent.com/Re… Self-align: raw.githubusercontent.com/IB…
1
1
10
1,225