Assistant professor at @Stanford and member of the technical staff at @AnthropicAI.

Palo Alto, CA
Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.
22
206
1,324
187,563
Super excited to finally release DataComp! There is still a lot we don't understand about Internet-scale datasets. DataComp makes research on datasets more accessible and leads to better training sets. The results so far are very encouraging and there is much more to explore!
Introducing DataComp, a new benchmark for multimodal datasets! We release 12.8B image-text pairs, 300+ experiments and a 1.4B subset that outcompetes compute-matched CLIP runs from OpenAI & LAION 📜 arxiv.org/abs/2304.14108 🖥️ github.com/mlfoundations/dat… 🌐 datacomp.ai
1
14
111
16,974
I'm a big fan of the approach to research funding @andykonwinski and the Laude team are taking! Working with them on terminal-bench has been fantastic (thanks @alexgshaw!) and I'm excited that they're going to support more open, impact-oriented research.
Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including @JeffDean & @jpineau1 on the board, @LaudeInstitute catalyzes research with real-world impact.
2
5
91
11,687
Very excited about our new agent benchmark! I think it's a nice way of evaluating how well agents can do complex task in terminal (command line) environments.
Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr lots of room for improvement! tbench.ai/
2
5
79
12,411
Very excited about this! DCLM already led to a great training set for language models, and there is (much) more to understand + more room for improvement here.
I am really excited to introduce DataComp for Language Models (DCLM), our new testbed for controlled dataset experiments aimed at improving language models. 1/x
7
53
7,153
Very excited about this!
Announcing the Open Thoughts project. We are building the best reasoning datasets out in the open. Building off our work with Stratos, today we are releasing OpenThoughts-114k and OpenThinker-7B.
2
46
6,175
If you are working on empirical phenomena in deep learning, consider submitting to our ICML workshop "Identifying and Understanding Deep Learning Phenomena" (deep-phenomena.org/). The deadline is May 5, but relevant work that was already published elsewhere is still welcome!
2
8
39
I learned a lot about the nuances of language model scaling laws from this project. Also the checkpoints are available now: huggingface.co/formll/resolv…
🧵1/8 We resolve the discrepancy between the compute optimal scaling laws of Kaplan (exponent 0.88, Figure 14, left) et al. and Hoffmann et al. (“Chinchilla”, exponent 0.5). Paper: arxiv.org/abs/2406.19146 Data + Code: github.com/formll/resolving-…
1
35
7,263
Similar to previous DataComp projects, we systematically experiment with every step of the data generation pipeline to build a state-of-the-art training set. Overall we conducted more than 1,000 individual experiments.
1
35
4,436
More details on openthoughts.ai/blog/ot3, Ryan’s thread below, and the paper itself arxiv.org/abs/2506.04178
Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data scales. Full details are in our ✨new paper✨ - below we share the highlights: BTW, it also works on non-Qwen models😉 (1/N)
3
32
6,616
Replying to @giffmana
Thanks for the kind words, Lucas! I hope we get a chance to work together some day, I'm a big fan of your work. BTW my lab is always looking for good postdocs. Comp is probably worse than OpenAI, but long-time lab members get to go on runs with @Vaishaal's dog Kaya. He's great!
1
1
31
3,151
Very nice community progress on open-data reasoning models since the R1 release!
Announcing OpenThinker-32B: the best open-data reasoning model distilled from DeepSeek-R1. Our results show that large, carefully curated datasets with verified R1 annotations produce SoTA reasoning models. Our 32B model outperforms all 32B models including DeepSeek-R1-Distill-Qwen-32B (a closed data model) in MATH500 and GPQA Diamond, and shows similar performance in other benchmarks. (1/n)
1
28
3,859
Together with the paper we also release our new dataset OpenThoughts3-1.2M and the corresponding model OpenThinker3-7B, which is currently the best open-data 7B reasoning model.
1
28
5,638
Congrats! Do you know "Benjamen Recht"? He won the 2017 test of time award nips.cc/Conferences/2017/Awa… Could be related?
1
16
Replying to @beenwrekt
What about starting the hierarchy of abstractions with "Ray Optics" = "Linear Models" in the ML context? Then we can ask which phenomena in deep learning we cannot explain with linear models. Or how the explanation in linear models differs from what we see in deep nets.
2
2
14
Humans can get 99%+ at least on the 600 object classes. proceedings.mlr.press/v119/s…
1
2
15
Replying to @giffmana @srchvrs
Hi Lucas, you were probably thinking of arxiv.org/abs/2005.09619 . The high-level conclusion that the V2 distribution shift is only a statistical peculiarity is wrong. Engstrom et al. ran their crowdsourcing experiment with a different setup compared to ours.
2
1
14
Cool to see more work on data for AI agents!
Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specialized, expert data. I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D! --- Snorkel Evaluate is our new data-centric agentic AI evaluation platform for specialized, mission-critical enterprise settings where vibe checks and out-of-the-box metrics driven by simple LLM prompts are not enough. Snorkel Expert Data-as-a-Service is our white glove service for expert-level AI datasets, powering frontier LLM developers in areas like expert knowledge, reasoning, agentic action and tool use, and more! Both built on top of @SnorkelAI’s Data Development Platform, using our programmatic technology to drive higher-quality expert data, faster– for getting specialized AI to real production value. If you’re building enterprise AI and want to partner around the key ingredient in AI today–the data–book a demo and let's talk! snorkel.ai/demo/ Finally, see thread for details on 🧵👇 - 📽️ A walkthrough of Snorkel Evaluate and Expert Data-as-a-Service on an agentic AI enterprise task - 📅 An upcoming event on Enterprise Agentic AI with innovators from @Accenture @BNY @Comcast @Stanford @QBE & others - 📊 An upcoming series of benchmark datasets and model artifact releases 👀 Want early access to the full agentic AI dataset? Retweet this post and we'll send you the link!
13
2,967
Great quote from Jon Kelner (I hope I'm attributing this correctly, has been a while): "The only advantage we have over Gauss is computers."
10
Ali Rahimi (@alirahimi19) gave an awesome talk for his test of time award with Ben Recht (@beenwrekt) today at NIPS. "Machine Learning has become alchemy" -> let's return to more scientific rigor in ML. Yes!! Go to minute 54 in facebook.com/nipsfoundation/…
9
The workshop is co-organized with a great set of collaborators: Samy Bengio @kenjihata @aleks_madry @arimorcos @bneyshabur @maithra_raghu @alirahimi0 @HanieSedghi and Ying Xiao.
6
Replying to @vsmolyakov
Comparisons are good! But for classification problems it's important to look at the test error as well. The following paper has a detailed comparison of various optimization algorithms in deep learning: arxiv.org/abs/1705.08292
6
Replying to @giffmana @srchvrs
Yup, we still need (much) better ways of relating data distributions. I think OOD is currently just a catch-all phrase for a range of disparate phenomena we need to tease apart. The main way I know for generating truly IID data is randomly splitting a large dataset.
2
1
6
Replying to @srchvrs
If there is evidence for adaptive overfitting on CIFAR-10, I'd be very curious to know. We had a close look and found little to no signs of adaptive overfitting: arxiv.org/abs/1902.10811 (Yadav & Bottou also checked MNIST: arxiv.org/abs/1905.10498).
3
4
Finally: the claim that ImageNet models suffer from accuracy drops under distribution shift is valid independently of ImageNetV2, e.g., ObjectNet shows very similar phenomena. You can find more analyses here: arxiv.org/abs/2007.00644
3
5
Actually finally: Another result that may be of interest: humans don't see an accuracy drop between ImageNet and ImageNetV2: proceedings.mlr.press/v119/s… . So regardless of the source of bias in ImageNetV2 (crowdsourcing or selection frequencies), models should be able to handle this.
1
5
Once you correct for the statistical bias Engstrom et al. analyze on the original ImageNetV2 data, the accuracy drop (11 percentage points for many models) only decreases by 0 - 1 pp. Happy to chat more if you're interested and send you our analysis.
2
4
You can find the differences between the MTurk setups in Appendix B.2 of their paper. Note that this is exactly what we described in the ImageNetV2 paper: small differences in the crowdsourcing setup can lead to substantial accuracy changes.
1
4
Dear Prof. Tramer, I work for a large technology news outlet. We are planning to write a major piece about your novel breakthrough idea Y and would like to interview you for the article. When can I call you? Please also send us photos of you in front of a whiteboard.
4
There is a range of vaccines in development that work by directly injecting the spike protein, see the section "Protein-Based Vaccines" in the NYT vaccine tracker: nytimes.com/interactive/2020…
1
3
Replying to @giffmana @srchvrs
Yup I agree that the models do much better OOD than what you'd expect for a worst-case shift. Also in-distribution progress translates very nicely to OOD, see arxiv.org/abs/2107.04649
2
3
Replying to @CyrusRashtchian
Is it clear that this is a proof of concept? We can train to 100% adversarial training accuracy but fail to generalize to the test set (see arxiv.org/abs/1804.11285 ). Is the network trained on the test set now similarly overfitting and would fail to generalize to new data?
2
3
Looking forward to the ImageNet workshop you're co-organizing at NeurIPS! I agree that there are many interesting question still around ImageNet and related datasets :-)
1
3
Replying to @ArmenAgha
Interesting! By how much?
1
3
595
Very much worth watching!
“Where else would you apply a mystery to mission critical systems?” Highly recommend spending an hour to watch James Mickens’ keynote on ML and security: piped.video/ajGX7odA87k
2
For context, it's worth noting that the CLIP paper has shown really impressive results since we wrote this paragraph. See Section 3.3 in arxiv.org/abs/2103.00020
2
Replying to @roydanroy @hardmaru
Are there pre-trained models? Just kidding, will have a look. (Also I really enjoyed @dcpage3's blog posts!)
2
Reminds me of a "Linearization Principle" I have heard about before ... ;-)
2
My understanding is that the leading contender in the west at least is Novavax and their Phase 1 / 2 results looked very good. I don't know if protein-based vaccines are harder to produce or whether the mRNA-based vaccines where just backed by entities that acted more quickly.
1
2
Replying to @matei_zaharia
We are working on it. In Appendix B.2.5 of our paper we already report results of a preliminary experiment with 9 human participants on CIFAR-10. The results suggest that our new test set is not harder for humans.
2
We have released FALCONN v1.2. Nice results on @fulhack's awesome ann-benchmarks (10^6 GloVe vectors):
2
3
We submitted to cs.LG (primary) and stat.ML (secondary).
1
1
Replying to @srchvrs
Yes, all great points. I agree that CNNs are often hard to train, and that we may be developing techniques that only work on specific datasets. I was just curious because "adaptive overfitting" is often used specifically for overfitting to a test set. Thanks for clarifying!
1
Replying to @giffmana @srchvrs
I agree with the paper you mentioned. Note that the abstract still says "In accordance with Barbu et al.'s conclusions, however, we also conclude that deep models suffer drastically on this dataset." For understanding ObjectNet in detail, I'd be curious to see what acc. I get :-)
1
Replying to @Tsourolampis
Congratulations!
1
Replying to @bernease
CSE2 G04 (Gates building) - would be great to see you there :-)
1
"Slight deterioration" vs "don't generalize OOD" is (partially) a matter of definition, so best addressed by concrete numbers on test sets. E.g., the 11% drop on V2 are about 5 years of progress on ImageNet. You know well how hard-earned each percentage point on ImageNet is :-)
1
That was also my first reaction :-)
1
I've had a Forerunner 935 for three years now and think it's a great watch. The heart rate monitor in mine recently stopped working and Garmin is sending me a replacement watch for free. A few friends of mine also have the FR 935 and all like it.
1