Ludwig Schmidt (@lschmidt3) | nitter

Ludwig Schmidt @lschmidt3

5 Jun 2025

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

22

206

1,324

187,563

Ludwig Schmidt @lschmidt3

28 Apr 2023

Super excited to finally release DataComp! There is still a lot we don't understand about Internet-scale datasets. DataComp makes research on datasets more accessible and leads to better training sets. The results so far are very encouraging and there is much more to explore!

Gabriel Ilharco

@gabriel_ilharco

28 Apr 2023

Introducing DataComp, a new benchmark for multimodal datasets! We release 12.8B image-text pairs, 300+ experiments and a 1.4B subset that outcompetes compute-matched CLIP runs from OpenAI & LAION 📜 arxiv.org/abs/2304.14108 🖥️ github.com/mlfoundations/dat… 🌐 datacomp.ai

1

14

111

16,974

Ludwig Schmidt @lschmidt3

23 Jun 2025

I'm a big fan of the approach to research funding @andykonwinski and the Laude team are taking! Working with them on terminal-bench has been fantastic (thanks @alexgshaw!) and I'm excited that they're going to support more open, impact-oriented research.

Andy Konwinski

@andykonwinski

23 Jun 2025

Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including @JeffDean & @jpineau1 on the board, @LaudeInstitute catalyzes research with real-world impact.

2

5

91

11,687

Ludwig Schmidt @lschmidt3

20 May 2025

Very excited about our new agent benchmark! I think it's a nice way of evaluating how well agents can do complex task in terminal (command line) environments.

Mike A. Merrill

@Mike_A_Merrill

19 May 2025

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr lots of room for improvement! tbench.ai/

2

5

79

12,411

Ludwig Schmidt @lschmidt3

19 Jun 2024

Very excited about this! DCLM already led to a great training set for language models, and there is (much) more to understand + more room for improvement here.

Vaishaal Shankar @Vaishaal

18 Jun 2024

I am really excited to introduce DataComp for Language Models (DCLM), our new testbed for controlled dataset experiments aimed at improving language models. 1/x

7

53

7,153

Ludwig Schmidt @lschmidt3

28 Jan 2025

Very excited about this!

Ryan Marten

@ryanmart3n

28 Jan 2025

Announcing the Open Thoughts project. We are building the best reasoning datasets out in the open. Building off our work with Stratos, today we are releasing OpenThoughts-114k and OpenThinker-7B.

2

46

6,175

Ludwig Schmidt @lschmidt3

30 Apr 2019

If you are working on empirical phenomena in deep learning, consider submitting to our ICML workshop "Identifying and Understanding Deep Learning Phenomena" (deep-phenomena.org/). The deadline is May 5, but relevant work that was already published elsewhere is still welcome!

2

8

39

Ludwig Schmidt @lschmidt3

16 Jul 2024

I learned a lot about the nuances of language model scaling laws from this project. Also the checkpoints are available now: huggingface.co/formll/resolv…

formll/resolving-scaling-law-discrepancies · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Tomer Porian @tomerporian

2 Jul 2024

🧵1/8 We resolve the discrepancy between the compute optimal scaling laws of Kaplan (exponent 0.88, Figure 14, left) et al. and Hoffmann et al. (“Chinchilla”, exponent 0.5). Paper: arxiv.org/abs/2406.19146 Data + Code: github.com/formll/resolving-…

1

35

7,263

Ludwig Schmidt @lschmidt3

5 Jun 2025

Similar to previous DataComp projects, we systematically experiment with every step of the data generation pipeline to build a state-of-the-art training set. Overall we conducted more than 1,000 individual experiments.

1

35

4,436

Ludwig Schmidt @lschmidt3

5 Jun 2025

More details on openthoughts.ai/blog/ot3, Ryan’s thread below, and the paper itself arxiv.org/abs/2506.04178

OpenThoughts3 - A new SOTA Reasoning Data Recipe

Pushing the boundaries of open reasoning datasets through rigorous experimentation.

openthoughts.ai

Ryan Marten

@ryanmart3n

5 Jun 2025

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data scales. Full details are in our ✨new paper✨ - below we share the highlights: BTW, it also works on non-Qwen models😉 (1/N)

3

32

6,616

Ludwig Schmidt @lschmidt3

7 Jun 2025

Replying to @giffmana

Thanks for the kind words, Lucas! I hope we get a chance to work together some day, I'm a big fan of your work. BTW my lab is always looking for good postdocs. Comp is probably worse than OpenAI, but long-time lab members get to go on runs with @Vaishaal's dog Kaya. He's great!

1

1

31

3,151

Ludwig Schmidt @lschmidt3

12 Feb 2025

Very nice community progress on open-data reasoning models since the R1 release!

Negin Raoof

@NeginRaoof_

12 Feb 2025

Announcing OpenThinker-32B: the best open-data reasoning model distilled from DeepSeek-R1. Our results show that large, carefully curated datasets with verified R1 annotations produce SoTA reasoning models. Our 32B model outperforms all 32B models including DeepSeek-R1-Distill-Qwen-32B (a closed data model) in MATH500 and GPQA Diamond, and shows similar performance in other benchmarks. (1/n)

1

28

3,859

Ludwig Schmidt @lschmidt3

5 Jun 2025

Together with the paper we also release our new dataset OpenThoughts3-1.2M and the corresponding model OpenThinker3-7B, which is currently the best open-data 7B reasoning model.

1

28

5,638

Ludwig Schmidt @lschmidt3

8 Dec 2020

Replying to @beenwrekt @HazyResearch

Congrats! Do you know "Benjamen Recht"? He won the 2017 test of time award nips.cc/Conferences/2017/Awa… Could be related?

1

16

Ludwig Schmidt @lschmidt3

27 Jan 2018

Replying to @beenwrekt

What about starting the hierarchy of abstractions with "Ray Optics" = "Linear Models" in the ML context? Then we can ask which phenomena in deep learning we cannot explain with linear models. Or how the explanation in linear models differs from what we see in deep nets.

2

2

14

Ludwig Schmidt @lschmidt3

14 Jan 2021

Replying to @quocleix @hieupham789 @lmthang @ZihangDai

Humans can get 99%+ at least on the 600 object classes. proceedings.mlr.press/v119/s…

1

2

15

Ludwig Schmidt @lschmidt3

31 Aug 2021

Replying to @giffmana @srchvrs

Hi Lucas, you were probably thinking of arxiv.org/abs/2005.09619 . The high-level conclusion that the V2 distribution shift is only a statistical peculiarity is wrong. Engstrom et al. ran their crowdsourcing experiment with a different setup compared to ours.

Identifying Statistical Bias in Dataset Replication

Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models' ability to generalize reliably. In this work,...

2

1

14

Ludwig Schmidt @lschmidt3

30 May 2025

Cool to see more work on data for AI agents!

Alex Ratner

@ajratner

29 May 2025

Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specialized, expert data. I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D! --- Snorkel Evaluate is our new data-centric agentic AI evaluation platform for specialized, mission-critical enterprise settings where vibe checks and out-of-the-box metrics driven by simple LLM prompts are not enough. Snorkel Expert Data-as-a-Service is our white glove service for expert-level AI datasets, powering frontier LLM developers in areas like expert knowledge, reasoning, agentic action and tool use, and more! Both built on top of @SnorkelAI’s Data Development Platform, using our programmatic technology to drive higher-quality expert data, faster– for getting specialized AI to real production value. If you’re building enterprise AI and want to partner around the key ingredient in AI today–the data–book a demo and let's talk! snorkel.ai/demo/ Finally, see thread for details on 🧵👇 - 📽️ A walkthrough of Snorkel Evaluate and Expert Data-as-a-Service on an agentic AI enterprise task - 📅 An upcoming event on Enterprise Agentic AI with innovators from @Accenture @BNY @Comcast @Stanford @QBE & others - 📊 An upcoming series of benchmark datasets and model artifact releases 👀 Want early access to the full agentic AI dataset? Retweet this post and we'll send you the link!

13

2,967

Ludwig Schmidt @lschmidt3

21 Aug 2019

Great quote from Jon Kelner (I hope I'm attributing this correctly, has been a while): "The only advantage we have over Gauss is computers."

10

Ludwig Schmidt @lschmidt3

6 Dec 2017

Ali Rahimi (@alirahimi19) gave an awesome talk for his test of time award with Ben Recht (@beenwrekt) today at NIPS. "Machine Learning has become alchemy" -> let's return to more scientific rigor in ML. Yes!! Go to minute 54 in facebook.com/nipsfoundation/…

9

Ludwig Schmidt @lschmidt3

23 Mar 2019

Great article: nature.com/articles/d41586-0…

9

Ludwig Schmidt @lschmidt3

30 Apr 2019

The workshop is co-organized with a great set of collaborators: Samy Bengio @kenjihata @aleks_madry @arimorcos @bneyshabur @maithra_raghu @alirahimi0 @HanieSedghi and Ying Xiao.

6

Ludwig Schmidt @lschmidt3

11 Jan 2018

Replying to @vsmolyakov

Comparisons are good! But for classification problems it's important to look at the test error as well. The following paper has a detailed comparison of various optimization algorithms in deep learning: arxiv.org/abs/1705.08292

6

Ludwig Schmidt @lschmidt3

3 Sep 2021

Replying to @giffmana @srchvrs

Yup, we still need (much) better ways of relating data distributions. I think OOD is currently just a catch-all phrase for a range of disparate phenomena we need to tease apart. The main way I know for generating truly IID data is randomly splitting a large dataset.

2

1

6

Ludwig Schmidt @lschmidt3

22 Apr 2020

Replying to @srchvrs

If there is evidence for adaptive overfitting on CIFAR-10, I'd be very curious to know. We had a close look and found little to no signs of adaptive overfitting: arxiv.org/abs/1902.10811 (Yadav & Bottou also checked MNIST: arxiv.org/abs/1905.10498).

3

4

Ludwig Schmidt @lschmidt3

31 Aug 2021

Replying to @lschmidt3 @giffmana @srchvrs

Finally: the claim that ImageNet models suffer from accuracy drops under distribution shift is valid independently of ImageNetV2, e.g., ObjectNet shows very similar phenomena. You can find more analyses here: arxiv.org/abs/2007.00644

3

5

Ludwig Schmidt @lschmidt3

31 Aug 2021

Replying to @lschmidt3 @giffmana @srchvrs

Actually finally: Another result that may be of interest: humans don't see an accuracy drop between ImageNet and ImageNetV2: proceedings.mlr.press/v119/s… . So regardless of the source of bias in ImageNetV2 (crowdsourcing or selection frequencies), models should be able to handle this.

1

5

Ludwig Schmidt @lschmidt3

31 Aug 2021

Replying to @lschmidt3 @giffmana @srchvrs

Once you correct for the statistical bias Engstrom et al. analyze on the original ImageNetV2 data, the accuracy drop (11 percentage points for many models) only decreases by 0 - 1 pp. Happy to chat more if you're interested and send you our analysis.

2

4

Ludwig Schmidt @lschmidt3

31 Aug 2021

Replying to @lschmidt3 @giffmana @srchvrs

You can find the differences between the MTurk setups in Appendix B.2 of their paper. Note that this is exactly what we described in the ImageNetV2 paper: small differences in the crowdsourcing setup can lead to substantial accuracy changes.

1

4

Ludwig Schmidt @lschmidt3

9 Apr 2021

Replying to @florian_tramer @mavroudisv @jhasomesh

Dear Prof. Tramer, I work for a large technology news outlet. We are planning to write a major piece about your novel breakthrough idea Y and would like to interview you for the article. When can I call you? Please also send us photos of you in front of a whiteboard.

4

Ludwig Schmidt @lschmidt3

21 Dec 2020

Replying to @AlexGDimakis @TomSercu

There is a range of vaccines in development that work by directly injecting the spike protein, see the section "Protein-Based Vaccines" in the NYT vaccine tracker: nytimes.com/interactive/2020…

1

3

Ludwig Schmidt @lschmidt3

3 Sep 2021

Replying to @giffmana @srchvrs

Yup I agree that the models do much better OOD than what you'd expect for a worst-case shift. Also in-distribution progress translates very nicely to OOD, see arxiv.org/abs/2107.04649

2

3

Ludwig Schmidt @lschmidt3

29 Jul 2020

Replying to @CyrusRashtchian

Is it clear that this is a proof of concept? We can train to 100% adversarial training accuracy but fail to generalize to the test set (see arxiv.org/abs/1804.11285 ). Is the network trained on the test set now similarly overfitting and would fail to generalize to new data?

Adversarially Robust Generalization Requires More Data

Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce...

2

3

Ludwig Schmidt @lschmidt3

31 Aug 2021

Replying to @lschmidt3 @giffmana @srchvrs

Looking forward to the ImageNet workshop you're co-organizing at NeurIPS! I agree that there are many interesting question still around ImageNet and related datasets :-)

1

3

Ludwig Schmidt @lschmidt3

1 Sep 2023

Replying to @ArmenAgha

Interesting! By how much?

1

3

595

Ludwig Schmidt @lschmidt3

18 Aug 2018

Very much worth watching!

Ben Recht @beenwrekt

17 Aug 2018

“Where else would you apply a mystery to mission critical systems?” Highly recommend spending an hour to watch James Mickens’ keynote on ML and security: piped.video/ajGX7odA87k

2

Ludwig Schmidt @lschmidt3

31 Aug 2021

Replying to @srchvrs @Nils_Reimers

For context, it's worth noting that the CLIP paper has shown really impressive results since we wrote this paragraph. See Section 3.3 in arxiv.org/abs/2103.00020

2

Ludwig Schmidt @lschmidt3

21 Aug 2019

Replying to @roydanroy @hardmaru

Are there pre-trained models? Just kidding, will have a look. (Also I really enjoyed @dcpage3's blog posts!)

2

Ludwig Schmidt @lschmidt3

27 Jan 2018

Replying to @lschmidt3 @beenwrekt

Reminds me of a "Linearization Principle" I have heard about before ... ;-)

2

Ludwig Schmidt @lschmidt3

21 Dec 2020

Replying to @lschmidt3 @AlexGDimakis @TomSercu

My understanding is that the leading contender in the west at least is Novavax and their Phase 1 / 2 results looked very good. I don't know if protein-based vaccines are harder to produce or whether the mRNA-based vaccines where just backed by entities that acted more quickly.

1

2

Ludwig Schmidt @lschmidt3

22 Feb 2019

Replying to @matei_zaharia

We are working on it. In Appendix B.2.5 of our paper we already report results of a preliminary experiment with 9 human participants on CIFAR-10. The results suggest that our new test set is not harder for humans.

2

Ludwig Schmidt @lschmidt3

17 Mar 2016

We have released FALCONN v1.2. Nice results on @fulhack's awesome ann-benchmarks (10^6 GloVe vectors):

2

3

Ludwig Schmidt @lschmidt3

15 Feb 2019

Replying to @maithra_raghu @beenwrekt @OpenAI

We submitted to cs.LG (primary) and stat.ML (secondary).

1

1

Ludwig Schmidt @lschmidt3

23 Apr 2020

Replying to @srchvrs

Yes, all great points. I agree that CNNs are often hard to train, and that we may be developing techniques that only work on specific datasets. I was just curious because "adaptive overfitting" is often used specifically for overfitting to a test set. Thanks for clarifying!

1

Ludwig Schmidt @lschmidt3

3 Sep 2021

Replying to @giffmana @srchvrs

I agree with the paper you mentioned. Note that the abstract still says "In accordance with Barbu et al.'s conclusions, however, we also conclude that deep models suffer drastically on this dataset." For understanding ObjectNet in detail, I'd be curious to see what acc. I get :-)

1

Ludwig Schmidt @lschmidt3

13 Nov 2018

Replying to @Tsourolampis

Congratulations!

1

Ludwig Schmidt @lschmidt3

30 Sep 2021

Replying to @bernease

CSE2 G04 (Gates building) - would be great to see you there :-)

1

Ludwig Schmidt @lschmidt3

3 Sep 2021

Replying to @lschmidt3 @giffmana @srchvrs

"Slight deterioration" vs "don't generalize OOD" is (partially) a matter of definition, so best addressed by concrete numbers on test sets. E.g., the 11% drop on V2 are about 5 years of progress on ImageNet. You know well how hard-earned each percentage point on ImageNet is :-)

1

Ludwig Schmidt @lschmidt3

11 Oct 2020

Replying to @thesasho @GregBodwin

That was also my first reaction :-)

1

Ludwig Schmidt @lschmidt3

6 Sep 2020

I've had a Forerunner 935 for three years now and think it's a great watch. The heart rate monitor in mine recently stopped working and Garmin is sending me a replacement watch for free. A few friends of mine also have the FR 935 and all like it.

1