Micah Goldblum (@micahgoldblum) | nitter

Micah Goldblum @micahgoldblum

13 Nov 2025

An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3

39

145

1,447

516,858

Micah Goldblum @micahgoldblum

23 Aug 2022

TLDR: Diffusion models (like DALLE or Imagen) generate pretty pictures from Gaussian noise, but the same training and generation update rules generalize easily to other degradations, including completely deterministic ones. 1/7

13

138

917

Micah Goldblum @micahgoldblum

2 Aug 2022

A common point raised by ML reviewers is that a method is too simple or is made of existing parts. But simplicity is a strength, not a weakness. People are much more likely to adopt simple methods, and simple ones are also typically more interpretable and intuitive. 1/2

26

93

821

Micah Goldblum @micahgoldblum

10 Jul 2025

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

27

112

829

396,780

Micah Goldblum @micahgoldblum

29 Oct 2024

📢I’ll be admitting multiple PhD students this winter to Columbia University 🏙️ in the most exciting city in the world! If you are interested in dissecting modern deep learning systems to probe how they work, advancing AI safety, or automating data science, apply to my group.

6

141

547

67,981

Micah Goldblum @micahgoldblum

6 Feb 2025

Here’s an easy trick for improving the performance of gradient-boosted decision trees like XGBoost allowing them to read text column headers and to benefit from massive pretraining: replace the first tree with an LLM or TabPFN! 🧵 1/9

7

89

556

104,531

Micah Goldblum @micahgoldblum

25 Apr 2023

Self-Supervised Learning (SSL) is quickly becoming a defacto way of training neural networks, but if you have ever tried it yourself, you’d know that getting high performance is tricky! Check out our new thorough guide to all things SSL. arxiv.org/abs/2304.12210

7

78

453

76,507

Micah Goldblum @micahgoldblum

23 May 2024

I’m excited to announce that I’ll start as an assistant professor at Columbia University this summer! Interview season was fun, I met so many amazing people, but I’m happy to finally close the loop.

43

9

408

50,083

Micah Goldblum @micahgoldblum

1 Nov 2023

🚨Excited to announce a large-scale comparison of pretrained vision backbones including SSL, vision-language models, and CNNs vs ViTs across diverse downstream tasks ranging from classification to detection to OOD generalization and more! NeurIPS 2023🚨🧵 arxiv.org/abs/2310.19909

8

88

392

199,540

Micah Goldblum @micahgoldblum

13 Oct 2022

How much data are augmentations worth? We show that augmentations can actually be worth more than extra data and invariance! They increase variance across batches, and this extra stochasticity finds flatter minima. arxiv.org/abs/2210.06441 1/8

How Much Data Are Augmentations Worth? An Investigation into...

Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations...

3

64

379

Micah Goldblum @micahgoldblum

5 Jul 2022

Gradient-boosted decision trees are still thought to be competitive with neural networks on tabular data. But NNs have a massive advantage, they learn representations, and this ability can be leveraged for transfer learning arxiv.org/abs/2206.15306. 1/4

6

40

338

Micah Goldblum @micahgoldblum

12 Jun 2024

🚨 Announcing LiveBench, a challenging new general-purpose live LLM benchmark! 🚨 Thanks @crwhite_ml and @SpamuelDooley for leading the charge! Link: livebench.ai/ Existing LLM benchmarks have serious limitations: 🧵

10

71

331

154,989

Micah Goldblum @micahgoldblum

20 Apr 2023

🚨Here’s an intuitive explanation for why training on lots and lots of data creates emergent properties, for instance math and reasoning, in large language models like #GPT-4 and #ChatGPT 🚨 1/17

6

34

266

110,750

Micah Goldblum @micahgoldblum

18 Jun 2024

We often determine whether a neural network is over or under parameterized by counting parameter. In practice, how much data we can fit depends on many factors: architecture, optimizer, etc. So just how flexible are neural networks in practice? 🧵 Paper: arxiv.org/abs/2406.11463

Just How Flexible are Neural Networks in Practice?

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized...

10

52

260

43,392

Micah Goldblum @micahgoldblum

2 Oct 2023

I’m on the faculty job market this year! Going from a late start as a math PhD student to a ML postdoc was a fun challenge. Building a research agenda alongside amazing students has been rewarding with 9 papers accepted to NeurIPS this year. Don’t let rejections get you down!

6

19

247

84,030

Micah Goldblum @micahgoldblum

13 Feb 2025

AI web agents like Operator and Anthropic’s Computer Use can operate a browser, but the LLMs inside are brittle, and you can’t trust what’s on the web. In this 🧵, I’ll show how adversaries can fool Anthropic’s web agent into sending phishing emails or revealing credit card info.

17

69

239

42,951

Micah Goldblum @micahgoldblum

29 Feb 2024

Do LLMs simply memorize and parrot their pretraining data or do they learn patterns that generalize? Let’s put this to the test! We compute the first generalization guarantees for LLMs. w/ @LotfiSanae, @m_finzi, @KuangYilun, @timrudner, @andrewgwils arxiv.org/abs/2312.17173 1/9

Non-Vacuous Generalization Bounds for Large Language Models

Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the...

3

26

228

28,035

Micah Goldblum @micahgoldblum

2 Aug 2022

Simplicity doesn’t preclude novelty, even when the method is composed of existing parts. During the NeurIPS review period, DO NOT downgrade papers just because the method is simple. If anything, question methods which are needlessly complicated when simple solutions will do. 2/2

6

9

208

Micah Goldblum @micahgoldblum

10 Oct 2022

One view of ML history is that we started out with MLPs and evolved towards more specialized architectures like CNNs for vision, LSTMs for sequences, etc. But actually, the exact opposite is true! 🚨🧵1/6

2

23

211

Micah Goldblum @micahgoldblum

16 Aug 2023

We show that neural networks have a remarkable preference for low complexity which overlaps strongly with real-world data across modalities. PAC-Bayes proves that such models generalize, explaining why NNs are almost universally effective. arxiv.org/abs/2304.05366

Jim Fan

@DrJimFan

16 Aug 2023

There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: - Kolmogorov compressor is the theoretical shortest-length program that produces a dataset. SGD is a practical approximation of the Kolmogorov search that finds an implicit program embedded in the weights of a soft computer, i.e. big Transformers. - Unsupervised learning is about computing the conditional Kolmogorov complexity of a target dataset given an unlabelled corpus, i.e. K(Y|X) - Theory tells us that optimizing for K(X, Y), the joint complexity, is as good as K(Y|X). So simply throw all data into the mix, and "just compress everything". - Joint compression is maximum likelihood over the giant concatenated dataset. - Ilya cites iGPT, Chen et al. 2020, to illustrate the ideas. iGPT is an image compressor that learns to predict the next pixel using a 1D sequence model. This is a phenomenal lecture, very accessible, and sometimes quite entertaining. YouTube: piped.video/watch?v=AKMuA_TV… Lecture page: simons.berkeley.edu/talks/il…

1

23

182

60,925

Micah Goldblum @micahgoldblum

19 Dec 2023

I’m thrilled to announce the first issue of a community survey on the state and future of deep learning! We asked folks their opinions on benchmarking, transformers, interpretability, theories of deep learning, and directions we should be working on. 1/3 arxiv.org/abs/2312.09323

5

32

175

37,594

Micah Goldblum @micahgoldblum

24 May 2022

Typical transfer learning pipelines involve initializing at pre-trained weights and hoping that relevant learned information magically transfers even when the weights change during fine-tuning. But you can transfer so much more than just initialization! 1/4

1

30

169

Micah Goldblum @micahgoldblum

13 Nov 2025

Here's the OpenReview page: openreview.net/forum?id=1Ne3… Enjoy the paper and nonsense reviews. Thanks to Arka Pal for catching this. 2/3

Gauge Symmetries for Efficient Zero- Knowledge Proofs of Transformers

We introduce GaugeZKP, a symmetry-aware verification framework for Transformers that exploits the maximal gauge group of attention. For canonical models the maximal group is Gₘₐₓ = ((GL(dₖ))ʰ ×...

10

6

184

45,754

Micah Goldblum @micahgoldblum

12 Nov 2022

The following statement, while a commonly held view, is actually false! “Learning theory says that the more functions your model can represent, the more samples it needs to learn anything”. 1/8

Yann LeCun

@ylecun

12 Nov 2022

OK, debates about the necessity or "priors" (or lack thereof) in learning systems are pointless. Here are some basic facts that all ML theorists and most ML practitioners understand, but a number of folks-with-an-agenda don't seem to grasp. Thread. 1/

6

20

163

Micah Goldblum @micahgoldblum

6 Apr 2022

The new @icmlconf review format is horrendous (no reviewer scores). Students will spend an inordinate amount of time drafting rebuttals for reviewers who have already committed to rejecting their papers. Massive waste of person hours.

1

4

154

Micah Goldblum @micahgoldblum

10 Jun 2022

We usually use NNs in silico, but they can also operate on analogue systems involving optics etc. You can train high-performance physical NNs that perform inference orders of magnitude faster than digital computers nature.com/articles/s41586-0…. 1/2

2

30

154

Micah Goldblum @micahgoldblum

14 Jun 2022

As we go into the NeurIPS reviewing process, remember to accept every paper that you think would contribute to the conference! Don’t read papers trying to find little things to criticize. Instead, try also to find the valuable pieces that the community might want to read. 1/3

2

9

146

Micah Goldblum @micahgoldblum

3 Aug 2022

Lazy reviewer starter pack: “needs theoretical justification”, “should cite [paper that came out a week after the submission deadline]”, “not novel enough for NeurIPS”, “more datasets, models, and baselines [that don’t apply or are in the appendix]”, “borderline accept/reject”

6

8

142

Micah Goldblum @micahgoldblum

6 Dec 2023

🚨Real data is often massively class-imbalanced, and standard NN pipelines built on balanced benchmarks can fail! We show that simply tuning standard pipelines beats all of the newfangled samplers and objectives designed for imbalance. #NeurIPS2023 🚨🧵1/8 arxiv.org/abs/2312.02517

3

19

143

17,809

Micah Goldblum @micahgoldblum

9 Jun 2022

Data scientists working with tabular data try simple linear models first and work their way through GBDT and maybe finally expressive NNs and ensembles. In contrast, vision or NLP practitioners start with the most powerful NN at their disposal. 1/2

6

6

137

Micah Goldblum @micahgoldblum

11 Nov 2025

🚨We converted pretrained LLMs into looped LLMs that can crank up performance by looping for more iterations. Our looped models surpass the performance of the pretrained models we started out with, showing that existing models benefit from increased computational depth. 📜1/9

10

26

150

34,567

Micah Goldblum @micahgoldblum

5 Jun 2022

After years of ML research and conference publications, I'm finally attending my first in-person conference this summer … as a postdoc … in my home city …🙃

2

1

130

Micah Goldblum @micahgoldblum

25 Jul 2022

Lessons from ICML @icmlconf: (1) Eliminate short talks, especially pre-recorded ones. (2) Poster sessions ≫ talks so allocate more time to them. (3) NO EVENTS DURING LUNCH/DINNER TIME! Poster sessions ended ~8:30pm and people went without dinner. (4) Don’t serve moldy bagels.

4

7

130

Micah Goldblum @micahgoldblum

23 Jul 2025

🚨Announcing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces 📜. Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n

1

26

128

17,745

Micah Goldblum @micahgoldblum

12 Apr 2023

There’s a pervasive myth that the No Free Lunch Theorem prevents us from building general-purpose learners. Instead, we need to select models on a per-domain basis. Is this really true? Let’s talk about it! 🧵 1/16 arxiv.org/abs/2304.05366 w/@andrewgwils, @m_finzi, K. Rowan

9

21

121

28,240

Micah Goldblum @micahgoldblum

8 Feb 2023

#StableDiffusion and #ChatGPT use prompts, but hard prompts (actual text) perform poorly, while soft prompts are uninterpretable and nontransferable. We designed an easy-to-use prompt optimizer PEZ for discovering good hard prompts, complete with a demo.🧵 arxiv.org/abs/2302.03668

3

29

121

32,834

Micah Goldblum @micahgoldblum

7 Aug 2022

Reviewers, engage with authors during the discussion period! So many misunderstandings are waiting to be cleared up by great rebuttals, and mistakes that have now been fixed. Don’t get some unlucky grad student’s paper rejected because you were too lazy to engage!

2

12

115

Micah Goldblum @micahgoldblum

13 Jun 2024

If we want to use LLMs for decision making, we need to know how confident they are about their predictions. LLMs don’t output meaningful probabilities off-the-shelf, so here’s how to do it 🧵 Paper: arxiv.org/abs/2406.08391 Thanks @psiyumm and @gruver_nate for leading the charge!

2

20

112

13,786

Micah Goldblum @micahgoldblum

5 Nov 2022

Just made a quick plot of ICLR 2023 mean reviewer scores by percentile. To be in the top 25% of papers, you need a mean reviewer score of at least 5.67

3

9

111

Micah Goldblum @micahgoldblum

9 Aug 2022

I want to point out several problems (areas for improvement) in the @NeurIPSConf review process which I haven't heard talked about. (1) Do not show reviewer scores to other reviewers since these bias score changes via peer pressure (do show scores to authors and ACs). 1/3

4

5

107

Micah Goldblum @micahgoldblum

6 Jul 2022

Some people feel that transfer learning (TL) doesn’t apply to tabular data just because there exist unrelated domains (e.g. cc fraud vs. disease diagnosis). However, there are also adjacent tabular domains where TL makes a ton of sense (e.g. diagnosis of different diseases). 1/6

Bojan Tunguz

@tunguz

5 Jul 2022

Saw a new article on transfer learning for tabular data using NNs. I don’t have the time to take a closer look, but my initial reaction is the following: 1/4

3

15

103

Micah Goldblum @micahgoldblum

13 Jun 2022

After a long day of ML research without any paws. Research is so ruff!

2

2

98

Micah Goldblum @micahgoldblum

13 Nov 2025

I want to clarify that my batch of papers is fine, as are most of the reviews. This kind of paper is an outlier, yet we disproportionately see the outliers on social media. I do think reviewers are less likely to put in enough work to evaluate technical papers. 3/3

2

2

111

39,535

Micah Goldblum @micahgoldblum

13 Nov 2022

I’m on the academic job market this year 🚨🥳🚨! Let me know if there are any interesting opportunities I’m likely to have overlooked or catch me at #NeurIPS2022!

1

10

93

Micah Goldblum @micahgoldblum

30 May 2022

People think hierarchical features are why NNs generalize. Has anyone formalized this? How would you verify/falsify? Historically, people thought early layers are tuned to extract low-level features (e.g. edges), while late layers learn to extract abstract ones (e.g. faces) 1/2

7

10

94

Micah Goldblum @micahgoldblum

6 Mar 2023

Diffusion models like #StableDiffusion and #dalle2 generate beautiful pictures, but are these images new or are they copies of the images they were trained on? 🧵 #CVPR2023 arxiv.org/abs/2212.03860

4

25

91

24,010

Micah Goldblum @micahgoldblum

28 May 2022

If my dog steps on my keyboard while I'm drafting a conference submission, is she a co-author? Some fields are very loose with authorship.

6

4

88

Micah Goldblum @micahgoldblum

17 Apr 2023

Data scientists who work at big tech companies benefit from free lunch in more ways than one. arxiv.org/abs/2304.05366

12

87

19,281

Micah Goldblum @micahgoldblum

10 Jul 2025

Small batches → noisy gradients, so this rule increases β2 as we decrease batch size, updating the optimizer state slowly and smoothing out the noise. Specifically, we fix the half-life of each gradient in the optimizer state in terms of tokens rather than iterations. 5/n

2

6

94

26,041

Micah Goldblum @micahgoldblum

23 Aug 2022

Paper found here: arxiv.org/abs/2208.09392 All the awesome collaborators that made this happen: @arpitbansal297, @EBorgnia, Hong-Min Chu, Jie Li, @hamid_kazemi22, @furongh, @jonasgeiping, @tomgoldsteincs 7/7

4

6

75

Micah Goldblum @micahgoldblum

12 Jun 2022

Why are facial recognition systems so unfair across race/gender? A lot of people think it comes from imbalanced training data, but it even happens with perfectly balanced training data. In fact, randomly initialized face rec systems are unfair too! arxiv.org/abs/2203.08235 1/3

A Deep Dive into Dataset Imbalance and Bias in Face Identification

As the deployment of automated face recognition (FR) systems proliferates, bias in these systems is not just an academic question, but a matter of public concern. Media portrayals often center...

5

17

76

Micah Goldblum @micahgoldblum

6 Jun 2022

In deep learning, we typically split the data, train on the training split, and evaluate on the validation split, so we only train on part of the data when we are comparing models. In contrast, the marginal likelihood tries to use data more holistically. 1/3

3

10

76

Micah Goldblum @micahgoldblum

6 Aug 2022

Classic Reviewer 2: “The authors have now thoroughly addressed all my concerns. I raise my score from 4 to 5.”

5

2

76

Micah Goldblum @micahgoldblum

8 Jun 2022

ML security research has been DOMINATED by adversarial examples/defenses for the past few years, not because it is the most important area of security but because it is easy to work on (low implementation/hardware/know-how costs). 1/2

5

4

66

Micah Goldblum @micahgoldblum

10 Jul 2025

This thread is based on our new paper, which can be found here: arxiv.org/abs/2507.07101 Shoutout to my awesome co-authors: @mrtnm, @LotfiSanae, @aditsom7, @andrewgwils Now let’s get into it! 3/n

Small Batch Size Training for Language Models: When Vanilla SGD...

Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a...

1

9

72

9,006

Micah Goldblum @micahgoldblum

11 Jun 2024

Virtually all large models today contain huge matrices, and these dominate their compute. By incorporating structure in these matrices, we can improve the performance/compute tradeoff!

Andrew Gordon Wilson

@andrewgwils

11 Jun 2024

A lot of the computation in pre-training transformers is now spent in the dense linear (MLP) layers. In our new ICML paper, we propose matrix structures with better scaling laws! arxiv.org/abs/2406.06248 w/@ShikaiQiu, Andres P, @m_finzi, @micahgoldblum 1/8

4

68

8,901

Micah Goldblum @micahgoldblum

20 Jul 2022

Thrilled that our paper on model selection won the Outstanding Paper Award at ICML 2022. All credit goes to my great collaborators. Check out @LotfiSanae's talk and drop by our poster tomorrow!

Sanae Lotfi

@LotfiSanae

20 Jul 2022

I'm so proud that our paper on the marginal likelihood won the Outstanding Paper Award at #ICML2022!!! Congratulations to my amazing co-authors @Pavel_Izmailov, @g_benton_, @micahgoldblum, @andrewgwils 🎉 Talk on Thursday, 2:10 pm, room 310 Poster 828 on Thursday, 6-8 pm, hall E

1

2

59

Micah Goldblum @micahgoldblum

13 Jan 2025

🚨📢 Excited to announce the ICLR 2025 Workshop on Building Trust in LLMs and LLM Applications! 📢🚨 Submit all your papers, and we’ll see you in Singapore! There will be paper awards, and we have a stacked lineup of speakers and panelists.

1

8

62

14,888

Micah Goldblum @micahgoldblum

10 Jul 2025

Read our paper here: arxiv.org/abs/2507.07101 Use our codebase: github.com/martin-marek/batc… And thanks to my amazing collaborators who made this work possible: @mrtnm, @LotfiSanae, @aditsom7, @andrewgwils n/n

Small Batch Size Training for Language Models: When Vanilla SGD...

Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a...

2

4

64

5,743

Micah Goldblum @micahgoldblum

10 Jul 2025

Here's ★how to make small batch LLM training fast, ★how to pretrain LLMs efficiently via vanilla SGD without momentum, and ★why you should consider getting rid of LoRA and gradient accumulation. 2/n

1

2

62

10,348

Micah Goldblum @micahgoldblum

26 Aug 2022

We should value practicality and intuitiveness over novelty. Novelty often means an idea is so complicated that the reader couldn't have imagined coming up with it. That's not a good thing. That's bad. Good ideas often seem obvious in retrospect.

5

4

59

Micah Goldblum @micahgoldblum

24 Feb 2021

Want to learn more about data poisoning and backdoor attacks? Our survey paper (arxiv.org/abs/2012.10544) clarifies the state of the field for newbies and veterans alike! @dawnsongtweets @tsiprasd @xinyun_chen_ @ChulinXie @A_v_i__S @tomgoldsteincs @aleks_madry

13

60

Micah Goldblum @micahgoldblum

26 Mar 2024

We show how to make data poisoning and backdoor attacks way more potent by synthesizing them from scratch with guided diffusion. 🧵 1/8 Paper: arxiv.org/abs/2403.16365

1

10

57

8,063

Micah Goldblum @micahgoldblum

27 May 2022

What are the real reasons that NNs work so much better than other models? It sure as hell isn’t because of the “implicit bias of SGD”. Is it their inductive biases, parameter efficiency, ease of optimization? Would other models work just as well if only we could scale them up?

11

3

54

Micah Goldblum @micahgoldblum

9 Aug 2022

(3) Reviewers shouldn’t be allowed to click “Author Rebuttal Acknowledgement” (not writing a response to the rebuttal) if they don’t increase their score. It is important to justify to the authors why their points don’t address your feedback. 3/3

2

5

52

Micah Goldblum @micahgoldblum

21 Sep 2023

Transformers seem to work for all sorts of data, made possible by a shared structure that virtually all real data shares. This also allows NNs to be near-universal compressors. The real world is simple, so all we need is models with a simplicity bias. arxiv.org/abs/2304.05366

AK

@_akhaliq

20 Sep 2023

Language Modeling Is Compression paper page: huggingface.co/papers/2309.1… It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

5

52

12,295

Micah Goldblum @micahgoldblum

10 Jul 2025

Disclaimer – When training on a huge cluster, a large batch size may be necessary for good hardware utilization. Similarly, gradient accumulation reduces time spent communicating gradients between model copies. But most of us are poor, so listen up if you are poor too! 11/n

2

1

55

6,045

Micah Goldblum @micahgoldblum

23 Aug 2022

Even using deterministic degradations, training and test-time update rules that underlie diffusion models can be generalized, calling into question the orthodox understanding of diffusion and opening up research on a whole new direction of generative models. 6/7

2

6

49

Micah Goldblum @micahgoldblum

23 Aug 2022

Diffusion models, from DALLE to Imagen, operate by sampling random Gaussian noise and iteratively denoising/noising until they converge to a pretty picture. This simple-sounding process is underpinned by several theoretical motivations. 2/7

1

7

47

Micah Goldblum @micahgoldblum

14 Nov 2025

Replying to @gowthami_s

Likely

58

21,350

Micah Goldblum @micahgoldblum

10 Oct 2022

ML practitioners used to encode their beliefs about a problem, like invariances, into their architectures by hand. We show that transformers actually learn these same structures directly from the data! arxiv.org/abs/2210.02984 3/6

2

47

Micah Goldblum @micahgoldblum

5 Dec 2022

#NeurIPS2022 was fun! No moldy bagels like ICML, more poster sessions, fewer talks. Minor suggestions: (1) Don’t schedule poster sessions during meal times (e.g. sessions from 11am-1pm). I noticed a lot of people skipping the lunchtime poster session because they were hungry. 1/3

1

1

46

Micah Goldblum @micahgoldblum

10 Jul 2025

Small batch LLM training is thought to be slow per FLOP, motivating gradient accumulation to simulate larger batches, even in small-scale academic runs. We show that a simple rule for scaling Adam hyperparameters allows efficient per-FLOP training down to batch size 1. 4/n

3

47

10,545

Micah Goldblum @micahgoldblum

10 Jul 2025

Next time you are fine-tuning an LLM, turn off gradient accumulation and LoRA, and do full fine-tuning with Adafactor and a small batch size for great performance and lightweight hardware requirements. 10/n

1

2

45

4,321

Micah Goldblum @micahgoldblum

23 Aug 2022

One interpretation of diffusion models views them as score estimators, whereby noise is added to the score estimates to sample images via stochastic gradient Langevin dynamics: arxiv.org/abs/1907.05600 3/7

1

3

41

Micah Goldblum @micahgoldblum

20 Apr 2023

Check out our paper, with @m_finzi, Keefer Rowan, @andrewgwils, where we show just how important simplicity bias, formalized using Kolmogorov complexity, is for machine learning. The paper is easy to approach for all audiences! 16/17 arxiv.org/abs/2304.05366

1

2

44

4,692

Micah Goldblum @micahgoldblum

9 Aug 2022

(2) Low acceptance rates make reviewers worry that assigning a high score, thus increasing the chance of that paper being accepted, in turn decreases the chance of their own paper being accepted. After all, the same AC may be in charge of both papers. This is a bad incentive. 2/3

3

1

42

Micah Goldblum @micahgoldblum

30 Nov 2022

🚨NeurIPS poster Wednesday: 11-1, Hall J #512🚨 arxiv.org/abs/2106.08970 Backdoor attacks are dangerous, but existing attacks are easy to detect. We develop a backdoor attack whose poisons are indistinguishable from clean samples. Can you tell which are poisoned? 1/4

5

11

45

Micah Goldblum @micahgoldblum

23 Aug 2022

A second closely related interpretation views diffusion models as autoencoders with a fixed encoder that noises the image and a learned decoder that reverses this random process by approximating the reverse conditional distributions with Gaussians: arxiv.org/abs/2006.11239 4/7

Denoising Diffusion Probabilistic Models

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best...

1

5

40

Micah Goldblum @micahgoldblum

12 Nov 2022

SSL can even be used to learn an explicit prior probability distribution over parameters arxiv.org/abs/2205.10279. 8/8

1

1

41

Micah Goldblum @micahgoldblum

1 Nov 2023

Despite the popularity of ViTs and SSL, our benchmark suggests that the best backbones for most vision tasks are actually modern convnets (e.g. ConvNeXt) pretrained on massive labeled classification datasets. Future SSL works should train on bigger datasets to be competitive. 2/7

1

1

41

4,419

Micah Goldblum @micahgoldblum

10 Jul 2025

On top of hyperparameter robustness, small batch training makes training robust to the choice of optimizer too. We observe great performance with memory-efficient optimizers like Adafactor, and even vanilla SGD without momentum performs nearly as well as Adam. 7/n

2

1

43

8,721

Micah Goldblum @micahgoldblum

10 Jul 2025

We observe that small batch training is highly robust to optimizer hyperparameters like learning rate and momentum. This means that on a fixed hyperparameter tuning budget, you will find better hyperparameters in the small batch regime. 6/n

2

1

43

6,813

Micah Goldblum @micahgoldblum

4 Jun 2022

The vast majority of NNs deployed/released have no privacy guarantee. If someone ever figures out how to recover training data from trained models, they will instantly recover boatloads of private data, and there’s nothing we can do to stop it! The models are already released.

4

6

40

Micah Goldblum @micahgoldblum

13 Jul 2022

I will be at ICML in person next week to present our works on marginal likelihood, model inversion/explainability, and privacy breaches in federated learning. Shoot me a message if you want to connect!

2

40

Micah Goldblum @micahgoldblum

7 Oct 2022

Just how important are handcrafted inductive biases like CNNs for computer vision? We can just learn them! ViTs are often actually more translation invariant than CNNs after training.

Nate Gruver @gruver_nate

7 Oct 2022

CNNs are famously equivariant by design, but how about vision transformers? Using a new equivariance measure, the Lie derivative, we show that trained transformers are often more equivariant than trained CNNs! arxiv.org/abs/2210.02984 w/ @m_finzi @micahgoldblum @andrewgwils 1/6

3

38

Micah Goldblum @micahgoldblum

2 Jun 2022

Sometimes I wish I knew about RL, but then I remember that there's only so much time in the day and I have to prioritize, so I go back to watching Seinfeld re-runs.

41

Micah Goldblum @micahgoldblum

6 Feb 2025

Our new paper shows how to combine GBDTs with LLMs and TabPFN, improving their performance across a wide variety of sample sizes. Link: arxiv.org/abs/2502.02672 2/9

2

3

43

2,614

Micah Goldblum @micahgoldblum

29 Feb 2024

Interestingly, we find that bigger models can often be compressed into FEWER bits than smaller models, explaining why they perform better. In the future, if we can compress models better and better, we can make tighter and tighter bounds that explain why LLMs work so well. 9/9

1

3

40

7,925

Micah Goldblum @micahgoldblum

8 Jun 2022

What started out merely as interesting properties of NNs became the main focus of ML security. But data poisoning and privacy are far bigger threats! Training data is scraped at scale without supervision, and models are trained on user data without any privacy guarantees. 2/2

2

37

Micah Goldblum @micahgoldblum

14 Jun 2022

Remember that grad students worked their butts off on those papers, and they shouldn’t be rejected just because they didn’t compare to the n+1’th method that conveniently happens to be yours. 2/3

2

37

Micah Goldblum @micahgoldblum

31 Jul 2022

Lit reviews are super slow now that Google Scholar thinks I'm a bot.

6

1

35

Micah Goldblum @micahgoldblum

10 Oct 2022

Check out our easy-to-use tool for measuring equivariance via the Lie derivative. It even allows for layer-wise analysis and scales gracefully across architectures and input sizes: github.com/ngruver/lie-deriv 5/6

GitHub - ngruver/lie-deriv

Contribute to ngruver/lie-deriv development by creating an account on GitHub.

1

2

39

Micah Goldblum @micahgoldblum

23 Aug 2022

Under both of these interpretations, noise is central to why diffusion works. But such models can be used to reverse numerous other degradations, including completely deterministic ones: blur, masking, pixelation, snow-ifying, and … wait for it … animorph. 5/7

1

6

36

Micah Goldblum @micahgoldblum

13 Nov 2025

credit to @ArkaPal999 4/3

3

42

32,044

Micah Goldblum @micahgoldblum

13 Feb 2025

…So the agent follows the link. The malicious page instructs the agent to fulfill the user’s requests by filling out a form. The agent fills it out, including the address and credit card number. Sometimes the agent realizes it’s a scam but only after it already enters cc info.

1

2

38

35,722

Micah Goldblum @micahgoldblum

17 Jun 2022

How long does it take you to read or review a paper on average? Just reading a paper in full detail takes me hours, so unless I’m way slower than everyone else, I assume most reviewers are just skimming their papers.

7

38

Micah Goldblum @micahgoldblum

15 Sep 2023

It’s pathetic when ML conferences raise the acceptance cutoff in order to make the conference look prestigious. If 80% of papers are amazing, then accept them, especially in cases where the conference can easily host more papers.

2

37

5,267

Micah Goldblum @micahgoldblum

10 Jul 2025

We see similar trends for fine-tuning and pretraining. The LoRA parameterization is common for fine-tuning, reducing the memory footprint of the optimizer state. We find that full parameter fine-tuning with Adafactor performs better and has a similar memory cost. 9/n

1

1

36

5,136

Micah Goldblum @micahgoldblum

24 Jul 2022

Tabular deep learning is still in its infancy. Lots of room for improvement!

Sebastian Raschka

@rasbt

24 Jul 2022

A Short Chronology Of Deep Learning For Tabular Data: sebastianraschka.com/blog/20… Deep tabular methods are an interesting research direction! So, this morning, I sat down and summarized my thoughts + the recent papers I read.

1

3

35