🤖Prof at Columbia University 🏙️. All things machine learning.🤖

An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3
39
145
1,447
516,858
TLDR: Diffusion models (like DALLE or Imagen) generate pretty pictures from Gaussian noise, but the same training and generation update rules generalize easily to other degradations, including completely deterministic ones. 1/7
13
138
917
A common point raised by ML reviewers is that a method is too simple or is made of existing parts. But simplicity is a strength, not a weakness. People are much more likely to adopt simple methods, and simple ones are also typically more interpretable and intuitive. 1/2
26
93
821
🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n
27
112
829
396,780
📢I’ll be admitting multiple PhD students this winter to Columbia University 🏙️ in the most exciting city in the world! If you are interested in dissecting modern deep learning systems to probe how they work, advancing AI safety, or automating data science, apply to my group.
6
141
547
67,981
Here’s an easy trick for improving the performance of gradient-boosted decision trees like XGBoost allowing them to read text column headers and to benefit from massive pretraining: replace the first tree with an LLM or TabPFN! 🧵 1/9
7
89
556
104,531
Self-Supervised Learning (SSL) is quickly becoming a defacto way of training neural networks, but if you have ever tried it yourself, you’d know that getting high performance is tricky! Check out our new thorough guide to all things SSL. arxiv.org/abs/2304.12210
7
78
453
76,507
I’m excited to announce that I’ll start as an assistant professor at Columbia University this summer! Interview season was fun, I met so many amazing people, but I’m happy to finally close the loop.
43
9
408
50,083
🚨Excited to announce a large-scale comparison of pretrained vision backbones including SSL, vision-language models, and CNNs vs ViTs across diverse downstream tasks ranging from classification to detection to OOD generalization and more! NeurIPS 2023🚨🧵 arxiv.org/abs/2310.19909
8
88
392
199,540
How much data are augmentations worth? We show that augmentations can actually be worth more than extra data and invariance! They increase variance across batches, and this extra stochasticity finds flatter minima. arxiv.org/abs/2210.06441 1/8
3
64
379
Gradient-boosted decision trees are still thought to be competitive with neural networks on tabular data. But NNs have a massive advantage, they learn representations, and this ability can be leveraged for transfer learning arxiv.org/abs/2206.15306. 1/4
6
40
338
🚨 Announcing LiveBench, a challenging new general-purpose live LLM benchmark! 🚨 Thanks @crwhite_ml and @SpamuelDooley for leading the charge! Link: livebench.ai/ Existing LLM benchmarks have serious limitations: 🧵
10
71
331
154,989
🚨Here’s an intuitive explanation for why training on lots and lots of data creates emergent properties, for instance math and reasoning, in large language models like #GPT-4 and #ChatGPT 🚨 1/17
6
34
266
110,750
We often determine whether a neural network is over or under parameterized by counting parameter. In practice, how much data we can fit depends on many factors: architecture, optimizer, etc. So just how flexible are neural networks in practice? 🧵 Paper: arxiv.org/abs/2406.11463
10
52
260
43,392
I’m on the faculty job market this year! Going from a late start as a math PhD student to a ML postdoc was a fun challenge. Building a research agenda alongside amazing students has been rewarding with 9 papers accepted to NeurIPS this year. Don’t let rejections get you down!
6
19
247
84,030
AI web agents like Operator and Anthropic’s Computer Use can operate a browser, but the LLMs inside are brittle, and you can’t trust what’s on the web. In this 🧵, I’ll show how adversaries can fool Anthropic’s web agent into sending phishing emails or revealing credit card info.
17
69
239
42,951
Simplicity doesn’t preclude novelty, even when the method is composed of existing parts. During the NeurIPS review period, DO NOT downgrade papers just because the method is simple. If anything, question methods which are needlessly complicated when simple solutions will do. 2/2
6
9
208
One view of ML history is that we started out with MLPs and evolved towards more specialized architectures like CNNs for vision, LSTMs for sequences, etc. But actually, the exact opposite is true! 🚨🧵1/6
2
23
211
We show that neural networks have a remarkable preference for low complexity which overlaps strongly with real-world data across modalities. PAC-Bayes proves that such models generalize, explaining why NNs are almost universally effective. arxiv.org/abs/2304.05366
There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: - Kolmogorov compressor is the theoretical shortest-length program that produces a dataset. SGD is a practical approximation of the Kolmogorov search that finds an implicit program embedded in the weights of a soft computer, i.e. big Transformers. - Unsupervised learning is about computing the conditional Kolmogorov complexity of a target dataset given an unlabelled corpus, i.e. K(Y|X) - Theory tells us that optimizing for K(X, Y), the joint complexity, is as good as K(Y|X). So simply throw all data into the mix, and "just compress everything". - Joint compression is maximum likelihood over the giant concatenated dataset. - Ilya cites iGPT, Chen et al. 2020, to illustrate the ideas. iGPT is an image compressor that learns to predict the next pixel using a 1D sequence model. This is a phenomenal lecture, very accessible, and sometimes quite entertaining. YouTube: piped.video/watch?v=AKMuA_TV… Lecture page: simons.berkeley.edu/talks/il…
1
23
182
60,925
I’m thrilled to announce the first issue of a community survey on the state and future of deep learning! We asked folks their opinions on benchmarking, transformers, interpretability, theories of deep learning, and directions we should be working on. 1/3 arxiv.org/abs/2312.09323
5
32
175
37,594
Typical transfer learning pipelines involve initializing at pre-trained weights and hoping that relevant learned information magically transfers even when the weights change during fine-tuning. But you can transfer so much more than just initialization! 1/4
1
30
169
The following statement, while a commonly held view, is actually false! “Learning theory says that the more functions your model can represent, the more samples it needs to learn anything”. 1/8
OK, debates about the necessity or "priors" (or lack thereof) in learning systems are pointless. Here are some basic facts that all ML theorists and most ML practitioners understand, but a number of folks-with-an-agenda don't seem to grasp. Thread. 1/
6
20
163
The new @icmlconf review format is horrendous (no reviewer scores). Students will spend an inordinate amount of time drafting rebuttals for reviewers who have already committed to rejecting their papers. Massive waste of person hours.
1
4
154
We usually use NNs in silico, but they can also operate on analogue systems involving optics etc. You can train high-performance physical NNs that perform inference orders of magnitude faster than digital computers nature.com/articles/s41586-0…. 1/2
2
30
154
As we go into the NeurIPS reviewing process, remember to accept every paper that you think would contribute to the conference! Don’t read papers trying to find little things to criticize. Instead, try also to find the valuable pieces that the community might want to read. 1/3
2
9
146
Lazy reviewer starter pack: “needs theoretical justification”, “should cite [paper that came out a week after the submission deadline]”, “not novel enough for NeurIPS”, “more datasets, models, and baselines [that don’t apply or are in the appendix]”, “borderline accept/reject”
6
8
142
🚨Real data is often massively class-imbalanced, and standard NN pipelines built on balanced benchmarks can fail! We show that simply tuning standard pipelines beats all of the newfangled samplers and objectives designed for imbalance. #NeurIPS2023 🚨🧵1/8 arxiv.org/abs/2312.02517
3
19
143
17,809
Data scientists working with tabular data try simple linear models first and work their way through GBDT and maybe finally expressive NNs and ensembles. In contrast, vision or NLP practitioners start with the most powerful NN at their disposal. 1/2
6
6
137
🚨We converted pretrained LLMs into looped LLMs that can crank up performance by looping for more iterations. Our looped models surpass the performance of the pretrained models we started out with, showing that existing models benefit from increased computational depth. 📜1/9
10
26
150
34,567
After years of ML research and conference publications, I'm finally attending my first in-person conference this summer … as a postdoc … in my home city …🙃
2
1
130
Lessons from ICML @icmlconf: (1) Eliminate short talks, especially pre-recorded ones. (2) Poster sessions ≫ talks so allocate more time to them. (3) NO EVENTS DURING LUNCH/DINNER TIME! Poster sessions ended ~8:30pm and people went without dinner. (4) Don’t serve moldy bagels.
4
7
130
🚨Announcing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces 📜. Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n
1
26
128
17,745
There’s a pervasive myth that the No Free Lunch Theorem prevents us from building general-purpose learners. Instead, we need to select models on a per-domain basis. Is this really true? Let’s talk about it! 🧵 1/16 arxiv.org/abs/2304.05366 w/@andrewgwils, @m_finzi, K. Rowan
9
21
121
28,240
#StableDiffusion and #ChatGPT use prompts, but hard prompts (actual text) perform poorly, while soft prompts are uninterpretable and nontransferable. We designed an easy-to-use prompt optimizer PEZ for discovering good hard prompts, complete with a demo.🧵 arxiv.org/abs/2302.03668
3
29
121
32,834
Reviewers, engage with authors during the discussion period! So many misunderstandings are waiting to be cleared up by great rebuttals, and mistakes that have now been fixed. Don’t get some unlucky grad student’s paper rejected because you were too lazy to engage!
2
12
115
If we want to use LLMs for decision making, we need to know how confident they are about their predictions. LLMs don’t output meaningful probabilities off-the-shelf, so here’s how to do it 🧵 Paper: arxiv.org/abs/2406.08391 Thanks @psiyumm and @gruver_nate for leading the charge!
2
20
112
13,786
Just made a quick plot of ICLR 2023 mean reviewer scores by percentile. To be in the top 25% of papers, you need a mean reviewer score of at least 5.67
3
9
111
I want to point out several problems (areas for improvement) in the @NeurIPSConf review process which I haven't heard talked about. (1) Do not show reviewer scores to other reviewers since these bias score changes via peer pressure (do show scores to authors and ACs). 1/3
4
5
107
Some people feel that transfer learning (TL) doesn’t apply to tabular data just because there exist unrelated domains (e.g. cc fraud vs. disease diagnosis). However, there are also adjacent tabular domains where TL makes a ton of sense (e.g. diagnosis of different diseases). 1/6
Saw a new article on transfer learning for tabular data using NNs. I don’t have the time to take a closer look, but my initial reaction is the following: 1/4
3
15
103
After a long day of ML research without any paws. Research is so ruff!
2
2
98
I want to clarify that my batch of papers is fine, as are most of the reviews. This kind of paper is an outlier, yet we disproportionately see the outliers on social media. I do think reviewers are less likely to put in enough work to evaluate technical papers. 3/3
2
2
111
39,535
I’m on the academic job market this year 🚨🥳🚨! Let me know if there are any interesting opportunities I’m likely to have overlooked or catch me at #NeurIPS2022!
1
10
93
People think hierarchical features are why NNs generalize. Has anyone formalized this? How would you verify/falsify? Historically, people thought early layers are tuned to extract low-level features (e.g. edges), while late layers learn to extract abstract ones (e.g. faces) 1/2
7
10
94
Diffusion models like #StableDiffusion and #dalle2 generate beautiful pictures, but are these images new or are they copies of the images they were trained on? 🧵 #CVPR2023 arxiv.org/abs/2212.03860
4
25
91
24,010
If my dog steps on my keyboard while I'm drafting a conference submission, is she a co-author? Some fields are very loose with authorship.
6
4
88
Data scientists who work at big tech companies benefit from free lunch in more ways than one. arxiv.org/abs/2304.05366
12
87
19,281
Small batches → noisy gradients, so this rule increases β2 as we decrease batch size, updating the optimizer state slowly and smoothing out the noise. Specifically, we fix the half-life of each gradient in the optimizer state in terms of tokens rather than iterations. 5/n
2
6
94
26,041
Paper found here: arxiv.org/abs/2208.09392 All the awesome collaborators that made this happen: @arpitbansal297, @EBorgnia, Hong-Min Chu, Jie Li, @hamid_kazemi22, @furongh, @jonasgeiping, @tomgoldsteincs 7/7
4
6
75
Why are facial recognition systems so unfair across race/gender? A lot of people think it comes from imbalanced training data, but it even happens with perfectly balanced training data. In fact, randomly initialized face rec systems are unfair too! arxiv.org/abs/2203.08235 1/3
5
17
76
In deep learning, we typically split the data, train on the training split, and evaluate on the validation split, so we only train on part of the data when we are comparing models. In contrast, the marginal likelihood tries to use data more holistically. 1/3
3
10
76
Classic Reviewer 2: “The authors have now thoroughly addressed all my concerns. I raise my score from 4 to 5.”
5
2
76
ML security research has been DOMINATED by adversarial examples/defenses for the past few years, not because it is the most important area of security but because it is easy to work on (low implementation/hardware/know-how costs). 1/2
5
4
66
Virtually all large models today contain huge matrices, and these dominate their compute. By incorporating structure in these matrices, we can improve the performance/compute tradeoff!
A lot of the computation in pre-training transformers is now spent in the dense linear (MLP) layers. In our new ICML paper, we propose matrix structures with better scaling laws! arxiv.org/abs/2406.06248 w/@ShikaiQiu, Andres P, @m_finzi, @micahgoldblum 1/8
4
68
8,901
Thrilled that our paper on model selection won the Outstanding Paper Award at ICML 2022. All credit goes to my great collaborators. Check out @LotfiSanae's talk and drop by our poster tomorrow!
I'm so proud that our paper on the marginal likelihood won the Outstanding Paper Award at #ICML2022!!! Congratulations to my amazing co-authors @Pavel_Izmailov, @g_benton_, @micahgoldblum, @andrewgwils 🎉 Talk on Thursday, 2:10 pm, room 310 Poster 828 on Thursday, 6-8 pm, hall E
1
2
59
🚨📢 Excited to announce the ICLR 2025 Workshop on Building Trust in LLMs and LLM Applications! 📢🚨 Submit all your papers, and we’ll see you in Singapore! There will be paper awards, and we have a stacked lineup of speakers and panelists.
1
8
62
14,888
Here's ★how to make small batch LLM training fast, ★how to pretrain LLMs efficiently via vanilla SGD without momentum, and ★why you should consider getting rid of LoRA and gradient accumulation. 2/n
1
2
62
10,348
We should value practicality and intuitiveness over novelty. Novelty often means an idea is so complicated that the reader couldn't have imagined coming up with it. That's not a good thing. That's bad. Good ideas often seem obvious in retrospect.
5
4
59
Want to learn more about data poisoning and backdoor attacks? Our survey paper (arxiv.org/abs/2012.10544) clarifies the state of the field for newbies and veterans alike! @dawnsongtweets @tsiprasd @xinyun_chen_ @ChulinXie @A_v_i__S @tomgoldsteincs @aleks_madry
13
60
We show how to make data poisoning and backdoor attacks way more potent by synthesizing them from scratch with guided diffusion. 🧵 1/8 Paper: arxiv.org/abs/2403.16365
1
10
57
8,063
What are the real reasons that NNs work so much better than other models? It sure as hell isn’t because of the “implicit bias of SGD”. Is it their inductive biases, parameter efficiency, ease of optimization? Would other models work just as well if only we could scale them up?
11
3
54
(3) Reviewers shouldn’t be allowed to click “Author Rebuttal Acknowledgement” (not writing a response to the rebuttal) if they don’t increase their score. It is important to justify to the authors why their points don’t address your feedback. 3/3
2
5
52
Transformers seem to work for all sorts of data, made possible by a shared structure that virtually all real data shares. This also allows NNs to be near-universal compressors. The real world is simple, so all we need is models with a simplicity bias. arxiv.org/abs/2304.05366
Language Modeling Is Compression paper page: huggingface.co/papers/2309.1… It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
5
52
12,295
Disclaimer – When training on a huge cluster, a large batch size may be necessary for good hardware utilization. Similarly, gradient accumulation reduces time spent communicating gradients between model copies. But most of us are poor, so listen up if you are poor too! 11/n
2
1
55
6,045
Even using deterministic degradations, training and test-time update rules that underlie diffusion models can be generalized, calling into question the orthodox understanding of diffusion and opening up research on a whole new direction of generative models. 6/7
2
6
49
Diffusion models, from DALLE to Imagen, operate by sampling random Gaussian noise and iteratively denoising/noising until they converge to a pretty picture. This simple-sounding process is underpinned by several theoretical motivations. 2/7
1
7
47
ML practitioners used to encode their beliefs about a problem, like invariances, into their architectures by hand. We show that transformers actually learn these same structures directly from the data! arxiv.org/abs/2210.02984 3/6
2
47
#NeurIPS2022 was fun! No moldy bagels like ICML, more poster sessions, fewer talks. Minor suggestions: (1) Don’t schedule poster sessions during meal times (e.g. sessions from 11am-1pm). I noticed a lot of people skipping the lunchtime poster session because they were hungry. 1/3
1
1
46
Small batch LLM training is thought to be slow per FLOP, motivating gradient accumulation to simulate larger batches, even in small-scale academic runs. We show that a simple rule for scaling Adam hyperparameters allows efficient per-FLOP training down to batch size 1. 4/n
3
47
10,545
Next time you are fine-tuning an LLM, turn off gradient accumulation and LoRA, and do full fine-tuning with Adafactor and a small batch size for great performance and lightweight hardware requirements. 10/n
1
2
45
4,321
One interpretation of diffusion models views them as score estimators, whereby noise is added to the score estimates to sample images via stochastic gradient Langevin dynamics: arxiv.org/abs/1907.05600 3/7
1
3
41
Check out our paper, with @m_finzi, Keefer Rowan, @andrewgwils, where we show just how important simplicity bias, formalized using Kolmogorov complexity, is for machine learning. The paper is easy to approach for all audiences! 16/17 arxiv.org/abs/2304.05366
1
2
44
4,692
(2) Low acceptance rates make reviewers worry that assigning a high score, thus increasing the chance of that paper being accepted, in turn decreases the chance of their own paper being accepted. After all, the same AC may be in charge of both papers. This is a bad incentive. 2/3
3
1
42
🚨NeurIPS poster Wednesday: 11-1, Hall J #512🚨 arxiv.org/abs/2106.08970 Backdoor attacks are dangerous, but existing attacks are easy to detect. We develop a backdoor attack whose poisons are indistinguishable from clean samples. Can you tell which are poisoned? 1/4
5
11
45
A second closely related interpretation views diffusion models as autoencoders with a fixed encoder that noises the image and a learned decoder that reverses this random process by approximating the reverse conditional distributions with Gaussians: arxiv.org/abs/2006.11239 4/7
1
5
40
SSL can even be used to learn an explicit prior probability distribution over parameters arxiv.org/abs/2205.10279. 8/8
1
1
41
Despite the popularity of ViTs and SSL, our benchmark suggests that the best backbones for most vision tasks are actually modern convnets (e.g. ConvNeXt) pretrained on massive labeled classification datasets. Future SSL works should train on bigger datasets to be competitive. 2/7
1
1
41
4,419
On top of hyperparameter robustness, small batch training makes training robust to the choice of optimizer too. We observe great performance with memory-efficient optimizers like Adafactor, and even vanilla SGD without momentum performs nearly as well as Adam. 7/n
2
1
43
8,721
We observe that small batch training is highly robust to optimizer hyperparameters like learning rate and momentum. This means that on a fixed hyperparameter tuning budget, you will find better hyperparameters in the small batch regime. 6/n
2
1
43
6,813
The vast majority of NNs deployed/released have no privacy guarantee. If someone ever figures out how to recover training data from trained models, they will instantly recover boatloads of private data, and there’s nothing we can do to stop it! The models are already released.
4
6
40
I will be at ICML in person next week to present our works on marginal likelihood, model inversion/explainability, and privacy breaches in federated learning. Shoot me a message if you want to connect!
2
40
Just how important are handcrafted inductive biases like CNNs for computer vision? We can just learn them! ViTs are often actually more translation invariant than CNNs after training.
CNNs are famously equivariant by design, but how about vision transformers? Using a new equivariance measure, the Lie derivative, we show that trained transformers are often more equivariant than trained CNNs! arxiv.org/abs/2210.02984 w/ @m_finzi @micahgoldblum @andrewgwils 1/6
3
38
Sometimes I wish I knew about RL, but then I remember that there's only so much time in the day and I have to prioritize, so I go back to watching Seinfeld re-runs.
41
Our new paper shows how to combine GBDTs with LLMs and TabPFN, improving their performance across a wide variety of sample sizes. Link: arxiv.org/abs/2502.02672 2/9
2
3
43
2,614
Interestingly, we find that bigger models can often be compressed into FEWER bits than smaller models, explaining why they perform better. In the future, if we can compress models better and better, we can make tighter and tighter bounds that explain why LLMs work so well. 9/9
1
3
40
7,925
What started out merely as interesting properties of NNs became the main focus of ML security. But data poisoning and privacy are far bigger threats! Training data is scraped at scale without supervision, and models are trained on user data without any privacy guarantees. 2/2
2
37
Remember that grad students worked their butts off on those papers, and they shouldn’t be rejected just because they didn’t compare to the n+1’th method that conveniently happens to be yours. 2/3
2
37
Lit reviews are super slow now that Google Scholar thinks I'm a bot.
6
1
35
Check out our easy-to-use tool for measuring equivariance via the Lie derivative. It even allows for layer-wise analysis and scales gracefully across architectures and input sizes: github.com/ngruver/lie-deriv 5/6
1
2
39
Under both of these interpretations, noise is central to why diffusion works. But such models can be used to reverse numerous other degradations, including completely deterministic ones: blur, masking, pixelation, snow-ifying, and … wait for it … animorph. 5/7
1
6
36
…So the agent follows the link. The malicious page instructs the agent to fulfill the user’s requests by filling out a form. The agent fills it out, including the address and credit card number. Sometimes the agent realizes it’s a scam but only after it already enters cc info.
1
2
38
35,722
How long does it take you to read or review a paper on average? Just reading a paper in full detail takes me hours, so unless I’m way slower than everyone else, I assume most reviewers are just skimming their papers.
7
38
It’s pathetic when ML conferences raise the acceptance cutoff in order to make the conference look prestigious. If 80% of papers are amazing, then accept them, especially in cases where the conference can easily host more papers.
2
37
5,267
We see similar trends for fine-tuning and pretraining. The LoRA parameterization is common for fine-tuning, reducing the memory footprint of the optimizer state. We find that full parameter fine-tuning with Adafactor performs better and has a similar memory cost. 9/n
1
1
36
5,136
Tabular deep learning is still in its infancy. Lots of room for improvement!
A Short Chronology Of Deep Learning For Tabular Data: sebastianraschka.com/blog/20… Deep tabular methods are an interesting research direction! So, this morning, I sat down and summarized my thoughts + the recent papers I read.
1
3
35