codegen @ FAIR, prev. DINO stuff @ INRIA & FAIR

1/ This week we released DINOv2: a series of general vision encoders pretrained without supervision. Good out-of-the-box performance on a variety of domains, matching or surpassing other publicly available encoders.
5
112
703
124,778
Vision transformers need registers! Or at least, it seems they 𝘸𝘢𝘯𝘵 some… ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”. Just add new tokens (“[reg]”): - no artifacts - interpretable attention maps 🦖 - improved performances!
38
303
1,976
466,628
Thanks python, very helpful
14
41
821
893,483
"Massive activations in LLMS" is the paper you need and that everyone should read
what happens in the residual stream of gemma3? l2 norm of activation explodes at the end of every transformer block after x=x+res. key architectural difference between gemma2 and 3 is softcapping vs qknorm. 1b is not even multimodal (fig reps gemma2-2b vs 3-1b). what's wrong?
16
49
609
67,593
Want strong SSL, but not the complexity of DINOv2? CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
22
108
604
160,878
DINOv2+registers=♥️ We are releasing code and checkpoints for DINOv2 augmented with registers and a slightly better training recipe. No more of those pesky artifacts! Simple one-liner, try it out: dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
12
41
479
72,294
Is there a good reason we use softmax losses in contrastive learning, instead of just doing MSE? ie L = ||xi-xi'||² - lambda sum_k ||xi-xk'||² I'd guess the optimization dynamics are maybe friendlier, but does anyone have a good pointer? Both for CLIP and SSL btw
29
20
474
109,653
Also: yes, it's a JEPA. Yes, you hated on @ylecun , but he was right. Yes, as usual
Want strong SSL, but not the complexity of DINOv2? CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
3
40
354
87,477
Qq has anyone ever seen the best AI researcher and the best sion euw in the same room because if not guys I've got a theory
16
6
352
34,135
Funniest bug of my phd: model loses 1 point if pretrain and eval use different conda env The difference was libjpeg vs libjpeg-turbo iiuc the jpeg algo is not entirely standardized (wtf?) and libjpeg != libjpeg-turbo Tiny differences in decoding artifacts caused a 1 point drop!
if you train a model exclusively on JPEG images, will performance drop on other image file formats?
10
8
335
25,490
Still not sure why the ML community adopted conda instead of plain old virtualenv
55
2
309
58,689
Alright actual serious post. Lingua := super simple codebase + torch.compile for speed --> clean, hackable, but still efficient *It can train a 7B >llama2 in 24h*. Crazy. If you got the gpus, not only can you train a good 7B, you can *iterate* on it. You can do *research*
🚨 RELEASE ALERT ‼️ github.com/facebookresearch/… THIS CHANGES EVERYTHING $META just dropped a game-changing codebase! Now everyone can do LLM research! 😱 🧵10 best things people are already building with lingua 🔥👇
5
18
287
49,207
I did not realize people used frameworks for simple distributed trainings. Tip: for 80% of trainings you just need DDP, and it's trivial to setup For the rest go with fsdp (either pytorch fsdp2 or the single-file fsdp in the CAPI repo)
wait. distributed training with pure pytorch is not that bad. why did we all collectively get gaslit into using accelerate...
8
13
261
37,408
Mistral's "Le Chat" logo is a design masterclass The two dots make a smol cat
4
10
225
17,245
Bonus trick: you can remove the gradient reduction of the first backward (which is useless) by wrapping in no_sync() Remember to also include the forward pass in the no_sync context, else it does not work
This simple pytorch trick will cut in half your GPU memory use / double your batch size (for real). Instead of adding losses and then computing backward, it's better to compute the backward on each loss (which frees the computational graph). Results will be exactly identical
1
17
232
33,518
SIGBOVIK 2025 is out A live thread of papers that make me lightly smile: 1/ UPPERCASE IS ALL YOU NEED
2
22
218
30,468
Hey I'm a doctor now, neat
🚨New doctor in the house!🚨 Congrats to @TimDarcet for his tremendous work (DINOv2, registers, CAPI) & successful PhD defense followed by ~2 hrs of questions -- he's got stamina! Congrats to his incredible team of advisors from Inria & Meta: @julienmairal @p_bojanowski M. Oquab
33
1
201
11,601
In case there is any ambiguity: DINOv2 is 100% a product of dumb hill-climbing on ImageNet-1k knn accuracy (and linear too) Overfitting an eval can be bad. But sometimes the reward signal is reliable, and leads to truly good models. It's about finding a balance
Replying to @lucasmaes_ @jxmnop
Oh I am a big fan of self supervised learning. Also ssl has never been benchmark maxing on imagenet afaik. I am mainly complaining about the supervised classification imagenet hill climb
8
12
199
25,862
hey we heard you liked dinov2 so we got you more of the same shit dinov3 is like dinov2 in the sense that it's much better than the things before rumor has it that plugging dinov3 on your benchmark is a low hanging sota but be quiet im not supposed to tell
Introducing DINOv3 🦕🦕🦕 A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale. High quality dense features, combining unprecedented semantic and geometric scene understanding. Three reasons why this matters…
7
12
197
16,119
The gaussian mixture fits MNIST in like 3 iterations and the fit is super good maybe EM GMM is all we needed after all
lfg it's fitting
12
12
181
24,853
If you need a replacement for an example image in a CV paper, you know what to do
Just FYI, computer vision papers submitted to IEEE that include this image of Ms. Forsén will no longer be considered for publication
7
10
161
14,249
Intriguing new property: on some images, the different registers naturally adopt a “slot attention-like” behavior, each attending to a different object! Needless to say, this was never required of the model (or even encouraged). Cool future research direction!
2
10
156
9,061
In case some of you were (like me) curious about this stat for AI conferences: here it is for ICLR2024
Statistics from @ICSE2024. Authors submitting, *each*, 33, 27, 24, ... papers. Interactive dashboard: app.powerbi.com/view?r=eyJrI…
12
19
149
370,464
ViT need registers got an outstanding paper award! Many thanks to the comittee for the honor
5
9
149
16,374
Some people are still not using Fréchet DINOv2 distance?
Oh wow, FID is fragile...
9
5
137
13,666
I realized I have a strong opinion on experiment management that not everybody shares: when I launch an experiment, I want **zero** parameter in the commandline. **All** informations should be commited to the repo for full reproducibility The only command is `./<scriptname>.sh`
Replying to @davnords
I've come to really not like submitit honestly In practice what I was doing in my codebase is just writing my own sbatch files, and every exp is a different script (which is commited to the repo)
11
3
135
16,019
Our hypothesis is: the model recognizes useless patches, discards the info in them, and uses them as 𝘢𝘨𝘨𝘳𝘦𝘨𝘢𝘵𝘰𝘳𝘴 𝘰𝘧 𝘨𝘭𝘰𝘣𝘢𝘭 𝘪𝘯𝘧𝘰𝘳𝘮𝘢𝘵𝘪𝘰𝘯.
2
6
126
12,455
Hey! If you are using DINOv2, whether in a startup, in research or whatever, could you send me a DM? I want your feedback on the model. Reward for you? Simple: next model is gonna be 𝘦𝘷𝘦𝘯 𝘮𝘰𝘳𝘦 suited to your needs 👌
10
11
127
91,523
Current state of neurips abstract submissions This neurips is gonna be crazy
10
19
119
116,463
With satellite imagery, it’s hard to get labels. Solution? DINOv2! WRI+Meta trained a satellite DINOv2 for tree height estimation. They created an interactive map of tree height of the whole globe (!) at 1-meter res (!): meta-forest-monitoring-okw37… Quizz: Can you recognize this city?
3
12
117
12,977
What I mean when I say “registers”: additional learnable tokens (like the [CLS]), but these ones are not used at output. No additional info at input, not used at output: these tokens could seem useless!
2
8
115
33,220
These visuals really highlight super well the differences between DINOv2 and CLIP: the latter has these text-induced abstractions that span across visual concepts, while the former has more advanced geometric concepts
Replying to @HThasarathan
Our method reveals model-specific patterns too: DinoV2 (left) shows specialized geometric features (depth, perspective), while SigLIP (right) captures unique text-aware visual concepts: This opens new paths for understanding model differences! (7/9)
1
9
119
8,472
Summary of "Massive activations in LLMs": - "artifact" tokens are in all transformers, ViTs and LLMs - their weirdness is ~only on 1 channel - they are the same as the quantization outliers - their purpose is *not* global information - there's a fix simpler than registers
Replying to @gaur_manu
Could you give a summary for all the lazy readers who won't open the link?
4
7
114
66,050
echo "echo 'sleep 0.5' >> ~/.bashrc" >> ~/.bashrc
Every time a colleague of mine does not lock their laptop, I add something to their .bashrc. alias vim='nano' is a good one, but moving file to a random folder is even funnier. rm is too evil, don't do it!

ALT Arnold Schwarzenegger Smile GIF

6
4
109
15,610
Very happy to say DINOv2 got outstanding certification finalist at TMLR! The models had an amazing reception already, but this kind of award is the cherry on top 😁
Replying to @TmlrOrg
Outstanding Finalist 2: “DINOv2: Learning Robust Visual Features without Supervision," by Maxime Oquab, Timothée Darcet (@TimDarcet), Théo Moutakanni (@TheoMoutakanni) et al. 5/n
2
11
113
9,532
ICLR results are out so its bragging time: ViT need reg got an oral and very good scores (top-15), so that's cool. Thanks a lot to the reviewers who found it good If you want to try a model with registers, we published some DINOv2 checkpoints earlier:
DINOv2+registers=♥️ We are releasing code and checkpoints for DINOv2 augmented with registers and a slightly better training recipe. No more of those pesky artifacts! Simple one-liner, try it out: dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
9
6
105
11,125
I am once again asking you to use einops for this kind of operations
Never thought I'd see a transpose with 8 numbers
1
109
4,357
PSA: when someone asks you a question including words such as "false positive rate", 𝗱𝗼 𝗻𝗼𝘁 𝗮𝗻𝘀𝘄𝗲𝗿 𝗿𝗶𝗴𝗵𝘁 𝗮𝘄𝗮𝘆. Simply state that you know your rights, and go on wikipedia to consult the 𝔐𝔞𝔡 𝕮𝔬𝔫𝔣𝔲𝔰𝔦𝔬𝔫 𝔐𝔞𝔱𝔯𝔦𝔵 𝔬𝔣 𝕳𝔢𝔩𝔩
Fewer than 1 in 5 doctors can correctly answer a basic question about statistics
2
13
100
16,877
But in fact, the model learns to use them. And they work quite well: a single register entirely fixes the attention maps, and gives a boost on downstream tasks. Adding more further increases the scores a bit. We improve upon DINOv2, which was already quite stronk 💪
2
2
102
7,564
Worth mentioning that this clustering idea does not come from nowhere: The iBOT head comes from the DINO head, which comes from the SwAV prototypes, which is an online version of the DeepCluster clustering We've been doing clustering all along
Replying to @TimDarcet
2. Loss? “DINO head”: good results, too unstable Idea: preds and targets have diff. distribs, so EMA head does not work on targets → need to separate the 2 heads So we just use a clustering on the target side instead, and it works
4
6
101
9,522
Re generative NN: To bypass the intractable log-likelyhood, you can either: - optimize the wrong objective and hope it will work (GAN/VAE/diffusion) - use a NN that you can invert (surprisingly easy) Is this right?? Is there a downside to coupling flows?? Do they work??
7
6
100
22,015
Do check out the paper! It’s got much more detail than I can give here. Thanks to Maxime Oquab, Julien Mairal and Piotr Bojanowski who were patient enough to work with me, and competent enough to compensate for my mistakes 😅. arxiv.org/abs/2309.16588
2
2
91
7,441
fuck your fancy personal page template im rawdoggin the html and you wont even make me use css
6
1
92
27,693
Actually the accept rate decreases monotonically with number of 1st author submissions: the more prolific the first author is, the lower the quality of their paper.
The acceptance rate among aspiring ICLR2024 first authors who submitted >= 4 papers was 15%! Contrast that with the base acceptance rate that year: 30.5%. Unsettling.
2
11
92
64,684
It's wild how few major modifications to the Transformer architecture took off. Most improvement papers stayed under the radar. Makes me wonder how many niche innovations actually work, but got lost because of community momentum
As expected, that was popular. Here is my attempt at consolidating all the answers into a list. - Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively - GQA (Group Query Attention): more Q than (K, V)
4
4
93
8,163
“Fine with me if you need global aggregators, but please don’t do this in my feature maps. I need those for downstream tasks! Here, have a few registers instead” - historical reconstruction of how it happened
1
2
85
8,374
Also of note: the repo contains a single-file fully standalone implem of FSDP in 500 LOC by the goat @fvsmassa . There's 0 guarantee associated with it, but if you want an understable implem of FSDP it's a good one (and no it's not slow I've got 58% MFU lfg)
Replying to @TimDarcet
nice work !! Did you roll out your own fsdp implementation? github.com/facebookresearch/…
2
4
85
16,322
This starts with a very simple observation: ~all ViTs have attention maps focused on a few seemingly random patches. DINO has clean attention maps, sure, but then why did the artifacts reappear in DINOv2? What 𝘢𝘳𝘦 these artifacts?
1
3
74
12,006
Okay this uiua thing is actually pretty fun
uiua goes unbelievably hard wtf array-orientated, stack based, glyph programming language and now I wanna make the game of life in it this weekend
2
4
79
18,384
Hey guys quick update vision transformers don't need registers after all brb gotta test some stuff
LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper “Massive Activations in Large Language Models” LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)
4
2
78
15,025
You may not like it, but this is what peak personal page looks like
fuck your fancy personal page template im rawdoggin the html and you wont even make me use css
11
2
73
13,901
🚨 RELEASE ALERT ‼️ github.com/facebookresearch/… THIS CHANGES EVERYTHING $META just dropped a game-changing codebase! Now everyone can do LLM research! 😱 🧵10 best things people are already building with lingua 🔥👇
Open science is how we continue to push technology forward and today at Meta FAIR we’re sharing eight new AI research artifacts including new models, datasets and code to inspire innovation in the community. More in the video from @jpineau1. This work is another important step towards our goal of achieving Advanced Machine Intelligence (AMI). What we’re releasing: • Meta Spirit LM: An open source language model for seamless speech and text integration. • Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. Plus a new developer suite to make it easier for developers to build with SAM 2. • Layer Skip: Inference code and fine-tuned checkpoints demonstrating a new method for enhancing LLM performance. • SALSA: New code to enable researchers to benchmark AI-based attacks in support of validating security for post-quantum cryptography. • Meta Lingua: A lightweight and self-contained codebase designed to train language models at scale. • Meta Open Materials: New open source models and the largest dataset of its kind to accelerate AI-driven discovery of new inorganic materials. • MEXMA: A new research paper and code for our novel pre-trained cross-lingual sentence encoder with coverage across 80 languages. • Self-Taught Evaluator: a new method for generating synthetic preference data to train reward models without relying on human annotations. Access to state-of-the-art AI creates opportunities for everyone. We’re excited to share this work and look forward to seeing the community innovation that results from it. Details and access to everything released by FAIR today ➡️ go.fb.me/hgtkel
2
9
69
59,325
Thanks @_akhaliq and @arankomatsuzaki for featuring our paper! It's great to see it 1st on the trending list on HF papers 😁 huggingface.co/papers
Vision Transformers Need Registers paper page: huggingface.co/papers/2309.1… Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
3
6
64
22,626
We find a few properties of these artifacts. 1. They appear on patches with useless information (redundant to their neighbors). 2. They contain little information about the original patch. It “forgot” its original value!
1
1
62
9,512
Happy to share that DINOv2 was accepted at TMLR! A special thanks to the reviewers and action editor. I found the review process to be actually pleasant and constructive. I believe that right now TMLR is possibly the best place to publish in ML
DINOv2: Learning Robust Visual Features without Supervision Maxime Oquab, Timothée Darcet, Théo Moutakanni et al.. Action editor: Abhishek Kumar. openreview.net/forum?id=a68S… #supervised #visual #features
1
5
63
8,090
I also view layernorm as hyperplane proj + hypersphere proj Hyperplane proj makes no sense, hence we do RMSnorm now Although don't forget the epsilon. We project onto the hyper*ball* actually
Absolutely gold article. Changed the way I see Layer Norm
3
63
3,292
Llamadrama being discussed in public?
You misread. There had been multiple LLM projects within FAIR for years. Some were open sourced as research prototypes (e.g. OPT175B, Galactica, BlenderBot...). In mid-2022, FAIR started a large LLM project called Zetta, which was still going in late 2022 when ChatGPT came out. A small group at FAIR-Paris was working on theorem proving. They needed an LLM for their own purpose and thought Zetta was too big and not ready. They developed their own model, which eventually became Llama-1. What happened internally between Zetta and Llama is somewhat similar to what just happened between DeepSeek and the big US players: a small team of talented folks innovated and beat the large teams.
6
2
63
10,152
On the other hand, the output tokens seem to contain 𝗹𝗼𝘁𝘀 of global information. We probe on a few different classification datasets. We find that these tokens contain much more class information than other patch tokens, and almost as much as the [CLS]!
1
57
8,751
Guess what model Depth Anything V2 is based on? 🦖🦖🦖 (yes, I only have one tune. No, I won't stop)
Replying to @giffmana
Depth anything V2 one shots this problem btw. All it requires is an algorithm to create a coherent world via past imagery and depth calculations.
3
1
62
9,667
Object counting is a surprinsingly unsolved problem so far, especially in terms of foundation models. AFAIK DINOv2 and CLIP-style models fail pretty hard. Of course, the VLMs on top can't do better than the encoder so they also fail there. One of the remaining things to solve
What's a vision model I can use to count toy pieces like this? GPT-4o tells me things like, "um, about 20", or counts incorrectly. Bonus points if it's easy to use via API and can work with plain english prompts
7
5
61
5,857
Do try out the new depth estimation parallax view, it's trippy
2
5
56
104,918
Thanks to DINO's nice attention maps, the model's behavior is quite interpretable! That's really cool
Another banger by @TheoMoutakanni : RayDINO, a DINO for chest X-ray. Excellent results on a ton of benchmarks with the frozen model, with great generalization and low bias. Check it out! arxiv.org/abs/2405.01469
2
11
57
20,492
6/ With these capabilities emerge new interesting properties. A very nice one is the ability to perform semantic keypoint matching between images simply by matching the closest features. This works across very different domains !
2
11
55
8,599
lfg it's fitting
Damn expectation-maximisation of a GMM got hands (it's the easiest algo in stat learning im just bad)
3
52
29,506
2/ As opposed to other recent SSL works, the goal is to provide vision encoders that work off-the-shelf, without any fine-tuning. In this setup, we improve significantly over previous SSL works, and even match or surpass CLIP-type models on a variety of tasks
1
3
50
38,177
Published my first paper, and my second one. I like them. I used to feel anxious about not being able to publish anything. It's getting better.
BRAG ABOUT SOMETHING YOU’RE PROUD OF ACCOMPLISHING IN 2023 ✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨
2
49
14,334
Hey everyone, will be in Milan for ECCV until Thursday Always up for a chat! Also I'm gonna need a job after my PhD ends in a few month so if you have some opportunities I'm interested 😇
1
1
50
4,289
Replying to @giffmana
To my knowledge it's common to just ditch the torch scheduler, use an array of learning rates, and at each iter do something equivalent to `optim.set_lr(lrs[iter])` Eg github.com/facebookresearch/…
1
2
50
2,988
ctrl+enter, run an epoch scroll twitter for 30s ctrl+enter, run another epoch scroll twitter for 30s the loss is not going down any more, divide lr by 3 ctrl+enter, run another epoch ...
2
53
5,919
Lmao they waited for the 405B release just to be able to 1-up it
3
50
4,436
Apple would be no.1 >>>all if they had just looked into LLMs for Siri 5 years ago Biggest blunder in the field so far
I don’t get why Apple hasn’t incorporated Whisper. They should be funding a team of, say, 5 researchers systematically iterating on it. Apple should have the world’s best voice-to-text and TTS models.
2
2
43
2,930
As always thank you to all the people who helped me, @BaldassarreFe , Maxime Oquab, @julienmairal , @p_bojanowski With this I’m wrapping up my PhD. An amazing journey, thanks to the excellent advisors and colleagues! And also I’m looking for a job lmao
2
2
45
3,122
Some plots are worth it just for the aesthetics
1
44
2,475
Another banger by @TheoMoutakanni : RayDINO, a DINO for chest X-ray. Excellent results on a ton of benchmarks with the frozen model, with great generalization and low bias. Check it out! arxiv.org/abs/2405.01469
1
5
42
8,218
Mfw the shitpost with 7 people in the target audience gets not engagement

ALT Breaking Bad Bryan Cranston GIF

3
41
2,715
PLONKKK
🚀 DinoV3 just became the new go-to backbone for geoloc! It outperforms CLIP-like models (SigLip2, finetuned StreetCLIP)… and that’s shocking 🤯 Why? CLIP models have an innate advantage — they literally learn place names + images. DinoV3 doesn’t.
2
1
42
3,316
Quite a few people have been asking me "can registers work with LLMs?" Here is a paper that says yes !
Think before you speak: Training Language Models With Pause Tokens - Performing training and inference on LMs with a learnable pause token appended to the input prefix - Gains on 8 tasks, e,g, +18% on SQuAD arxiv.org/abs/2310.02226
3
2
39
5,294
AIMv2 looks great! When SSL and text-supervised training both work so well, it was inevitable that combining both would be a great idea Big congrats to @DonkeyShot21 @alaa_nouby @MustafaShukor1 and team!
We release AIMv2, the second iteration of the AIM family of large autoregressive vision encoders. This time we bring multimodality into the game 🔥 Paper: arxiv.org/abs/2411.14402 Repo: github.com/apple/ml-aim Model Gallery: huggingface.co/collections/a…
1
5
38
3,237
2. Loss? “DINO head”: good results, too unstable Idea: preds and targets have diff. distribs, so EMA head does not work on targets → need to separate the 2 heads So we just use a clustering on the target side instead, and it works
1
39
13,033
Replying to @JFPuget
Chinese room argument? IMO consciousness is super badly defined but it may be something that emerges at a system level. The same way a collection of individual cells are "conscious", a collection of "not conscious" elements (layers etc) might be conscious
4
1
37
53,231
Btw for those at ECCV come to the DINOv2 demo in the Meta stand @TheoMoutakanni is showing the literal "map of every tree" (spoiler it's pretty cool)
1
37
1,731
Big news on the DINOv2 side! - Apache2 license (commercial use) - Releasing the segmentation and depth heads - significantly updated demo, with keypoint matching! - New fairness evaluations on FACET
2
5
34
1,637
The viennese street artists are a different breed
2
36
4,435
Language modeling solved NLP. So vision people have tried masked image modeling (MIM). The issue? It’s 𝘩𝘢𝘳𝘥. BeiT/MAE are not great for representations. iBOT works well, but is too unstable to train without DINO. →Pure MIM lags behind DINOv2
1
35
4,334
The biggest step change in the DINOv2 project was a skillful yolo run by Maxime yoloing is a dangerous but powerful weapon
In AI research there is tremendous value in intuitions on what makes things work. In fact, this skill is what makes “yolo runs” successful, and can accelerate your team tremendously. However, there’s no track record on how good someone’s intuition is. A fun way to do this is “betting”, where researchers try to predict the results of an experiment, or whether an approach would ultimately be successful. When I was at Google Brain in 2022, I made a bet for what accuracy a 540B-parameter LLM would get on mate-in-one in chess after finetuning. I had great fun asking my friends to participate—their predictions ranged from 10% to 80% (I think it ended up being around 30%). I particularly enjoyed a few bets with @LiamFedus (now my manager at OpenAI). Back in the day when we were writing a paper on emergent abilities, we bet on whether he would be able to predict the final accuracy of a task based on the log-prob trends from smaller models, and I won that one. More recently, we had a bet on how much data would be needed for a model to reach a certain performance, and I lost that bet by an order of magnitude. It was a nice ego check for me. (Bro tip: if you bet a dinner, specify the price range before you lose) Having a track record holds you to be accountable for intuitions and helps you remember when you were wrong. The best researchers excite their peers about only a few things, and some of those things work well in a big way. You don’t want to be excited about everything, but then only a small portion of those things actually work. Finally, I think there is also a lot of value in correctly predicting that and research direction won't go well—these “negative bets” aren’t typically rewarded in today’s culture, but I believe there is a lot of value in saving your team time.
1
35
6,805
In case you haven't got it yet: google scholar pdf reader extension for chrome is a must chromewebstore.google.com/de…
5
33
3,773
Highly recommend scholar inbox it's the highest SNR for papers that fit your tastes / topic
2
33
1,759
Dropout: A Simple Way to Prevent Neurons from Depression
1
3
33
1,489
Next week I'll be talking about registers, what they are and why we need them, at Cohere for AI! More info: cohere.com/events/c4ai-Timot…
Next week on Wednesday, February 7th, our Geo-Regional Asia Group is excited to welcome Timothée Darcet, PhD student, building large vision models at @Meta AI (FAIR) & @Inria to present "Vision Transformers need Registers." Learn more: cohere.com/events/c4ai-Timot…
4
32
4,606
Very clear and simple tutorial on how to use DINOv2 as an image featurizer. Check it out !
DINOv2, a SOTA ViT trained by @Meta on 142 million images, is now part of 🤗 Transformers! It's one of the strongest vision backbones at the moment, so I created a tutorial on training a linear classifier on top of it for semantic segmentation, using DINOv2's frozen features 1/2
7
32
3,644
Code and weights are Apache2 so don’t hesitate to try it out! If you have torch you can load the models in a single line anywhere The repo is a flat folder of like 10 files, it should be pretty readable
1
31
2,571
Contrastive loss in general push the model to use the whole space In DINOv2 we used the specific KoLeo loss, which pushes the embedding distribution towards higher entropy Higher entropy --> uniform distribution (on the hypersphere) --> full usage of the space
1
31
1,117
Okay caveat of my last post: maybe those are all middle authorship? Let's look at the same plot but only for _first_ and _last_ authors. First authors: (1/2)
In case some of you were (like me) curious about this stat for AI conferences: here it is for ICLR2024
5
2
28
104,321
Wait till they hear about selective checkpointing github.com/facebookresearch/…
Gradient Checkpointing is the single most effective way of reducing GPU memory footprint. This thing is fantastic! Am I missing something, or is it that good?
2
30
15,243
Always check the image normalization! It can completely change results. eg CLIP uses its own specific norm, and openclip uses either the CLIP values or the inception values depending on the model. When in doubt, often you can check in timm
Replying to @gabriberton
Notable models that use non-imagenet norm are Dust3r, OpenIBL, many image matching models, and some (many?) remote sensing models. This is an issue when you create a fair codebase to benchmark multiple models (where ideally you can simply swap the model to compute the results).
1
1
28
2,931
Qualitatively the features are pretty good imo DINOv2+reg still has artifacts (despite my best efforts), while the MAE features are mostly color, not semantics (see shadow in the first image, or rightmost legs in the second) CAPI has both semantic and smooth feature maps
1
1
29
2,506
So the reason I was asking about this is because the squared L2 has the very pleasant property of reducing to "just push away from the avg" and that would eliminate all batch size issues (you an use an EMA avg) It's basically what DINO does, w/ softmax+CE loss instead of L2
Is there a good reason we use softmax losses in contrastive learning, instead of just doing MSE? ie L = ||xi-xi'||² - lambda sum_k ||xi-xk'||² I'd guess the optimization dynamics are maybe friendlier, but does anyone have a good pointer? Both for CLIP and SSL btw
2
30
2,567
Let’s dissect a bit the anatomy of a mask image model. 1. take an image, convert its patches to representations. 2. given part of this image, train a model to predict the content of the missing parts 3. measure a loss between pred and target
1
29
6,443