1/ You might have seen it—DINOv3 is out! 🦖🦕In this thread, we share key insights on our Gram anchoring ⚓︎ and how it helps to get smooth feature maps. 👇
9
59
489
75,519
We strengthen MaskCLIP features w. a simple weighted aggregation strategy w. weights learnt using self-sup. DINO🦖 ➡️patch-level CLIP abilities 🧪SoTA in open-voc sem. seg. 🚀pass in CLIP & new conv 3x3 Check out CLIP-DINOiser 📜 arxiv.org/abs/2312.12359 🖥️github.com/wysoczanska/clip_…
🚨Happy to release on arXiv CLIP-DINOiser: Teaching CLIP a few DINO tricks🦖🎓 We obtain dense CLIP features in 1 forward pass w/o feature alteration and w/ almost no computational extra cost to facilitate open-vocabulary semantic segmentation 🧶 🖥️: wysoczanska.github.io/CLIP_D… [1/N]
8
63
11,258
Poster #72 #ECCV2024 is already up and running with @mkwysoczanska @MichaelRamamon @abursuc. Come by if you want to talk about open-vocabulary semantic segmentation at low cost!
We strengthen MaskCLIP features w. a simple weighted aggregation strategy w. weights learnt using self-sup. DINO🦖 ➡️patch-level CLIP abilities 🧪SoTA in open-voc sem. seg. 🚀pass in CLIP & new conv 3x3 Check out CLIP-DINOiser 📜 arxiv.org/abs/2312.12359 🖥️github.com/wysoczanska/clip_…
1
8
54
6,001
Replying to @ducha_aiki @abursuc
Thanks for catching my best expression. Things are serious when you talk about CLIP-DINOiser!
1
1
45
2,508
Densifying CLIP with 𝘯𝘰 𝘦𝘹𝘵𝘳𝘢 𝘵𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘯𝘰𝘳 𝘢𝘯𝘯𝘰𝘵𝘢𝘵𝘪𝘰𝘯 ? Check out the great results of CLIP-DIY🔧 Very proud of @mkwysoczanska for the excellent work 👏 Paper: arxiv.org/abs/2309.14289
1
13
37
5,508
#CVPR2023 🚀Unsupervised object localization at 80 FPS 🚀on a V100 after a quick 2h training on a single GPU with no annotation. Check out FOUND which exploits DINO features with a single self-trained conv1x1. paper: arxiv.org/abs/2212.07834 presentation: piped.video/watch?v=jfYQfFcr…
1
9
32
3,032
Tomorrow morning we will present dino.txt and show how to efficiently align text to frozen DINOv2 features. And no need to choose between image- and patch-level alignement, we show a simple strategy to do both at once! Come by for discussion, poster 370
Replying to @BaldassarreFe
Stop by the CVPR poster for more eye-candy results. We will be there with @dahyunkang_, @oriane_simeoni, and @BaldassarreFe. Happy to answer any questions! [7/N] 📅 Sunday, June 15 🕥 10:30 - 12:30 📍 Poster 370
25
2,853
4/ Our idea: Gram Anchoring ⚓️. We align the Gram matrix of output patch features with that of a trusted teacher ↔️ therefore operating on pairwise patch similarities, but wo/ constraining the features themselves. We use our 200k it. checkpoint (see above) as the anchor teacher🎯
2
1
28
2,180
3/ 🔎Analyzing patch feature, we found that the locality of patch-level cosine similarity strikingly degrades over training (e.g., 1M vs 200k iterations). Our key question: how to improve patch locality without affecting the global features’s quality? 🤔
1
23
2,398
🍾POP-3D @NeurIPSConf If you want to discuss open-vocabulary 3D occupancy prediction from images only, reach out to @AVobecky who will present the poster on Thursday: Poster #115 @🗺️Great Hall & Hall B1+B2 (level 1) ⏲️Th. 14 Dec. 10:45 a.m. CST — 12:45 p.m. CST
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images by @AVobecky @oriane_simeoni D.Hurych @SpyrosGidaris @abursuc P.Pérez, J.Sivic tl;dr: open-vocab 3D semantic occupancy maps from multi-cam inputs w/ tri-modal self-supervised learning #neurips2023 vobecant.github.io/POP3D/
9
18
2,397
13/ DINOv3 resources 📜 Paper: ai.meta.com/research/publica… 🌍 Blog: ai.meta.com/blog/dinov3-self… 💻Code: github.com/facebookresearch/… 🤗 Pretrained backbones in HF Transformers: huggingface.co/collections/f…
1
23
1,875
Our @CVPR tutorial about "object localization for free" is today room East 11 starting at 8:30am PDT time (with @WeidiXie, @tkipf and P. Pérez). Come and join us if you want to hear/discuss about different successful approaches to object localization with no annotation!
Don't miss the "Object localization for free: Going beyond self-supervised learning" @CVPR tutorial (by @oriane_simeoni @WeidiXie @tkipf P. Pérez) for an in-depth coverage of different angles on object localization with no human supervision #cvpr2023 osimeoni.github.io/object-lo…
1
6
18
6,675
12/ DINOv3 is a big effort from the great team ❤️ composed of @huyvvo, @maxseitzer, @BaldassarreFe, Maxime Oquab, Cijo Jose, Vasil Khalidov, @MarcSzafraniec, Seungeun Yi, @MichaelRamamon, @fvsmassa, @d_haziza, @LucaWehrstedt, @jianyuan_wang, …
1
20
1,494
2/ Let’s start with the why. With DINOv3, we aimed to scale up the model & training to get stronger features 🦾. Extending the training w/ DINOv2 losses gave a consistent improvement on classification results, but we observed a big drop over time on dense ones 😟
1
19
2,577
11/ We hope that you enjoy the quality of DINOv3 dense feature as much as we do 🦖🦕
1
20
1,536
10/ Cosine similarities before & after Gram anchoring ⚓️ — some eye candy! 👀✨
1
20
1,443
7/ Can we boost patch locality more? Yes, with high-res features✨ We feed 2x res images to Gram teacher, then downsample features to student's size. Downsampling (~averaging fine patches) smooths artifacts & keeps fine details → big performance gains (orange curve below) 🚀
1
18
1,751
Very happy to be in such a brilliant team 🪩
🎉Crossing 10K citations 🎉 Thanks to all who've joined us on this quest! 🙏 ✅Follow us to get notified of our latest research: scholar.google.com/citations… We're proud of the team and collaborators, though we know citations are just a metric and not the end game
1
14
1,061
Thank you @matas_jiri & @giotolias for the invitation! It was a great day with very insightful talks & discussions with VRG students/researchers. Prague is still just as welcoming even under the snow ❄️
13
1,149
8/ Sure, running high-res inference is more expensive, but we add Gram loss late in training (1M iters) and run just 70k iters! We initially tried from 200k iters, to prevent degradation, but even damaged features can be quickly fixed, reaching the same quality as early use 🤯
1
14
1,616
5/ Gram Anchoring quickly & drastically boosts dense task performance, even surpassing the Gram teacher checkpoint after only 10k iterations 🎯, showing that patch locality can be quickly recovered and improved⚡
1
16
2,000
14/ Also, don't hesitate to check out this v. interesting study by @JRaugel and team which shows how much DINOv3 and the human brain have in common 🧠
Can AI help understand how the brain learns to see the world? Our latest study, led by @JRaugel from FAIR at @AIatMeta and @ENS_ULM, is now out! 📄 arxiv.org/pdf/2508.18226 🧵 A thread:
1
16
2,187
6/ But how? We hypothetize that Gram anchoring leads to better and more consistent patch features, which also improves the quality of iBOT targets. We observe a decrease in the iBOT loss and little impact on DINO losses.
1
14
1,866
How to to obtain high-quality 3D representation from 2D self-supervised backbones? #CVPR2024 Check-out ScaLR which narrows down from 30 to 10pts the difference to fully-supervised models with our homemade WaffleIron model github.com/valeoai/WaffleIro…. Both works led by @gillespuy👏.
📢We introduce the ScaLR models (code+checkpoints) for LiDAR perception distilled from vision foundation models tl;dr: don’t neglect the choice of teacher, student, and pretraining datasets -> their impact is probably more important than the distillation method #CVPR2024 🧵 [1/8]
2
12
872
Last December @AVobecky presented our POP-3D model which generates 3D occupancy with open-vocabulary representation from images only, trained w/o annotation. We also proposed a new small benchmark for open-voc. 3D-occupancy w/ natural language queries. More details below ⬇️
[@NeurIPSConf'23]🚨Did you miss it? Our POP-3D generates open-vocabulary 3D occupancy predictions from 📷 surround-view images only, w/o human labels & w/ distillation from pre-trained models. We also propose a new small 3D-occupancy open-vocabulary benchmark #neurips2023⬇️ [1/N]
10
709
Replying to @92HsChoi
Thank you for the question! This remains an open question, the degradation of the features might be due to an imbalance between the two losses; DINO loss constraining the CLS while iBOT the patch features
1
10
579
9/ Which checkpoint for the Gram anchor? We use an early teacher model (200k iters). However the downsampled high-res features trick boosts Gram matrix quality across most teachers. So, the choice of teacher checkpoint matters less
1
12
1,506
2/ This survey expands and deepens the 🔊 CVPR’23 tutorial on “Object localization for free” done in collaboration with @tkipf, @WeidiXie who covered perpendicular object-centric feature learning and multi-modal approaches 📽️piped.video/watch?v=AxhjGR1W…
1
8
638
@TimDarcet, @TheoMoutakanni, Leonel Sentana, @_claireroberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, @julienmairal, @hjegou, @monsieurlabatut, @p_bojanowski
1
9
1,303
I am glad you enjoy it as well :) We spent quite some time looking at those maps improving our fruit localisation skill
6
176
Excited to be at @DeepIndaba in Dakar and to take part tomorrow in the 3rd Weakly Sup. Computer Vision workshop organized by R. de Charette & co. 💬 I will give a talk about one of my favorite subjects: using the power of self-sup. features for object localization. Come say hi!
8
1,306
The link to the workshop: wscv-indaba.github.io/2024 which will start at 8:30 🕗
Excited to be at @DeepIndaba in Dakar and to take part tomorrow in the 3rd Weakly Sup. Computer Vision workshop organized by R. de Charette & co. 💬 I will give a talk about one of my favorite subjects: using the power of self-sup. features for object localization. Come say hi!
5
864
Replying to @minotauronlucy
I get that not explaining the degradation is frustrating, it remains an open question that we are exploring.
1
5
423
Replying to @giffmana
Thanks and glad you also enjoy the name !
2
141
Replying to @jd_markovchain
The Gram matrix (so the patch cosine similarities) is better early in training (although features are worse), but over time while the feature themselves improve, the Gram matrix worsen. This is likely due to an imbalance between the global and local losses. Hope that helps!
2
22
Might also come handy if you are sightseeing in Paris for @ICCVConference 🙃
2
283
A work in collaboration with an intern at the time Chloé Sekkat, with @ctu_cs+@valeoai PhD student @AVobecky and fellow @valeoai colleagues @gillespuy, @EloiZablocki and Patrick Perez. Don't hesitate to come and chat with us during Tuesday morning poster session 👩‍💼👨‍💼
2
182
IMHO, understanding that there is somewhat an orthogonality between having discriminative features and good patch locality is interesting and opens more endeavours
2
2
159
1/ We discuss methods which 🚀 solely exploit ❄️ self-supervised ViT features to extract object localization or propose a smart training 🏋️ strategy to improve the object localization performance
1
1
498
1/ Summary. We generate coarse masks for objects by looking for patches that do not belong to the background ( 2/) and then self-train our conv1x1 model to predict such masks refined by a bilateral solver (3/).
1
2
218
Replying to @thomas_fel_
Thank you! I am not sure about the answer, I will ask and come back to you!
1
2
307
Thanks @AljosaOsep! Your paper looks very interesting but maybe not exactly in the scope of our survey. We focus here on fully-unsupervised class-agnostic methods which have never had access to any annotation. But thanks for the ref, we keep it for possible expansions!
1
1
198
Replying to @abursuc
Thanks @abursuc for your kind words!
1
211
2/💡Look for the background to discover objects. We exploit that DINO gives little attention to background patches and find all patches correlated to the least attention patch. The complement to this mask highlights all objects in the scene → no need to know how many.
1
1
174
3/ Self-training of our conv1x1. FOUND is a single conv1x1 layer trained to extract information from DINO features using the coarse masks (2/) as pseudo-labels after refinement using a bilateral solver.
1
1
197
Replying to @ZadaianchukML
Thanks @ZadaianchukML! COMUS is a very nice work! I am not aware of survey for USS (we quickly discuss some works in perspectives, will make sure to add COMUS in next update, sorry for the oversight). I am thinking of adding a sec. for USS in the Awesome page as its v. relevant
1
1
24
Ah yes I was missing a sentence for it be clear maybe With the Gram loss we only impact the locality of the patch cosine sim. & this does not hurt the discriminativity of the features (no drop in global perf). This is the orthogonality I was talking about
1
1
143
Replying to @jd_markovchain
You are reading correctly, we do not use any external model. You can see the better patch locality of the 200k checkpoint in the cosine similarity maps in post 3/
1
1
19
For those who have a CVPR registration but unfortunately couldn't attend, we will be live. Details here cvpr.thecvf.com/virtual/2023…
1
145
As mentioned in a previous reply, we tried applying the Gram loss at 200k iters & observed that it mainly prevents the dense degradation, but was less efficient than using it at 1M iters, while reaching similar performance
1
1
153