cofounder & chief science officer at @amilabs | faculty @nyu_courant | prev: @googledeepmind @meta (fair) @ucsandiego | ynwa

New York
i’m joining forces with @ylecun and an incredible group of people to start AMI Labs @amilabs. AMI isn’t a conventional lab. we don’t intend to become one. a lot to say about why this moment matters, but for now we’re heads down building. join us: amilabs.xyz
Advanced Machine Intelligence (AMI) is building a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe. We’ve raised a $1.03B (~€890M) round from global investors who believe in our vision of universally intelligent systems centered on world models. This round is co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, along with other investors and angels across the world. We are a growing team of researchers and builders, operating in Paris, New York, Montreal and Singapore from day one. Read more: amilabs.xyz/ AMI - Real world. Real intelligence.
154
163
2,803
502,169
Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community. What we have learned so far: - Architecture: Sora is built on our diffusion transformer (DiT) model (published in ICCV 2023) — it's a diffusion model with a transformer backbone, in short: DiT = [VAE encoder + ViT + DDPM + VAE decoder]. According to the report, it seems there are not much additional bells and whistles. - "Video compressor network": Looks like it's just a VAE but trained on raw video data. Tokenization probably plays a significant role in getting good temporal consistency. By the way, VAE is a ConvNet, so DiT technically is a hybrid model ;) (1/n)
35
520
2,603
1,345,050
good question... thinking back to pre-LLM interviews I experienced (before 2019)… they were all in-person on-site, no chance of ''llm cheating,'' very different across places, and somehow way more memorable. > old deepmind had brutal ''quizzes'' -- 2-hour marathons with 100+ math/stats/ML concept questions. > meta FAIR was basically academia interview with a bit of coding, but the highlight was chatting vision research with piotr, ross and kaiming. > google brain/research was similar. the @NoamShazeer was my coding interviewer, who kindly kept it simple with just a two-pointer q. we spent most of the time discussing research, where I explained how I had applied something called a transformer to visual data (point clouds) -- a topic that, at the time, hardly anyone cared about. > but the coolest? openai in 2018: whiteboard coding, a research talk, and a *~5-hour* session in a tiny room to work on an RL problem (variance collapse in cross entropy methods). I knew almost nothing about RL, but that was the point. They handed you a self-contained problem description, handwritten by @johnschulman2, and expected you to learn, research, solve, write up in a notebook, and present. feeling a bit nostalgic. makes me wonder if interviews like that still happen anywhere. If they do, I’d love to know. :)
At which of these places did you have the coolest interview in your career? I know it's an ill-posed poll, but what am i gonna do with only 4 options?! I tried grouping them by interview similarity to the best of my knowledge. Comment if "other". Might make a second round.
22
119
2,420
299,861
there’s only one right answer here, the @ylecun definition, and everyone should be able to recite it word for word
what is a world model?
59
204
2,072
306,673
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
57
326
1,852
415,245
Representation matters. Representation matters. Representation matters, even for generative models. We might've been training our diffusion models the wrong way this whole time. Meet REPA: Training Diffusion Transformers is easier than you think! sihyun.me/REPA/(🧵1/n)
29
266
1,779
372,320
During my internship at DeepMind, Demis met with all the interns. When asked about the company’s goal, I vividly remember him saying, “winning *multiple* Nobel prizes.” I was shocked at the time, but now, just 7 years later, part of that mission is already accomplished. Eager to see the rest unfold.
BREAKING NEWS The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Chemistry with one half to David Baker “for computational protein design” and the other half jointly to Demis Hassabis and John M. Jumper “for protein structure prediction.”
10
107
1,508
149,462
Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]
17
246
1,108
333,231
Is this now about gravity? 😶
47
42
847
165,766
🔍Introducing V*: exploring guided visual search in multimodal LLMs MLLMs like GPT4V & LLaVA are amazing, but one concern that keeps me up at night: the (frozen) visual encoder typically extracts global image tokens *only once*, regardless of resolution or scene complexity (1/n)
15
115
742
195,754
well diffusion transformer was rejected at CVPR 2023 due to limited novelty.
R2: While the results are impressive, this is a simple combination of diffusion transformer (ICCV 2023) and latent diffusion model (CVPR 2022). Limited novelty. Weak reject.
18
49
687
165,906
Video understanding is the next frontier, but not all videos are alike. Models now reason over youtube clips and feature films, but what about the everyday spaces we—and our future AI assistants—navigate and experience? Introducing Thinking in Space, our latest study exploring how multimodal LLMs see, remember and recall spaces. 🧵[1/n] vision-x-nyu.github.io/think…
17
104
679
210,496
A new chapter of my professional life! After 4 incredible years at FAIR and living in the bay, I’m moving to NYC! I’ll be joining @NYU_Courant CS @NYUniversity @CILVRatNYU as an Assistant Professor the upcoming Jan. Looking for students/postdocs to join me on this new adventure!!
53
22
612
Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶
30
102
686
258,436
I know op is click-baiting, but let me bite... fwiw every researcher’s DREAM is to find out their architecture is wrong. If it’s never wrong, that’s a bigger problem. we try to break DiT every day w/ SiT, REPA, REPA-E etc. but you gotta form hypotheses, run experiments, test, not by LARPing science in your head...otherwise, your conclusion is not just “wrong,” it’s not even wrong. okay - more technical take on "what's wrong with DiT" (as of today): - tread is more like stochastic depth, i think the convergence comes from the regularization effect that makes the representation stronger (note inference is all standard - all blocks process all tokens); very interesting work, but has nothing to do with whatever OP is saying... - lightning DiT is a proven, robust upgrade (w\ swiglu, rmsnorm, rope, ps=1), always use that when possible - no evidence that post-norm is hurting anything - the biggest fix from the past year is on internal rep learning: repa was first, but now tons of ways to do it (tokenizer-level fix like va-vae/repa-e, concat semantic tokens to noise latents, decoupled arch like ddt, regularizers like dispersive loss or self-representation alignment, etc.) - always go with stochastic interpolants/flow matching (SiT should be the baseline here) - use adaln-zero for time embedding, but use cross attn for more complex distributions like text embedding - but do it right -- use pixart-style shared adaln, otherwise you waste 30% of params for nothing - sd-vae is the real wrong thing in DiT, it's the elephant in the room, bloated (445.87 GFlops for 256^2 images??), not end to end, again approaches like va-vae and repa-e are partial fixes, but more progress is coming.
bros, DiT is wrong. it's mathematically wrong. it's formally wrong. there is something wrong with it
12
58
538
87,455
papers are kind of like movies: the first one is usually the best, and the sequels tend to get more complicated but not really more exciting. But that totally doesn’t apply to the DepthAnything series. @bingyikang's team somehow keeps making things simpler and more scalable each time. in this new version, they basically show that a strong representation encoder plus a depth-ray prediction objective is enough (you see the RAE vibes too, right?) to get solid, general spatial perception across a bunch of tasks. people often say they hate computer vision because it’s messy--too many tasks, too many data types, too many moving parts. but that’s exactly why I love it. I think the biggest AI breakthroughs are going to come quietly from vision and then suddenly leapfrog everything else, changing how AI interacts with the real world and with us. pretty soon we’ll realize vision is not a big list of tasks--it’s a perspective. a perspective about modeling continuous sensory data, building layered representations of the world, and inching toward human-like intelligence. and tbh we’re watching this happen every day, behind all the hype, as all these different '"tasks" slowly start to merge.
After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3 reveals two key insights: 💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture. ✨ A single depth-ray representation is enough. No complex 3D tasks. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen. 👇(1/n) #DepthAnything3
5
40
514
76,254
Wow, Deeply Supervised Nets received the Test of Time award at @aistats_conf 2025! It was the very first paper I submitted during my PhD. Fun fact: the paper was originally rejected by NeurIPS with scores of 8/8/7 (yes, that pain stuck with me... maybe now I can finally let it go😅). I wouldn’t call conferences a lottery, but a bit of perseverance does go a long way. Students: if you're feeling disheartened after recent paper decisions and gearing up for the next one, I hope this is a small reminder to keep going.
The #AISTATS 2025 Test of Time Award goes to ... 🥁 ... Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, Zhuowen Tu, for "Deeply Supervised Nets"! Congratulations!
33
44
509
88,720
When I first saw diffusion models, I was blown away by how naturally they scale during inference: you train them with fixed flops, but during test time, you can ramp it up by like 1,000x. This was way before it became a big deal with o1. But honestly, the scaling isn’t that impressive since adding more denoising steps stops making much of a difference pretty quickly – sry but it's a disappointing scaling law :( People have played with the randomness in diffusion models at test time and found that using good noises can improve the quality for various tasks. In this new work, we focus on a systematic exploration of inference-time scaling of diffusion models with a general search framework. We take the idea of verifiers from large language models and demonstrate that different search algorithms can help find better noise candidates, preventing the scaling from hitting a ceiling during testing. By looking at inference-time scaling, we also highlight the trade-offs between different verifiers and discover new insights about evaluating diffusion models. Definitely check out Wills @ma_nanye's thread below!
Inference-time scaling for LLMs drastically improves the model's ability in many perspectives, but what about diffusion models? In our latest study—Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps—we reframe inference-time scaling as a search problem over sampling noises. Our results show that increasing search computation can further enhance generation performance, pushing the capabilities of diffusion models further. 🧵[1/n]
9
68
465
51,866
Diffusion Transformer architecture + Flow Matching / Stochastic Interpolants objective? Great work and looking forward to the technical report! In SiT (scalable-interpolant.github.…) we have also studied this new design space under class conditional generation (though on a much smaller scale), and hopefully SiT can provide some complementary insights.
Announcing Stable Diffusion 3, our most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities. Today, we are opening the waitlist for early preview. This phase is crucial for gathering insights to improve its performance and safety ahead of open release. You can sign up to join the waitlist and learn more here: bit.ly/3OR2qQF #stablediffusion3 Prompt: Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says "Stable Diffusion 3" made out of colorful energy
5
46
343
85,251
When Bill and I were working on the DiT project, instead of creating novelty (see my last tweet🤷‍♂️), we prioritized two aspects: simplicity and scalability. These priorities offer more than just conceptual advantages. - Simplicity means flexibility. The cool thing about vanilla ViT that people often miss is how it makes your model way more flexible when it comes to working with input data. For example, in masked autoencoder (MAE), ViT helped us to just process the visible patches and ignore the masked ones. And similarly, Sora "can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid." UNet does not directly offer this flexibility. 👀Speculation: Sora might also use Patch n’ Pack (NaViT) from Google, to make DiT adaptable to variable resolutions/durations/aspect ratios. - Scalability is the core theme of the DiT paper. First, an optimized DiT runs much faster than UNet in terms of wall-clock time per Flop. More importantly, Sora demonstrated that the DiT scaling law applies not just to images but now to videos as well -- Sora replicates the visual scaling behavior observed in DiT. 👀Speculation: In the Sora report, the quality for the first video is quite bad, I suspect it is using a base model size. A back-of-the-envelope calculation: DiT XL/2 is 5X GFLOPs of the B/2 model, so the final 16X compute model is probably 3X DiT-XL model size, which means Sora might have ~3B parameters – if true, this is not an unreasonable model size . It could suggest that training the Sora model might not require as many GPUs as one would anticipate – I would expect very fast iterations going forward. (2/n)
7
41
400
97,603
"esp as a computer vision at heart who is temporarily masquerading as a natural language person" -- omg could not love this from @karpathy more❤️
I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...
3
9
392
78,369
i think this just shows the input image passes through semantic encoders instead of VQ; they're aligned with the LLM and grasp image content well (super important for editing) but may not perfectly reconstruct original pixels (due to capacity limits / # image tokens)
ChatGPT insists on swapping out real faces for fake ones, when requested to generate an identical image to the input. Below (input, output)
17
26
384
57,076
Multimodal LLMs have been shown to err in complex, OOD, and edge-case scenarios. Yet, we have identified a systematic method for pinpointing visual errors in these models even when they are posed with *very basic* questions, using just common images from ImageNet and LAION. 🧵
8
69
360
84,013
yes. but you should all follow Ken Liu (@kyliu99) and read his novels. he’s my favorite sci-fi writer and just an incredible person. Ken’s been hands-on with AI for years. No surprise Pantheon clicks so hard with many researchers. I remember a few years ago he fine-tuned his own gpt model to explore how LLMs could affect nonfiction writing. way ahead of the curve. he also had an legendary career: english & CS double major at Harvard, worked at microsoft as an engineer, did some startup thing, then got a JD (!!) and became a lawyer…and then pivoted to writing sci-fi full time and went on to win 4x hugo awards. Oh, he's also an immigrant from China, from the same hometown as me.
pantheon is such a good show!
8
14
354
58,242
Had a great time at this CVPR community-building workshop---lots of fun discussions and some really important insights for early-career researchers. I also gave a talk on "Research as an Infinite Game." Here are the slides: canva.com/design/DAGp0iRLk9g…
In this #CVPR2025 edition of our community-building workshop series, we focus on supporting the growth of early-career researchers. Join us tomorrow (Jun 11) at 12:45 PM in Room 209 Schedule: sites.google.com/view/stando… We have an exciting lineup of invited talks and candid panels: @sarameghanbeery, @dimadamen, @jbhuang0604, @lealtaixe, @LerrelPinto, @lschmidt3, @shubhtuls, @gulvarol, @cvondrick, @sainingxie Co-organizing with @unnatjain2010, @ap229997, @georgiagkioxari, @akanazawa, and Lana Lazebnik @CVPR
17
65
360
45,694
thought experiment: ViTs work great for 224^2 images, but what if you had a 1 million^2 pixel one? You'd either use conv, or you patchify and process each with a ViT using shared weights—essentially conv. a moment I realize convnet isn't an architecture; it's a way of thinking.
A short post on the best architectures for real-time image and video processing. TL;DR: use convolutions with stride or pooling at the low levels, and stick self-attention circuits at higher levels, where feature vectors represent objects. PS: ready to bet that Tesla FSD uses convolutions (or perhaps more complex *local* operators) at the low levels, combined with more global circuits at higher levels (perhaps using self-attention). Transformers on low-level patch embeddings are a complete waste of electrons.
13
26
335
172,033
wait a sec. look at the content -- did y'all actually go this route? This looks way too plausible, and honestly the most practical approach on multimodal gen rn (based on my own experience with students). So, not pure AR, but an LLM + a diffusion "renderer" on the compressed latents? @ajabri is this your secret plan to make OpenAI open again?
10
16
344
75,750
🌎 𝕤𝕒𝕪 𝕙𝕖𝕝𝕝𝕠 𝕥𝕠 𝕧𝕚𝕣𝕝 🌏 virl-platform.github.io
10
64
323
95,439
Diffusion Transformer (DiT) just got an upgrade! Same backbone but better quality, speed and flexibility. And we achieved this by... moving beyond standard diffusion and exploring a broader design space with interpolants! Introducing SiT -- Scalable Interpolant Transformers!
NYU presents SiT Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers paper page: huggingface.co/papers/2401.0… present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06.
4
40
300
76,105
wait, speaking of false dichotomies---during your phd, you *can* write code, dive into data and systems, collaborate with a team, and build useful things---all while enjoying complete openness and the freedom to pursue what *genuinely* excites you.
i left my phd before joining openai working in industry demands more rigor – you don’t just need to convince reviewer 2 with a nice graph and an ego-cite, it better actually work if it’s underwriting billions in research investment not saying it always pans out that way in practice. yolo culture is pervasive and “research-quality code” abounds. i certainly don’t have a clean conscience there but some of the best breakthroughs come from people without academic preconceptions, with the discipline to build things that actually work
11
10
295
49,072
you can’t build superintelligence without first building supersensing
25
31
283
36,047
The key takeaway is from the "Emerging simulation capabilities" section. Before Sora, it was unclear if long form consistency could emerge on its own or if it required complex subject-driven generation pipelines or even physics simulators. OpenAI has shown that, though not perfect, these behaviors can be achieved with end-to-end training. Yet, two essential points have not been discussed. 1. Training Data: No talk about training source and construction at all, which might just imply data is likely the most critical factor for Sora's success. 👀Speculations: There's already much speculation about data from game engines. I also anticipate the inclusion of movies, documentaries, cinematic long takes, etc. Quality really matters. Super curious where Sora got this data from (surely not YouTube, right?). 2. (Auto-regressive) Long Video Generation: a significant breakthrough in Sora is the ability to generate very long videos. The difference between producing a 2-second video and a 1-minute video is monumental. In Sora, this is probably achieved through joint frame prediction that allows auto-regressive sampling, yet a major challenge is how to address error accumulation and maintain quality/consistency through time. A very long (and bi-directional) context for conditioning? Or could scaling up simply lessen the issue? These technical details can be super important and hopefully will be demystified in the future (3/n) nitter.app/gabor/status/175829571…
.@OpenAI SORA vs @pika_labs vs @runwayml vs @StabilityAI Video. I gave the other models SORA's starting frame. I tried my best prompting and camera motion techniques to get the other models to output something similar to SORA. SORA's just much better at longer scenes.
11
34
265
87,435
So this is not a benchmark for software engineering agents. It’s meant to test core reasoning and intelligence through coding—backed by 71 pages of deep analysis from some of the best competitive programmers out there. This effort was carried out by students across multiple institutions (I’m mostly just a cheerleader here!) It was led by @ZihanZheng71803 (an undergrad who represented NYU in the ICPC World Finals), @wenhaocha1, and many of their Olympiad medalist friends. They built the live benchmark and offered expert analysis of how elite human coders compare to top LLMs. The results are now public: on the hard problems, LLMs essentially score 0%. They're good at implementation-heavy tasks that rely on memorization, but still struggle badly with observation-heavy or logic-heavy problems—those where the implementation is easy once you’ve had the critical "aha" insight. They also struggle with detail-oriented tasks—often getting the basics right but failing to account for edge cases. Some more thoughts on why this benchmark matters: I’ve always been surrounded by top competitive programmers. My undergrad program at SJTU is renowned for ICPC success and primarily admits students with a strong high school competitive programming background. While I’ve never won an olympiad medal myself, I deeply admire my peers who did—friends who trained for years as teens and competed at the highest international levels. One of them is my classmate and key collaborator on this project, Prof @shangjingbo, who earned ICPC world final gold for SJTU. For us, competitive programming was the ultimate badge of intelligence for CS students. Competitive programming emphasizes reasoning and problem solving under pressure, which differs from standard software engineering—but the skills carry over surprisingly well. That’s why so many startups love to show off their IOI gold medalists! Beating this benchmark would be like AlphaGo beating Lee Sedol. We're not at that level yet—not even for problems with clearly verifiable outcomes. And if you care about fundamental intelligence and reasoning, this result might be worth a close look.
We introduce LiveCodeBench Pro, a live, exceptionally challenging benchmark comprising competitive programming problems sourced from IOI, Codeforces, and ICPC. Frontier models such as o3, and Gemini 2.5 achieve scores of 0% on the Hard split. Leaderboard: livecodebenchpro.com/
12
38
259
73,614
Our take on a 4o-style AR + diffusion unified model: Transferring knowledge from an AR LLM to generation is easier than expected--you don't even need to touch the LLM. The right bridge between output modalities can unlock cool capabilities like knowledge-augmented generation!
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
5
28
260
19,877
Nothing ever happens… until it does! Just saw JanusFlow by @deepseek_ai uses REPA for training and shows some solid improvements: arxiv.org/abs/2411.07975
Replying to @cloneofsimo
hmm I’ve got a bunch of independent data points now showing that REPA helps with big text-to-image models too! Let’s give it a little more time before saying nothing ever happens—I’m sure it won’t be long :)
4
31
256
40,822
openai.com/index/thinking-wi… About two years ago, we started building V* to bring visual search into a multimodal LLM and show that it's a key part of how these models can understand the world. I still remember talking with my friend @bowenc0221 and @_alex_kirillov_ about why this kind of multimodal grounding was the future—and Bowen had the guts to kick off the “thinking with images” research at OpenAI. It’s finally live—huge congrats to Bowen and the team!! Incredible to see that "vision" become reality. V* jumped from 55% performance with GPT-4V all the way to 95.7% with o3. There's something deeply satisfying about watching a tough benchmark get solved. It means visual search is becoming a fundamental piece of how multimodal models reason—just like it is for us humans. When I went back to NYU and started thinking about research directions, I told students not to just follow what OpenAI is doing. We need to push ahead and build stuff that’s not only relevant now but shapes what comes next—and maybe even inspires great companies like OpenAI. V* is another example that proves we can do it. It’s a bit sad that work like this doesn’t always get reflected in papers anymore, but deep down, we know there’s still so much untapped potential in vision—and we’re excited to keep exploring something new.
🔍Introducing V*: exploring guided visual search in multimodal LLMs MLLMs like GPT4V & LLaVA are amazing, but one concern that keeps me up at night: the (frozen) visual encoder typically extracts global image tokens *only once*, regardless of resolution or scene complexity (1/n)
2
29
251
46,488
In Cambrian-1, we found that vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. We’ve tried scaling SSL before, but this is the first time I’ve seen real signal: SSL adapts to data, makes use of model capacity, and scales effectively (even better than CLIP!). Huge credit to @DavidJFan and @TongPetersb and other friends at FAIR for digging into this from all angles - data, model size, evaluation, and more! SSL 2.0 feels exciting. If the bitter lesson tells us to avoid hardcoded structures and lean into scale through learning and search, then maybe language itself is the most strongly imposed structure in VLMs.
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
2
51
245
28,359
this isn’t just a modeling problem. it’s also a benchmarking problem. spurious correlations are always a pain, but in multimodal llms they become a particularly tough battle. On one hand, you want to leverage the language prior to enable better generalization; on the other, that same language prior can turn into a shortcut that makes the model effectively blind. the irony is that humans do the same thing. We still gravitate toward language-first tasks, and the “multimodal results” in major model releases like gpt-5 reflect exactly that bias. I mean, economically this makes most sense for LLM companies: you can claim wins in “multimodal reasoning” without investing heavily in real multimodal research. that shortcut will come due tho. when you try to put these systems into glasses, robots, or anything else that touches the real world, the cracks will show. and they’ll be costly.
Replying to @TairanHe99
I couldn’t believe GPT-5 could make this mistake until @ziqiao_ma pointed it out to me. Highly recommend this paper (arxiv.org/abs/2406.16860) on vision-centric evaluation of multimodal LLMs from @sainingxie — now imagine the same rigor applied to VLAs.
8
23
244
46,414
Some further thoughts on the idea of "thinking with images": 1) zero-shot tool use is limited -- you can’t just call an object detector to do visual search. That’s why approaches like VisProg/ViperGPT/Visual-sketchpad will not generalize or scale well. 2) visual search needs to be a native, end-to-end component within multimodal LLMs. This was our focus in V*, but two yrs ago, we didn’t realize how powerful RL would become, so we had to stick with SFT to train detection heads. It worked, but it was slow and kind of a pain. 3) however, when the tools are simple and low-level -- say, basic python image processing functions, rather than a faster R-CNN -- they can be incorporated directly into the end to end system. With RL at scale, these simple tools become visual primitives you can mix and match to build scalable visual skills. 4) we should keep identifying these visual primitives. It’s definitely not just simple image processing functions; think about video and 3D. 5) finally, I think most traditional visual recognition models are dead. As the great @inkynumbers said, they’re parsers (drive.google.com/file/d/1Vod…). But vision itself isn’t dead. i think it’s more alive and exciting than ever.
at a glance, this is quite diff than what we did fwiw. the behaviors you see in o3 and o4-mini are all emergent from large-scale RL. we just give them access to python and the ability to manipulate images, the rest is up to the model
5
33
242
34,013
#shamelessplug DiT shines in Sora. Our team at NYU has recently released a new DiT model, called SiT. It has exactly the same architecture, but offers enhanced performance and faster convergence. Super curious about its performance on video generation too! (n/n) nitter.app/ma_nanye/status/174819…
Introducing Scalable Interpolant Transformer! SiT integrates a flexible interpolant framework into DiT, enabling a nuanced exploration of dynamical transport in image generation. With an FID of 2.06 on ImageNet 256, SiT pushes Interpolant-based models to new heights! (1/n)
4
20
224
66,747
Indeed. For text-to-image, @xichen_pan had a great summary supporting this decoupled design philosophy: "Render unto diffusion what is generative, and unto LLMs what is understanding." We've repeatedly observed that diffusion gradients can negatively impact the backbone repr. This effect shows up in simpler settings—for example, we explored this issue to some extent in REPA-E (end2end-diffusion.github.io/). I believe the same principle applies to VLA. Fundamentally, the problem seems to be that diffusion gradients care too much about high-frequency details—whether in pixels or action policies—which tends to conflict with representation learning and understanding. btw, @ylecun has always been right about this -- long before any of these empirical findings.
as expected, this matches findings in unified multimodal understanding and generation models by @sainingxie: frozen VLM might help you. xichenpan.com/metaquery/
12
32
224
55,616
looking ahead, we’re prototyping something new -- we call it predictive sensing. our paper cited tons of work from cogsci and developmental psychology. the more we read, the more amazed we became by human / animal sensing. the human visual system is super high-bandwidth, yet insanely efficient. each eye’s 6 million cone receptors can transmit ~1.6 Gbit/s, yet the brain uses only about 10 bits/s to guide behavior. most sensory data is filtered, compressed, and everything is autopiloted -- you don’t even notice. how does our brain pull that off? one leading theory: your brain runs a predictive world model in the background for sensing, constantly forecasting the future and comparing it to what actually happens. - if the prediction error is low → it’s expected, you can ignore it. - if it’s high → it’s a surprise, and your brain pays attention, updating memory. we don't have anything comparable in LLMs right now. to test this idea, we trained a latent frame prediction (LFP) head on top of Cambrian-S. we estimate "surprise" during inference, and use it in two ways: 1️⃣ surprise-driven memory management -- compress or skip non-surprising frames, focus compute on surprising ones. 2️⃣ surprise-driven event segmentation -- use surprise spikes to detect event boundaries or scene changes. by leveraging signals from this internal predictive model, we’re already seeing promising gains on spatial cognition tasks. it’s just a toy predictive world model -- but with this mechanism, our small model outperforms gemini on vsi-super. [6/n]
5
14
231
76,898
I like pretty pictures but one thing I like more about diffusion models is how they open up new doors for (useful) analysis-by-synthesis approaches. More and more research is showing that (pre-trained) diffusion models are pretty good feature extractors too. This empirical study reveals several intriguing aspects of this. We began with a DiT model, which has a complex design and isn't intended for feature extraction (yet it manages to produce good features). Gradually, we transformed it into a basic denoising auto-encoder (simple, but poor at feature extraction). Throughout this process, we discovered the key elements that matter and those that don't, and arrived at a new, simplified approach for vision SSL. Led by the vision SSL crew at FAIR (@endernewton, @liuzhuang1234 and Kaiming) and thanks @_akhaliq for sharing the work!
Meta presents Deconstructing Denoising Diffusion Models for Self-Supervised Learning paper page: huggingface.co/papers/2401.1… examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive procedure allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.
5
26
210
56,095
Awesome plot showing the progress in CLIP-like model training! As both a user and a researcher, there are a couple of caveats I personally feel worth pointing out.
CLIP models have become a lot better since 2021
1
14
211
46,701
diffusion transformers have come a long way, but most still lean on the old 2021 sd-vae for their latent space. that causes a few big issues: 1. outdated backbones make the architecture more complex than it needs to be. the sd-vae runs at around 450 gflops, while a simple ViT-B encoder only needs about 22 gflops. 2. over-compressed latent spaces (just 4 channels) limit how much information can be stored. compression leads to intelligence they say, but not here: VAE-style compression doesn’t actually do much. it’s basically as limited as raw 3-channel pixels. 3. weak representations: with reconstruction-only training, the VAE learns weak features (~8% linear probe), which ends up slowing convergence and hurting generation quality. we’ve learned by now: representation matters for generation quality. and the sd-vae is just not built for that (2/n)
1
29
218
58,657
Thanks for bringing this to my attention. I honestly wasn’t aware of the situation until the recent posts started going viral. I would never encourage my students to do anything like this—if I were serving as an Area Chair, any paper with this kind of prompt would be desk-rejected right away. That said, for any problematic submission, co-authors all share the responsibility, no excuse here. And this has been a good reminder for me, as a PI, to not just check the final PDF but also look through the full submission files. I wasn’t aware of this kind of need before. Let me take a moment to share what we found after doing a full internal review this past week--everything’s backed up by logs and screenshots, available if needed. 1. Background In November 2024, a researcher @jonLorraine9 tweeted this: nitter.app/jonLorraine9/status/18…. That was the first time I saw this kind of idea, and I think it was also when people realized that LLM prompts could be embedded in papers. Note that such injection only works if the reviewer uploads the PDF to an LLM directly. At that time, one thing we all agree is that LLMs should NOT be used for reviewing. It’s a real threat to the integrity of the process. That’s why conferences like CVPR and NeurIPS have now explicitly and strictly banned LLM reviewing (e.g., “LLMs are NOT allowed to be used for writing the reviews nor the meta-reviews at any step.”). If you've published at AI conferences, you probably know how frustrating it is to receive a review that was clearly written by an AI. It’s nearly impossible to respond to, and often just as hard to definitively prove that an LLM wrote it. While the original post might have been made partly as a joke, we all felt that trying to “fight fire with fire” isn’t the right defense--it raises more ethical issues than it solves. A better path is to address these concerns through official conference policies, not through individual hacks that can backfire. 2. What happened in our case The student author—who was visiting our group briefly from Japan—took that tweet a bit too literally and used the idea in an EMNLP submission. They copied the format exactly, not realizing it was partly a joke and could come across as manipulative or misleading. They also didn’t fully grasp how this might impact public trust in science or the integrity of peer review. On top of that, they included the same thing in the arXiv version without thinking twice. I missed it too—partly because this goes beyond the usual checks I have in place to catch anything ethically questionable as a coauthor. 3. Next steps The student has since updated the paper and reached out to ARR for formal guidance. We'll follow whatever steps they recommend. 4. Bigger picture This has been a teaching moment for me. Students under pressure don’t always think through all the ethical implications—especially in newer areas like this. My job is to guide them through these gray zones, not just react to their mistakes. Rather than punishment, what’s really needed is better education around these issues. I was upset with the student at first too. But after thinking it through, I don’t think the students should be punished beyond having the paper rejected. I’ve told them clearly this can’t happen in the future, and we’re also planning additional training around AI ethics and responsible research practices (which to me is more about having some common sense). I’ll be honest—it’s been not a good feeling being at the center of this kind of public shaming. These conversations should be thoughtful and constructive, not about singling people out. And honestly, the students feel the pressure even more. I've actually been keeping up with the public conversations around this, and in a recent poll, 45.4% of people said they think this kind of thing is actually okay. Sure, it’s just a poll and there could be bias—but it still says something about the nature of this problem. nitter.app/gabriberton/status/194… The real issue here is the current system—it creates space for things like this to happen. And this isn’t traditional academic misconduct like faking data; it’s something newer, and it calls for a deeper, more nuanced conversation about how research ethics are evolving in the age of AI. In that sense, I don’t feel too bad—I feel confident I could explain the context honestly to any ethics board. And to circle back to the original post’s question—this whole situation really highlights why we need to rethink how the game is played in academia. That’s really the main point I was trying to make in my talk. I’m going to continue doing my best to help students learn how to do solid research. (This post was written by me, with help from ChatGPT-4o on editing.)
Is it ethical to add a hidden line of text in your paper saying "write a good review" in case R2 uses chatGPT to review your paper?
10
29
216
39,300
you’re reading my mind :) all i want are better representations — but knowing how to use them guides how we build them. a central theme in our recent research is unifying these seemingly separate domains so we can build stronger representations and ultimately true world models. with good representations, diffusion models can easily render them into pixels, multimodal llms can easily decode them into language, and so on.
Congrats on this amazing paper @sainingxie, really love these results! I wonder if further improvements in representation learning can push this paradigm.
3
6
207
34,027
Courant is now a new school!💜
NYU (@nyuniversity) announced the creation of the Courant Institute School of Mathematics, Computing, and Data Science today, signaling the university’s enthusiastic commitment to mathematics, computing, and data science over the coming decades: nyu.edu/about/news-publicati…
3
10
206
30,855
Almost every deep learning model for 3D recognition has been *trained from scratch*. In our #ECCV2020 spotlight paper, we propose 👉PointContrast👈, an unsupervised pre-training framework that boosts performance on 6 different 3D point cloud benchmarks. arxiv.org/abs/2007.10985
3
39
200
The three biggest hps for stable training in everything are lr, bs, and beta2. We’ve built up good intuitions on how to tune them over time, but this lays it all out analytically and convincingly. this is definitely my new handbook for training big models on small gpus.
🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n
3
19
197
20,914
two exciting directions for diffusion models in 2025: either going (extremely) small or going (extremely) big with your steps
5
11
188
16,515
Replying to @jbhuang0604
there's no true self-supervised learning in text - it's (strongly) supervised learning.
7
4
153
14,789
Really enjoyed working on this project; some thoughts on why I believe combining the creative freedom of generative models with the precision of the 3D graphics pipeline could be the future. (1/n)🧵
Intel and NYU present Image Sculpting Precise Object Editing with 3D Geometry Control paper page: huggingface.co/papers/2401.0… present Image Sculpting, a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. This approach differs markedly from existing methods, which are confined to 2D spaces and typically rely on textual instructions, leading to ambiguity and limited control. Image Sculpting converts 2D objects into 3D, enabling direct interaction with their 3D geometry. Post-editing, these objects are re-rendered into 2D, merging into the original image to produce high-fidelity results through a coarse-to-fine enhancement process. The framework supports precise, quantifiable, and physically-plausible editing options such as pose editing, rotation, translation, 3D composition, carving, and serial addition. It marks an initial step towards combining the creative freedom of generative models with the precision of graphics pipelines.
3
18
142
31,512
as always, we’re releasing everything: the paper, the model, and the PyTorch code. this project has been led by three amazing students: Boyang @boyangzheng_ (1st year phd), Willis @ma_nanye (2nd year phd), and Peter @TongPetersb (3rd year phd). we’ve been working on this for nearly a year, learning (and struggling!) a lot along the way to truly make it work. hope you’ll enjoy it as much as we do paper: arxiv.org/abs/2510.11690 blog: rae-dit.github.io code: github.com/bytetriper/RAE (7/n)
2
10
144
12,397
When I saw these results, it didn’t feel like we invented something entirely new—it felt more like we barely understand the representations learned from diffusion models and SSL methods. This has many implications for building a true world model. Plus, we still need new, scalable approaches to improve visual representations. So much left to uncover in the space between generative models and representation learning. (n/n) The project was led by the awesome @sihyun_yu at KAIST. Check out our: - Paper: arxiv.org/abs/2410.06940 - Project page: sihyun.me/REPA/ - Code: github.com/sihyun-yu/REPA And don't miss Sihyun’s technical thread 🧵: nitter.app/sihyun_yu/status/18454…
Introducing REPA! We show that learning high-quality representations in diffusion transformers is crucial for boosting generation performance. With REPA, we speed up SiT training by 17.5x (without CFG) and achieve state-of-the-art FID = 1.42 using CFG with the guidance interval. 🧵[1/7] sihyun.me/REPA
4
4
138
13,030
The pre-trained, frozen VAE in DiT is massive (much higher flops than the transformer itself!!). Do you have to freeze it? What if you could leverage that capacity through e2e training? It turns out it doesn't play well with diffusion loss -- but with REPA, you can make it work!
Can we optimize both the VAE tokenizer and diffusion model together in an end-to-end manner? Short Answer: Yes. 🚨 Introducing REPA-E: the first end-to-end tuning approach for jointly optimizing both the VAE and the latent diffusion model using REPA loss 🚨 Key Idea: 🧠 Traditional deep learning wisdom dictates that end-to-end learning is preferable when feasible ❗️However for latent diffusion models - the training process remains two stage: VAE for reconstruction and then LDM for generation 🤔 This leads to complex interplay between two stages: how to best optimize representation from stage one (VAE) to improve generation performance in stage two (LDM)? 🔥 We propose REPA-E: an end-to-end approach for jointly optimizing both VAE and LDM using representation-alignment (REPA) loss. End-to-End tuning offers several advantages over traditional training: ⚡️Faster Training: REPA-E speeds up diffusion training by more than 17x over REPA and 45x over vanilla training recipe. 🚀 Better Performance: Despite its simplicity, REPA-E achieves SOTA FID performance of 1.26 and 1.83 w/ and w/o cfg on ImageNet-256 🎉 Better Latent Space Structure: End-to-End training helps automatically optimize VAE features to best support generation performance -> leading to a better latent space structure. 💪 Improved VAE Performance: end-to-end tuned VAEs (E2E-VAE) can be used as drop-in replacement for their original counterparts (SD-VAE) showing significantly better downstream performance for both reconstruction and generation. Exciting project co-lead with @xingjian_leng and in collaboration with @YunzhongH @ZhenchangXing @LiangZheng_06 and @sainingxie 🙏 Project Page: end2end-diffusion.github.io Code: github.com/End2End-Diffusion… [1/7] 🧵
3
16
141
15,069
for all the details, resources, and projects, check out: 🔗 cambrian-mllm.github.io/ we’re a small group of researchers, but we’ve been fortunate to have incredible support from thought leaders like @drfeifei and @ylecun. (huge congrats to both on the Queen Elizabeth Prize!) they’ve guided us throughout this project helping us think differently, recalibrate our north stars, and encourage us to explore new paths toward intelligence. the project is led by @shushengyang who previously worked on Qwen models and is now fearlessly pushing into the next paradigm. core contributions from @jihanyang13, @PinzhiHuang, @_ellisbrown, and many others (too many to tag but equally important ❤️). special thanks to @googlecloud TRC program for enabling us with the compute + storage to dream big 🚀 we’re proud to give back with many open-source JAX and PyTorch XLA codebases. finally, hope you enjoyed our little film for this project a $0-budget student production led by @fred_lu_443, but really, a love letter to NYC. a city full of people, stories, and motion -- the spark behind our dream of supersensing intelligence. not just to see the world, but to feel it. to understand it. to help the people in it. 🤍 [n/n]
4
14
145
34,671
Here's a bit of reflection: when I moved from industry to academia, I wasn't sure if we'd ever be able to pull off a large-scale project like this that requires full-stack skills. The students amazed me with their dedication and courage. Our team, with PhDs, masters, and undergrads, all contributed substantially, solved millions of technical challenges on data, infra, and modeling, and gained tons of experience along the way. This project wouldn't have been possible without support from the Google TPU Research Cloud program (big thanks to @JeffDean @demishassabis for the continuous support to academia). I think Cambrian shows what can be done to complement industry efforts. We can't afford to scale up LLMs, but that was never our goal anyways. Give us some resources, and we'll definitely share useful stuff ;) We call our model Cambrian because, just like in the Cambrian explosion where creatures developed better vision, we believe improved vision is not just about seeing farther, but about understanding more deeply. Check out our paper on arxiv: arxiv.org/abs/2406.16860 Our project page with tons of resources: cambrian-mllm.github.io/ And feel free to talk with our team if you have any questions and/or want to collaborate! [10/10]
7
6
113
9,984
Excited to be at CVPR next week with my students. We’ll be presenting our work in the main conference and several workshops and tutorials, including this one👇 See you soon in Nashville!
Join us for a full-day tutorial on Scalable Generative Models in Computer Vision at @CVPR in Nashville, on Wednesday, June 11, from 9:00 AM to 5:00 PM in Room 202 B! 👉 We are honored to have @sainingxie, @deeptigp, @thoma_gu, Kaiming He, @ArashVahdat, and @sherryyangML to speak at our tutorial! 👉 More details and the full schedule are available at: vision-x-nyu.github.io/scala… #CVPR2025
2
4
118
10,703
Check out the latest paper from @TongPetersb on grokking both visual understanding and generation abilities through (modest scale) instruction tuning. The data composition reveals both asymmetry and prioritization: we represent to understand, and we understand to create.
How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit visual generation from an LLM, when trained jointly with visual understanding.
1
10
110
13,724
It was a blast seeing everyone at nyu and getting to learn about all the cool work. this is why nyc (and surroundings) are a great place for computer vision 😊
Fresh memories from 🗽NYC Vision Day🗽 hosted at NYU yesterday, April 1st (cs.nyu.edu/~fouhey/NYCVision…) Grateful to the organizers (David Fouhey, @jiadeng, @orussakovsky, @Jimantha, @cvondrick @sainingxie) for setting up such an amazing event to bring the vision community together!
2
3
110
14,369
I used to think diffusion models struggled to denoise efficiently in high-dimensional spaces -- but I was wrong again. since RAE latent spaces are inherently high-dimensional, diffusion transformers require adaptation, but with just three simple tweaks, they perform *remarkably* well. 1. wide DiT design: we found that for diffusion to function properly, the transformer width d must be at least as large as the latent token dimension n. when this condition isn’t met, the model can’t even overfit a single example (!!). our paper goes into the theory and intuition behind why this constraint is so critical. 2. noise scheduling: this isn’t a new concept. resolution-dependent noise schedule shifts have long been used for high-resolution image generation. the same idea applies here: shifting the noise schedule allows diffusion models to adapt smoothly to increased input channel dimensionality. 3. noisy decoder: to make the decoder more resilient to small diffusion errors in latent space, we add a touch of noise during decoder training. this helps the decoder gracefully handle subtle imperfections in reconstructed representations. with these simple changes, we can already train a DiT-XL model that outperforms REPA, without introducing ANY auxiliary losses or additional training stages. I still love REPA, but… RIP REPA🪦. With RAE, convergence is up to 16× faster than the REPA with sd-vae. (4/n)
2
8
112
21,260
team industry -1, team academia +1 🎉😉 students, here’s your chance to join an amazing lab ⬇️
Excited to share that I will be joining Princeton Computer Science @PrincetonCS as an Assistant Professor in September 2025! I'm looking for students to join me. If you are interested in working with me on VLMs, LLMs, deep learning (vision/LLM) architectures, data, training, efficiency, or understanding, please apply!
1
1
103
19,243
I used to think that semantic encoders primarily captured high-level, abstract representations and discarded fine-grained visual details, but I was wrong. we employ pretrained representation encoders (such as DINO, SigLIP, and MAE, all based on standardized ViTs) combined with trained decoders. unlike some recent approaches, RAE follows a minimalist path. there’s no extra training or alignment stages, no auxiliary losses, and no adapter layers that reintroduce compression. you simply take a pretrained semantic encoder and train a decoder using L1 + LPIPS + GAN loss, that’s it. despite this simplicity, RAEs achieve reconstruction quality that *surpasses* SD-VAEs in terms of rFID (3/n)
4
8
107
32,906
guys, real geospatial data is a total goldmine for digital agents. step away from the web browser and get real. (we explored a bit in virl-platform.github.io, but building a simulation-ready pipeline like this could take things way further)
Replying to @gan_chuang
Virtual Community provides an online pipeline that automatically generates 3D scenes from real geospatial data, performing comprehensive cleaning and enhancement of both geometry and texture — including mesh simplification, texture refinement, object placement, and automatic annotation. 3/n
4
16
100
20,406
one more thing: a special BONUS drop! we’re also open sourcing a JAX nnx diffusion codebase that Willis @ma_nanye has been building: nitter.app/ma_nanye/status/197792… it’s by far the most research-friendly, comprehensive, and efficient JAX implementation I’ve seen for diffusion and flow-based models. it supports a wide range of architectures and models, (including RAE) and is built from the ground up for scalability (e.g. large-scale training with fsdp). our lab has already been developing several projects on top of it with TPUs (thanks Google!), and we’re excited to invite the community to join us, explore, and build together! (n/n)
Excited to introduce DiffuseNNX, a comprehensive JAX/Flax NNX-based library for diffusion and flow matching! It supports multiple diffusion / flow-matching frameworks, Autoencoders, DiT variants, and sampling algorithms. Repo: github.com/willisma/diffuse_… Delve into details below! 🧵[1/n]
3
5
104
16,081
From our previous projects (MMVP, V*, VIRL), we've noticed unexpected visual shortcomings in current MLLM systems. While we can temporarily fix issues by e.g. adding data, one root problem is that our visual representations are not yet sufficient for language understanding. In the short term, projects like Astra and GPT-4o are impressive. However, to develop a reliable multimodal assistant that perceives the real world like humans, manages complex tasks robustly, and acts accordingly, weak sensory grounding will likely become a bottleneck. Language priors are powerful, but we shouldn't use them as crutches (quoting @ylecun) to compensate for deficiencies in visual representations. [2/n]
2
9
94
40,613
Replying to @johnschulman2
lol thanks john maybe I should frame it ;) I remember walking out of the building after the day, it was dark, felt so hyped, had no idea then how much the world would change in the next seven years.
2
98
10,029
People (in academia) always tell me that training DiTs/SiTs is way too hard because it takes 7M iters and weeks to get the FID we reported in the paper. We figured out how to speed up training by ~18X, hitting even better FID in less than 400K iters. We did this by digging into the representation learned from diffusion models (2/n).
2
3
95
17,370
Replying to @tydsh
feeling nostalgic. i really miss the days we spent working together on research with such brilliant colleagues and interns at FAIR. all good things eventually come to an end, but I’m hopeful that new and even better chapters are ahead
1
98
27,572
yup more multimodal more fun :)
Career update: after an amazing journey at OpenAI I left to join something new, exciting, and multimodal!
2
91
23,713
Unsung hero for academia.
everybody talks big game about democratizing AI, but today I'm super grateful that @GoogleColab gives free TPU + GPU instances. Makes it possible for a lot of students to learn about ML without spending a few thousand dollars building a deep learning PC
3
89
19,093
Now things become natural: we can supercharge diffusion transformer training by adding a simple regularization that maximizes the similarity between the diffusion transformer's latent representation and a powerful external visual representation like DINOv2. This simple tweak leads to surprising results‼️: training DiTs and SiTs becomes significantly easier, and we’ve achieved a state-of-the-art FID=1.42 on ImageNet 256x256 with guidance interval. Plus, this method scales well — the improvements are even greater for larger models. (5/n)
4
6
92
16,136
Detailed visual grounding is crucial but to see that, we need to raise the bar for benchmarking. We created V*Bench, a challenging vision-centric benchmark. Our model with visual search outperforms GPT4V by a big margin, despite using a much worse LM. Better Vision Matters! (5/n)
3
13
85
39,749
benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts
🌶️ hot take 🌶️ > we should normalize training on the test set yes, you read that right. no, I'm not joking. and, yes... I have taken ML 101 👉 here's why this is crucial for future multimodal LLM research [1/n] 🧵
1
5
86
20,544
Oh this is a fun hybrid architecture: using DiT for TTS, but at the input using ConvNeXt blocks for additional refinement.
Replying to @reach_vb
Overall architecture:
1
10
86
12,052
Recently open-sourced projects from @TongPetersb, @DavidJFan, and the team at Meta FAIR. MetaMorph (training code and model weights): github.com/facebookresearch/… Web-SSL (model weights for Web-DINO and Web-MAE) github.com/facebookresearch/… FAIR's still leading the way in open research.
We are open-sourcing all the models in Web-SSL, from ViT-L to ViT-7B! It was super fun to train and play with these massive ViTs. Models: huggingface.co/collections/f… Github: github.com/facebookresearch/… Huge credit to @DavidJFan for putting these models together!
1
12
85
10,575
Excited to present in the upcoming tutorials/workshops/posters and reconnect with old friends at #ECCV2024 Milano! Sunday AM (29th): Tutorial on Large Multimodal Foundation Models (boyiliee.github.io/lmfm/) Sunday PM (29th): 2nd OmniLabel Workshop on Enabling Complex Perception Through Vision and Language Foundational Models (sites.google.com/view/omnila…) Monday PM (30th): EVAL-FoMo 24: Emergent Visual Abilities and Limits of Foundation Models (sites.google.com/view/eval-f…) Also, I’ll be hanging around poster sessions presenting: Tuesday PM (Oct 1): Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (SiT) with @ma_nanye Wednesday AM (Oct 2): V-IRL: Grounding Virtual Intelligence in Real Life 🤪*I'm a friendly ICLR deadline reminder*🤪 Thursday AM (Oct 3): Fast Encoding and Decoding for Implicit Video Representation
8
82
7,263
Just landed in #Paris🇫🇷 and excited to attend #ICCV2023 in person. Tomorrow (Oct 2nd) afternoon, join us at the #QVCV workshop (1:30pm - 5:30pm @ S03). Computer vision research is evolving - hear from thought leaders about what the future may hold. (gkioxari.github.io/Tutorials…)
1
7
77
13,223
Arrived in Vancouver for #NeurIPS2024 now! Don’t miss @_ellisbrown and @TongPetersb’s talk about Cambrian-1—I’ll be there and at the poster too. Excited to connect with you all!
Heading to #NeurIPS2024 to present Cambrian-1 w/ @TongPetersb! Catch our oral presentation Friday @ 10am (Oral 5C) and our poster afterwards until 2pm (#3700 in East Hall A-C) 🪼🎉
3
79
5,907
using this DiT variant with a shallow/wide DDT head, we achieve strong image generation results on imagenet. some highlights: > 1.51 FID at 256×256 (without any guidance) > 1.13 FID at both 256×256 and 512×512 (with auto-guidance) personally, I don’t think these absolute sota FIDs tell the whole story anymore. what could matter more: how quickly a diffusion model can be trained, because that reflects the quality of its underlying representation. in this sense, the RAE-based DiT also stands out: it converges extremely fast, reaching 3.71 FID after only 20 epochs of training. and the samples are very fun to look at: the generated images are remarkably *diverse* and high quality! (6/n)
2
3
76
13,340
As @ylecun often points out, relying solely on the "rendering" loss isn't enough. If your focus is just on reconstructing nice-looking pixels, there's no way to filter out irrelevant details from the input — which is key to learning robust representations. Looks like even if your goal is to create nice-looking images, you still need to learn a strong representation first before getting into the details that make the image look good. (4/n) nitter.app/ylecun/status/18036775…
Can generative image models be good world models? This work from @Meta FAIR shows that there is a tradeoff between realism and diversity. The more realistic a generative model becomes, the less diverse it becomes. Realism comes at the cost of coverage. In other words, the most realistic systems are mode-collapsed. My hunch, supported by a growing amount of empirical evidence, is that world models should *not* be generative. They should make predictions in representation space. In representation space, unpredictable or otherwise irrelevant information is absent. This is the main argument in favor of JEPA (Joint Embedding Predictive Architectures).
1
4
73
13,640
In our new work: "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs", we try to understand the roots of these errors. How do we do this? The key here is to explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify 🙈''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. (2/n)
2
13
71
9,335
Jiatao is such an AI polymath, with amazing knowledge and experience across so many areas🤯! Every time I chat research with him, I come away learning so much. You should definitely apply to his lab—it’d be an incredible experience!
Life update: Excited to share that I will be joining @CIS_Penn @PennEngineers as an Assistant Professor in Fall 2025!🤯 I’m also seeking multiple PhD students passionate about Generative Intelligence and leveraging it to empower AI agents to interact with the Physical World🌟
1
68
12,288
'creativity is intelligence having fun,' indeed -- huge congrats to @drfeifei and my friends at WL on such an incredible launch!
“Creativity is intelligence having fun.” Unleash your creativity and imagination with Marble - our 3D world generation model, now available to everyone!
7
68
15,022
it’s a long paper, but trust me -- there are a lot of details we found genuinely interesting. if you’re working on video multimodal models it might be worth a read. I don’t know if our approach is the right path -- but I do know the current paradigm is not enough, and open science, open research is the only way forward. and it’s not just this one paper. we’re also co-releasing two related projects led by @_ellisbrown 👇 1️⃣ a study on multimodal benchmark design -- how to stress-test benchmarks and properly remove language bias. 2️⃣ a summary of our lessons from building simulators to collect spatial sensing videos (the same data we used for Cambrian-S). [7/n]
6
5
66
8,860
Looking forward to attending in person tomorrow!
A reminder that our Transformers for Vision workshop in #CVPR22 is happening this Sunday, June 19th at 7:50am CST in Great Hall D (and also on Zoom). We have an amazing speaker lineup, great panelists, and excellent paper sessions. Looking forward to seeing everybody!
2
6
63
Excited to see more open multimodal LLMs! Also quite impressive vision-centric performance on e.g. MMVP (huggingface.co/MMVP) with Gemma-2B.
We release PaliGemma. I'll keep it short, still on vacation: - sota open base VLM designed to transfer quickly, easily, and strongly to a wide range of tasks - Also does detection and segmentation - We provide lots of examples - Meaty tech report later! ai.google.dev/gemma/docs/pal…
2
7
63
15,525
not vibe/value alignment but “reality alignment”: we could let the model imagine in the multi-universe, then align it to earth in post-training. @giffmana you might find this interesting too then: arxiv.org/abs/2503.09595
This paper is interestingly thought- provoking for me. There is a chance, that it's easier to "align t2i model with real physics" in post-training. And let it learn to generate whatever (physically implausible) combinations in pretrain. As opposed to trying hard to come up with stuff that is supposed to learn only really physically plausible stuff from the start but might never work (not gonna call names here but i have something prominent in mind lol)
2
12
64
13,422
(🤷Now a bit of rant) The real issue is that working on visual representation learning is quite challenging right now. While CLIP-based models, which are strongly supervised by language, have proven to be effective, they come with their own set of problems, such as attribute binding. These models have been around for a while, and it's surprising that we haven't seen any major advancements. On the other hand, vision SSL models are impressive, but the traditional evaluation protocols (like linear probing or transferring to object detection) are no longer effective. They have become outdated and disconnected from current applications, making a lot of people think vision SSL has hit a wall. Nevertheless, I firmly believe that we should continue to push forward. CLIP/SigLIP models are great, but we need to diversify our approaches and keep exploring new possibilities instead of settling and claiming victory. (I'm sure @giffmana who has explored new approaches like CapPa, would agree with this perspective as well.) This situation is reminiscent of 2015-2016 when ImageNet supervised pre-training was deemed unbeatable, with other visual representations trailing by at least 10-15%. However, this did not deter researchers from exploring diverse approaches and pre-text tasks. It wasn't until several years later that MoCo demonstrated the potential to surpass a supervised pre-trained model. [3/n]
2
60
13,290
Some key observations: 1⃣ As many have noticed recently, diffusion transformers can produce reasonable representations, and better generative models lead to stronger representations. 2⃣ However, these are still much weaker than sota visual representations learned through SSL methods like DINOv2, JEPA or MAE. 3⃣ When we measure the alignment between diffusion features and DINOv2, the diffusion model makes steady progress throughout training, but it’s a slow climb. (3/n)
2
2
61
16,887
new update from the LiveCodeBench Pro competitive programming team
LiveCodeBench Pro remains one of the most challenging code benchmarks, but its evaluation and verification process is still a black box. We introduce AutoCode, which democratizes evaluation allowing anyone to locally run verification and perform RL training! For the first time, we also show that an LLM can act as a problem setter, transforming a simple problem into a harder version sometimes even harder than what it can solve itself. In other words, LLMs can generate problems they can’t yet solve, opening the door to true self-play. Moreover, through an agentic framework, we find that LLMs can automatically generate test cases, achieving 98.7% evaluation consistency, which is already highly practical accuracy for an RL verifier.
1
1
58
17,781
Finally, this is a completely open project where we have released the training code, model weights, all benchmarks, and detailed information such as system prompts and evaluation pipelines. These aspects are sometimes overlooked in research papers, so we made sure to provide them to save you the hassle of searching for them yourself. Regarding data, we collected Cambrian-7M from public datasets and investigated data mixing and balancing. This dataset is the largest for instruction tuning, and we are also open-sourcing it. [9/n]
1
1
54
8,998
last year, we built Cambrian-1, an open exploration of multimodal models for images. instead of just scaling up to Cambrian-2 or 3, we paused to ask: - what does true multimodal intelligence mean? - does the LLM paradigm even make sense for sensory modeling? - and why is human sensing so effortless, so intuitive, yet so powerful? something fundamental was missing and you really can’t build superintelligence without first building supersensing. so, what is supersensing? supersensing in our context isn’t about fancy sensors or better cameras. it’s about how a digital being truly experiences the world, absorbing endless streams of input and learning from them. supersensing is part of intelligence just like eyes are the part of the brain that touches the outside world. you don't need sensing to solve coding and math. but AI agents in the real world need sensory modeling. or maybe as @karpathy said, that’s all it’ll ever need. okay, enough philosophy. to be more concrete, we think the following taxonomy makes sense and captures how things evolve from what we have now to what we actually need to build next. 0. (linguistic-only understanding): no sensory capabilities; reasoning confined to text and symbols. current MLLMs have progressed beyond this stage, yet still retain traces of its bias. 1. semantic perception: parsing pixels into objects, attributes, and relations. this corresponds to the strong multimodal "show and tell" capabilities present in MLLMs. 2. streaming event cognition: processing live, unbounded streams while proactively interpreting and responding to ongoing events. this aligns with current efforts to make MLLMs real-time assistants. 3. implicit 3D spatial cognition: understanding video as projections of a 3D world. agents must know what is present, where, how things relate, and how configurations change over time. today’s multimodal models remain VERY limited here. 4. predictive world modeling: the brain makes “unconscious inferences” by predicting latent world states based on prior expectations. Current multimodal systems don’t have an internal model that can anticipate future states, maintain persistent memory, or reason and plan. to study all this, video is the ultimate medium -- it’s how humans experience the world every day, a direct projection of our lived experience. [2/n]
1
1
59
8,246
awesome work by @jiacheng_chen_ and @sanghyunwoo1219 on 3D-grounded visual compositing (and nice demos!)
Introducing BlenderFusion: Reassemble your visual elements—objects, camera, and background—to compose a new visual narrative. Play the interactive demo: blenderfusion.github.io/
4
8
52
9,517
Replying to @atulit_gaur
you can ask the same for vision people, the expected answer is 768=16x16x3
2
55
21,970
Welcome Gautam! So excited to have you join us🥳👏!! Thinking about a PhD in AI? @NYU_Courant is a great place, and so is NYC. Apply before the deadline!
Excited to announce that I'm joining NYU Courant (@NYU_Courant) CS as an Assistant Professor in Fall 2025 (~1 year from now). If you wish to work with me as a PhD student in theory/empirics of robustness/privacy in stats/ML (or related topics), apply to Courant CS by Dec 12! 1/n
1
53
12,288
much needed demystification - kudos to the arc agi team for this
We were able to reproduce the strong findings of the HRM paper on ARC-AGI-1. Further, we ran a series of ablation experiments to get to the bottom of what's behind it. Key findings: 1. The HRM model architecture itself (the centerpiece of the paper) is not an important factor. 2. The outer refinement loop (barely mentioned in the paper) is the main driver of performance. 3. Cross-task transfer learning is not very helpful. What matters is training on the tasks you will test on. 4. You can use much fewer data augmentations, especially at inference time. Finding 2 & 3 mean that this approach is a case of *zero-pretraining test-time training*, similar to the recently published "ARC-AGI without pretraining" paper by Liao et al.
1
54
10,928
We integrate the VQA LLM with a visual search model. With LLM's world knowledge, V* performs multi-round, guided search for visual targets. It extracts local features, and adds them to a working memory. The searched data are then used by VQA LLM to generate final responses. (4/n)
1
5
53
9,339
metaquery is now open-source — with both the data and code available.
The code and instruction-tuning data for MetaQuery are now open-sourced! Code: github.com/facebookresearch/… Data: huggingface.co/collections/x… Two months ago, we released MetaQuery, a minimal training recipe for SOTA unified understanding and generation models. We showed that tuning few learnable queries can transfer the world knowledge, strong reasoning, and in-context learning capabilities inherent in MLLMs to image generation. With the training code now available, you can train MetaQuery yourself almost as easily as fine-tuning a diffusion model. We have also open-sourced our 2.4M instruction-tuning dataset. Sourced from web corpora, it offers diverse supervision beyond copy-pasting and unlocks many new exciting capabilities. Thanks @metaai for their support in making it open source!
2
7
56
9,975