General intelligence and continue learning at meta tbd lab. prev Phd in MIT, M.S. in CMU, B.S. in PKU.

Bay Area
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch (no custom kernels) 🚀 10× GPU FLOPs utilization compared to previous nonlinear test-time training(ttt) methods. 🧠 Huge memory size (up to 40% of model params) Project page with code: tianyuanzhang.com/projects/t… (videos generated with our AR video diffusion) 1/9
7
80
425
101,817
3D Gaussian is great, but how can you interact with it 🌹👋? Introducing #PhysDreamer: Create your own realistic interactive 3D assets from only static images! Discover how we do this below👇 🧵1/: Website: physdreamer.github.io/
13
80
381
60,830
Model and training code for LaCT on language model, AR video gen and novel view synthesis are released, also have a TTT layer implementation with sequence parallel supported. Both object-centric and scene-level view synthesis checkpoints are released 🤓— come play!
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch (no custom kernels) 🚀 10× GPU FLOPs utilization compared to previous nonlinear test-time training(ttt) methods. 🧠 Huge memory size (up to 40% of model params) Project page with code: tianyuanzhang.com/projects/t… (videos generated with our AR video diffusion) 1/9
3
18
112
9,736
Image depth/normal models advanced so fast, how can we use them for consistent video depth/normal? Introducing Buffer Anytime, a framework to learn video depth/normal without video annotations!
Recent video depth models thrive on large-scale annotated datasets—but what if they’re unavailable? Introducing Buffer Anytime: a zero-shot framework using image priors to predict video depth and normals. Trained exclusively on unannotated videos, it achieves surprisingly smooth and consistent predictions that surpasses image-based methods and comprarable to state-of-the-art fully-supervised video models. 🌐 Website: bufferanytime.github.io 📄 Paper: arxiv.org/pdf/2411.17249 Huge thanks to my amazing collaborators: @tianyuanzhang99 @KaiZhang9546, @HaoTan5, @Sai__Bi, Yiwei Hu, @zexiangxu, @MilosHasan, @GordonWetzstein and @fujun_luan Check our website for more amazing results!
2
7
112
10,599
Attending CVPR at Seattle this week. Happy to chat about anything!
3
2
100
10,845
Very interesting work from MIT office mates! Diffusion Forcing with History Guidance introduces a novel approach to video generation, excelling at ultra-long sequences—800+ frames shown in the paper!
Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: boyuan.space/history-guidanc… (1/7)
1
7
66
6,217
part science, part empirical, part magic. All driven by extreme curiosity!!
I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streamin…
4
1
39
9,301
Check our learning-based deterministic model for novel view synthesis without NeRF and 3DGS, yet producing consistent rendering(still need camera poses). We try to use few 3D inductive biases and make it simple!
Novel view synthesis has long been a core challenge in 3D vision. But how much 3D inductive bias is truly needed? —Surprisingly, very little! Introducing "LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias"—a fully transformer-based approach that enables scalable, generalizable, and fully data-driven novel view synthesis, from sparse posed inputs. 🧵(1/6) Project Page: haian-jin.github.io/projects…
1
1
34
2,898
Thanks Songlin and Xinyu for hosting. Here is the recording and slides.
1
3
33
4,713
An image of an object tells more than the visual geometry of objects —it’s also a physical snapshot of an object in a state of static equilibrium. Can we use that cue to get more information about the objects? Checkout Minghao’s work on this topic: PhysComp!
Excited to share PhysComp (accepted by NeurIPS 2024 as spotlight) that turns single images into 3D objects designed to survive real-world forces! Reconstructing 3D shapes from an image often aims to be beyond visualization—they’re used in gaming, design, and engineering. Yet, many methods ignore physical principles, leading to unstable or deformed models under real-world forces like gravity. This inconsistency undermines their functionality and fails the aesthetic expectations set by the original image. (1/5)
1
2
31
2,989
The core idea behind LaCT (Large-Chunk Test-Time Training) is simple: 1. Use extremely large online chunk sizes (2K–1M tokens) for ttt to ensure high GPU utilization. 2. Use window attention for local memory, and test-time training (TTT) for non-local memory! 2/9
1
5
29
3,225
Replying to @bingyikang
Very cool paper. I guess one reason behind color > size > velocity > shape, is in your dataset, the color attributes affects lots of pixels and influence the L2 diffusion loss a lot.
1
27
3,667
I feel we need both. compression and sparsity are orthogonal, sometimes even the opposite.
talke a look at this blog introduce sparse attn and the implementation, which I think currently more promising than compression based method for long-context modeling
2
27
5,279
Very cool results. Step towards monocular 4D reconstruction!
🚀 Introducing CAT4D! 🚀 CAT4D transforms any real or generated video into dynamic 3D scenes with a multi-view video diffusion model. The outputs are dynamic 3D models that we can freeze and look at from novel viewpoints, in real-time!
Be sure to try our interactive viewer!
24
2,357
We do the opposite!🧠 Update fast weights using extremely huge chunks (2048–1M tokens).This simple idea has profound implications: 🚀 Parallelism & compute intensity → 10× FLOPs utilization 🦣: Scaling of State size → up to 40% of model params in our exp. 🛠️ Simplicity → no cumbersome custom kernels, just a few lines of PyTorch ⚡ Fast research iteration + easy integration with sophisticated ttt optimizers and ttt archs. 4/9
1
4
22
2,082
📚: TTT (Sun et al.) is a new way to design more powerful recurrent models. It propose to adapt a model’s fast weights during inference to store in-context info or learn in-context. It opens a vast design space for new RNNs! But prior TTT methods suffers from low GPU utilization (<5%)—even with custom CUDA kernels—because they updated fast weights every token or every 16–64 tokens. Not parallel enough! 3/9
1
3
21
2,464
📌Autoregressive Video Generation: We test LaCT up to scale:  distilling a 14B param video diffusion transformer (WAN-T2V) into an AR video diffusion by replacing full attention with LaCT + SWA. (Generated videos in this tweets all comes from this model.) 8/9
3
2
21
1,972
📌 Novel View Synthesis aims to render images of a static scene from previously unseen viewpoints given a set of input images. LaCT handles up to 1M tokens, outperforming 3D Gaussian Splatting with up to 128 × 960×536 input images (patch-size as 8x8 -> 1M tokens) on DL3DV dataset. (attached videos are our novel view rendering results) 6/9
1
2
18
1,645
Results are so good! Was imagining annotating millions of dynamic data with it from online videos.
Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!
17
2,193
📌 Language Models: Compared to linear memory models like GLA & DeltaNet, LaCT delivers: 🔟 5-10× larger nonlinear state ⏱ Comparable training wall-clock time 📉 Similar or better loss per token — especially at the last 2K tokens in a sequence 🔍 Similar or better retrieval accuracy (S-NIAH benchmark) 7/9
1
1
18
1,476
Happening in 5 min
Test-time training (TTT) is an elegant framework for adapting context to model weights. In today’s ASAP seminar (2pm Eastern Time), @tianyuanzhang99 presents Large Chunk TTT (LaCT) — a simple, efficient method combining TTT with chunked attention to unlock new opportunities.
18
1,779
We test LaCT on 3 diverse tasks: 🖼️ Novel View Synthesis (image sets) 📝 Language Modeling (1D sequences) 🎥 Video Diffusion (sequence of images) Let’s look at each ⬇️ 5/9
2
2
16
1,738
Intelligence needs long-context memories! We hope this work would inspire and accelerate future research in this field. 🙏 Huge shoutout to my amazing co-authors and collaborators @HaoTan5 , @Sai__Bi , @YicongHong , @KaiZhang9546 , @fujun_luan, @SonglinYang4 , Kalyan Sunkavalli, Bill Freeman! 9 / 9 (here attached novel view synthesis results)
2
15
1,631
Got a chance to play pingpong in VR with this virtual agent on May, it’s so cool! Imagine more sophisticated interactions with virtual agent in the future.
Thrilled to share our #SIGGRAPH2024 work on physics-based character animation for ping pong!🏓We show not only agent-agent matches but also human-agent interactions via VR, allowing humans to challenge our trained agents!🎮 🌐: jiashunwang.github.io/Physic… 📜: arxiv.org/abs/2407.16210
2
12
1,613
Amazing results!
Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense Point Cloud Reconstruction ✅ Point Tracking Project Page: vgg-t.github.io/ Code & Weights: github.com/facebookresearch/…
1
12
2,120
🐮Curious about results of gpt-5-thinking pro
1
12
1,701
Impressive reconstruction(scene and object) results. This gives me a feeling that attention is all you need for 3D reconstruction.
Thanks @_akhaliq for promoting our work. We show that long context learning (we use up to 16k tokens) also finds its place in sparse-view reconstruction! Together with @Sai__Bi, @HaoTan5, @zexiangxu, @ambie_kk, Kalyan Sunkavalli, Nanxuan Zhao!
1
1
11
2,485
Nice interpret! Want to add one more detail about Muon in test-time training (learned from @jxbz ). Muon probably is a perfect fit for the large chunk size. Since small chunk gives low rank gradients, and muon would increase magnitude of some noise in the low rank gradients. Adding momentum also increases rank.
1
10
269
Really love this line of works!
Just uploaded a 1-hr exclusive video for Part 2.1, with many technical details. piped.video/bpp6Dz8N2zY. Part 2.2 will be online in about a week.
10
962
Played with it a little bit on X. Speed seems fine, around 20 seconds for a batch of 4 image at 1024x768. Seems discretized tokenizer is used, causing artifacts in small details, like small faces and fingers. Maybe some parallel decoding is used to accelerate sampling.
Interesting @xai release when people are waiting for their Sora generations 👀 "Aurora is an autoregressive mixture-of-experts network trained to predict the next token from interleaved text and image data." So does this mean it's natively multimodal? Also interesting to see they make autoregressive image generation work that well. x.ai/blog/grok-image-generat…
1
9
1,112
Really impressive results. I think data driven approaches will be able to do fully inverse/forward rendering soon, including strong specular effects, hard shadows and transparencies.
Check out our recent work “Neural Gaffer: Relighting Any Object via Diffusion” 📷🌈, an end-to-end 2D relighting diffusion model that accurately relights any object in a single image under various lighting conditions. 🧵1/N: Website: neural-gaffer.github.io/
6
1,103
Replying to @stochasticchasm
Good take! About the last point, I don't think so. We use SwiGLU-MLP, just because it's everywhere, and it's better than pure linear. The fast weight function(single mlp layer, or whole transformer block, or something totally new), and the online training objectives(key-value association, next-token prediction), and online optimizers are all design spaces here. Which one is more important? I am not sure. -But the good news is that, LaCT makes research exploration much easier, since you only need to write pytorch code rather than kernel code for high gpu utilization.
1
8
532
Replying to @zhu_zhaocheng
Feels like the biggest problem is that JAX has smaller opensource communties compared to torch on most area now.
2
7
1,018
So cool
check out my latest trailer, Sand, crafted using my favorite ai tools: @midjourney for image generation @runwayml gen-3 for video creation @VideoleapApp for seamless editing making videos like this feels like magic
1
6
1,256
Replying to @Grad62304977
Good question! For the S-NIAH task, we still do chunk-update. If current decoding token is fewer than chunk size, no update is performed! Another interesting exp not in the paper: train with 2048 chunk size and inference with 1024 chunk size will only leads to slightly(near indistinguishable) loss.
1
5
140
🎥 We represent 3D objects as 3D Gaussians and synthesize a 2D video of the object in motion. We estimate the materials through differentiable simulation and differentiable rendering. Check out more results at our project page: physdreamer.github.io/ 4/N
1
5
1,231
First dinner at Berkeley!&,@;);/?
4
3
Very Good demos. But feels like, to get better results with fewer images, LFN needs more modeling on the scene, thus drops one big advantages over NeRF: no explicit modelling/contraints of the rendering process. -- Using depth reg assumes near Lambertian scene and no occlusion
1
3
Replying to @thjashin
Just curious, have you tried about random order pretrain, then using rl for learning orders.
1
3
2,795
Replying to @Xianbao_QIAN
Real world model should be much more complex and capable than this.
4
264
Realistic interaction requires physical materials of the 3D objects, yet these materials can be spatially varying and are hard to estimate from static images 🥲. However, video generation models, having seen millions of videos 🎬, contain visual priors of object dynamics. 2/N
2
4
1,277
Very cool results!
🔥Spatial intelligence needs fast, *interactive* 3D world generation 🎮 — introducing WonderWorld: generating 3D scenes interactively following your movement and content requests, and see them in <10 seconds! 🧵1/6 Web: kovenyu.com/WonderWorld/ arXiv: arxiv.org/pdf/2406.09394
4
748
Seems that only a few “retrieval” head needs KV cache.
Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU
4
809
Replying to @Grad62304977
I guess maybe some details are not well experimented. I can think of some details about the chunk momentum, currently, m_chunk = avg(m_i_within_chunk), which is not chunk-size invariant.... Maybe m_chunk = prod(m_i_within_chunk) makes more sense.
1
4
112
This one is interesting. Selective memory needs RL to learn what is important to memorize and what is not. This paper trigger me to think about handling multimodal memory with a more neural approach
5
115
Checkout Tianwei”s fast autoregressive video diffusion. A promising step towards real time interactive video generation!
Video diffusion models generate high-quality videos but are too slow for interactive applications. We @MIT_CSAIL @AdobeResearch introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵
4
692
Replying to @stochasticchasm
Goooood question. I am not sure if initial fast weight is not important. I hope it can be proven to be un-necessary. Here are some immediate thoughts: If you zero-init a swiglu-mlp fast weight, it will have no grad, thus will retain zero forever. Instead, one might want to zero init the fast layer of swiglu-mlp, and use "identity matrix" to init the second layer. This way, it's a non-op for first chunk, and some sort of memory for second and later chunks. Would be an interesting problem to explore.
1
3
132
Replying to @leloykun
Interesting! Will see if it can be used in not only pertaining but also test-time training!
1
3
367
Replying to @stochasticchasm
partially because I am new to language. Also, want to comment out, the novel-view-synthesis experiment is a perfect task to test model's in-context memory capacities. It's nearly a pure "smart retrieval" task, "smart" in the sense that it's needs some 3D physical reasoning. "Retrieval" in the sense that all information about novel views are already given in the input tokens. And retrieval is very hard for all linear-attentions ( or fixed state size models).
1
3
83
Just read, very elegant. And clear experiment. When reading table 9, 10, I was wishing that table 9, 10 has an accompanied figures where x-axis is length and y-axis is ppl or per-position loss
1
3
114
Exciting work! I feel similar problem also occurs on stable-diffusion. Where generated images hardly follow the composition of the text prompt.
1
3
224
Work done with @Koven_Yu, @ChrisWu6080, Brandon Y. Feng, Changxi Zheng, @Jimantha, @jiajunwu_cs, and Bill Freeman. 5/5
2
963
And just now it broke Cmt
2
Soon there gona be a list of Tiny /Quabntized .. BERT papers
1
Replying to @_AmilDravid
Thanks for pointing out!
2
697
Replying to @GuoMh14
Congrats and enjoy!
2
139
Replying to @jam3scampbell
For the personalization aspect of memory. There is a 20-80% law here. 20% of attributes can largely describe 80% one’s personality and interests. Feels like the real technical challenge lies in modeling the rest 80% of the memory.
2
219
Replying to @cloneofsimo
This seems to be the general problem of teacher forcing. Both next token pred and diffusion is trained with teacher forcing => leading to train inference mismatch.
1
286
From the provided visualization of generation process on X, it seems to be ar from top-row to bottom-row. Maybe row-wise ar, or raster-order patch-wise ar. Or something in between with gradually more aggressive parallel decoding. One-token-at-a-time is too slow for image ar.
2
1
329
I feel so, ttt is a promising option.
2
142
might be a path forward!
1
24
Replying to @Grad62304977
I think it depends on PE. We didn't do too much exploration about PE, just use the same RoPE. For the novel-view-synthesis experiment, figure-4(a), (b) shows results with different number of input images (the chunk-size = number of tokens in all input images). The model is only trained with 8 input images at the tested-resolution. So there are some zero-shot generalization. But!, the task of novel-view-synthesis use posed image as input, it comes with "natural" and "physically-correct" PEs, plucker-ray for each pixel! And transformers with such plucker-ray as PE also shows zero-shot length generalizations. So, my take away is, if PE is correct, chunk-size generalization, length generalization is possible.
1
1
114
We introduce #PhysDreamer. The key idea is to distill the object dynamics priors learned by a video generation model to estimate the physical materials of static 3D objects. 3/N
1
1
1
911
Replying to @Xianbao_QIAN
Good question! It's slow. We mentioned the speed in the final Limitation section of the paper. Our implementation takes 1 min of a V100 GPU to produce 1 second of the simulated video.
1
1
277
I also use metashapes and it’s way faster and robust. But i don’t know why. One general assumption is that heavy engineering is very important for implementing algorithm with long pipeline like SFM, and metashapes just have the money to engineering the system quite well.
1
1
57
Replying to @cloneofsimo
was thinking does casual attention really needs PE? Feels like casual structure should already tells some low frequency position information.
1
1
587
Thanks for the reference!
1
189
Replying to @timudk @rbhuta95
Thanks for the sharing! Is there a way to control camera motion magnitude and scene motion magnitude separately?
24
Thanks for the replying. I think your work is a good one, it shows a good balance between more prior and being versatile. Looking forward to your future works!
1
Replying to @Grad62304977
Might be a good idea. But one bug (or feature) is that, the first token's PE in each chunk is quite far away from the last token's PE in previous chunk. (maybe you want to match the cycles with chunk size...
1
55
Replying to @nathan84686947
I see what you are thinking. The idea that "There are no chunk structure in language" is wrong. There are concepts of episodes, where one topic ends. And align the chunk-size of LaCT with the episodes might be good. Also, get some caching meachnisim where there are too many episodes.
1
252
Replying to @dr_cintas
So cool! From these demo videos, the speed of generative feels to be “real time”(few seconds). That’s quite impressive if they are using diffusion models on a cpu machine.
1
378
Replying to @rrika9
remind me of the range analysis paper on sdf: Spelunking the Deep: Guaranteed Queries on General Neural Implicit Surfaces via Range Analysis.
1
41
I think mini-batch wise operation cannot be applied to text-to-img?
1
46
Replying to @tiangeluo
In terms of 3D/4D visual consistency, memory is easy to learn. Hard part is still the physics.
1
10
Replying to @giffmana
What does the +2 mean in the number of tokens per image? like register or CLS tokens in the image encoder?
1
1
273