Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

Pinned Tweet

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch (no custom kernels) 🚀 10× GPU FLOPs utilization compared to previous nonlinear test-time training(ttt) methods. 🧠 Huge memory size (up to 40% of model params) Project page with code: tianyuanzhang.com/projects/t… (videos generated with our AR video diffusion) 1/9

425

101,817

Tianyuan Zhang · Apr 22, 2024 · 11:18 PM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Apr 2024

3D Gaussian is great, but how can you interact with it 🌹👋? Introducing #PhysDreamer: Create your own realistic interactive 3D assets from only static images! Discover how we do this below👇 🧵1/: Website: physdreamer.github.io/

381

60,830

Tianyuan Zhang · Jul 17, 2025 · 7:43 PM UTC

Tianyuan Zhang

@tianyuanzhang99

17 Jul 2025

Model and training code for LaCT on language model, AR video gen and novel view synthesis are released, also have a TTT layer implementation with sequence parallel supported. Both object-centric and scene-level view synthesis checkpoints are released 🤓— come play!

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

112

9,736

Tianyuan Zhang · Nov 29, 2024 · 10:09 PM UTC

Tianyuan Zhang

@tianyuanzhang99

29 Nov 2024

Image depth/normal models advanced so fast, how can we use them for consistent video depth/normal? Introducing Buffer Anytime, a framework to learn video depth/normal without video annotations!

Zhengfei Kuang @zfkuang1

29 Nov 2024

Recent video depth models thrive on large-scale annotated datasets—but what if they’re unavailable? Introducing Buffer Anytime: a zero-shot framework using image priors to predict video depth and normals. Trained exclusively on unannotated videos, it achieves surprisingly smooth and consistent predictions that surpasses image-based methods and comprarable to state-of-the-art fully-supervised video models. 🌐 Website: bufferanytime.github.io 📄 Paper: arxiv.org/pdf/2411.17249 Huge thanks to my amazing collaborators: @tianyuanzhang99 @KaiZhang9546, @HaoTan5, @Sai__Bi, Yiwei Hu, @zexiangxu, @MilosHasan, @GordonWetzstein and @fujun_luan Check our website for more amazing results!

112

10,599

Tianyuan Zhang · Jun 17, 2024 · 5:32 PM UTC

Tianyuan Zhang

@tianyuanzhang99

17 Jun 2024

Attending CVPR at Seattle this week. Happy to chat about anything!

100

10,845

Tianyuan Zhang · Dec 22, 2024 · 1:08 AM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Dec 2024

Replying to @jbhuang0604

one talk from Sasha Rush is very relevant: piped.video/watch?v=6PEJ96…

Speculations on Test-Time Scaling (o1)

Tutorial on the technical background behind OpenAI o1. Talk written...

youtube.com

6,283

Tianyuan Zhang · Feb 11, 2025 · 8:37 PM UTC

Tianyuan Zhang

@tianyuanzhang99

11 Feb 2025

Very interesting work from MIT office mates! Diffusion Forcing with History Guidance introduces a novel approach to video generation, excelling at ultra-long sequences—800+ frames shown in the paper!

Boyuan Chen

@BoyuanChen0

11 Feb 2025

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: boyuan.space/history-guidanc… (1/7)

6,217

Tianyuan Zhang · Aug 8, 2025 · 3:36 AM UTC

Tianyuan Zhang

@tianyuanzhang99

8 Aug 2025

part science, part empirical, part magic. All driven by extreme curiosity!!

Guangxuan Xiao @Guangxuan_Xiao

8 Aug 2025

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streamin…

9,301

Tianyuan Zhang · Oct 24, 2024 · 11:14 PM UTC

Tianyuan Zhang

@tianyuanzhang99

24 Oct 2024

Check our learning-based deterministic model for novel view synthesis without NeRF and 3DGS, yet producing consistent rendering(still need camera poses). We try to use few 3D inductive biases and make it simple!

Haian Jin

@Haian_Jin

24 Oct 2024

Novel view synthesis has long been a core challenge in 3D vision. But how much 3D inductive bias is truly needed? —Surprisingly, very little! Introducing "LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias"—a fully transformer-based approach that enables scalable, generalizable, and fully data-driven novel view synthesis, from sparse posed inputs. 🧵(1/6) Project Page: haian-jin.github.io/projects…

2,898

Tianyuan Zhang · Jun 9, 2025 · 11:28 PM UTC

Tianyuan Zhang

@tianyuanzhang99

9 Jun 2025

Thanks Songlin and Xinyu for hosting. Here is the recording and slides.

Songlin Yang

@SonglinYang4

9 Jun 2025

Replying to @SonglinYang4 @tianyuanzhang99

Recording: piped.video/watch?v=5QxQUr-m… Slides: asap-seminar.github.io/asset…

4,713

Tianyuan Zhang · Dec 9, 2024 · 1:38 AM UTC

Tianyuan Zhang

@tianyuanzhang99

9 Dec 2024

An image of an object tells more than the visual geometry of objects —it’s also a physical snapshot of an object in a state of static equilibrium. Can we use that cue to get more information about the objects? Checkout Minghao’s work on this topic: PhysComp!

Minghao Guo

@GuoMh14

8 Dec 2024

Excited to share PhysComp (accepted by NeurIPS 2024 as spotlight) that turns single images into 3D objects designed to survive real-world forces! Reconstructing 3D shapes from an image often aims to be beyond visualization—they’re used in gaming, design, and engineering. Yet, many methods ignore physical principles, leading to unstable or deformed models under real-world forces like gravity. This inconsistency undermines their functionality and fails the aesthetic expectations set by the original image. (1/5)

2,989

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

The core idea behind LaCT (Large-Chunk Test-Time Training) is simple: 1. Use extremely large online chunk sizes (2K–1M tokens) for ttt to ensure high GPU utilization. 2. Use window attention for local memory, and test-time training (TTT) for non-local memory! 2/9

3,225

Tianyuan Zhang · Nov 5, 2024 · 4:29 PM UTC

Tianyuan Zhang

@tianyuanzhang99

5 Nov 2024

Replying to @bingyikang

Very cool paper. I guess one reason behind color > size > velocity > shape, is in your dataset, the color attributes affects lots of pixels and influence the L2 diffusion loss a lot.

3,667

Tianyuan Zhang · Jun 25, 2025 · 8:26 PM UTC

Tianyuan Zhang

@tianyuanzhang99

25 Jun 2025

I feel we need both. compression and sparsity are orthogonal, sometimes even the opposite.

Wenhao Chai

@wenhaocha1

25 Jun 2025

talke a look at this blog introduce sparse attn and the implementation, which I think currently more promising than compression based method for long-context modeling

5,279

Tianyuan Zhang · Nov 28, 2024 · 3:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

28 Nov 2024

Very cool results. Step towards monocular 4D reconstruction!

Rundi Wu @ChrisWu6080

28 Nov 2024

🚀 Introducing CAT4D! 🚀 CAT4D transforms any real or generated video into dynamic 3D scenes with a multi-view video diffusion model. The outputs are dynamic 3D models that we can freeze and look at from novel viewpoints, in real-time! Be sure to try our interactive viewer!

2,357

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

We do the opposite!🧠 Update fast weights using extremely huge chunks (2048–1M tokens).This simple idea has profound implications: 🚀 Parallelism & compute intensity → 10× FLOPs utilization 🦣: Scaling of State size → up to 40% of model params in our exp. 🛠️ Simplicity → no cumbersome custom kernels, just a few lines of PyTorch ⚡ Fast research iteration + easy integration with sophisticated ttt optimizers and ttt archs. 4/9

2,082

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

📚: TTT (Sun et al.) is a new way to design more powerful recurrent models. It propose to adapt a model’s fast weights during inference to store in-context info or learn in-context. It opens a vast design space for new RNNs! But prior TTT methods suffers from low GPU utilization (<5%)—even with custom CUDA kernels—because they updated fast weights every token or every 16–64 tokens. Not parallel enough! 3/9

2,464

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

📌Autoregressive Video Generation: We test LaCT up to scale: distilling a 14B param video diffusion transformer (WAN-T2V) into an AR video diffusion by replacing full attention with LaCT + SWA. (Generated videos in this tweets all comes from this model.) 8/9

1,972

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

📌 Novel View Synthesis aims to render images of a static scene from previously unseen viewpoints given a set of input images. LaCT handles up to 1M tokens, outperforming 3D Gaussian Splatting with up to 128 × 960×536 input images (patch-size as 8x8 -> 1M tokens) on DL3DV dataset. (attached videos are our novel view rendering results) 6/9

1,645

Tianyuan Zhang · Dec 6, 2024 · 5:38 AM UTC

Tianyuan Zhang

@tianyuanzhang99

6 Dec 2024

Results are so good! Was imagining annotating millions of dynamic data with it from online videos.

Zhengqi Li @zhengqi_li

6 Dec 2024

Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!

2,193

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

📌 Language Models: Compared to linear memory models like GLA & DeltaNet, LaCT delivers: 🔟 5-10× larger nonlinear state ⏱ Comparable training wall-clock time 📉 Similar or better loss per token — especially at the last 2K tokens in a sequence 🔍 Similar or better retrieval accuracy (S-NIAH benchmark) 7/9

1,476

Tianyuan Zhang · Jun 9, 2025 · 6:03 PM UTC

Tianyuan Zhang

@tianyuanzhang99

9 Jun 2025

Happening in 5 min

Songlin Yang

@SonglinYang4

9 Jun 2025

Test-time training (TTT) is an elegant framework for adapting context to model weights. In today’s ASAP seminar (2pm Eastern Time), @tianyuanzhang99 presents Large Chunk TTT (LaCT) — a simple, efficient method combining TTT with chunked attention to unlock new opportunities.

1,779

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

We test LaCT on 3 diverse tasks: 🖼️ Novel View Synthesis (image sets) 📝 Language Modeling (1D sequences) 🎥 Video Diffusion (sequence of images) Let’s look at each ⬇️ 5/9

1,738

Tianyuan Zhang · Jun 3, 2025 · 12:00 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Intelligence needs long-context memories! We hope this work would inspire and accelerate future research in this field. 🙏 Huge shoutout to my amazing co-authors and collaborators @HaoTan5 , @Sai__Bi , @YicongHong , @KaiZhang9546 , @fujun_luan, @SonglinYang4 , Kalyan Sunkavalli, Bill Freeman! 9 / 9 (here attached novel view synthesis results)

1,631

Tianyuan Zhang · Jul 24, 2024 · 9:35 PM UTC

Tianyuan Zhang

@tianyuanzhang99

24 Jul 2024

Got a chance to play pingpong in VR with this virtual agent on May, it’s so cool! Imagine more sophisticated interactions with virtual agent in the future.

Jiashun Wang

@JiashunWang

24 Jul 2024

Thrilled to share our #SIGGRAPH2024 work on physics-based character animation for ping pong!🏓We show not only agent-agent matches but also human-agent interactions via VR, allowing humans to challenge our trained agents!🎮 🌐: jiashunwang.github.io/Physic… 📜: arxiv.org/abs/2407.16210

1,613

Tianyuan Zhang · Jun 11, 2025 · 1:56 AM UTC

Tianyuan Zhang

@tianyuanzhang99

11 Jun 2025

Just arrived at Nashville for CVPR! Looking forward to chat on any topics!

Age-restricted adult content. This content might not be appropriate for people under 18 years old. To view this media, you’ll need to log in to X.

869

Tianyuan Zhang · Mar 17, 2025 · 6:12 AM UTC

Tianyuan Zhang

@tianyuanzhang99

17 Mar 2025

Amazing results!

Jianyuan

@jianyuan_wang

17 Mar 2025

Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense Point Cloud Reconstruction ✅ Point Tracking Project Page: vgg-t.github.io/ Code & Weights: github.com/facebookresearch/…

2,120

Tianyuan Zhang · Aug 11, 2025 · 6:41 AM UTC

Tianyuan Zhang

@tianyuanzhang99

11 Aug 2025

Replying to @wenhaocha1 @ZihanZheng71803

🐮Curious about results of gpt-5-thinking pro

1,701

Tianyuan Zhang · May 1, 2024 · 3:14 AM UTC

Tianyuan Zhang

@tianyuanzhang99

1 May 2024

Impressive reconstruction(scene and object) results. This gives me a feeling that attention is all you need for 3D reconstruction.

Kai Zhang @KaiZhang9546

1 May 2024

Thanks @_akhaliq for promoting our work. We show that long context learning (we use up to 16k tokens) also finds its place in sparse-view reconstruction! Together with @Sai__Bi, @HaoTan5, @zexiangxu, @ambie_kk, Kalyan Sunkavalli, Nanxuan Zhao!

2,485

Tianyuan Zhang · Dec 4, 2024 · 10:05 PM UTC

Tianyuan Zhang

@tianyuanzhang99

4 Dec 2024

Checkout our work on random order gpt-style visual autoregressive model.

Age-restricted adult content. This content might not be appropriate for people under 18 years old. To view this media, you’ll need to log in to X.

662

Tianyuan Zhang · Jun 3, 2025 · 1:22 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @Grad62304977 @kellerjordan0

Nice interpret! Want to add one more detail about Muon in test-time training (learned from @jxbz ). Muon probably is a perfect fit for the large chunk size. Since small chunk gives low rank gradients, and muon would increase magnitude of some noise in the low rank gradients. Adding momentum also increases rank.

269

Tianyuan Zhang · Sep 14, 2024 · 6:20 PM UTC

Tianyuan Zhang

@tianyuanzhang99

14 Sep 2024

Really love this line of works!

Zeyuan Allen-Zhu, Sc.D.

@ZeyuanAllenZhu

14 Sep 2024

Just uploaded a 1-hr exclusive video for Part 2.1, with many technical details. piped.video/bpp6Dz8N2zY. Part 2.2 will be online in about a week.

962

Tianyuan Zhang · Dec 9, 2024 · 11:41 PM UTC

Tianyuan Zhang

@tianyuanzhang99

9 Dec 2024

Played with it a little bit on X. Speed seems fine, around 20 seconds for a batch of 4 image at 1024x768. Seems discretized tokenizer is used, causing artifacts in small details, like small faces and fingers. Maybe some parallel decoding is used to accelerate sampling.

Tianle Cai

@tianle_cai

9 Dec 2024

Interesting @xai release when people are waiting for their Sora generations 👀 "Aurora is an autoregressive mixture-of-experts network trained to predict the next token from interleaved text and image data." So does this mean it's natively multimodal? Also interesting to see they make autoregressive image generation work that well. x.ai/blog/grok-image-generat…

1,112

Tianyuan Zhang · Jun 22, 2024 · 2:33 AM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Jun 2024

Really impressive results. I think data driven approaches will be able to do fully inverse/forward rendering soon, including strong specular effects, hard shadows and transparencies.

Haian Jin

@Haian_Jin

22 Jun 2024

Check out our recent work “Neural Gaffer: Relighting Any Object via Diffusion” 📷🌈, an end-to-end 2D relighting diffusion model that accurately relights any object in a single image under various lighting conditions. 🧵1/N: Website: neural-gaffer.github.io/

1,103

Tianyuan Zhang · Jun 4, 2025 · 6:09 AM UTC

Tianyuan Zhang

@tianyuanzhang99

4 Jun 2025

Replying to @stochasticchasm

Good take! About the last point, I don't think so. We use SwiGLU-MLP, just because it's everywhere, and it's better than pure linear. The fast weight function(single mlp layer, or whole transformer block, or something totally new), and the online training objectives(key-value association, next-token prediction), and online optimizers are all design spaces here. Which one is more important? I am not sure. -But the good news is that, LaCT makes research exploration much easier, since you only need to write pytorch code rather than kernel code for high gpu utilization.

532

Tianyuan Zhang · May 10, 2023 · 8:08 PM UTC

Tianyuan Zhang

@tianyuanzhang99

10 May 2023

Replying to @zhu_zhaocheng

Feels like the biggest problem is that JAX has smaller opensource communties compared to torch on most area now.

1,018

Tianyuan Zhang · Oct 21, 2021 · 12:34 AM UTC

Tianyuan Zhang

@tianyuanzhang99

21 Oct 2021

Excited to share our new work on Autonomous driving: @yuewang314, Vitor Guizilini, Yilun Wang, @zhaohang0124, @JustinMSolomon DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries arxiv.org/abs/2110.06922 Our work allows end-to-end multi-camera 3D detection.

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

We introduce a framework for multi-camera 3D object detection. In contrast to existing works, which estimate 3D bounding boxes directly from monocular images or use depth prediction networks to...

arxiv.org

Tianyuan Zhang · Aug 6, 2024 · 7:25 AM UTC

Tianyuan Zhang

@tianyuanzhang99

6 Aug 2024

So cool

Justin Ryan ✨@justinryanai

5 Aug 2024

check out my latest trailer, Sand, crafted using my favorite ai tools: @midjourney for image generation @runwayml gen-3 for video creation @VideoleapApp for seamless editing making videos like this feels like magic

1,256

Tianyuan Zhang · Jun 3, 2025 · 1:25 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @Grad62304977

Good question! For the S-NIAH task, we still do chunk-update. If current decoding token is fewer than chunk size, no update is performed! Another interesting exp not in the paper: train with 2048 chunk size and inference with 1024 chunk size will only leads to slightly(near indistinguishable) loss.

140

Tianyuan Zhang · Apr 22, 2024 · 11:20 PM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Apr 2024

🎥 We represent 3D objects as 3D Gaussians and synthesize a 2D video of the object in motion. We estimate the materials through differentiable simulation and differentiable rendering. Check out more results at our project page: physdreamer.github.io/ 4/N

1,231

Tianyuan Zhang · Jun 26, 2019 · 6:55 AM UTC

Tianyuan Zhang

@tianyuanzhang99

26 Jun 2019

First dinner at Berkeley!&,@;);/?

Tianyuan Zhang · May 11, 2022 · 7:08 PM UTC

Tianyuan Zhang

@tianyuanzhang99

11 May 2022

Replying to @taiyasaki @vincesitzmann

Very Good demos. But feels like, to get better results with fewer images, LFN needs more modeling on the scene, thus drops one big advantages over NeRF: no explicit modelling/contraints of the rendering process. -- Using depth reg assumes near Lambertian scene and no occlusion

Tianyuan Zhang · Apr 14, 2025 · 8:35 PM UTC

Tianyuan Zhang

@tianyuanzhang99

14 Apr 2025

Replying to @thjashin

Just curious, have you tried about random order pretrain, then using rl for learning orders.

2,795

Tianyuan Zhang · Apr 23, 2024 · 4:55 PM UTC

Tianyuan Zhang

@tianyuanzhang99

23 Apr 2024

Replying to @Xianbao_QIAN

Real world model should be much more complex and capable than this.

264

Tianyuan Zhang · Apr 22, 2024 · 11:19 PM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Apr 2024

Realistic interaction requires physical materials of the 3D objects, yet these materials can be spatially varying and are hard to estimate from static images 🥲. However, video generation models, having seen millions of videos 🎬, contain visual priors of object dynamics. 2/N

1,277

Tianyuan Zhang · Aug 20, 2023 · 4:41 AM UTC

Tianyuan Zhang

@tianyuanzhang99

20 Aug 2023

Replying to @ShivamDuggal4 @MITEECS @MIT_CSAIL @pathak2206 @CarnegieMellon @roboVisionCMU

Cograts!

352

Tianyuan Zhang · Sep 17, 2024 · 8:34 PM UTC

Tianyuan Zhang

@tianyuanzhang99

17 Sep 2024

Very cool results!

Hong-Xing (Koven) Yu

@Koven_Yu

16 Sep 2024

🔥Spatial intelligence needs fast, *interactive* 3D world generation 🎮 — introducing WonderWorld: generating 3D scenes interactively following your movement and content requests, and see them in <10 seconds! 🧵1/6 Web: kovenyu.com/WonderWorld/ arXiv: arxiv.org/pdf/2406.09394

748

Tianyuan Zhang · Oct 15, 2024 · 5:16 PM UTC

Tianyuan Zhang

@tianyuanzhang99

15 Oct 2024

Seems that only a few “retrieval” head needs KV cache.

Guangxuan Xiao @Guangxuan_Xiao

15 Oct 2024

Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU

809

Tianyuan Zhang · Jun 3, 2025 · 1:32 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @Grad62304977

I guess maybe some details are not well experimented. I can think of some details about the chunk momentum, currently, m_chunk = avg(m_i_within_chunk), which is not chunk-size invariant.... Maybe m_chunk = prod(m_i_within_chunk) makes more sense.

112

Tianyuan Zhang · Jul 17, 2025 · 4:40 PM UTC

Tianyuan Zhang

@tianyuanzhang99

17 Jul 2025

Replying to @Grad62304977 @natanielruizg @jxmnop

This one is interesting. Selective memory needs RL to learn what is important to memorize and what is not. This paper trigger me to think about handling multimodal memory with a more neural approach

115

Tianyuan Zhang · Dec 7, 2024 · 4:01 AM UTC

Tianyuan Zhang

@tianyuanzhang99

7 Dec 2024

Checkout Tianwei”s fast autoregressive video diffusion. A promising step towards real time interactive video generation!

Tianwei Yin

@TianweiY

7 Dec 2024

Video diffusion models generate high-quality videos but are too slow for interactive applications. We @MIT_CSAIL @AdobeResearch introduce CausVid, a fast autoregressive video diffusion model that starts playing the moment you hit "Generate"! A thread 🧵

692

Tianyuan Zhang · Jun 4, 2025 · 6:20 AM UTC

Tianyuan Zhang

@tianyuanzhang99

4 Jun 2025

Replying to @stochasticchasm

Goooood question. I am not sure if initial fast weight is not important. I hope it can be proven to be un-necessary. Here are some immediate thoughts: If you zero-init a swiglu-mlp fast weight, it will have no grad, thus will retain zero forever. Instead, one might want to zero init the fast layer of swiglu-mlp, and use "identity matrix" to init the second layer. This way, it's a non-op for first chunk, and some sort of memory for second and later chunks. Would be an interesting problem to explore.

132

Tianyuan Zhang · May 8, 2024 · 12:37 AM UTC

Tianyuan Zhang

@tianyuanzhang99

8 May 2024

Interesting findings with tons of experiments!

Age-restricted adult content. This content might not be appropriate for people under 18 years old. To view this media, you’ll need to log in to X.

568

Tianyuan Zhang · Jun 21, 2025 · 6:28 PM UTC

Tianyuan Zhang

@tianyuanzhang99

21 Jun 2025

Replying to @leloykun

Interesting! Will see if it can be used in not only pertaining but also test-time training!

367

Tianyuan Zhang · Sep 6, 2024 · 3:30 AM UTC

Tianyuan Zhang

@tianyuanzhang99

6 Sep 2024

Replying to @peterljq @xuyilun2 @CongyueD @ishaanpreetam @TianweiY @ShivamDuggal4 @_atewari

🤣🤣

147

Tianyuan Zhang · Jun 4, 2025 · 6:12 AM UTC

Tianyuan Zhang

@tianyuanzhang99

4 Jun 2025

Replying to @stochasticchasm

partially because I am new to language. Also, want to comment out, the novel-view-synthesis experiment is a perfect task to test model's in-context memory capacities. It's nearly a pure "smart retrieval" task, "smart" in the sense that it's needs some 3D physical reasoning. "Retrieval" in the sense that all information about novel views are already given in the input tokens. And retrieval is very hard for all linear-attentions ( or fixed state size models).

Tianyuan Zhang · Jun 23, 2025 · 11:29 PM UTC

Tianyuan Zhang

@tianyuanzhang99

23 Jun 2025

Replying to @ninoscherrer @SonglinYang4 @oswaldjoh

Just read, very elegant. And clear experiment. When reading table 9, 10, I was wishing that table 9, 10 has an accompanied figures where x-axis is length and y-axis is ppl or per-position loss

114

Tianyuan Zhang · Mar 24, 2023 · 1:54 AM UTC

Tianyuan Zhang

@tianyuanzhang99

24 Mar 2023

Replying to @james_y_zou @mertyuksekgonul @federicobianchy @ria_kalluri @jurafsky

Exciting work! I feel similar problem also occurs on stable-diffusion. Where generated images hardly follow the composition of the text prompt.

224

Tianyuan Zhang · Apr 22, 2024 · 11:24 PM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Apr 2024

Work done with @Koven_Yu, @ChrisWu6080, Brandon Y. Feng, Changxi Zheng, @Jimantha, @jiajunwu_cs, and Bill Freeman. 5/5

963

Tianyuan Zhang · Nov 17, 2021 · 7:14 AM UTC

Tianyuan Zhang

@tianyuanzhang99

17 Nov 2021

Replying to @jbhuang0604 @CVPR @overleaf

And just now it broke Cmt

Tianyuan Zhang · Aug 27, 2019 · 11:34 PM UTC

Tianyuan Zhang

@tianyuanzhang99

27 Aug 2019

Soon there gona be a list of Tiny /Quabntized .. BERT papers

Xin Eric Wang

@xwang_lk

27 Aug 2019

A list of V*BERT papers: VideoBERT: arxiv.org/abs/1904.01766 ViLBERT: arxiv.org/abs/1904.01766 LXMERT: arxiv.org/abs/1908.07490 VisualBERT: arxiv.org/abs/1908.03557 Unicoder-VL: arxiv.org/abs/1908.06066 B2T2: arxiv.org/abs/1908.05054 VL-BERT: arxiv.org/abs/1908.08530 ... ...

Tianyuan Zhang · Jun 3, 2025 · 1:08 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @_AmilDravid

Thanks for pointing out!

697

Tianyuan Zhang · Dec 8, 2024 · 10:51 PM UTC

Tianyuan Zhang

@tianyuanzhang99

8 Dec 2024

Replying to @GuoMh14

Congrats and enjoy!

139

Tianyuan Zhang · Apr 13, 2025 · 10:05 PM UTC

Tianyuan Zhang

@tianyuanzhang99

13 Apr 2025

Replying to @jam3scampbell

For the personalization aspect of memory. There is a 20-80% law here. 20% of attributes can largely describe 80% one’s personality and interests. Feels like the real technical challenge lies in modeling the rest 80% of the memory.

219

Tianyuan Zhang · Mar 31, 2025 · 8:02 PM UTC

Tianyuan Zhang

@tianyuanzhang99

31 Mar 2025

Replying to @cloneofsimo

This seems to be the general problem of teacher forcing. Both next token pred and diffusion is trained with teacher forcing => leading to train inference mismatch.

286

Tianyuan Zhang · Dec 9, 2024 · 11:54 PM UTC

Tianyuan Zhang

@tianyuanzhang99

9 Dec 2024

From the provided visualization of generation process on X, it seems to be ar from top-row to bottom-row. Maybe row-wise ar, or raster-order patch-wise ar. Or something in between with gradually more aggressive parallel decoding. One-token-at-a-time is too slow for image ar.

329

Tianyuan Zhang · Jun 4, 2025 · 12:38 AM UTC

Tianyuan Zhang

@tianyuanzhang99

4 Jun 2025

Replying to @rm_rafailov @karpathy

I feel so, ttt is a promising option.

142

Tianyuan Zhang · Jun 3, 2025 · 6:07 PM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @MinuteMovies3 @dwarkesh_sp

might be a path forward!

Tianyuan Zhang · Jun 3, 2025 · 1:43 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @Grad62304977

I think it depends on PE. We didn't do too much exploration about PE, just use the same RoPE. For the novel-view-synthesis experiment, figure-4(a), (b) shows results with different number of input images (the chunk-size = number of tokens in all input images). The model is only trained with 8 input images at the tested-resolution. So there are some zero-shot generalization. But!, the task of novel-view-synthesis use posed image as input, it comes with "natural" and "physically-correct" PEs, plucker-ray for each pixel! And transformers with such plucker-ray as PE also shows zero-shot length generalizations. So, my take away is, if PE is correct, chunk-size generalization, length generalization is possible.

114

Tianyuan Zhang · Apr 22, 2024 · 11:20 PM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Apr 2024

We introduce #PhysDreamer. The key idea is to distill the object dynamics priors learned by a video generation model to estimate the physical materials of static 3D objects. 3/N

911

Tianyuan Zhang · Apr 23, 2024 · 4:54 PM UTC

Tianyuan Zhang

@tianyuanzhang99

23 Apr 2024

Replying to @Xianbao_QIAN

Good question! It's slow. We mentioned the speed in the final Limitation section of the paper. Our implementation takes 1 min of a V100 GPU to produce 1 second of the simulated video.

277

Tianyuan Zhang · Dec 16, 2024 · 5:23 AM UTC

Tianyuan Zhang

@tianyuanzhang99

16 Dec 2024

Replying to @HongjieWang3

Nice work!

435

Tianyuan Zhang · Feb 26, 2024 · 1:40 AM UTC

Tianyuan Zhang

@tianyuanzhang99

26 Feb 2024

Replying to @Aaronf_hd @janusch_patas

I also use metashapes and it’s way faster and robust. But i don’t know why. One general assumption is that heavy engineering is very important for implementing algorithm with long pipeline like SFM, and metashapes just have the money to engineering the system quite well.

Tianyuan Zhang · Jun 3, 2025 · 4:41 PM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @ahatamiz1

👌!

325

Tianyuan Zhang · Nov 19, 2024 · 5:59 AM UTC

Tianyuan Zhang

@tianyuanzhang99

19 Nov 2024

Replying to @cloneofsimo

was thinking does casual attention really needs PE? Feels like casual structure should already tells some low frequency position information.

587

Tianyuan Zhang · Nov 19, 2024 · 3:56 PM UTC

Tianyuan Zhang

@tianyuanzhang99

19 Nov 2024

Replying to @andregraubner @cloneofsimo

Thanks for the reference!

189

Tianyuan Zhang · Nov 10, 2021 · 10:50 PM UTC

Tianyuan Zhang

@tianyuanzhang99

10 Nov 2021

Replying to @tolga_birdal @CVPR

Agree！

Tianyuan Zhang · Jul 4, 2022 · 9:18 PM UTC

Tianyuan Zhang

@tianyuanzhang99

4 Jul 2022

Replying to @Bw_Li1024 @eccvconf @DrChenWang @pranay_ar @seungchankim25 @smash0190

Congrats!

Tianyuan Zhang · Nov 22, 2023 · 4:58 AM UTC

Tianyuan Zhang

@tianyuanzhang99

22 Nov 2023

Replying to @timudk @rbhuta95

Thanks for the sharing! Is there a way to control camera motion magnitude and scene motion magnitude separately?

Tianyuan Zhang · May 11, 2022 · 11:24 PM UTC

Tianyuan Zhang

@tianyuanzhang99

11 May 2022

Replying to @liangchensong @taiyasaki @vincesitzmann

Thanks for the replying. I think your work is a good one, it shows a good balance between more prior and being versatile. Looking forward to your future works!

Tianyuan Zhang · Jun 3, 2025 · 1:52 AM UTC

Tianyuan Zhang

@tianyuanzhang99

3 Jun 2025

Replying to @Grad62304977

Might be a good idea. But one bug (or feature) is that, the first token's PE in each chunk is quite far away from the last token's PE in previous chunk. (maybe you want to match the cycles with chunk size...

Tianyuan Zhang · Jun 4, 2025 · 8:56 PM UTC

Tianyuan Zhang

@tianyuanzhang99

4 Jun 2025

Replying to @nathan84686947

I see what you are thinking. The idea that "There are no chunk structure in language" is wrong. There are concepts of episodes, where one topic ends. And align the chunk-size of LaCT with the episodes might be good. Also, get some caching meachnisim where there are too many episodes.

252

Tianyuan Zhang · Apr 23, 2024 · 1:02 AM UTC

Tianyuan Zhang

@tianyuanzhang99

23 Apr 2024

Replying to @UUUUUsher

Thanks!

212

Tianyuan Zhang · Jul 19, 2024 · 8:40 PM UTC

Tianyuan Zhang

@tianyuanzhang99

19 Jul 2024

Replying to @TairanHe99

Congrats!

241

Tianyuan Zhang · Jun 19, 2025 · 8:19 AM UTC

Tianyuan Zhang

@tianyuanzhang99

19 Jun 2025

Replying to @wenhaocha1

Agree!

222

Tianyuan Zhang · May 25, 2023 · 11:22 PM UTC

Tianyuan Zhang

@tianyuanzhang99

25 May 2023

Replying to @dr_cintas

So cool! From these demo videos, the speed of generative feels to be “real time”(few seconds). That’s quite impressive if they are using diffusion models on a cpu machine.

378

Tianyuan Zhang · Apr 16, 2023 · 6:16 AM UTC

Tianyuan Zhang

@tianyuanzhang99

16 Apr 2023

Replying to @rrika9

remind me of the range analysis paper on sdf: Spelunking the Deep: Guaranteed Queries on General Neural Implicit Surfaces via Range Analysis.

Tianyuan Zhang · Jan 1, 2025 · 5:17 PM UTC

Tianyuan Zhang

@tianyuanzhang99

1 Jan 2025

Replying to @cloneofsimo @ArnaudDoucet1

I think mini-batch wise operation cannot be applied to text-to-img?

Tianyuan Zhang · Aug 6, 2025 · 7:35 PM UTC

Tianyuan Zhang

@tianyuanzhang99

6 Aug 2025

Replying to @tiangeluo

In terms of 3D/4D visual consistency, memory is easy to learn. Hard part is still the physics.

Tianyuan Zhang · Jun 18, 2025 · 2:32 AM UTC

Tianyuan Zhang

@tianyuanzhang99

18 Jun 2025

Replying to @giffmana

What does the +2 mean in the number of tokens per image? like register or CLS tokens in the image encoder?

273

Tianyuan Zhang · Mar 24, 2025 · 5:54 PM UTC

Tianyuan Zhang

@tianyuanzhang99

24 Mar 2025

Replying to @TianweiY @MIT @AdobeResearch @reveimage

Congrats!

285