CS Ph.D. Student @Berkeley_AI. B.Eng. @SJTU1896 CS. previous with @GoogleDeepMind, @MSFTResearch. Vision, generative model, robotics.

Pinned Tweet
๐—ข๐—ป๐—ฒ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† ๐—ฐ๐—ฎ๐—ปโ€™๐˜ ๐—ฟ๐˜‚๐—น๐—ฒ ๐˜๐—ต๐—ฒ๐—บ ๐—ฎ๐—น๐—น. We present ๐—Ÿ๐—ผ๐—š๐—ฒ๐—ฅ, a new ๐—ต๐˜†๐—ฏ๐—ฟ๐—ถ๐—ฑ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† architecture for long-context geometric reconstruction. LoGeR enables stable reconstruction over up to ๐Ÿญ๐Ÿฌ๐—ธ ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜€ / ๐—ธ๐—ถ๐—น๐—ผ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ, with ๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ-๐˜๐—ถ๐—บ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด in sequence length, ๐—ณ๐˜‚๐—น๐—น๐˜† ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ณ๐—ผ๐—ฟ๐˜„๐—ฎ๐—ฟ๐—ฑ inference, and ๐—ป๐—ผ ๐—ฝ๐—ผ๐˜€๐˜-๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป. Yet it matches or surpasses strong optimization-based pipelines. (1/5) @GoogleDeepMind @Berkeley_AI
64
445
3,392
561,709
Excited to share MonST3R! -- a simple way to estimate geometry from unposed video of dynamic scene We achieve competitive results on several downstreams (video depth, camera pose) and believe this is a promising step toward feed-forward 4D reconstruction monst3r-project.github.io
22
138
723
131,546
MonST3R is accepted by ICLR'25 as Spotlight! We have also added a fully feed-forward reconstruction mode that runs in real-time for video input (samples at: monst3r-paper.github.io/pageโ€ฆ), check more details here: github.com/Junyi42/monst3r/pโ€ฆ
Excited to share MonST3R! -- a simple way to estimate geometry from unposed video of dynamic scene We achieve competitive results on several downstreams (video depth, camera pose) and believe this is a promising step toward feed-forward 4D reconstruction monst3r-project.github.io
2
30
327
21,908
Introducing St4RTrack!๐Ÿ–– Simultaneous 4D Reconstruction and Tracking in the world coordinate feed-forwardly, just by changing the meaning of two pointmaps! st4rtrack.github.io
7
50
277
52,442
Code for inference, visualization, training, and evaluation is released! - GitHub.com/Junyi42/monst3r
Excited to share MonST3R! -- a simple way to estimate geometry from unposed video of dynamic scene We achieve competitive results on several downstreams (video depth, camera pose) and believe this is a promising step toward feed-forward 4D reconstruction monst3r-project.github.io
6
32
224
21,518
Very impressive! At VideoMimic.net, we already: learn from 3rd-person human videos + RL -- for locomotion. Excited to see where this path goes next!
One of our goals is to have Optimus learn straight from internet videos of humans doing tasks. Those are often 3rd person views captured by random cameras etc. โ€จโ€จWe recently had a significant breakthrough along that journey, and can now transfer a big chunk of the learning directly from human videos to the bots (1st person views for now). This allows us to bootstrap new tasks much faster compared to teleoperated bot data alone (heavier operationally).โ€จโ€จMany new skills are emerging through this process, are called for via natural language (voice/text), and are run by a single neural network on the bot (multi-tasking).โ€จโ€จNext: expand to 3rd person video transfer (aka random internet), and push reliability via self-play (RL) in the real-, and/or synthetic- (sim / world models) world.โ€จโ€จIf youโ€™re great at AI and want to be part of its biggest real-world applications ever, you really need to join Tesla right now.
2
20
207
17,872
Humanoids need to perceive the environment in the real world Using 4D reconstruction techniques, we turn casual human videos into training data for an environment-aware humanoid policy Super excited to share: VideoMimic.net
our new system trains humanoid robots using data from cell phone videos, enabling skills such as climbing stairs and sitting on chairs in a single policy (w/ @redstone_hong @junyi42 @davidrmcall)
2
17
132
11,352
Just arrived at Nashville for #CVPR25! ๐Ÿฅฐ I'll present St4RTrack tomorrow morning (10:30โ€“12:30) at the 4D Vision Workshop, poster #137 in Hall 104 B. Feel free to come and chat!
Introducing St4RTrack!๐Ÿ–– Simultaneous 4D Reconstruction and Tracking in the world coordinate feed-forwardly, just by changing the meaning of two pointmaps! st4rtrack.github.io
1
5
99
8,862
๐Ÿš€Introducing โ€œTelling Left from Rightโ€ at #CVPR2024 -๐Ÿ”Identify the problem ๐ ๐ž๐จmetry-๐š๐ฐ๐š๐ซ๐ž semantic correspondence (SC) -๐Ÿ“Evaluate foundation model featuresโ€™ geometric awareness -๐Ÿ†Achieve SOTA with a lightweight post-processor ๐Ÿ”— (w/ code!): telling-left-from-right.githโ€ฆ
2
14
91
9,582
On my way to Seattle โœˆ๏ธ for my first ever #CVPR! Excited to meet old and new friends. ๐Ÿ˜„ I'll be presenting our work telling-left-from-right.githโ€ฆ on Wed. (19th) morning at #284. If you're interested in how a plug-in processor can enhance the Geo-aware SC of SD+DINO, please stop by.
4
66
7,086
I'll be presenting MonST3R at ICLR! ๐Ÿ‡ธ๐Ÿ‡ฌ Friday 25th, 10am-12:30pm Hall 3+2B #97 Come by if you are interested!
MonST3R is accepted by ICLR'25 as Spotlight! We have also added a fully feed-forward reconstruction mode that runs in real-time for video input (samples at: monst3r-paper.github.io/pageโ€ฆ), check more details here: github.com/Junyi42/monst3r/pโ€ฆ
2
2
60
3,087
The results are so cool! 4D reconstruction is a very challenging task - I tried to explore it before MonST3R but couldn't make it work. I'm thrilled to see MonST3R contributing a part to this reconstruction pipeline!
๐Ÿš€ Introducing CAT4D! ๐Ÿš€ CAT4D transforms any real or generated video into dynamic 3D scenes with a multi-view video diffusion model. The outputs are dynamic 3D models that we can freeze and look at from novel viewpoints, in real-time!โ€จBe sure to try our interactive viewer!
1
2
51
5,170
Hard to see the details in the figure? Check it out for yourself ๐Ÿ˜: monst3r-project.github.io/paโ€ฆ Weโ€™ve created an interesting 4D online demo that you can easily explore!
3
3
44
7,008
Nice work! Very cool results by carefully-designed generative inpainting on MonST3R's partial pointmaps. Glad to see MonST3R/dynamic 3d reconstruction is playing an important role.
๐Ÿ”ฅFree4D creates explicit 4D Gaussian scene representations from a single image, enabling high-quality, controllable, and real-time rendering. ๐Ÿ‘‰Project (with interactive demo): free4d.github.io/ Paper: arxiv.org/abs/2503.20785 Code (open-sourced): github.com/TQTQliu/Free4D
6
41
5,242
Super excited to attend #NeurIPS2023 from 11th to 16th! ๐Ÿฅฐ I'll be presenting our work 'A Tale of Two Features' (sd-complements-dino.github.iโ€ฆ) at the 'Thu Morning session #212'. Looking forward to meeting new and old friends in New Orleans! ๐ŸŒŸ
4
35
3,736
In "telling left from right", we showed it's important to make 2D semantic correspondence geometry-aware. In DenseMatcher, we further lift it to 3D and with this, we enable robots generalizable skills across categories! Fun collaboration with Joseph, Yuanchen, and the team!
๐ŸŒWe present DenseMatcher๏ผ ๐Ÿค–๏ธDenseMatcher enables robots to acquire generalizable skills across diverse object categories by only seeing one demo, by finding correspondences between 3D objects even with different types, shapes, and appearances.
2
7
34
3,737
Appreciate the share @sstj389! Code is now live at: github.com/Junyi42/sd-dino. We exploit complementarity of SD and DINOv2 features, achieving superior results via a simple fusion. Surprisingly, fused features outperform supervised methods on SPair-71k. Welcome to explore further!
A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence Super cool results on their webpage! arxiv: arxiv.org/abs/2305.15347 project page: sd-complements-dino.github.iโ€ฆ
1
5
27
3,304
Huge thanks to my amazing collaborators! -- Charles Herrmann+, @JunhwaHur, @jampani_varun, @trevordarrell, @forrestercole, @DeqingSun*, and Ming-Hsuan Yang* Special thanks to @brenthyi for the support in setting up the online demo Model is released at: github.com/Junyi42/monst3r
1
1
25
1,645
Very great study! This is a much more comprehensive analysis into the 3d/geometric awareness of vision models compared to our telling-left-from-right, while we focus more on correspondence scenarios and how to improve it.
Google presents Probing the 3D Awareness of Visual Foundation Models Visual foundation models can learn representations that encode the depth and orientation of the visible surface but struggle with multiview consistency possibly because they are learning view-dependent representations repo: github.com/mbanani/probe3d abs: arxiv.org/abs/2404.08636
22
2,864
Great to see a much more clever unsupervised way to fuse SD & DINO features compared to concatenation. ๐Ÿ˜ƒ The smooth and globally consistent correspondence from these features is really nice!
Our paper "Zero-Shot Image Feature Consensus with Deep Functional Maps" is accepted at #ECCV2024! @eccvconf Want better image correspondences with noisy and inaccurate features? Let's go to the spectral space with Laplacian eigenfunctions! ArXiv: arxiv.org/abs/2403.12038
1
1
22
3,767
Just arrived in lovely Paris ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ—ผ for my first in-person conference #ICCV2023! Thrilled about the valuable input and networking ahead!! ๐Ÿฅฐ
1
16
1,469
We handle dynamic videos with DUSt3R's pointmap representation: estimate xyz coordinates for two frames, aligned in the camera coordinate of the first frame โžก๏ธ No constraint on dynamic/static scenes in the representation! But how does DUSt3R actually work for dynamic scenes?๐Ÿค”
1
16
2,905
What's more exciting is our "joint dense reconstruction & camera pose estimation" result, while being 10x faster than previous method We visualize the optimized global point cloud and estimated camera poses:
1
14
1,557
Last, we also show the results of feed-forward pairwise pointmaps prediction, compared with DUSt3R: Row 1: we can still handle dynamic focals; Rows 2,3: we can do "impossible matching" in dynamic scenes; Rows 4,5: we can better estimate geometry in large scenes
1
14
2,148
It doesn't work out-of-the-box But as this is primarily a data issue, we propose a simple approach to adapt DUSt3R to dynamic scenes, by fine-tuning on a small set of dynamic videos, which surprisingly works well
1
13
2,024
Thank Bjรถrn Ommer for covering our work at the #NeurIPS23 opening keynote yesterday! Interested in more details? Check our work (sd-complements-dino.github.iโ€ฆ) and other concurrent works (diffusionfeatures.github.io, ubc-vision.github.io/LDM_corโ€ฆ, diffusion-hyperfeatures.githโ€ฆ) at this NeurIPS!๐Ÿ˜„
13
875
For a video input consisting of more than two frames, we can aggregate all the pairwise pointmap results to build a global point cloud With this unified representation, we can simply pull out per-frame camera pose, intrinsics, and video depth
1
12
1,673
We achieve competitive results compared to task-specific methods, e.g., DepthCrafter in video depth, LEAP-VO in camera pose estimation
1
12
1,571
I will also be presenting VideoMimic at the Agents in Interaction workshop: Poster #182โ€“#201 | June 12 (Thu), 11:45โ€“12:15 | ExHall D @redstone_hong will also give a spotlight talk on VideoMimic on Thu โ€” come check it out! More details โฌ‡๏ธ
Excited to present VideoMimic this week at #CVPR2025! ๐ŸŽฅ๐Ÿค– ๐Ÿ“Œ POETs Workshop "Embodied Humans" Spotlight Talk | June 12, Thu, -10:10 | Room 101B ๐Ÿ“Œ Agents in Interaction: From Humans to Robots Poster #182-#201 | June 12, Thu, -12:15 | ExHall D Come by and chat! #VideoMimic #Humanoids #Robotics
11
860
This is a very simple, reasonable, and effective method to improve diffusion features! Nice gains over "telling left from right" and "tale of two features"!
Replying to @rmsnorm
We show you can, with just 30 minutes of task-agnostic finetuning on a single GPU. ๐Ÿคฏ No noise. Better features. Better performance. Across many tasks. And no timestep searching headaches! ๐Ÿ‘‡
1
9
1,035
From static to dynamic, MonST3R reconstructs pointmaps at their own times. St4RTrack instead estimates both at the same momentโ€”predicting how points in frame1 move to frame2 and reconstructing geometry of frame2. Same architecture, now for simultaneous tracking + reconstruction.
1
8
1,122
๐Ÿš€Excited about applying DM to heterogeneous data? Check out our #ICCV2023 work "LayoutDiffusion"! And it excels in graphic layout generation. ๐Ÿ—“๏ธPresenting on Wed. (4th) at 2:30pm, Foyer Sud. Drop by and let's chat! Paper: arxiv.org/abs/2303.11589 Code: layoutdiffusion.github.io
1
9
798
One paper accepted to #ICCV2023๐Ÿฅณ Hope to see you in Paris!!
1
8
946
Replying to @Vinc3nt_Leroy
Thanks, Vincent! Big thanks to the DUSt3R team for providing a great foundation to build on!
1
8
706
Replying to @JeromeRevaud
Thanks Jerome, DUSt3R work is amazing!
7
571
We present a "more interactive" results on our webpage. Come and check it out! st4rtrack.github.io/page1
1
6
673
A reason seems to be given in Yang Song's ICLR21 work. ๐Ÿค” They term this as "Uniquely identifiable encoding" (encoding of the same input is only determined by the data distribution b/c the SDE doesn't rely on trainable params). They also provide an empirical example on CIFAR.
6
366
๐Ÿคฏ
Sora is our first video generation model - it can create HD videos up to 1 min long. AGI will be able to simulate the physical world, and Sora is a key step in that direction. thrilled to have worked on this with @billpeeb at @openai for the past year openai.com/sora
6
582
So true ๐Ÿ˜‚
Me after the #CVPR2024 paper deadline
5
732
Such representation can be learned supervisedly from small-scale, synthetic 4D datasets. But to better generalize to real scenes, St4RTrack can also adapt to new videos *without any 4D labels*, using only 2D reprojection cues like trajectories & monocular depth.
1
5
682
Fantastic work! It's also really great to see that fusing diffusion and DINOv2 features shines in other tasks! ๐Ÿ˜„
๐ŸงตI'm excited to share that our paper "Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features" has been accepted at @CVPR 2024! A big thank you to my co-authors. Project page- diff3f.github.io/ arXiv- arxiv.org/abs/2311.17024 #CVPR2024
1
5
736
Replying to @ch3njus
We tried mimicking this sequence in simulation with RL. It does decent jumping off, but struggles with climbing up due to insufficient arm strength Unfortunately, we only have one G1 (without hands), so testing risky motions in real is quite limited for now ๐Ÿ‘€ @UnitreeRobotics
1
5
309
I'll also be around the poster of DenseMatcher Friday afternoon at Hall 3+2B #569 with @hkz222! Check the poster @ju_yuanchen made ๐Ÿ‘‡
#ICLR2025 Thrilled for our ICLR 2025 Spotlight: DenseMatcher๐ŸŒ๏ผ๐Ÿ“ Hall 3 + Hall 2B #569, Fri 25 Apr, 3-5:30 AM EDT. Meet my awesome collaborators Junzhe, Junyi @junyi42 , Kaizhe @hkz222 & our advisor Huazhe @HarryXu12 to discuss! โ˜บ๏ธ
1
1
5
947
Current foundation model features (SD & DINO) deliver impressive SC results But their matching accuracy peaks at ~60% - a leap away from human level. What gaps remain?๐Ÿค” We've uncovered a key challenge: these feats struggle with geometric ambiguity, or "telling left from right"
1
5
320
Check more insights about St4RTrack from Haven's thread!
When it comes to recovering a dynamic world, many researchers focus on extracting geometry and camera parametersโ€”via SfM or SLAMโ€”while others concentrate on motion estimation, whether through correspondences, optical flow, or point tracking. ๐Ÿงต1/4
5
1,158
More interestingly is our joint reconstruction and tracking results, even the fully feed-forward mode gives promising results โฌ‡๏ธ
1
5
484
Replying to @vincesitzmann
Thanks for sharing, Vincent!! ๐Ÿฅฐ I totally agree; data is definitely a key to reconstruction.
4
369
We took a deep dive into "geometry-aware SC," by introducing a specialized geo-aware SC subset Our findings?๐Ÿ“ŠA striking performance gap between our Geo. subset (dash bar) and the conventional Std. set (solid bar) in SOTA methods (Note these subset accounts ~50% of total kpts)
1
4
232
More qualitative comparison w/ prior arts. Our method successfully establishes geometrically correct semantic correspondence even in cases of extreme view variation . Please see webpage for more results.
1
4
205
To evaluate the method, we propose a benchmark, *WroldTrack*, for world coordinate tracking and dynamic 3d reconstruction
1
4
541
This is really impressive!! Congrats on the great work @zhengqi_li!
Introducing MegaSaM! ๐ŸŽฅ Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!
1
4
857
Huge thanks to the amazing collaborators! Charles Herrmann, @JunhwaHur, Luisa Polania, @jampani_varun, @DeqingSun, Ming-Hsuan Yang
4
329
Replying to @amw7
Thanks Andrew! Very nice to meet you :-)
4
186
That being said, we believe thereโ€™s still huge potential for sim2real Thereโ€™s plenty of juice left in the reconstruction pipeline that we havenโ€™t fully squeezed yet โ€” like handling more challenging motions and ego-view rendering. Excited about whatโ€™s ahead
1
4
270
More on our webpage: - Proposed new large-scale, challenging SC benchmark for pretraining and evaluation - Found that large foundation features grasp the global pose of instances - Leveraged this info for improved correspondence - Details & analysis of our proposed Geo. subset..
1
3
481
Huge thanks to my incredible collaborators for another round of interesting work๐Ÿ˜„: Charles Herrmann, @JunhwaHur, Eric Chen, @jampani_varun, @DeqingSun, and Ming-Hsuan Yang. Excited to delve deeper with this follow-up to our previous work at sd-complements-dino.github.iโ€ฆ!
3
492
Replying to @hangg70
Thanks Hang! Totally agree on the point.
3
191
Replying to @jianyuan_wang
Congrats Jianyuan!! Sorry to hear that you had to go through that..๐Ÿ˜…
3
371
@HavenFeng and I are both around the conference. Feel free to talk to us if you are interested in our latest work, St4RTrack!
3
322
Are these problems an innate failing of these features, or can they be alleviated through better post-processing? Yes, they can be! We developed a highly efficient post-processor ๐Ÿ(ยท) that boosts the raw features with just 0.32% extra runtime
1
3
181
Ours (above) A side-by-side comparison with the previous method (DIFT) shows: though it also generalizes well, but falls short on resolving geometric confusion (e.g., matching left/front leg to right/back leg) DIFT (below)
1
3
180
We tried to make ๐Ÿ(ยท) ๐‘ ๐‘š๐‘Ž๐‘™๐‘™ to best retain the raw feature information and ensure generalizability to OOD cases E.g., the processor is trained on real images yet generalizes to anime images; trained with keypoint annotation but extends to query points beyond supervision
1
3
167
Replying to @ion_barrel
Thank you, time traveller ๐Ÿ˜‚
3
827
Replying to @Michael_J_Black
Thanks, Michael!!
2
226
Replying to @ju_yuanchen
Thanks so much for sharing our work, Yuanchen!
2
160
Replying to @ndsong95
Thanks for sharing, Chonghyuk!
2
138
Reports of 2022
2
306
Replying to @theo_gervet
Cool!! It seems that there's also a concurrent work with this idea: yanjieze.com/GNFactor/
2
141
Thanks, Chris! Currently, it only supports monocular video input since we treat different timestamps of a video as "multiview", but I think it is not challenging to adapt to multiple videos as input ๐Ÿ™‚
1
2
105
Replying to @Just_Me1313
Thanks for the feedback! There's a typo in the loading code. I just pushed a commit to the GitHub repo and it should be working now. :-)
1
1
45
Replying to @PETEcemetery
Yes, the output .glb file contains both the point cloud and (optional) camera frustums.
1
133
Replying to @janusch_patas
Thanks for sharing our work!!
1
79
Replying to @AlxandreRufino
Thanks, Alexandre! We have released the code for inference, and the camera trajectory could be exported in a certain format (please refer to github.com/Junyi42/monst3r/iโ€ฆ for more details) ๐Ÿ™‚
1
1
387
Replying to @ndsong95 @Nik__V__
Yes, we are currently limited by data annotation. We require ground truth depth, camera poses, and intrinsic for the training data, which limits the data we can use. Even the only real-world dataset we use, Waymo, is domain-specific to driving.
1
1
111
Replying to @zhifan_zhu
Thanks for the question! Stereo4D is a great data contribution, and we believe training St4RTrack on it could further boost performance. We're excited to explore this in the future!
1
116
Hi, new to twitter~ I'm looking for an AI MPhil/PhD position at 24fall, and welcome everyone to be friends๐Ÿฅฐ
1
Replying to @ChuanxiaZ
Thanks for the kind words, Chuanxia!
1
97
Replying to @ZeYanjie
ๅ‘œๅ‘œ ๆƒณๅŽปๅŠ ๅทžๆ™’ๅคช้˜ณ๐Ÿ˜ญ
1
Replying to @LightQuantumhah
็›ฒ็Œœ่ฟ™ๅญฆๆœŸ็š„่ฎก็ฎ—็†่ฎบ/ๅฏ†็ ๅญฆ๐Ÿ˜‡
1
1
401
Replying to @HarryXu12
Thanks Huazhe! Will try to release the code soon ๐Ÿ˜ƒ
1
228
Thanks, Baifeng!!
1
139
๐Ÿคฃ๐Ÿคฃ
Excited to announce my newest breakthrough project!! ๐Ÿ”ฅ๐Ÿ”ฅ State-of-the-art results (100%!!) on widely used academic benchmarks (MMLU, GSM8K, HumanEval, OpenbookQA, ARC Challenge, etc.) ๐Ÿ”ฅ๐Ÿ”ฅ 1M param LLM trained on 100k tokens ๐Ÿคฏ How?? Introducing **phi-CTNL** ๐Ÿงต๐Ÿ‘‡ 1/6
1
445
Replying to @LightQuantumhah
ๆ”ฟๅทฅ่ฟ˜ๅฏไปฅ็พŽ็พŽๅ‡่ฎญ๏ผŒ็œŸ็š„ๅพˆnice๐Ÿ˜‡
1
1
Replying to @ChiehHubertLin
Congratulations!
1
35