It's been 6 months since I slammed the brakes on several PhD research projects to go work at π... 😅 super excited to finally share our results! A short 🧵 with some details:
At Physical Intelligence (π) our mission is to bring general-purpose AI into the physical world. We're excited to show the first step towards this mission - our first generalist model π₀ 🧠 🤖 Paper, blog, uncut videos: physicalintelligence.company…
9
49
653
101,754
In LLM land, a slow model is annoying. In robotics, a slow model can be disastrous! Visible pauses at best, dangerously jerky motions at worst. But large VLAs are slow by nature. What can we do about this? An in-depth 🧵:
13
64
525
89,727
The biggest problem with our RL diffusion paper was that nobody could run our Jax+TPU code. No more! I've reimplemented DDPO in PyTorch, plus replicated our results using LoRA for low-memory training! Links below 👇
10
44
294
90,915
These people copy + pasted the website I designed (including the body text), changed some stuff here and there, and then released it as their own work without attribution. The best part is they didn't even change the preview thumbnail so it still says "Octo". Incredible stuff!
7
6
233
74,417
Here's a link to the recording for anyone that's interested! piped.video/live/ELUMFpJCUS0…
If you're at #CoRL2024, come check out my talk at the X-Embodiment workshop at 1:30pm! Thanks to @KarlPertsch for inviting me to speak!
3
17
180
20,177
If you're at #CoRL2024, come check out my talk at the X-Embodiment workshop at 1:30pm! Thanks to @KarlPertsch for inviting me to speak!
2
9
149
39,435
This caption is a bit funny to me because we've put precisely zero effort into optimizing our model implementation. Thanks JAX!
3
4
115
14,373
My favorite slide that I made for my talk last weekend -- a very silly thought experiment in which we compare language datasets to robotics datasets (in the most shallow way possible). Yes it is to scale; I learned that the maximum shape size in Keynote is 20,000pts
5
4
92
20,881
I'm surprised more window manager enthusiasts don't know about yabai. This is what my macOS setup looks like -- maybe I'm just not a power user, but I don't miss i3 one bit!
No you cannot, you can only get it down to 200ms or so. At least when i had one 4y ago, and believe me i searched!
9
2
74
22,253
Replying to @cloneofsimo
not just RoPE, tons of ppl copy the original Vaswani et. al. posemb code without thinking. it's *much much worse* for diffusion/flow timestep encodings; if you're not careful, you can end up using an encoding calibrated for t ∈ [0, 1000] with t ∈ [0, 1].
5
67
36,865
🔥🔥🔥 wake up babe, new BridgeData just dropped 🔥🔥🔥 Are you a fan of the original BridgeData? Doesn't matter! BridgeData V2 has 60k trajectories, 24 environments, 13 skills, and 100+ objects.
4
8
62
13,100
Replying to @kevin_zakka
bro capitalizes TeleOperate like it's a Special Move from a fantasy novel
4
57
3,257
We combined the 3 hottest things in machine learning: transformers, diffusion, and cute animal names, and what we got was Octo🐙: an open-source, cross-embodied, generalist robot policy backbone! 1/n
1
7
58
8,872
If you're at #ICLR2024, come check out my DDPO poster tomorrow -- Thurs 10:45am, poster #21, Hall B! It's crazy looking back to see how much has changed since the paper first came out nearly 1 year ago. Really makes me feel how fast things move in this field.
We've updated the DDPO website with some new results for training diffusion models with RL! Our aesthetic bunny is now much more... aesthetic. Latest here: rl-diffusion.github.io/ Includes code, LoRA training for low memory, pretrained models, etc. Some highlights 👇
1
4
48
10,208
I worked a lot on the model design, and I think we ended up with a pretty cool way to adapt a pre-trained VLM backbone for action prediction using a diffusion-style objective (we use flow matching, of course, like all the cool kids these days)
1
3
38
1,704
Here's a link to the code: github.com/kvablack/ddpo-pyt… If you want to learn more about DDPO, you can check out the project website (rl-diffusion.github.io) or @svlevine's original thread below:
We figured out how to train diffusion models with RL to generate images aligned with user goals! Our RL method gets ants to play chess and dolphins to ride bikes. Reward from powerful vision-language models (i.e., RL from AI feedback): rl-diffusion.github.io/ A 🧵👇
2
4
39
10,889
Flow matching is a great fit for modeling continuous action distributions, especially as we scale up data collection and train on many distinct tasks/behaviors/strategies at the same time
2
6
37
5,055
Replying to @anshulkundaje
this simply isn't true, at least not at Berkeley. students with all sorts of backgrounds get in, and everyone I know has grown significantly during their PhD. junior students are not at all like postdocs, and usually have a lot to learn about doing research.
2
33
4,376
Our solution, real-time chunking (RTC), combines action chunking with inpainting — the actions within the inference delay are frozen, while the rest are “inpainted” in a way that’s consistent with the previous plan.
1
2
37
5,225
We just released the code and model weights for DDPO! Excited to see what the community will do with this 😃 Project website: rl-diffusion.github.io Code (links to weights/demos inside): github.com/jannerm/ddpo
3
1
33
7,940
Overall, working at @physical_int has been a blast and joining was definitely the right decision. I can't believe it's only been 6 months and I can't wait for what comes next!
3
25
2,190
DDPO updates: after fixing some numerical precision issues, of all things, the aesthetic quality results look much better! (1/4)
We've updated the DDPO website with some new results for training diffusion models with RL! Our aesthetic bunny is now much more... aesthetic. Latest here: rl-diffusion.github.io/ Includes code, LoRA training for low memory, pretrained models, etc. Some highlights 👇
1
5
24
14,017
Replying to @ZhongpaiGao
This is a common sentiment, but I disagree. A website isn't automatically a template unless it's explicitly advertised as such. Copying it without permission is no different than copying someone's artwork or writing.
1
23
1,984
Here's a little secret: π₀-small, which also uses flow matching but not a VLM backbone, was our "main model" for 4+ months and was outperforming many strong baselines! IMO the most exciting benefit of adding the VLM init was drastically improved language following
1
23
1,429
Replying to @chris_j_paxton
The paper is original (as far as I can tell). The website is plagarized. The second body paragraph is almost word-for-word identical, not to mention the overall design, which was obviously copied and slightly modified.
2
1
21
3,382
Finally arrived in Vienna for #ICLR2024! @mitsuhiko_nm and I will be at the first poster session tomorrow morning presenting SuSIE --- a simple recipe for generalizable robotic manipulation using a pretrained diffusion model. Come check us out at poster #69, hall B, 10:45am!
Diffusion models make great images. But can they drive robots? Usually that gets complicated really fast. We figured out how to get a Stable Diffusion model (based on Instruct pix2pix) to drive robotic instruction following. Simple recipe, works on a wide range of tasks. Thread👇
2
20
8,117
Finally, there’s a subtle issue with non-real-time inference that’s easy to overlook: distribution shift. Pauses for inference are not in the training data! We found that RTC was not only faster, but also more precise and consistent than our old synchronous strategy.
2
1
23
2,973
Importantly, this requires no training-time changes! It’s applicable to any diffusion- or flow-based policy at inference time. With RTC, we get smooth real-time execution.
1
24
2,463
Thanks to LoRA and mixed precision, you can now finetune Stable Diffusion with less than 10GB of GPU memory. It runs on my 1080ti!
2
2
17
1,679
Replying to @anshulkundaje
yes, most admits have written at least one paper (not necessarily at top conference), but that doesn't make them "basically postdocs". also, almost everyone views their ugrad work as immature and weak compared to their PhD work -- which supports your wider argument for academia!
1
15
1,301
Model size is not the only contributor to latency. Personally, I’m betting that the VLAs that solve physical intelligence will not be able to fit in onboard robot computers. That means we will need centralized inference servers, and we will have network latency.
2
20
1,881
For super cool uncut videos of evaluations and all that other good stuff, check out our blog post: physicalintelligence.company…
1
18
2,675
Shoutout to my friend @shreyaskapur and his new library Iceberg (github.com/revalo/iceberg) for the sick animation! Fun fact, the animation would also not have been possible without LoRA, because otherwise the checkpoints are way too big to save every epoch.
2
16
1,057
For smooth execution, we need to always produce the next action as soon as it’s needed. This is called a “real-time constraint”. With high-latency models, this requires concurrency: generating new actions while executing old ones. But naive concurrency does not work.
1
16
2,578
Replying to @anshulkundaje
there is selection towards people with prior research experience, but "multiple first author papers at top conferences" is not common at all
3
14
1,307
To prepare for this future, we added up to +200ms of artificial latency to π0.5 (>300ms total), and the speed and performance of RTC were totally unaffected!
1
16
1,681
my therapist: incompressible drake isn't real he can't hurt you incompressible drake:
1
14
636
tag urself im sixx ttutttas
2
13
1,095
Google Brain really cooked with bfloat16
13
1,607
I mean, technically the model is optimized... by the XLA compiler, not by a human! from arxiv.org/abs/2502.19645
1
1
13
2,281
how to check if the latest "AI feature" rollout is being powered by a real LLM (google messages "magic compose" passes with flying colors)
11
867
Also at ICML 👍
1
11
2,171
Octo has been accepted to RSS 2024! For the full paper, we added some juicy new experiments (including bimanual ALOHA). And of course we're also releasing some new and improved models! The best part of finally uploading to arXiv is getting those sweet sweet AK tweets 😉
Octo An Open-Source Generalist Robot Policy Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a
1
10
862
I'm especially proud of `octo.data`. We knew we had to take advantage of the incredible Open X-Embodiment dataset, but it turns out that just because the data exists doesn't mean that loading it is easy! We went through a lot of pain so that you don't have to. 3/n
1
1
9
637
nice! you have a minor bug in your timestep sampling function -- it should be `(self.s - sample) / self.s`. also, you don't need `num_train_timesteps` ;)
1
6
560
if productization is the bottleneck, then shouldn't there be non-product prototype of a general-purpose humanoid, or even autonomous decluttering?
1
2
5
658
Replying to @CyberRobooo
The robot is teleoperated. Here is the original post from @watneyrobotics, which you copied and watermarked as your own.
Ever seen a robot do a cannonball?
1
5
411
Replying to @ericjang11
PyTorch always manages to be wrong 🫣
1
2
1,319
copying someone else's original work without acknowledgement is plagiarism. technically, it's also copyright infringement (with or without acknowledgement), although academics are typically protected by fair use.
1
5
685
Thanks again to my awesome collaborators @michaeljanner, @du_yilun, @ikostrikov, and my advisor @svlevine
4
377
I think a) is the hard part and still an open research problem with plenty of people working on it
1
4
226
Replying to @ekzhang1
> JAX is the framework most LLMs are trained on Is this true? Which companies use JAX? (genuinely curious)
1
3
454
Sometimes, data is all you need. We got *6 different methods* -- both image-conditioned and language-conditioned, imitation learning and RL -- to achieve zero-shot generalization to new tasks, objects, and environments. Download the data for yourself 👇 rail-berkeley.github.io/brid…
1
4
553
And especially grateful to my co-leads, @its_dibya @HomerWalke @KarlPertsch @oier_mees. Everyone went absolutely all-in on making this into a killer project -- working together has been one whale of an opportunity! n/n
1
3
483
I'm sure it varies by lab based on the PI's recruiting preferences. I personally have 1 ugrad paper in a very small, non-AI conference. I would say the majority of my PhD friends don't have the "extensive prior experience" background
1
2
244
Check out Sergey's thread for more:
What kind of general-purpose robotic learning algorithm can learn to perform such a huge range of skills in so many different environments, based on either language commands or goals? Let me explain😉 Thread below👇
4
664
Replying to @BoyuanChen0
shameless plug for SuSIE (rail-berkeley.github.io/susi…) -- beats RT-2 and video prediction by quite a large margin (and we tried *very* hard to get video prediction working)
3
137
Replying to @yacineMTB
do I need to record a keyboard overlay? it's literally instant I dislike apple in general but unfortunately I also like having an 18 hour battery life and my drivers working on the first try
3
348
This is super useful! I've never actually measured FID because I was too lazy to install 2+ year old repos...
FID computation can be quite esoteric, here's a simple helper to do it in JAX. You can compute FID online during training! This implementation can closely match the numbers from OpenAI's guided-diffusion evaluations. Code: github.com/kvfrans/jax-fid-p… .
3
1,620
that's great that you're working on that, and I'm not saying that your company isn't going to solve it ;) but calling general-purpose manipulation a product problem is like calling self-driving a product problem ever since the 2005 DARPA challenge in the Mojave desert
2
3
187
but have they been successful? I guess my point here is that if a research lab has not once successfully demonstrated what you want your product to do, then it seems quite inaccurate to say that "productizing" is "the bottleneck"
1
3
198
Regardless, BERTScore seemed to be more than sufficient for our tasks. Also worth mentioning that for the counting issue, LLaVA directly produces the "fake" number -- so better response scoring wouldn't fix it, only a better VLM. Again, I'm sure GPT-4 could do it ;).
1
3
68
And finally, here are the maximally aesthetic majestic animals that I can't stop staring at (4/4)
3
318
Replying to @giffmana @yacineMTB
I was bored this afternoon so I timed it using my phone's 240fps slow-mo camera Yabai on my 2023 MacBook Pro: 71ms (17 frames) i3 on my deskmate's 2020 X1 Carbon: 88ms (21 frames) seems pretty instant to me 😅
3
146
Replying to @ramkumarkoppu
Of course! Our experiments use Pi0.5, which is more or less the same architecture. There is not yet an official implementation of RTC in openpi though.
3
1,036
Replying to @ramkumarkoppu
pi_zero also starts from a pretrained VLM (PaliGemma), which has seen a lot of Internet data! the idea of this slide is to compare datasets rather than models (the labels are a bit confusing in that sense -- should probably read "GPT-2/Llama 3 training dataset")
2
687
yep, Pi05 is also based on PaliGemma 1.
2
68
Thanks! In my experience, LLaVA isn't good enough to give accurate numerical scores if you ask for them directly. Even ChatGPT with few-shot prompting seemed to struggle with this. I'm sure GPT-4 could do it -- or maybe I'm just not a good enough prompt engineer :'(.
1
2
95
Super cool new work from Google DeepMind that, most importantly, continues the legacy of the "raccoon washing dishes" prompt (3/4)
@PaulVicol and I are excited to introduce DRaFT, a method that fine-tunes diffusion models on rewards (such as scores from human preference models) by backpropagating through the diffusion sampling! with @kswersk, @fleet_dj arXiv: arxiv.org/abs/2309.17400 (1/5)
1
2
474
Replying to @giffmana
do you mean during data collection or post-hoc? post-hoc isn't possible because you don't have access to the robot's (and the world's) dynamics. I guess adding pauses during data collection could work but it feels... quite unsavory
1
2
445
I used to be exactly like you. I transitioned to macOS with yabai/skhd a year ago and quickly found it to be *better* than i3 with the right setup. 0ms switching, intuitive controls, plus edge cases like multiple displays with different resolutions "just work"
1
2
129
We tried a few CLIP experiments but didn't get anything compelling. Stable Diffusion already uses the CLIP text encoder, so you wouldn't expect CLIP applied naively to magically improve things that Stable Diffusion is bad at. LLaVA is generally more powerful than CLIP as well.
1
2
72
Replying to @tarikkelestemur
Replying to @kvablack
I mean, technically the model is optimized... by the XLA compiler, not by a human! from arxiv.org/abs/2502.19645
1
347
Good point, we used L/14 throughout. No idea how LLaVA compares to the largest CLIP models. I do think the LM backbone helps a lot with reasoning capabilities, but that might not come into play until more complex reward functions (which I hope to see in future work!)
1
36
Replying to @payandath @svlevine
Midjourney would absolutely blow us out of the water, haha. These results are more about showing that our algorithm works and can achieve some promising results. I don't think anything based on Stable Diffusion can get anywhere near Midjourney atm.
1
1
27
Shoutout to @carperai's @iScienceLuvr (below) and also @huggingface for integrating DDPO into their DRLX and TRL libraries, respectively (2/4)
Really excited to share something I've been working on for the last couple months: DRLX - A library for doing RLHF on diffusion models! Implements the DDPO algorithm but more algorithms coming soon! Read more about the library and our experiments here: carper.ai/enhancing-diffusio…
1
1
185
Replying to @ThePrimeagen
I was 48 hours away from having to ship a frontend thing and I decided to learn Svelte instead of trying to use React (which I had used before) and I think it saved my ass
64
Replying to @offchan420
I think the aesthetic quality DDPO finetuning is decent, but these experiments are really quite small-scale compared to production usage. I think it's up to industry to scale up methods like DDPO and see what really works best.
1
75
Replying to @liliyu_lili
In the SDXL tech report there's an interesting tidbit that COCO FID is negatively correlated with image quality. Their chart has SD2-1 at worse FID than SD1-5, while yours has the opposite. I'm not an expert so I might be missing something -- is there an obvious explanation?
1
160
Thanks! 1. the long term goal is to solve robot manipulation 2. on the model side, none (actions are just vectors); on the hardware/data side, probably a lot, since we would have to build or buy a hand and then collect lots of data with it
1
1
76
Always feels nice to have your work shared by others!
🐙 Octo: An Open-Source Generalist Robot Policy Transformer-based diffusion policy, pretrained on 800k robot episodes from the Open X-Embodiment dataset proj: octo-models.github.io/ abs: arxiv.org/abs/2405.12213
1
512
Flexibility was our #1 design principle. You can swap out different observation spaces, action spaces and training objectives with only a config change. This allowed us to get great results across 6 robot setups and 3 institutions! 2/n
1
1
237