It's been 6 months since I slammed the brakes on several PhD research projects to go work at π... 😅 super excited to finally share our results! A short 🧵 with some details:
At Physical Intelligence (π) our mission is to bring general-purpose AI into the physical world.
We're excited to show the first step towards this mission - our first generalist model π₀ 🧠 🤖
Paper, blog, uncut videos: physicalintelligence.company…
In LLM land, a slow model is annoying. In robotics, a slow model can be disastrous! Visible pauses at best, dangerously jerky motions at worst. But large VLAs are slow by nature. What can we do about this? An in-depth 🧵:
The biggest problem with our RL diffusion paper was that nobody could run our Jax+TPU code. No more! I've reimplemented DDPO in PyTorch, plus replicated our results using LoRA for low-memory training!
Links below 👇
These people copy + pasted the website I designed (including the body text), changed some stuff here and there, and then released it as their own work without attribution. The best part is they didn't even change the preview thumbnail so it still says "Octo". Incredible stuff!
This caption is a bit funny to me because we've put precisely zero effort into optimizing our model implementation. Thanks JAX!
ALT "Also, despite its larger size, π0 outperforms both RDT-1B and Diffusion Policy in speed thanks to its optimized JAX implementation (all other methods are implemented in PyTorch)."
My favorite slide that I made for my talk last weekend -- a very silly thought experiment in which we compare language datasets to robotics datasets (in the most shallow way possible). Yes it is to scale; I learned that the maximum shape size in Keynote is 20,000pts
ALT Comparison of robotics and language datasets in terms of hours: OXE, π dataset, GPT-2 training dataset, and Llama 3 training dataset.
I'm surprised more window manager enthusiasts don't know about yabai. This is what my macOS setup looks like -- maybe I'm just not a power user, but I don't miss i3 one bit!
not just RoPE, tons of ppl copy the original Vaswani et. al. posemb code without thinking. it's *much much worse* for diffusion/flow timestep encodings; if you're not careful, you can end up using an encoding calibrated for t ∈ [0, 1000] with t ∈ [0, 1].
🔥🔥🔥 wake up babe, new BridgeData just dropped 🔥🔥🔥
Are you a fan of the original BridgeData? Doesn't matter! BridgeData V2 has 60k trajectories, 24 environments, 13 skills, and 100+ objects.
We combined the 3 hottest things in machine learning: transformers, diffusion, and cute animal names, and what we got was Octo🐙: an open-source, cross-embodied, generalist robot policy backbone!
1/n
If you're at #ICLR2024, come check out my DDPO poster tomorrow -- Thurs 10:45am, poster #21, Hall B!
It's crazy looking back to see how much has changed since the paper first came out nearly 1 year ago. Really makes me feel how fast things move in this field.
We've updated the DDPO website with some new results for training diffusion models with RL! Our aesthetic bunny is now much more... aesthetic.
Latest here: rl-diffusion.github.io/
Includes code, LoRA training for low memory, pretrained models, etc. Some highlights 👇
I worked a lot on the model design, and I think we ended up with a pretty cool way to adapt a pre-trained VLM backbone for action prediction using a diffusion-style objective (we use flow matching, of course, like all the cool kids these days)
ALT our VLM + action expert flow matching architecture for predicting robot actions
We figured out how to train diffusion models with RL to generate images aligned with user goals! Our RL method gets ants to play chess and dolphins to ride bikes. Reward from powerful vision-language models (i.e., RL from AI feedback): rl-diffusion.github.io/
A 🧵👇
Flow matching is a great fit for modeling continuous action distributions, especially as we scale up data collection and train on many distinct tasks/behaviors/strategies at the same time
this simply isn't true, at least not at Berkeley. students with all sorts of backgrounds get in, and everyone I know has grown significantly during their PhD. junior students are not at all like postdocs, and usually have a lot to learn about doing research.
Our solution, real-time chunking (RTC), combines action chunking with inpainting — the actions within the inference delay are frozen, while the rest are “inpainted” in a way that’s consistent with the previous plan.
We just released the code and model weights for DDPO! Excited to see what the community will do with this 😃
Project website: rl-diffusion.github.io
Code (links to weights/demos inside): github.com/jannerm/ddpo
ALT teaser figure showing progressions of images throughout RL training: a picture of a llama becoming more and more compressible by becoming smaller and with a blurry background (top), a picture of a lion becoming more and more aesthetic by adopting a minimalistic artistic style (middle), and a picture of a raccoon washing dishes becoming more and more true to the prompt (bottom)
Overall, working at @physical_int has been a blast and joining was definitely the right decision. I can't believe it's only been 6 months and I can't wait for what comes next!
We've updated the DDPO website with some new results for training diffusion models with RL! Our aesthetic bunny is now much more... aesthetic.
Latest here: rl-diffusion.github.io/
Includes code, LoRA training for low memory, pretrained models, etc. Some highlights 👇
This is a common sentiment, but I disagree. A website isn't automatically a template unless it's explicitly advertised as such. Copying it without permission is no different than copying someone's artwork or writing.
Here's a little secret: π₀-small, which also uses flow matching but not a VLM backbone, was our "main model" for 4+ months and was outperforming many strong baselines! IMO the most exciting benefit of adding the VLM init was drastically improved language following
The paper is original (as far as I can tell). The website is plagarized. The second body paragraph is almost word-for-word identical, not to mention the overall design, which was obviously copied and slightly modified.
Finally arrived in Vienna for #ICLR2024! @mitsuhiko_nm and I will be at the first poster session tomorrow morning presenting SuSIE --- a simple recipe for generalizable robotic manipulation using a pretrained diffusion model. Come check us out at poster #69, hall B, 10:45am!
Diffusion models make great images. But can they drive robots? Usually that gets complicated really fast. We figured out how to get a Stable Diffusion model (based on Instruct pix2pix) to drive robotic instruction following. Simple recipe, works on a wide range of tasks. Thread👇
Finally, there’s a subtle issue with non-real-time inference that’s easy to overlook: distribution shift. Pauses for inference are not in the training data! We found that RTC was not only faster, but also more precise and consistent than our old synchronous strategy.
Importantly, this requires no training-time changes! It’s applicable to any diffusion- or flow-based policy at inference time. With RTC, we get smooth real-time execution.
yes, most admits have written at least one paper (not necessarily at top conference), but that doesn't make them "basically postdocs". also, almost everyone views their ugrad work as immature and weak compared to their PhD work -- which supports your wider argument for academia!
Model size is not the only contributor to latency. Personally, I’m betting that the VLAs that solve physical intelligence will not be able to fit in onboard robot computers. That means we will need centralized inference servers, and we will have network latency.
Shoutout to my friend @shreyaskapur and his new library Iceberg (github.com/revalo/iceberg) for the sick animation!
Fun fact, the animation would also not have been possible without LoRA, because otherwise the checkpoints are way too big to save every epoch.
For smooth execution, we need to always produce the next action as soon as it’s needed. This is called a “real-time constraint”. With high-latency models, this requires concurrency: generating new actions while executing old ones. But naive concurrency does not work.
To prepare for this future, we added up to +200ms of artificial latency to π0.5 (>300ms total), and the speed and performance of RTC were totally unaffected!
ALT a screenshot from the DDPO website showing reward-hacking against LLaVA, where the diffusion model generates some wacky looking text instead of the correct number of animals
Octo has been accepted to RSS 2024! For the full paper, we added some juicy new experiments (including bimanual ALOHA). And of course we're also releasing some new and improved models!
The best part of finally uploading to arXiv is getting those sweet sweet AK tweets 😉
Octo
An Open-Source Generalist Robot Policy
Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a
I'm especially proud of `octo.data`. We knew we had to take advantage of the incredible Open X-Embodiment dataset, but it turns out that just because the data exists doesn't mean that loading it is easy! We went through a lot of pain so that you don't have to.
3/n
nice! you have a minor bug in your timestep sampling function -- it should be `(self.s - sample) / self.s`. also, you don't need `num_train_timesteps` ;)
copying someone else's original work without acknowledgement is plagiarism. technically, it's also copyright infringement (with or without acknowledgement), although academics are typically protected by fair use.
Sometimes, data is all you need. We got *6 different methods* -- both image-conditioned and language-conditioned, imitation learning and RL -- to achieve zero-shot generalization to new tasks, objects, and environments.
Download the data for yourself 👇
rail-berkeley.github.io/brid…
And especially grateful to my co-leads, @its_dibya@HomerWalke@KarlPertsch@oier_mees. Everyone went absolutely all-in on making this into a killer project
-- working together has been one whale of an opportunity!
n/n
I'm sure it varies by lab based on the PI's recruiting preferences. I personally have 1 ugrad paper in a very small, non-AI conference. I would say the majority of my PhD friends don't have the "extensive prior experience" background
What kind of general-purpose robotic learning algorithm can learn to perform such a huge range of skills in so many different environments, based on either language commands or goals?
Let me explain😉
Thread below👇
shameless plug for SuSIE (rail-berkeley.github.io/susi…) -- beats RT-2 and video prediction by quite a large margin (and we tried *very* hard to get video prediction working)
do I need to record a keyboard overlay? it's literally instant
I dislike apple in general but unfortunately I also like having an 18 hour battery life and my drivers working on the first try
FID computation can be quite esoteric, here's a simple helper to do it in JAX. You can compute FID online during training! This implementation can closely match the numbers from OpenAI's guided-diffusion evaluations.
Code: github.com/kvfrans/jax-fid-p…
.
that's great that you're working on that, and I'm not saying that your company isn't going to solve it ;) but calling general-purpose manipulation a product problem is like calling self-driving a product problem ever since the 2005 DARPA challenge in the Mojave desert
but have they been successful? I guess my point here is that if a research lab has not once successfully demonstrated what you want your product to do, then it seems quite inaccurate to say that "productizing" is "the bottleneck"
Regardless, BERTScore seemed to be more than sufficient for our tasks. Also worth mentioning that for the counting issue, LLaVA directly produces the "fake" number -- so better response scoring wouldn't fix it, only a better VLM. Again, I'm sure GPT-4 could do it ;).
I was bored this afternoon so I timed it using my phone's 240fps slow-mo camera
Yabai on my 2023 MacBook Pro: 71ms (17 frames)
i3 on my deskmate's 2020 X1 Carbon: 88ms (21 frames)
seems pretty instant to me 😅
Of course! Our experiments use Pi0.5, which is more or less the same architecture. There is not yet an official implementation of RTC in openpi though.
pi_zero also starts from a pretrained VLM (PaliGemma), which has seen a lot of Internet data! the idea of this slide is to compare datasets rather than models (the labels are a bit confusing in that sense -- should probably read "GPT-2/Llama 3 training dataset")
Thanks! In my experience, LLaVA isn't good enough to give accurate numerical scores if you ask for them directly. Even ChatGPT with few-shot prompting seemed to struggle with this. I'm sure GPT-4 could do it -- or maybe I'm just not a good enough prompt engineer :'(.
@PaulVicol and I are excited to introduce DRaFT, a method that fine-tunes diffusion models on rewards (such as scores from human preference models) by backpropagating through the diffusion sampling!
with @kswersk, @fleet_dj
arXiv: arxiv.org/abs/2309.17400
(1/5)
do you mean during data collection or post-hoc? post-hoc isn't possible because you don't have access to the robot's (and the world's) dynamics. I guess adding pauses during data collection could work but it feels... quite unsavory
I used to be exactly like you. I transitioned to macOS with yabai/skhd a year ago and quickly found it to be *better* than i3 with the right setup. 0ms switching, intuitive controls, plus edge cases like multiple displays with different resolutions "just work"
We tried a few CLIP experiments but didn't get anything compelling. Stable Diffusion already uses the CLIP text encoder, so you wouldn't expect CLIP applied naively to magically improve things that Stable Diffusion is bad at. LLaVA is generally more powerful than CLIP as well.
Good point, we used L/14 throughout. No idea how LLaVA compares to the largest CLIP models. I do think the LM backbone helps a lot with reasoning capabilities, but that might not come into play until more complex reward functions (which I hope to see in future work!)
Midjourney would absolutely blow us out of the water, haha. These results are more about showing that our algorithm works and can achieve some promising results. I don't think anything based on Stable Diffusion can get anywhere near Midjourney atm.
Really excited to share something I've been working on for the last couple months:
DRLX - A library for doing RLHF on diffusion models!
Implements the DDPO algorithm but more algorithms coming soon!
Read more about the library and our experiments here: carper.ai/enhancing-diffusio…
I was 48 hours away from having to ship a frontend thing and I decided to learn Svelte instead of trying to use React (which I had used before) and I think it saved my ass
I think the aesthetic quality DDPO finetuning is decent, but these experiments are really quite small-scale compared to production usage. I think it's up to industry to scale up methods like DDPO and see what really works best.
In the SDXL tech report there's an interesting tidbit that COCO FID is negatively correlated with image quality. Their chart has SD2-1 at worse FID than SD1-5, while yours has the opposite. I'm not an expert so I might be missing something -- is there an obvious explanation?
Thanks!
1. the long term goal is to solve robot manipulation
2. on the model side, none (actions are just vectors); on the hardware/data side, probably a lot, since we would have to build or buy a hand and then collect lots of data with it
🐙 Octo: An Open-Source Generalist Robot Policy
Transformer-based diffusion policy, pretrained on 800k robot episodes from the Open X-Embodiment dataset
proj: octo-models.github.io/
abs: arxiv.org/abs/2405.12213
Flexibility was our #1 design principle. You can swap out different observation spaces, action spaces and training objectives with only a config change. This allowed us to get great results across 6 robot setups and 3 institutions!
2/n