"Interactive Language: Talking to Robots in Real Time" interactive-language.github.… - Real-time, interactive, open-vocabulary, language+pixels -> actions - A new scale (~600,000 traj.) for language-conditioned behavior - Dataset, sim, models, code all to be released! (1/n)...
8
176
791
Excited to share some work with colleagues last summer at Facebook Reality Labs that is now up on arXiv! “DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation” Here’s a snippet of some fun interpolations in the shape latent space.
15
151
636
It may be time to settle this. Poll in next tweet.
22
48
448
Last Spring I took off from Google DeepMind, and I've been heads-down building since with an amazing team. Excited to share more today -- introducing Generalist. It's felt to me for a couple years, since we started bringing multimodal LLMs into robotics, that a subset of the ingredients for creating truly general purpose robot intelligence seem to be falling into place. But what's been needed is a new focus at the intersection of data, models, and hardware. No amount of downloading data from the internet, by itself, will create the level of fast, fluid, precise, reactive layer of intelligence in being able to interact with the physical world. In due time we'll be excited to share more, but what we're sharing today is about what the models have grown to be capable of. We think we've hit a new point on the frontier of general purpose real world intelligence – new levels of simultaneously fast, smooth, precise, reactive, bi-manual coordinated dexterity. Looking forward to sharing even more. Super proud of the team we've put together, and where we're headed. Reach out if you'd like to chat about working together!
Today we're excited to share a glimpse of what we're building at Generalist. As a first step towards our mission of making general-purpose robots a reality, we're pushing the frontiers of what end-to-end AI models can achieve in the real world. Here's a preview of our early results in autonomous general-purpose dexterous capabilities – fast, reactive, smooth, precise, bi-manual coordinated sensorimotor control.
27
33
414
36,783
Today we share more on PaLM-E! (palm-e.github.io) Thread 🧵with blog post link at the end. PaLM-E can do a lot of things across robotics, vision, and language… but let’s look at a few capabilities in detail, step by step 😉 👇
What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website: palm-e.github.io
6
64
245
102,846
From Scott Kuindersma's @BostonDynamics talk on Friday -- Atlas jumping between boxes now with computer vision in the loop. From robotics today seminar -- see @RoboticsSeminar or roboticstoday.github.io/ for more.
6
80
242
Excited to share more about our "Implicit Behavioral Cloning" work! ✅*code* just released: github.com/google-research/i… ✅*videos*: implicitbc.github.io/ Will be sharing more this week at #CoRL2021. I'll also maybe write a TL;DR thread soon, meanwhile, check out the website!
2
46
240
You may have seen this week some pretty powerful large "foundational" models. (i.e., PaLM, DALLE-2). With "Socratic Models" we look into combining such models... composing them zero-shot to do various new tasks, including across modalities. A couple more thoughts below 🧵
With multiple foundation models “talking to each other”, we can combine commonsense across domains, to do multimodal tasks like zero-shot video Q&A or image captioning, no finetuning needed. Socratic Models: website + code: socraticmodels.github.io paper: arxiv.org/abs/2204.00598
2
41
236
A comparison of the largest model sizes used for real-robot control:
3
21
164
37,416
Can robots model the world with keypoints, and learn how to see, predict, and control them into the future? "Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning" @lucas_manuelli, @YunzhuLiYZ, me, @rtedrake arxiv.org/pdf/2009.05085.pdf (1/n)
5
31
137
TL:DR: “How can NeRF be useful for robotics?” One option: train precise correspondence models, made possible by generating training data from NeRF’s beautiful geometry. @yen_chen_lin did an amazing job leading this project.
Hi everyone, I'm happy to share our new #ICRA2022 paper on 𝐦𝐚𝐤𝐢𝐧𝐠 𝐍𝐞𝐑𝐅 𝐮𝐬𝐞𝐟𝐮𝐥 𝐟𝐨𝐫 𝐫𝐨𝐛𝐨𝐭𝐬! NeRF-Supervision is a method that learns dense visual descriptors from NeRF for category-level robotic pick and place. yenchenlin.me/nerf-supervisi…
1
21
133
Very nice real-time reactive robot manipulation demo from @MarcToussaint17's group.
Finally a step from Logic-Geometric Programming to a reactive robotic manipulation framework: "Sequence-of-Constraints MPC: Reactive Timing-Optimal Control of Sequential Manipulation" Paper & Videos: user.tu-berlin.de/mtoussai/2… Thanks to all collaborators! @DannyDriess
1
10
99
New xArm robot (the "Lite 6"), and they're selling some for $1,199. kickstarter.com/projects/ufa… I've really enjoyed using the bigger xArm 6 for robot research. They're simple but pretty high quality for the price point. Exciting to see prices jump even lower.
5
11
85
Very nice! Was hoping somebody would get Diffusion working really well for real-world robot policy learning. Comprehensive display of results (see website), nice visualizations and tasks. 👏 @chichengcc and @SongShuran's lab together with Siyuan and Eric (TRI) and Yilun (MIT) !
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion abs: arxiv.org/abs/2303.04137 project page: diffusion-policy.cs.columbia…
2
8
66
16,521
The amount of people trying to claim they already did Q* without even knowing why Q* is, this is hilarious
2
1
66
10,451
One way of thinking about these results — this is the widest diversity of complex tasks I’ve *ever* seen performed by *any* robot. Finally, something actually exceeds the ~2010 PR1 videos :) Also to clarify the video below is teleop, but they have autonomous results for a smaller set but still impressive mix of tasks. Amazing work @zipengfu @tonyzzhao @chelseabfinn
Mobile ALOHA's hardware is very capable. We brought it home yesterday and tried more tasks! It can: - do laundry👔👖 - self-charge⚡️ - use a vacuum - water plants🌳 - load and unload a dishwasher - use a coffee machine☕️ - obtain drinks from the fridge and open a beer🍺 - open doors🚪 - play with pets🐱 - throw away trash - turn on/off a lamp💡 Project website: mobile-aloha.github.io/ Co-lead @tonyzzhao, advised by @chelseabfinn (amazing photographing from @qingqing_zhao_ )
2
8
65
20,120
Excited to have this paper come out, it studies a lot of ideas under one roof! Melts together ideas/models from: "LMs as Zero-Shot Planners", SayCan, Socratic Models, PaLM, Chain-of-thought.
Have you ever “heard” yourself talk in your head? Turns out it's a useful tool for robots too! Introducing Inner Monologue: feeding continual textual feedback into LLMs allows robots to articulate a grounded “thought process” to execute long, abstract instructions 🧵👇
2
3
63
Another recent new video showing more computer vision in the loop for @BostonDynamics robots, this one extracted from a short talk by Marc Raibert here -- piped.video/watch?v=C8-w9eF2… @VentureBeat In this one, Spot picking up clothes.
28
55
🤙 Hardware improvements make life much better in unexpected ways. Example: I used to think there was a software bug causing gripper latency… nope, just static friction in the original grippers!! + new sim 💯 👏 team for @tonyzzhao’s Google DeepMind internship!
Led by @GoogleDeepMind, we present ALOHA 2 🤙: An Enhanced Low-Cost Hardware for Bimanual Teleoperation. ALOHA 2 🤙 significantly improves the durability of the original ALOHA 🏖️, enabling fleet-scale data collection on more complex tasks. As usual, everything is open-sourced!
1
5
53
5,957
Replying to @ericjang11
Love this. The key trend it turns out is that the year is monotonically increasing.
2
2
46
Having robots learn dexterous tasks requiring real-time hand-eye coordination is hard. Can learning visual correspondence make it easier? New paper: “Self-Supervised Correspondence in Visuomotor Policy Learning” Pdf: arxiv.org/abs/1909.06933 Video: piped.video/watch?v=nDRBKb4A…
12
48
New 🤖 paper led by the awesome @WiYoungsun! arxiv.org/pdf/2202.00868.pdf The paper is essentially "using the Force* to deform neural fields" (In this case, DeepSDF-style representations.) A cool thing here is that robots can have tactile (e.g., force-torque) sensing...
2
6
47
Happening tomorrow — join us online or in New Orleans!
Join us next week at the CVPR Tutorial on Vision-Based Robot Learning! We’ll distribute Colabs that show you how to run Socratic Models for language-driven robot pick & place right in your browser (in person, or online!) sites.google.com/view/cvpr20…
3
40
First Question: “Which is the best action space for learning?” 🤔... Second Question: “Can we just *not* choose any one specific action space, and let the model figure it out?” 🙋‍♂️🎉 One step closer to action spaces that *just work* :)
For end-to-end robot learning: pixels to joint angles? or to cartesian poses? IKP uses Implicit BC + (differentiable) kinematics to learn inductive patterns in both action spaces. arxiv.org/abs/2203.01983 w/ @AdityaGanapathi @peteflorence Jake Varley @kaylburns @Ken_Goldberg
1
1
39
You may have noticed, even earlier today, that Large Language Models are getting better. Now, this work from colleagues on our team at Google, shows how to use Large Language Models to make robots work better at planning in the real world. LLMs —> 🤖👍🏻
Super excited to introduce SayCan (say-can.github.io): 1st publication of a large effort we've been working on for 1+ years Robots ground large language models in reality by acting as their eyes and hands while LLMs help robots execute long, abstract language instructions
1
3
31
In your head, when you *plan* into the future - how much planning do you do in "language" in your head? - if not using language, do you visualize? If you visualize, is it photoreal from an ego view, or something else? - is there a 3rd, not language or visualizing, way you plan?
14
1
31
11,092
Recent talk at MIT by Toyota Research Institute on their work scaling diffusion models and dexterous data collection piped.video/live/fwBbj6UmK-I… @Ben_Burchfiel, Siyuan, @eacousineau, @naveenoid, Russ, and co.
3
27
3,345
Here’s another multimodal reasoning question addressed with chain-of-thought, this time doing visual math questions, no OCR required despite needing spatial-textual context, just does everything all in one model. This prompt by @xf1280!
2
3
24
2,834
🎙We podcasted! Thanks for putting it together @kevin_zakka and it was great as always chatting with @ericjang11. I think we covered a bunch of topics. For example I didn’t expect to learn what stigmergy is (thanks @ericjang11!).
Super stoked to release the very first episode of my Casual Robotics podcast: "Progress Towards General Purpose Robots" ft. the brilliant @ericjang11 and @peteflorence. casualrobotics.ai/
3
2
25
For one, “Let’s think step by step” comes to multimodal models! Zero-shot chain-of-thought has been one of these emergent behaviors that has caught considerable interest in researching LLM capabilities… With PaLM-E-562B, zero-shot visual chain-of-thought comes “included”.
2
4
25
3,436
Why RoboPianist? - Full bi-manual anthropomorphic hands, contact-rich manipulation. - CV/NLP have had ambitious high-quality quantitative benchmarks, this helps add more in robotics. - Tons of expansion opportunity: learning from humans / multimodal input / generative music..
Introducing 𝗥𝗼𝗯𝗼𝗣𝗶𝗮𝗻𝗶𝘀𝘁 🎹🤖, a new benchmark for high-dimensional robot control! Solving it requires mastering the piano with two anthropomorphic hands. This has been one year in the making, and I couldn’t be happier to release it today! Some highlights below:
1
5
23
5,952
One of the bigger bimanual teleop demos I’ve seen 🤣
Excavator team work 👷‍♂️🤝🏼👷‍♂️
1
1
24
3,881
Replying to @coreylynch
@coreylynch 👏 very nice bimanual policies!
1
6
2,928
One of the largest areas for impact of *generalist mulitmodal models* may be in the medical domain. 🩺🩻👩‍⚕️(radiology, dermatology, genomics…) With this new step, Med-PaLM becomes multimodal — a generalist biomedical AI. And it’s finetuned from PaLM-E! 🌴-🤖—> 🥼
Medicine is inherently multimodal. Thrilled to share Med-PaLM M, the first demonstration of a generalist multimodal biomedical AI system with a stellar team @GoogleAI @GoogleDeepMind @GoogleHealth Paper: arxiv.org/pdf/2307.14334.pdf
1
2
22
4,675
In Honolulu to present PaLM-E! 2 time slots: - today (Tuesday) 2:00-3:30 pm poster session, Exhibit Hall 1, #237 - tomorrow (Wednesday) 10:30 am - 12:00 noon, Google DeepMind booth Several authors here and looking forward to chatting with folks!
1
22
3,147
One more -- onboard view & faster.
1
8
21
I'm looking forward to sharing more on our Implicit BC work, and we should have our own implementations out soon. Meanwhile though, Kevin did a very nice PyTorch implementation here of one of the results!
Always fun to use homework as an excuse to implement friends/collaborators' newest work :) Learned a lot taking a stab at @peteflorence and @andyzengtweets's newest Implicit Behavior Cloning in @PyTorch. github.com/kevinzakka/ibc
1
4
20
In the language of Moravec’s paradox: Training on billions of the easy problems seems to be making some of the hard problems more tractable.
1
20
3,914
Check out our new blog post! We talk a bit more about our research process and the questions in our recent Implicit BC work (implicitbc.github.io/).
It can be challenging for robots to imitate precise and decisive behaviors. Introducing Implicit Behavioral Cloning, a simple method that scales to difficult real-world tasks and achieves state-of-the-art performance on human-expert offline RL benchmarks→ goo.gle/3FurkP6
1
20
If you or anybody you know has time at home and has ever wanted to learn 3D CAD, here's a step-by-step tutorial with GIFs at every step. From a 16 y.o. we taught last summer: "I learned more in a couple hours than I did in a year poking at CAD". stageoneeducation.com/cad-tu…
Hey teachers...challenge your kids to learn #CAD with the great GIF tutorials at stageoneeducation.com/ #RemoteLearning #PBL #CTE #Rockets
9
19
Sonic The Hedgehog robot! A solution to the classic “legs vs wheels” debate? Six legs *and* a single omnidirectional wheel. Also a different take on legs+wheels than “wheels at the bottom of legs”, for example Boston Dynamics’ ~2017 Handle bot, Ascento , etc. Also relevant see “Ballbot” (piped.video/8BtDuzu2WeI?si=9a0T…) and other ball-balancing robots, including BB-8, but add legs.
ボール状に変形する六脚ロボット、 あらゆる方向に転がることもできる。 piped.video/yn3FWb-vQQ4 #DIY #handmade #robot #robotics #Biomimicry #生物ロボット #生物模倣 #バイオミミクリー #Armadillo #hexapod #MorpHex #ZentaRobotics
1
3
20
2,579
In addition to the paper, I wanted to highlight that @simonlc_ made a beautifully distilled, narrated, and animated explainer video, intro-ing key topics in simulating contact, which is a pillar of robotics. See Simon's tweet for full YouTube link. Some snippets:
I'm excited to present Single-Level Differentiable Contact Simulation. It's a novel formulation that unifies contact dynamics and collision detection in a single optimization problem. paper: arxiv.org/abs/2212.06764 code: github.com/simon-lc/Silico.j… video: piped.video/oaGLTR13iF8
1
19
2,718
Excited to be starting this week as a Research Scientist @GoogleAI working with many talented folks on the Brain Robotics team! In SF Bay Area — looking forward to spending time with old friends and new ones too.
19
Always such a joy seeing new Atlas videos :) Looks like a hard task, especially with the contact mode switching into/from sliding, and the constraints involved, on both grabbing and stowing. Also looks like a heavy widget. Interesting new fingers too!
Can't trip Atlas up! Our humanoid robot gets ready for real work combining strength, perception, and mobility.
1
1
19
2,503
And one more callout: Low-level physical control skills ("manipulation") that are 1. highly capable and 2. broadly general, remains very challenging. This is not much addressed by these works from our team this week. Tons more work to do there. Moravec's Paradox continues.
2
19
Replying to @Stone_Tao
The hardest solved problem in robotics is camera calibration. The hardest unsolved problem in robotics research is communication to the broader public about what is/isn’t hard.
1
1
19
1,069
Excited to have this come out. A large effort with a lot of folks behind this. Note these videos (previous one and this one below) are "1x speed" (real time)! Here are rollouts for one of the ~87,000 strings the robot can do, "push the yellow star between the green blocks"
1
14
Now that I have a kiddo of my own, when it’s my own birthday, my main thought is, *wow* thank you mom and dad!
17
2,411
Here’s a many-step zero-shot CoT example (prompt by @ayzwah!). Note large VQA training datasets (VQAv2, OKVQA, etc.) typically only have 1-, 2-, 3-word answers, so these many-step answers are considerably out-of-distribution.
1
2
17
1,793
If you work on robot learning -- What do you think would have higher impact on robotics over a 1-year horizon: 100 million diverse high-quality dexterous demonstrations, or the entirety of the rest of robotics research for the year? Context in replies
57% 100 M demonstrations
43% Rest of robot research
308 votes • Final results
11
3
16
7,294
Huy Ha from @SongShuran ‘s lab is brand new to Twitter/X today, currently at 3 followers. Give him a follow? Amazing work by him on this project. Addresses scalability and LLMs and diffusion policies. And check out that website! Also, code is all available :)
How can we put robotics on the same scaling trend as large language models while not compromising on rich low-level manipulation and control?
1
1
16
3,591
🌴🤖: 🦾👀✍️
What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website: palm-e.github.io
1
14
2,290
New blog post is out on Transporter Nets! @andyzengtweets made us a new Blendered explainer visual, and I love it. Code is open source now too and major kudos to @ayzwah for help with the code release. github.com/google-research/r…
Can models more efficiently learn rearrangement tasks by overlaying 3D space instead of using object-centric representations? Check out Transporter Nets, an open-source framework for sample-efficient robot manipulation, with related benchmark tasks. See ↓ goo.gle/37k9KOW
1
16
Do you want to help? Open-source project for low-cost, Arduino-based, partially-3D-printed ventilator: github.com/jcl5m1/ventilator This is to address the potential case of COVID-19 hospitalizations depleting all FDA approved ventilators. Started by Johnny Lee. Plenty help needed.
1
10
15
Also here is link for paper, didn’t have it in last tweet! arxiv.org/pdf/1901.05103v1.p… DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, Steven Lovegrove.
1
3
14
It is pretty remarkable to me how quickly just some creative programming can combine models in this style. Our open-source example for image captioning is about 40 lines of non-boiler-plate code. (colab.research.google.com/dr…)
1
2
14
Congrats @DrJimFan!! And @yukez !!
Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof. @yukez. GEAR stands for Generalist Embodied Agent Research. We believe in a future where every machine that moves will be autonomous, and robots and simulated agents will be as ubiquitous as iPhones. We are building the Foundation Agent — a generally capable AI that learns to act skillfully in many worlds, virtual and real. 2024 is the Year of Robotics, the Year of Gaming AI, and the Year of Simulation. We are setting out on a moon-landing mission, and getting there will spin off mountains of learnings and breakthroughs. Join us on the journey: research.nvidia.com/labs/gea…
1
12
6,176
Blog post by Vincent Vanhoucke which weaves together: - New Yorker cartoons by @BobMankoff - new work in AI from last week including from our team, and - “language as the connective tissue of AI”. vanhoucke.medium.com/bob-man…
3
14
Replying to @snikolov
Nice, this is a great point. Should we call B "mulimodality models" then? I think I like that. Here's a reference of folks calling B "multimodal", the 4th MULA workshop: mula-workshop.github.io/ Maybe they could call all that multimodality learning
4
13
Introduction by way of a massively oversimplified Haiku -- VLM problem: Suck at 3D reasoning Generate data :) Actually getting this done, at scale, comes with a very creative pipeline, and lots of analysis. Awesome work lead by @BoyuanChen0 and amazing hosting by @xf1280 !
Introducing Spatial VLM, a Vision-Language Model with 3D Spatial Reasoning Capabilities by @GoogleDeepmind. We investigate to what extent synthetic data can help VLMs learn - 3D relationship - quantitative distance - CoT spatial reasoning - RL reward spatial-vlm.github.io (1/6)
2
14
3,257
One capability we study is *interactive language guidance* in which the robot can be iteratively guided by a human to accomplish long-horizon complex tasks requiring multiple minutes of coordinated actions. (These videos, are long, sped up to 4x)
1
11
Multimodal chain-of-thought can be very helpful to get a sense of what the model is picking up on. While the question here is only a 1-bit (yes/no) answer, the chain-of-thought provides much more than 1 bit of information on what the model sees.
1
13
1,297
If you know anybody who’d be interested, encourage them to apply! I signed up to work with students interested in robotics, computer vision, ML. What is is: mentorship in the intangibles of navigating the research world, and intended for students from under-represented groups.
Applications are open for our CS Research Mentorship Program — CSRMP. Students from underrepresented groups in computing are paired with @Google mentors to support their pursuit of research pathways. Learn more & apply by Nov 18 ➡️ research.google/outreach/csr…
1
13
Moving on from chain-of-thought, another capability of PaLM-E that “just comes included” is the ability to do multi-image reasoning… despite only ever being trained on single-image examples.
1
4
13
2,741
I also want to call out another effort, "SayCan", released this week from colleagues on our team. Clearly, trying some "multi-foundation-model" (Socratic) approach + "use LLM for robot planning" (SayCan) will be on the docket for things to try next :)
Super excited to introduce SayCan (say-can.github.io): 1st publication of a large effort we've been working on for 1+ years Robots ground large language models in reality by acting as their eyes and hands while LLMs help robots execute long, abstract language instructions
1
13
Extending multi-image further, we can do more than just 2 images... For this, let’s look at a capability we showed last year with Socratic Models (socraticmodels.github.io/, led by @andyzengtweets), where we could do long-form egocentric video understanding, some examples here:
1
2
13
1,002
The most practical resolution I can think of is for A to become "non-unimodal". But unfortunately that's kind of a mouthful.
3
12
So… it trains about as fast as you can say “Neural Radiance Field” 🤯
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding paper: nvlabs.github.io/instant-ngp… project page: nvlabs.github.io/instant-ngp… github: github.com/NVlabs/instant-ng…
1
1
10
Congrats Figure team, nice to see the hands and manipulation learning! 👏@coreylynch oscar @adcock_brett and co!
Figure-01 has learned to make coffee ☕️ Our AI learned this after watching humans make coffee This is end-to-end AI: our neural networks are taking video in, trajectories out Join us to train our robot fleet: figure.ai/careers
2
11
3,111
But of course, avoiding forgetting is a low bar :) I’ve been glad to see that folks are picking up on the **transfer** story of PaLM-E – for example see r/MachineLearning: teddit.net/r/MachineLearning…
1
1
11
1,170
Very nice @kanjun ! Love this: rich interactable 3D worlds, 10k steps/sec, single-GPU trainable, 215 hours of human data… all open source with great docs (github.com/Avalon-Benchmark/…), excited to see where this heads!
Replying to @kanjun
Avalon is the world's fastest 3D simulator for RL agents. All baselines train on 1 GPU in ~1 day. We want academic researchers to be able to study aspects of intelligence missing from today’s models, even w/o access to large-scale compute. Get started: generallyintelligent.com/ava…
1
3
11
With PaLM-E, we can do this end-to-end, all in one model, with no explicit textual intermediate stage. A wide set of temporal/visual reasoning capabilities are in scope. Lots of potential AR & Robotics applications here.
2
1
12
1,108
Another capability of PaLM-E-562B is that it's, quantitatively, an excellent language model. Roughly as good as PaLM-540B. Notable that scaling the model significantly reduces catastrophic language forgetting 🤔
Replying to @DannyDriess
We observe a notable trend with model scale: the larger the language model, the more it maintains its language capabilities when training on visual-language and robotics tasks – quantitatively, the 562B PaLM-E model nearly retains all of its language capabilities.
1
12
1,100
Interesting to see where people landed here. In the 2nd poll, the cumulative total is that 71% voted for 1 B demos. And 29% still held out! Here’s the talk from CoRL where I originally had this type of question (timestamped link), I simplified it a bit for this Twitter poll piped.video/aeCZ7DY8KHw?si=oGVH… In the talk there’s a crude back-of-the-envelope estimate that if for 1 year instead of working on CoRL papers everybody just collected demos, we’d have ~230 million demos. As I say in the talk, I’m not actually saying we should do that… we need a diversity of ideas, and CoRL is a great conference full of lots of great ideas, but the point of the question is to get people to think. Phrasing the question this way is elucidating because (a) with this group the “people time” is already paid for, and (b) we know that robot researchers typically collect high-quality demo data 🙂. Maybe this makes you realize that collecting a lot of data is indeed feasible… ideally in a way where people are also doing other research too. And maybe it helps you think about how, if at all, you might adjust your own work for a future world where we have such a large scale of robot demonstration data. Rather than thinking about whether or not 100 M demos would “solve” any subset of robotics, I think it’s more helpful to think about how the robotics landscape would change in such a world. NLP and Computer Vision are quite different since LLMs, CLIP-style models, large-scale text-to-image diffusion models, multimodal LLMs have existed. What shifts in importance? What is then possible, and what doesn't change? What is perhaps even more important than before? Re: the poll’s question, my own view is that both would be immensely valuable. I am both gung-ho on getting lots of demos and also on research on lots of other, potentially unrelated things. In the talk I originally said “nonzero probability” that 230 million demos might be more valuable than everything else combined. That’s a weaker statement than the poll’s question. Wild guess, probability is maybe around 50%. If forced to pick as the poll asked, we’ll say maybe over 50%, so I’d take the 100 M demos. Also keep in mind that the impact of 100 M demos might be pretty immediate in under 1 year, whereas methodological research ideas typically take longer to bear fruit in terms of impact. Thoughts?
1
1
12
1,668
The importance of context in communication: Blinking headlights while driving either means - Go f yourself - Thank you - Get out of my way - Go ahead - Your lights are off A single bit, but potentially many bits of context. :)
1
11
2,138
Here's some examples (see thread) from @maxbraun, running our Socratic-Models-based image captioning in our open-source colab (colab.research.google.com/dr…) If anybody would prefer, we can also provide a "request-result-over-Twitter" API :) -- just send some images.
Replying to @maxbraun
0.2773 A high-tech cafeteria where robots serve delicious food without a single human in sight. 0.2362 An empty room with only a cleaning robot to keep it company. 0.2253 A robotic future?
2
11
In a recent-ish podcast (recorded in October, released in January), I had a few comments on where large-scale multimodal models are headed and “one big model” approach... (see around 42 minutes here)
Check out our interview with Google's @peteflorence! We chat about how robotics can benefit from dense visual representations, neural radiance fields, and large language models. It's an exciting time for robotics, take a listen! 👇 thegradientpub.substack.com/…
1
1
9
1,422
Replying to @pfau
Yitang Zhang bounded gaps between primes at age ~58, in ~2013
10
In Socratic Models, this worked by writing out a language-based world-state history – a timestamped log of textually-represented events:
1
1
10
930
For context, with more and more research happening on B (right), it can be hard to search for things related to A (left). Maybe I'm the only person that runs into this though :) Poll
24% Use different name for A
50% Use different name for B
27% Doesn't bother me
1,568 votes • Final results
4
2
10
For robotics, PaLM-E is a rapid learner of new planning tasks, requiring only a handful of samples to start generalizing well in a given domain. Here we plot PaLM-E sample complexity relative to baseline – the difference is solely transfer learning. (Subset of Table 2)
1
2
10
1,325
Can NeRF help reinforcement learning? See Danny’s (dannydriess@) thread on “NeRF-RL”! A few more comments in this thread too.
New preprint on Reinforcement Learning with Neural Radiance Fields Paper: arxiv.org/abs/2206.01634 Video: dannydriess.github.io/nerf-r… Amazing collaboration between @DannyDriess, @IngmarSchubert, @peteflorence, @YunzhuLiYZ, @Marc__Toussaint (1/6)
1
1
10
Rough math: $27k per ALOHA2 x 9 ALOHA2s = $243k… Equivalently, only would have needed to hold onto about $13k of NVIDIA stock from 5 yrs ago 😉 (+1802%)
1
10
257
Here is the link for the blog post:
Today we share PaLM-E, a generalist, embodied language model for robotics. The largest instantiation, 562 billion parameters, is also a state-of-the-art visual-language model, has PaLM’s language skills, and can be successfully applied across robot types →goo.gle/3JsszmK
2
10
1,076
Has anyone figured out what the optimal first Wordle word guess is? There’s no information for the first word. Should be the same optimal first guess every time.
10
8
For this multi-image reasoning, since PaLM-E flexibly supports multimodal sentences, it can answer questions about specific relationships between images. While the previous example was a “what matches?” question, this one is a “what’s different?” question.
1
1
9
1,395
But certainly "learning" / "meta-learning" the form of the interaction itself seems possible.
1
9
And I want to close with a Haiku. Prompt in gray by @brian_ichter, and the completion written by PaLM-E-562B:
2
9
1,235
Replying to @MikkoMononen
Cool. A couple references to broaden your rabbit hole :) you might find interesting: 1: groups.csail.mit.edu/robotic… (says it’s about UAVs but I think you’ll see could be applied to any motion planning really.) 2. arxiv.org/pdf/2101.11565.pdf (also talk on YouTube: piped.video/wciDaoNSwwk)
1
8
Interesting to look back at that interview now – finishing out the results of PaLM-E has definitely shifted my perspective! (btw, thanks @gradientpub + @andrey_kurenkov for having me on!)
1
8
708
This morning's Hard Fork podcast is a nice intro to RT-2. @hausman_k thread for more links nitter.app/hausman_k/status/16849… Also see @haqhuy thread on "Scaling up and Distilling Down" nitter.app/haqhuy/status/16849671… And background on Moravec's paradox by @chelseabfinn: piped.video/watch?v=raHM3k-u…
How can we put robotics on the same scaling trend as large language models while not compromising on rich low-level manipulation and control?
1
1
8
2,615
Model size is certainly not everything, but I think the comparisons are notable. RoboCat/Gato are pretty large, so are MVP/VC-1 (both ViT-L). RT-2 is a considerable step up. Everything else is pretty small. "Largest over time" view, log scale:
1
8
873
V curious to see this hand in action! @ericjang11 great show btw piped.video/X7HmltUWXgs?si=Z6pD…
NEO just picked its first cup, excited to finally share some hands-on details🧵
1
8
3,066
Great task to show off the new Tesla hardware, very nice, looking forward to more! 👏@julianibarz @aelluswamy
Optimus folds a shirt
8
2,115