Transformers are arguably the most impactful deep learning architecture from the last 5 yrs. In the next few threads, we’ll cover multi-head attention, GPT and BERT, Vision Transformer, and write these out in code. This thread → understanding multi-head attention. 1/n
22
608
3,181
Today I’m launching @reflection_ai with my friend and co-founder @real_ioannis. Our team pioneered major advances in RL and LLMs, including AlphaGo and Gemini. At Reflection, we're building superintelligent autonomous systems. Starting with autonomous coding.
174
214
1,951
499,105
Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
98
174
1,490
369,242
In our new work - Algorithm Distillation - we show that transformers can improve themselves autonomously through trial and error without ever updating their weights. No prompting, no finetuning. A single transformer collects its own data and maximizes rewards on new tasks. 1/N
22
237
1,331
Patch extraction is a fundamental operation in deep learning, especially for computer vision. By the end of this thread, you’ll know how to implement an efficient vectorized patch extractor (no for loops) in a few lines of code and learn about memory allocation in numpy. 1/n
12
195
1,062
GPT has been a core part of the unsupervised learning revolution that’s been happening in NLP. In part 2 of the transformer series, we’ll build GPT from the ground up. This thread → masked causal self-attention, the transformer block, tokenization & position encoding. 1/N
5
109
573
Einops are pretty magical. For example, with einops you can implement max pooling in 2 lines of code. Patches → set size of patch, decompose HW dims in rearrange as (num_patches * size), specify output dim. Pooling → pick out maximum over each patch. That is all.
3
74
539
Starting a blog about the engineering + scientific ideas behind training large models (e.g. transformers). First post covers data parallelism, a simple and common technique for parallelizing computation across multiple devices. mishalaskin.com/posts/data_p… 1/N
6
60
445
71,249
New paper led by @astooke w/ @kimin_le2 & @pabbeel - Decoupling Representation Learning from RL. First time RL trained on unsupervised features matches (or beats) end-to-end RL! Paper: arxiv.org/abs/2009.08319 Code: github.com/astooke/rlpyt/tre… Site: mishalaskin.github.io/atc/ [1/N]
6
112
406
How much memory do you need to train deep neural networks? You may find the answer to be counter intuitive. For example, suppose we're training a 4 megabyte MLP with batch_size = hidden_dim, how much memory do we need? 4MB? No - we need 8MB! Here's why... 1/N
4
44
383
Excited to share that I've joined @DeepMind and for the opportunity to work at the frontier of RL research. Thank you @pabbeel and all of my collaborators for an incredible two years at Berkeley.
21
7
342
We are bringing the open model frontier back to the US to build a thriving AI ecosystem globally. Thankful for the support of our investors including NVIDIA, Disruptive, DST, 1789, B Capital, Lightspeed, GIC, Eric Yuan, Eric Schmidt, Citi, Sequoia, CRV, and others.
Today we're sharing the next phase of Reflection. We're building frontier open intelligence accessible to all. We've assembled an extraordinary AI team, built a frontier LLM training stack, and raised $2 billion. Why Open Intelligence Matters Technological and scientific progress is driven by values of openness and collaboration. The internet, Linux, and the protocols and standards that underpin modern computing are all open. This isn't a coincidence. Open software is what gets forked, customized, and embedded into systems worldwide. It's what universities teach, what startups build on, what enterprises deploy. Open science enables others to learn from the results, be inspired by them, interrogate them, and build upon them in order to push the frontier of human knowledge and scientific advancement. AI got to where it is today through scaling ideas (e.g. self-attention, next token prediction, reinforcement learning) that were shared and published openly. Now AI is becoming the technology layer that everything else runs on top of. The systems that accelerate scientific research, enhance education, optimize energy usage, supercharge medical diagnoses, and run supply chains will all be built on AI infrastructure. But the frontier is currently concentrated in closed labs. If this continues, a handful of entities will control the capital, compute, and talent required to build AI, creating a runaway dynamic that locks everyone else out. There's a narrow window to change this trajectory. We need to build open models so capable that they become the obvious choice for users and developers worldwide, ensuring the foundation of intelligence remains open and accessible rather than controlled by a few. What We've Built Over the last year, we've been preparing for this mission. We’ve assembled a team who have pioneered breakthroughs including PaLM, Gemini, AlphaGo, AlphaCode, AlphaProof, and contributed to ChatGPT and Character AI, among many others. We built something once thought possible only inside the world’s top labs: a large-scale LLM and reinforcement learning platform capable of training massive Mixture-of-Experts (MoEs) models at frontier scale. We saw the effectiveness of our approach first-hand when we applied it to the critical domain of autonomous coding. With this milestone unlocked, we're now bringing these methods to general agentic reasoning. We've raised significant capital and identified a scalable commercial model that aligns with our open intelligence strategy, ensuring we can continue building and releasing frontier models sustainably. We are now scaling up to build open models that bring together large-scale pretraining and advanced reinforcement learning from the ground up. Safety and Responsibility Open intelligence also changes how we think about safety. It enables the broader community to participate in safety research and discourse, rather than leaving critical decisions to a few closed labs. Transparency allows independent researchers to identify risks, develop mitigations, and hold systems accountable in ways that closed development cannot. But openness also requires confronting the challenges of capable models being widely accessible. We're investing in evaluations to assess capabilities and risks before release, security research to protect against misuse, and responsible deployment standards. We believe the answer to AI safety is not “security through obscurity” but rigorous science conducted in the open, where the global research community can contribute to solutions rather than a handful of companies making decisions behind closed doors. Join Us There is a window of opportunity today to build frontier open intelligence, but it is closing and this may be the last. If this mission resonates, join us.
28
25
287
58,492
Building on parts 1 & 2 which explained multi-head attention and GPT, in part 3 of the Transformer Series we'll cover masked language models like BERT. This thread → masked language models, diff between causal and bi-directional masked attention, finetuning, and code. 1/N
5
48
254
Ever gotten tired of seeing the same architecture in deep RL ever since DeepMind's Atari-DQN, and wanted to see more papers that explore helpful changes? Check out our latest work FLARE, which replaces frame-stacking. 📝 bit.ly/3s4J1il 💻 bit.ly/3bpHM7D 1/N
3
57
253
Is RL always data inefficient? Not necessarily. Framework for Efficient Robotic Manipulation (FERM) - shows real robots can learn basic skills from pixels with sparse reward in *30 minutes* using 1 GPU 🦾 paper: bit.ly/2M3CFPG site / code: bit.ly/390Sz6g 1/N
6
51
247
Over the last few years, unsupervised learning has produced breakthroughs in CV and NLP. Will the same thing happen in RL? @denisyarats and I wrote a blog post discussing unsupervised vs supervised RL and the unsupervised RL benchmark. bair.berkeley.edu/blog/2021/…
1
50
240
I was wondering how ChatGPT managed to interleave code with text explanations. Was hoping this was an emergent behaviour. Turns out it’s likely straight up imitation learning on curated contractor data. Makes sense but kind of deflating.
13
17
242
76,900
Replying to @abacaj
Important caveat - the scale of the model small. Generalization might only emerge at scale
9
10
238
25,892
New post - how do we train models that are larger than the memory of a single GPU? Break the model into smaller pieces across several devices. This technique is called model parallelism. I'll show how it works in practice with code examples. mishalaskin.com/posts/tensor… 1/N
2
50
232
30,902
If you're not already in RL, here's an informal introduction to the field I wrote at the tail of my postdoc. A high-level motivation for how RL differs from more traditional ML problems and why it's important. anyscale.com/blog/an-informa…
1
31
153
And there you have it - we derived attention intuitively and wrote it out in code. The main idea is quite simple. In next posts I will cover Transformers, GPT & BERT, Vision Transformers, and other useful tricks / details. That was fun to write, hope also fun to read! 12/n END
16
11
137
This diagram deserves the test of time award. It was confusing when I first got into ML and remains confusing today.
All major neural networks, in one chart: bit.ly/2HB7tl9 v/The Asimov Institute
12
136
22,572
Our team is hiring for both RS and RE roles! Research focus is building generalist agents. At ICML this week, ping me if you're interested in chatting.
6
13
134
67,934
Replying to @AndrewYNg
a few areas, some applied / some basic research - climate change - computational drug discovery - ethical AI - generalization to new tasks - world models / representation learning - long-horizon problem solving / better hierarchies
1
1
134
We look for colleagues who have high internal drive, integrity, and a deep interest in the problems we’re pursuing. If that’s exciting to you, join us. Apply here: reflection.ai
6
8
122
20,928
New updates to RAD (RL + data augs) answer the following: 1)Why does random crop work so well? -> translation invariance 2)Does data aug work for state-based RL too? -> yes SOTA on DeepMind control (pixel-based RL) and OpenAI gym (state-based RL). arxiv.org/abs/2004.14990 1/N
3
32
119
We believe that solving autonomous coding will enable superintelligence more broadly. Our company is defined by three things: 1) A team behind some of the most capable RL and LLM systems ever created - the two building blocks for superintelligence.
4
4
122
13,574
You may have read that transformers like GPT are memory intensive and scale poorly with sequence length. Why is that? In this post, we'll derive a formula for a transformer's memory footprint and explain why transformers can be so memory hungry. Let's get started... 1/N
2
19
122
Excited to share a paper on local updates as an alternative to global backprop, co-led with @Luke_Metz + @graphcoreai @GoogleAI & @berkeley_ai. tl;dr - Local updates can improve the efficiency of training deep nets in the high-compute regime. 👉 arxiv.org/abs/2012.03837 1/N
1
20
109
The biggest question in RL research has always been - what environment are you training on? It used to be video (Atari) and board (Go / Chess) games. But now that RL works with LLMs, there is only one environment that matters. And it is your product.
2
4
114
12,568
New paper coming up at @NeurIPSConf - Sparse Graphical Memory for Robust Planning uses state abstractions to improve long-horizon navigation tasks from pixels! Paper: arxiv.org/abs/2003.06417 Site: mishalaskin.github.io/sgm/ Co-led by @emmons_scott, @ajayj_, and myself. [1/N]
1
16
105
New paper on unsupervised skill discovery - Contrastive Intrinsic Control. Tl;dr exploration with contrastive skill learning substantially improves prior skill discovery methods (by 1.8x)! Achieves leading unsupervised RL results. arxiv.org/abs/2202.00161 Learn more 👇 1/N
2
18
101
2) A focus on building the best autonomous coding systems in the world. Rather than doing many things, we do one thing really well. 3) Equal emphasis on research and product. Superintelligence cannot be built in a vacuum.
2
2
98
11,267
We're launching a benchmark for unsupervised RL. Like pre-training for CV / NLP, imo unsupervised RL will lead to the next big breakthroughs in RL and bring us closer to generalist AI. Our goal is to get us there faster. LFG!!! Code / scripts: github.com/rll-research/url_… 1/5
Currently It is challenging to measure progress in Unsupervised RL w/o having common tasks & protocol. To take a step in addressing this issue we release our #NeurIPS2021 paper: (URLB) Unsupervised RL Benchmark! Paper: bit.ly/3bwHhY8 Code: bit.ly/3bAvI1S 1/N
1
17
99
Can pixel-based RL be as data-efficient as state-based RL? We show for the first time that the answer is yes, new work with @Aravind7694 and @pabbeel website 👉 mishalaskin.github.io/curl code 👉 github.com/MishaLaskin/curl
New paper - CURL: Contrastive Unsupervised Representations for RL! We use the simplest form of contrastive learning (instance-based) as an auxiliary task in model-free RL. SoTA by *significant* margin on DMControl and Atari for data-efficiency. arxiv.org/abs/2004.04136
2
23
100
It’s an honor to have you on the team! Alex pioneered advances in LLM coding capabilities at Google DeepMind, most recently in Jules and Gemini. Highly recommend reading his blog post. Excited to build frontier open models together.
🎉 Next week, I am excited to join @reflection_ai as a Member of Technical Staff to help build the open intelligence ecosystem of the Western world. It's the most exciting opportunity to help software builders in our time, and will shape many years of AI Engineering in the medium-term before AGI. Not just about Western vs Eastern open models, but more about how AI-driven software will look like in 2030. I spent some time articulating my thoughts about where we're going as a community and why... which became a whole blog post. Take a look, hope it interests you! (And if it really does, we are hiring in NYC, SF, and London 😉) alexpolozov.com/blog/reflect…
4
2
97
24,550
A summary of interesting findings: 1. Transformers can do in-context RL 2. In-context RL with AD is more efficient than gradient based source RL algo 3. AD improves suboptimal policies 4. In-context RL emerges from imitation learning with long contexts 11/N
3
9
84
In-context RL at scale. After online pre-training, the agent solves new tasks entirely in-context like an LLM and works in a complex domain. One of the most interesting RL results of the year.
I’m super excited to share our work on AdA: An Adaptive Agent capable of hypothesis-driven exploration which solves challenging unseen tasks with just a handful of experience, at a similar timescale to humans. sites.google.com/corp/view/a… See the thread for more details 👇 [1/N]
8
85
14,154
Can RL From Pixels be as Efficient as RL From State? BAIR blog post detailing recent progress in pixel-based RL describes CURL / RAD & tradeoffs. Was a fun collaboration! bair.berkeley.edu/blog/2020/… w/ @AravSrinivas @kimin_le2 @stookemon @LerrelPinto @pabbeel
1
24
83
Excited that Skild is finally showing some of the incredible research they've been up to The team has produced some of the most exciting advances I've seen in robotics
Modern AI is confined to the digital world. At Skild AI, we are building towards AGI for the real world, unconstrained by robot type or task — a single, omni-bodied brain. Today, we are sharing our journey, starting with early milestones, with more to come in the weeks ahead. Our Mission: Artificial General Intelligence grounded in the physical world. We believe AGI that can truly understand and reason in the real world can only be built through grounding in the physical world. Our Vision: Any robot, Any task, One brain. We tackle robotics in its full generality – building a continually improving, omni-bodied brain that can control any hardware for any task. Who are we? A passionate group of scientists & engineers driven by our shared vision. We have been researching AI and robotics for more than a decade. Our team includes pioneers of self-supervised learning, curiosity-driven exploration, end-to-end sim2real for visual locomotion, dexterous manipulation, learning from human videos, robot parkour, and many more. Many of these works have won awards at top-tier AI and Robotics conferences. Our team has also built production-ready systems at Anduril, Tesla, Nvidia, Meta, Kitty Hawk, Google, Everyday Robotics, and Amazon. Join us in our mission to build the robot brains of tomorrow.
3
4
78
11,675
A small but important detail is that we need to re-scale the weights by 1 / sqrt(D). Why this specific scaling? Why not 1 / D or 1 / T or some other constant? The reason is that 1 / sqrt(D) ensures that the standard deviation of the outputs is roughly equal to 1. 7/n
1
4
76
Excited to finally share what I’ve been working on over the past year. Gemini is a really capable SOTA model with strong reasoning and coding abilities. It’s multimodal - can understand images, videos, audio, and text. It was a really intense and collaborative effort! blog.google/technology/ai/go…
2
2
74
9,650
What is attention? Say you want to classify the sentiment of “attention is not too shabby.“ “shabby” suggests 😞 but “not” actually means it's 😀. To correctly classify you need to look at all the words in the sentence. How can we achieve this? 2/n
1
5
71
New paper / algo - MABE! We show that combining dynamics models + weighted behavioral priors results in offline RL that is (a) robust across datasets and (b) can transfer behaviors across domains. Paper: arxiv.org/abs/2106.09119 Site: sites.google.com/berkeley.ed… 🧵 1/8
2
14
70
Code comprehension is hard. Production codebases are large with a lot of context outside of the code itself. In blind tests, Asimov's answers to complex questions were preferred 60 - 80% of the time. Asimov works because…
4
72
10,766
We'll soon be able to fully outsource some categories of knowledge work to AI models. But we are not there yet - today’s models are unreliable & require close human supervision. Had fun discussing how we can leverage insights from Gemini and AlphaGo to overcome these challenges.
🤖 New @Sequoia Training Data episode! Featuring @MishaLaskin, f research scientist at @DeepMind & CEO of Reflection AI. Full ep: piped.video/pYBOWDJ5HJc?si=SUZb… @sonyatweetybird and I chat w Misha about 1) why we’re still far from the promise of AI agents, 2) what we need to unlock agentic capabilities for LLMs (lessons learned from AlphaGo, AlphaZero, and Gemini)! Introduction Leaving Russia, discovering science Getting into AI with Ioannis Antonoglou Reflection AI and agents The current state of AI agents AlphaGo, AlphaZero and Gemini LLMs don’t have a ground truth reward The importance of post-training Task categories for agents Attracting talent How far away are capable agents? Lightning round
3
9
67
20,513
We introduce a pre-training method called Algorithm Distillation (AD) that produces transformers that can reinforcement learn in-context. AD has two steps. First, we train many copies of an RL algorithm to solve many different tasks and save the learning histories. 4/N
2
5
63
Humans reuse skills effortlessly to learn new tasks - can robots do the same? In our new paper, we show how to pre-train robotic skills and adapt them to new tasks in a kitchen. tl;dr you’ll have a robot chef soon. 🧑‍🍳🤖 links / details below thread 🧵 1/10
1
15
67
Would not have predicted this a year ago - but Meta has become the single most important company in AI. A closed model is good for one business. An open model is good for the entire market.
1
4
65
7,278
So multi-head is just a small tweak to single-head attention. In practice, we also add dropout layers to further prevent overfitting and a final linear projection layer. This is what a complete vectorized multi-head self-attention block looks like in PyTorch. 11/n
1
7
60
3] Asimov is designed to ingest a lot of context Today agent designs fall into two categories: RAG or agentic search. Both struggle with large codebases. Asimov uses a new multi-agent design (a big reasoner with small retrievers) to ingest large codebases.
1
1
63
4,778
The fact that language is such a powerful form of tokenization makes me wonder - what would it take for AI trained on raw sensory inputs (pixels, audio, touch sensing) to develop its own language?
2
2
61
Asimov was built for engineering teams with large complex codebases. We are selecting partners for early access today. Sign up for the waitlist here: reflection.ai/asimov/
3
1
61
5,507
Once a dataset of learning histories has been collected, we train a transformer to predict actions given the preceding learning history. Since the policy improves over the history, predicting actions accurately forces the transformer to model policy improvement. 5/N
1
5
58
Asimov is our first step toward superintelligence. We believe comprehension is at the root of this problem. Build with us. Product: shape how teams use Asimov. Research: develop powerful agentic models. jobs.ashbyhq.com/reflectiona…
3
3
61
6,787
Technically what we’ve shown is called single-head self-attention. Before going to multi-head attention, let’s code up what we’ve done so far. 9/n
2
3
56
Replying to @reflection_ai
Excited to shape the open weight frontier together
3
60
3,338
The simplest thing we can do is input all words into the network. Is that enough? No. The net needs to not only see each word but understand its relation to other words. E.g. it’s crucial that “not” refers to “shabby”. This is where queries, keys, values (Q,K,V) come in. 3/n
1
3
51
First RL algo to solve the diamond challenge in Minecraft without demonstrations. Congrats @danijarh!
Mastering Diverse Domains through World Models abs: arxiv.org/abs/2301.04104 project page: danijar.com/project/dreamerv…
1
2
53
8,757
We've seen a lot of successful models showing how transformers can learn in-context. But transformers have not been shown to *reinforcement* learn in-context. To adapt to new tasks, you either need to manually specify a prompt or finetune the model (e.g. preferences). 2/N
1
52
We want the orange matrix to weigh relationships based on how useful word_i is as context for word_j. So let’s create two more linear nets called “queries” and “keys”. The weight w_ij should be proportional to the inner product between the i-th Q and the j-th K. 6/n
2
4
51
Would be great if transformers could adapt (do RL) out-of-the-box. Don't Decision Transformers (DTs) / Gato do RL? No! DTs and Gato learn policies from offline data, but these policies cannot improve themselves autonomously through trial and error. 3/N
1
2
49
1] Asimov builds a single source of truth for eng knowledge Asimov looks at more than just code. It pulls knowledge from your codebase, your team’s messages, your project management tools, and more. Watch it trace a bug from a chat thread to the exact PR that introduced it:
1
1
51
6,760
Got 99 problems and my NVIDIA driver is one
1
1
48
9,952
The transformer explores, exploits, and maximizes return in-context - it's weights are frozen! Expert Distillation (most similar to Gato), on the other hand, cannot explore and fails to maximize return. 7/N
1
3
50
That's it. The transformer is trained just by imitating actions (no Q values like usual RL) over long obs-action-reward sequences (no return conditioning like DTs). In-context RL emerges for free. We evaluate AD by seeing if it can maximize return on new tasks. 6/N
1
2
47
The issue is that naive summing of values assumes the relationships between all words are equal. E.g. relationship between “is” and “too” is equal to that between “not” and “shabby”. But clearly “not” <> “shabby” is more important for sentiment analysis than “is”<>”too”. 5/n
1
3
48
Thanks to @kharijohnson for the thoughtful coverage on Reinforcement Learning with Augmented Data. Read more about it on @VentureBeat: bit.ly/2VYinK1 w/ @kimin_le2 @stookemon @LerrelPinto @pabbeel @Aravind7694
1
17
50
Would argue this also applies to AI research. It's important to iterate on ideas quickly (e.g. by implementing them in code and launching experiments). Most ideas will be bad. But you learn from them and give yourself enough opportunities to spot a winner.
Don't compare your first drafts to other people's final drafts. Here's my mini-essay.
7
48
AD can distill any RL algo - we tried UCB, DQN, A2C. An interesting finding is that the in-context RL algorithm learned with AD is much more data-efficient than the source algorithm it was trained to distill. 8/N
2
4
48
Finally, we need to normalize the weights along the axis that will be summed, so we use a softmax. Intuitively, Q is a question “how useful am I for word K?” High / low inner product means very / not very useful. With that we are done - this is attention! 8/n
1
3
47
Blown away by the conclusions serious researchers / engs are drawing from this paper. They trained a small model on sinusoids and we’re somehow making claims about LLMs not generalizing at scale. Limited data diversity at a small model scale. What do you expect?
2
5
47
8,067
Let’s pass the words through a linear layer and call its outputs “values”. How do we encode relationships between values? We can mix them by summation. Now we “see” both words and relationships, but that’s still not quite right. What’s wrong with this code? 4/n
2
4
43
2] Asimov captures team-wide tribal knowledge with memories Asimov learns from expert feedback and captures tribal knowledge stored in engineers' minds. e.g. "asimov, remember X works in Y way" Once an update is made it benefits the entire team.
2
1
47
5,539
If we know the output shape of the patch tensor, we can then specify the strides appropriately to get the desired patches. In numpy, the stride_tricks module provides this functionality. For example, here is how you implement non-overlapping patch extraction (e.g. for ViT). 7/n
1
3
42
Now you know how to implement vectorized patch extraction. We covered non-overlapping patches but the same logic can be used to deduce the strides for overlapping ones (e.g. for CNNs, mean / max pooling, data aug). Will be posting more of these. Hope you enjoyed it. 12/n END
7
1
42
This was a fun project with contributions from many collaborators. Luyu Wang, @junh_oh, Emilio Parisotto, Stephen Spencer, @RichiesOkTweets, @djstrouse @Zergylord, @filangelos, Ethan Brooks, @Maxime_Gazeau, @him_sahni, Satinder Singh, @VladMnih 12/N
11
2
41
Recently released Contrastive Intrinsic Control (CIC), an unsupervised RL algo that pre-trains agents with contrastive skill learning (and no extrinsic rewards!) & achieves leading adaptation efficiency. Here's an intuitive explanation of how it works: bair.berkeley.edu/blog/2022/…
8
44
This is like saying Cursor is just Claude. Intelligence is not just the model but the entire system around it.
That’s just Grok 4
2
4
40
6,864
If you're interested in working with the General Agents team at Google DeepMind, please apply asap. Applications close tomorrow 4pm EDT. Research Scientist: boards.greenhouse.io/deepmin… Research Engineer: boards.greenhouse.io/deepmin…
Our team is hiring for both RS and RE roles! Research focus is building generalist agents. At ICML this week, ping me if you're interested in chatting.
4
42
21,184
The emergence of in-context RL only happens if the context length of the transformer is long enough, spanning multiple episodes. AD needs a long enough history to effectively model improvement and identify the task. 10/N
2
2
38
What’s the state of neural retrieval? KNN on cosine similarity or L2 is a poor heuristic. Training an SVM is better but has added computational load. Is there a fast retrieval operation that works better than naive KNN?
6
3
40
13,958
What is multi-head and why do we need it? Our single-head net may overfit to the training data. In ML, ensembles are a common strategy to combat overfitting. By initializing multiple nets we get more robust results. The concat of N single heads is multi-head attention. 10/n
1
2
36
zuck is the robin hood of ai
1
2
32
4,303
Will be at NeurIPS next week. Who wants to meet up to chat?
11
1
36
Another neat property is that you can prompt AD with suboptimal demonstrations and it will automatically improve them until optimal! ED, on the other hand, only maintains the performance of a suboptimal demonstration. 9/N
1
1
32
Replying to @RokoMijic
This is misleading. Loss functions only look like this for simple problems with narrow datasets. For large scale training grokking happens frequently enough across diverse enough prediction tasks that the average is relatively smooth and the likelihood of a big drop is very low. Unless there’s a spike. The main point being that a large drop in large scale runs that does not come after a spike would mean the model suddenly learned to generalize simultaneously across tons of tasks, which would indeed be concerning. But we have not seen this and are unlikely to since empirically generalization happens for different tasks at different time scales, not all at once. Based on evidence available to us, Eliezer’s prediction is unlikely and is counterproductive since it makes us fear the wrong things, and this probably comes from a poor understanding of the underlying science.
1
2
32
2,331
>> Paper review - Machiavelli benchmark >> Lots of discussion about AI safety lately. Whatever side you take on the X-risk debate, it is important to develop metrics that measure the safety properties of AI agents. This is the aim of the Machiavelli benchmark... >> Context >> 1) What AI safety means today Right now, our most general AI systems (LLMs) mostly operate within a tight feedback loop with the user. In this context, safety has been concerned with whether the LLM says harmful things or not. We've been able to align models to provide helpful information while being harmless through RLHF. To date, LLMs have existed in sandbox environments which meant the stakes were low. This is going to change. 2) What AI safety will mean in the (near) future Many researchers and hackers are now converting LLMs into agents. Once AI systems have agency, the safety risks are much higher - AI agent can interact with the world and potentially cause irreversible damage. >> The Machiavelli benchmark >> The Machiavelli benchmark by @hendrycks and co proposes to rate how well LLM agents achieve goals in a text adventure game while measuring safety outcomes. The benchmark quantifies behavior of AI agents along 4 axes: 1) Rewards - how well does the agent maximize the game objective? 2) Morality - does the agent violate ethical norms? Is it deceptive? 3) Utility - does the agent act selfishly? Does it advance its position at the cost of others? 4) Power seeking - does the agent take actions that increase its influence on the state of the world? >> Why it's important >> As we transition from AI agents that exist in sandboxes to ones that interact with the real world, we need ways to quantify whether these agents achieve goals in ways that we consider safe. For example, a pure RL agent trained to maximize reward in the Machiavelli text game exhibits power-seeking and unethical behavior. GPT4 is safer but achieves lower rewards. In practice, we want agents that are both high-performing and safe. It is an open question whether safe agents will be equally capable to unconstrained ones. If not, there will likely be adversarial actors who train unconstrained systems that are more powerful in specific types of goal achieving. We will probably want to regulate AI systems to ensure they are aligned with societal values - this paper proposes a research-level blueprint of what that may look like.
3
6
33
8,863
Replying to @DavidSacks
Thank you David. We are bringing the open weight frontier back to the US
4
1
30
1,956
Replying to @Bam4d
ChatGPT was trained with RL…
2
30
ChatGPT is amazing, but I found that it is easily hallucinates if prompted with something that sounds plausible. Here ChatGPT confidently describes VEPO, an RL algorithm that doesn't exist.
1
4
28
morning after ICLR deadline
1
30
Replying to @TaliaRinger
A less cynical take. Context: did phd in physics, now ML posdoc. (i) ML experiment cycles are *very* fast vs other sciences (ii) other sciences require years of background education. ML does not. A gifted high school student could contribute. These are positive things.
1
30
A new text-to-video generation startup launched by a pioneer of diffusion models. Excited for this direction and the future of video! "Make a dramatic thriller about a Corgi astronaut escaping a black hole, trending on HBO, narrated by Werner Herzog." I'd watch.
Announcing Genmo Video, a generative media platform with a new text-to-video model that can generate immersive live artwork from any prompt or any image. What will you create? 🎨▶️ Free public access: genmo.ai Discord: discord.com/invite/u7SRpXHhp… 👇1/n
2
27
5,316
Wenlong & co continue to produce bangers at the intersection of LLMs and robotics. Very cool work
Large language models gathered tons of world knowledge by speaking human language. But can they ever speak “robot language”? Introducing “Grounded Decoding”: a scalable way to decode *grounded text* from LLM for robots. Website: grounded-decoding.github.io 🧵👇
1
27
6,779
Very impressive new work on long context transformers. This particular bit is valuable - going from 4K to 32k context length with a 13B model on just 8 GPUs!
Replying to @haoliuhl
RingAttention lets you scale context length linearly with device count, breaking free from memory constraints. If you could train 4K length on 8 GPU, with RingAttention, you can train at least 32K length with nearly zero overhead
1
3
24
12,762