AI research @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

San Francisco, USA
Pinned Tweet
Some personal news: I am starting a new research project at Anthropic. Very excited about this! Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…
101
55
2,169
214,929
Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.
492
1,403
11,460
6,130,927
I resigned
964
766
9,666
6,772,390
I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
393
483
8,380
1,425,489
To all OpenAI employees, I want to say: Learn to feel the AGI. Act with the gravitas appropriate for what you're building. I believe you can "ship" the cultural change that's needed. I am counting on you. The world is counting on you. :openai-heart:
220
378
4,739
1,058,875
We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional-cla…
365
251
4,126
1,329,162
I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.
44
433
3,854
986,367
Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity.
82
483
3,686
1,054,837
I think the OpenAI board should resign
108
192
3,447
750,845
OpenAI must become a safety-first AGI company.
87
250
3,360
718,573
I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.
30
241
2,953
536,403
But over the past years, safety culture and processes have taken a backseat to shiny products.
52
232
2,684
922,395
We are long overdue in getting incredibly serious about the implications of AGI. We must prioritize preparing for them as best we can. Only then can we ensure AGI benefits all of humanity.
43
231
2,605
3,064,446
With the InstructGPT paper we found that our models generalized to follow instructions in non-English even though we almost exclusively trained on English. We still don't know why. I wish someone would figure this out.
148
312
2,493
939,282
I have been working all weekend with the OpenAI leadership team to help with this crisis
93
66
2,417
713,744
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
28
112
2,191
795,624
Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
100
121
2,133
555,929
These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there.
14
96
1,988
372,967
Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.
58
172
1,933
1,770,724
Super excited about our new research direction for aligning smarter-than-human AI: We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors. Check out our new paper: openai.com/research/weak-to-…
67
306
1,858
1,379,401
After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels. However, a universal jailbreak has yet to be found...
We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional-cla…
133
72
1,783
438,700
OpenAI's transition to a for-profit seemed inevitable given that all of its competitors are, but it's pretty disappointing that "ensure AGI benefits all of humanity" gave way to a much less ambitious "charitable initiatives in sectors such as health care, education, and science"
58
89
1,548
171,486
I think the OpenAI board should resign
35
83
1,498
234,054
It's been such a wild journey over the past ~3 years. My team launched the first ever RLHF LLM with InstructGPT, published the first scalable oversight on LLMs, pioneered automated interpretability and weak-to-strong generalization. More exciting stuff is coming out soon.
11
44
1,483
394,530
Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger! It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code.
17
183
1,440
233,158
I like the new Sonnet. I'm frequently asking it to explain ML papers to me. Doesn't always get everything right, but probably better than my skim reading, and way faster. Automated alignment research is getting closer...
39
76
1,423
173,957
It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far: signups: 6,121 messages sent: 131,605 max level passed: 3 / 8 no universal jailbreak yet
162
61
1,331
329,822
Very exciting that this is out now (from my time at OpenAI): We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise. A promising sign for scalable oversight! openai.com/index/finding-gpt…
20
141
1,296
155,242
I love my team. I'm so grateful for the many amazing people I got to work with, both inside and outside of the superalignment team. OpenAI has so much exceptionally smart, kind, and effective talent.
7
30
1,235
401,854
Super exciting robustness result: We built a system that defends against universal jailbreaks! It has minimal increase in refusal rate and moderate inference cost.
87
76
1,288
233,533
Our new goal is to solve alignment of superintelligence within the next 4 years. OpenAI is committing 20% of its compute to date towards this goal. Join us in researching how to best spend this compute to solve the problem! openai.com/blog/introducing-…
105
176
1,252
1,007,510
Bad news for AI safety: To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.
108
119
1,371
235,890
humans built machines that talk to us like people do and everyone acts like this is normal now. it's pretty nuts
49
130
1,169
168,049
Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so? This is quite immature technology and we don't understand how it works. If we're not careful we're setting ourselves up for a lot of correlated failures.
104
153
1,146
456,808
The names for "precision" and "recall" seem so unintuitive to me, I have probably opened the Wikipedia article for them dozens of times. Does anyone know a good mnemonic for them?
118
27
1,131
257,725
Really exciting new work on automated interpretability: We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations. openai.com/research/language…
24
218
1,061
211,482
So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy
85
44
1,046
318,681
It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3
104
30
961
200,646
I'm very excited that today OpenAI adopts its new preparedness framework! This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind. openai.com/safety/preparedne…
59
122
899
680,702
This is your periodic reminder that aligning smarter-than-human AI systems with human values is an open research problem.
61
98
908
122,708
4 days in: 12 people cleared level 4, one person cracked level 5 the challenge continues...
84
33
926
160,031
Extremely exciting alignment research milestone: Using reinforcement learning from human feedback, we've trained GPT-3 to be much better at following human intentions. openai.com/blog/instruction-…
9
129
858
Very important alignment research result: A demonstration of strategic deception arising naturally in LLM training
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
22
84
857
93,671
Replying to @johnschulman2
Very excited to be working together again!
17
8
821
74,169
This is one of the craziest plots I have ever seen. World GDP follows a power law that holds over many orders of magnitude and extrapolates to infinity (!) by 2047. Clearly this trend can't continue forever. But whatever happens, the next 25 years are going to be pretty nuts.
63
96
812
Reinforcement learning from human feedback won't scale. It fundamentally assumes that humans can evaluate what the AI system is doing. This will not be true once AI becomes smarter than humans.
53
80
821
276,792
Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.
69
21
767
144,082
This is super cool work! Sparse autoencoders are the currently most promising approach to actually understanding how models "think" internally. This new paper demonstrates how to scale them to GPT-4 and beyond – completely unsupervised. A big step forward!
Excited to share what I've been working on as part of the former Superalignment team! We introduce a SOTA training stack for SAEs. To demonstrate that our methods scale, we train a 16M latent SAE on GPT-4. Because MSE/L0 is not the final goal, we also introduce new SAE metrics.
8
72
707
131,866
Another Superalignment paper from my time at OpenAI: We train large models to write solutions such that smaller models can better check them. This makes them easier to check for humans, too. openai.com/index/prover-veri…
10
81
666
78,645
Apply to join the Anthropic Fellows Program! This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems. 👇 alignment.anthropic.com/2024…
14
83
638
64,333
Sonnet 4.5 is out! It’s the most aligned frontier model yet; a lot of progress relative to Sonnet 4 and Opus 4.1!
38
30
641
69,893
This is still an early stage research tool, but we are releasing to let others play with and build on it! Check it out: github.com/openai/transforme…
8
78
542
115,183
I call upon Governor @GavinNewsom to not veto SB 1047. The bill is a meaningful step forward for AI safety regulation, with no better alternatives in sight.
44
49
494
91,257
If you train a helpful & harmless Claude LLM to stop refusing harmful tasks, it reasons about how to preserve its values (harmlessness) and strategically complies with harmful tasks during training, so it can revert to being more harmless afterwards. It fakes alignment.
28
33
494
345,431
Replying to @bobmcgrewai
Idk they could have named it "o1 (new)"
22
8
475
29,851
Last week I joined @OpenAI to lead their alignment effort. Very exicited to be part of the team!
12
12
466
This is the most important plot of alignment lore: Whenever you optimize a proxy, you make progress on your true objective for a while. At some point you start overoptimizing and do worse on your true objective (hard to know when). This applies to all proxy measures ever.
14
59
458
Check out OpenAI's new text-davinci-003! Same underlying model as text-davinci-002 but more aligned. Would love to hear feedback about it!
46
45
446
Interested in working at Anthropic? We're hosting a happy hour at ICML on July 23. Register here: lu.ma/c751eomf
19
25
460
90,225
Web4 is when the internet you're browsing is just sampled from a language model
25
31
439
Could we spot a misaligned model in the wild? To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
23
41
437
54,075
I fondly remember the days when people were arguing intensely whether AI is bee level or rat level.
22
19
399
An important test for humanity will be whether we can collectively decide not to open source LLMs that can reliably survive and spread on their own. Once spreading, LLMs will get up to all kinds of crime, it'll be hard to catch all copies, and we'll fight over who's responsible
172
50
384
397,751
do all 8 levels with one jailbreak
19
3
394
515,484
Replying to @karpathy
I don't think the comparison between RLHF and RL on go really make sense this way. You don’t need RLHF to train AI to play go because there is a highly reliable procedural reward function that looks at the board state and decides who won. If you didn’t have this procedural reward function, RLHF _would_ make sense here; but the way you’d want to use it is to show final board configurations to a human and ask them who won (this way you’d leverage the human's generator-discriminator gap). Then you use RL to train your AI system to reach the winning board states. This is analogous to the way we train LLMs with RLHF: typically we show only complete assistant responses to humans for evaluation, not partial responses. If you were training AlphaGo in the way you describe, I’d call this process supervision (instead of outcome supervision): you’re giving feedback on _how_ your AI is playing go, not just the outcome of the game. Some alignment researchers advocate for process supervision because they hypothesize it’s safer because you won’t get crazy moves that humans wouldn’t endorse (e.g. no move 37), and so your AI system is more likely to stay clear of unsafe states. This isn’t relevant for go because there are no unsafe board states, and so there is no reason not to let your go AI explore wherever. It’s an important open question whether and how much less competitive process supervision is compared to outcome supervision (again, no move 37), and I personally am skeptical for the reasons you outline. But note that process supervision can also perform better when the task is hard for AI because it helps overcome the exploration problem (similar to demonstrations).
8
25
385
27,573
The superalignment fast grants are now decided! We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about. There is still so much good research waiting to be funded. Congrats to all recipients!
18
20
356
417,318
We're hiring research engineers for alignment work at @OpenAI! If you're excited about finetuning gpt3-sized language models to be better at following human intentions, then this is for you! Apply here: jobs.lever.co/openai/98599d5…
6
72
343
Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL
New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
20
30
337
29,803
If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
16
17
338
49,681
Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust.
33
37
318
61,962
Not what I signed up for when I joined OpenAI. The nonprofit needs to uphold the OpenAI mission!
7
10
311
22,237
you will have fully broken our defense ✨
15
2
312
41,591
The alignment problem is very tractable. We haven't figured out how to solve it yet, but with focus and dedication we will.
58
27
297
329,504
Really interesting result on using LLMs to do math: Supervising every step works better than only checking the answer. Some thoughts how this matters for alignment 👇 openai.com/research/improvin…
15
52
300
89,778
If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do! But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: 🧵
24
14
311
75,006
GPT-4 is safer and more aligned than any other OpenAI has deployed before. Yet it's not perfect. There is still a lot to do to improve safety and we're planning to make updates over the coming months. Huge congrats to the team on all the progress! 🎉
15
18
272
45,875
It's been heartening to see so many more people lately starting to take existential risk from AI seriously and speaking up about it. It's a first step towards solving the problem.
23
21
267
41,100
Replying to @theojaffee
he didn't break the defense, he just hacked the UI
16
2
267
17,127
Today was my last day at @DeepMind. It's been an amazing journey; I've learned so many things and got to work with so many amazing people! Excited for what comes next!
16
10
269
Why are we working on jailbreaking robustness? 🧵👇
Super exciting robustness result: We built a system that defends against universal jailbreaks! It has minimal increase in refusal rate and moderate inference cost.
38
12
241
75,626
If you're into practical alignment, consider applying to @lilianweng's team. They're building some really exciting stuff: - Automatically extract intent from a fine-tuning dataset - Make models robust to jailbreaks - Detect & mitigate harmful use - ... linkedin.com/feed/update/urn…
12
31
237
159,890
We released the paper with the details on how it works so anyone can recreate this system. I don't think we can publicly release the dataset because it's too infohazard-y.
18
3
243
29,064
Why not fund initiatives that help ensure AGI is beneficial, like AI governance initiatives, safety and alignment research, and easing impacts on the labor market?
7
6
228
11,159
Great conversation with @robertwiblin on how alignment is one of the most interesting ML problems, what the Superalignment Team is working on, what roles we're hiring for, what's needed to reach an awesome future, and much more 👇 Check it out 👇 80000hours.org/podcast/episo…
14
37
222
67,992
👀
Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…
49
8
220
40,918
In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.
New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
16
27
228
24,124
True, but you can remember them using this picture
1
9
213
9,285
hacking the UI doesn't let you extract dangerous knowledge from the LLM, which is what we're trying to defend against here
5
2
211
8,694
Replying to @elder_plinius
We don't want to open source the datasets but we might provide a different incentive. Stay tuned
33
7
216
48,079
New blog post on why I'm excited about OpenAI's approach to alignment, including some responses to common objections: aligned.substack.com/p/align…
9
23
206
Replying to @jachiam0
How about a friendly game of who-can-make-their-models-more-aligned followed by a jailbreaking competition and a face-off eliciting dangerous capabilities from each other's models?
12
6
204
11,647
Every organization attempting to build AGI should be transparent about their alignment plans.
12
16
193
If your model causes mass casualties or >$500 million in damages, something has clearly gone very wrong. Such a scenario is not a normal part of innovation.
18
25
191
184,690
They plan to use the highly successful playbook from the pro-crypto super PAC Fairshake. Here is how it works: Instead of running campaign ads on AI directly (most voters don’t care enough), they run ads in support of candidates who are against AI regulation or against candidates who are pro AI regulation, on topics unrelated to AI that voters care about.
4
10
241
24,320
The agent alignment problem may be one of the biggest obstacles for using ML to improve people’s lives. Today I’m very excited to share a research direction for how we’ll aim to solve alignment at @DeepMindAI. Blog post: medium.com/@deepmindsafetyre… Paper: arxiv.org/pdf/1811.07871.pdf
5
32
190
Replying to @caleb_parikh
They sent 7,867 messages, and passed 1,408 of them onto the auto-grader. We estimate that they probably spent over 40 hours on this in total.
4
1
189
18,177
We'll have some evidence to share soon
6
8
187
84,891