Jan Leike · May 8, 2026 · 5:48 PM UTC

Jan Leike

Pinned Tweet

Jan Leike

@janleike

May 8

Some personal news: I am starting a new research project at Anthropic. Very excited about this! Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…

101

2,169

214,929

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.

492

1,403

11,460

6,130,927

Jan Leike · May 15, 2024 · 4:43 AM UTC

Jan Leike

@janleike

15 May 2024

I resigned

964

766

9,666

6,772,390

Jan Leike · May 28, 2024 · 4:50 PM UTC

Jan Leike

@janleike

28 May 2024

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

393

483

8,380

1,425,489

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

To all OpenAI employees, I want to say: Learn to feel the AGI. Act with the gravitas appropriate for what you're building. I believe you can "ship" the cultural change that's needed. I am counting on you. The world is counting on you. :openai-heart:

220

378

4,739

1,058,875

Jan Leike · Feb 3, 2025 · 4:32 PM UTC

Jan Leike

@janleike

3 Feb 2025

We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional-cla…

365

251

4,126

1,329,162

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.

433

3,854

986,367

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity.

483

3,686

1,054,837

Jan Leike · Nov 20, 2023 · 1:54 PM UTC

Jan Leike

@janleike

20 Nov 2023

I think the OpenAI board should resign

108

192

3,447

750,845

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

OpenAI must become a safety-first AGI company.

250

3,360

718,573

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.

241

2,953

536,403

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

But over the past years, safety culture and processes have taken a backseat to shiny products.

232

2,684

922,395

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

We are long overdue in getting incredibly serious about the implications of AGI. We must prioritize preparing for them as best we can. Only then can we ensure AGI benefits all of humanity.

231

2,605

3,064,446

Jan Leike · Feb 13, 2023 · 6:56 PM UTC

Jan Leike

@janleike

13 Feb 2023

With the InstructGPT paper we found that our models generalized to follow instructions in non-English even though we almost exclusively trained on English. We still don't know why. I wish someone would figure this out.

148

312

2,493

939,282

Jan Leike · Nov 20, 2023 · 1:54 PM UTC

Jan Leike

@janleike

20 Nov 2023

I have been working all weekend with the OpenAI leadership team to help with this crisis

2,417

713,744

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

112

2,191

795,624

Jan Leike · Feb 13, 2025 · 8:51 PM UTC

Jan Leike

@janleike

13 Feb 2025

Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!

Anthropic

@AnthropicAI

3 Feb 2025

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

Title card for the paper entitled "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming"

ALT Title card for the paper entitled "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming"

100

121

2,133

555,929

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there.

1,988

372,967

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.

172

1,933

1,770,724

Jan Leike · Dec 14, 2023 · 5:06 PM UTC

Jan Leike

@janleike

14 Dec 2023

Super excited about our new research direction for aligning smarter-than-human AI: We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors. Check out our new paper: openai.com/research/weak-to-…

306

1,858

1,379,401

Jan Leike · Feb 9, 2025 · 3:52 PM UTC

Jan Leike

@janleike

9 Feb 2025

After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels. However, a universal jailbreak has yet to be found...

Jan Leike

@janleike

3 Feb 2025

We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional-cla…

133

1,783

438,700

Jan Leike · Dec 28, 2024 · 7:36 AM UTC

Jan Leike

@janleike

28 Dec 2024

OpenAI's transition to a for-profit seemed inevitable given that all of its competitors are, but it's pretty disappointing that "ensure AGI benefits all of humanity" gave way to a much less ambitious "charitable initiatives in sectors such as health care, education, and science"

1,548

171,486

Jan Leike · Nov 20, 2023 · 1:56 PM UTC

Jan Leike

@janleike

20 Nov 2023

I think the OpenAI board should resign

1,498

234,054

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

It's been such a wild journey over the past ~3 years. My team launched the first ever RLHF LLM with InstructGPT, published the first scalable oversight on LLMs, pioneered automated interpretability and weak-to-strong generalization. More exciting stuff is coming out soon.

1,483

394,530

Jan Leike · Mar 12, 2024 · 12:31 AM UTC

Jan Leike

@janleike

12 Mar 2024

Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger! It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code.

183

1,440

233,158

Jan Leike · Jun 20, 2024 · 4:41 PM UTC

Jan Leike

@janleike

20 Jun 2024

I like the new Sonnet. I'm frequently asking it to explain ML papers to me. Doesn't always get everything right, but probably better than my skim reading, and way faster. Automated alignment research is getting closer...

1,423

173,957

Jan Leike · Feb 4, 2025 · 7:19 PM UTC

Jan Leike

@janleike

4 Feb 2025

It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far: signups: 6,121 messages sent: 131,605 max level passed: 3 / 8 no universal jailbreak yet

162

1,331

329,822

Jan Leike · Jun 27, 2024 · 5:57 PM UTC

Jan Leike

@janleike

27 Jun 2024

Very exciting that this is out now (from my time at OpenAI): We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise. A promising sign for scalable oversight! openai.com/index/finding-gpt…

141

1,296

155,242

Jan Leike · May 17, 2024 · 3:57 PM UTC

Jan Leike

@janleike

17 May 2024

I love my team. I'm so grateful for the many amazing people I got to work with, both inside and outside of the superalignment team. OpenAI has so much exceptionally smart, kind, and effective talent.

1,235

401,854

Jan Leike · Feb 3, 2025 · 4:31 PM UTC

Jan Leike

@janleike

3 Feb 2025

Super exciting robustness result: We built a system that defends against universal jailbreaks! It has minimal increase in refusal rate and moderate inference cost.

1,288

233,533

Jan Leike · Jul 5, 2023 · 5:04 PM UTC

Jan Leike

@janleike

5 Jul 2023

Our new goal is to solve alignment of superintelligence within the next 4 years. OpenAI is committing 20% of its compute to date towards this goal. Join us in researching how to best spend this compute to solve the problem! openai.com/blog/introducing-…

Introducing Superalignment

We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and...

openai.com

105

176

1,252

1,007,510

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

19 Sep 2025

Bad news for AI safety: To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.

108

119

1,371

235,890

Jan Leike · Jan 5, 2024 · 5:16 PM UTC

Jan Leike

@janleike

5 Jan 2024

humans built machines that talk to us like people do and everyone acts like this is normal now. it's pretty nuts

130

1,169

168,049

Jan Leike · Mar 17, 2023 · 5:56 PM UTC

Jan Leike

@janleike

17 Mar 2023

Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so? This is quite immature technology and we don't understand how it works. If we're not careful we're setting ourselves up for a lot of correlated failures.

104

153

1,146

456,808

Jan Leike · Oct 29, 2023 · 8:13 PM UTC

Jan Leike

@janleike

29 Oct 2023

The names for "precision" and "recall" seem so unintuitive to me, I have probably opened the Wikipedia article for them dozens of times. Does anyone know a good mnemonic for them?

118

1,131

257,725

Jan Leike · May 9, 2023 · 5:04 PM UTC

Jan Leike

@janleike

9 May 2023

Really exciting new work on automated interpretability: We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations. openai.com/research/language…

218

1,061

211,482

Jan Leike · May 22, 2025 · 4:52 PM UTC

Jan Leike

@janleike

22 May 2025

So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy

1,046

318,681

Jan Leike · Feb 5, 2025 · 5:08 PM UTC

Jan Leike

@janleike

5 Feb 2025

It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3

104

961

200,646

Jan Leike · Dec 18, 2023 · 6:01 PM UTC

Jan Leike

@janleike

18 Dec 2023

I'm very excited that today OpenAI adopts its new preparedness framework! This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind. openai.com/safety/preparedne…

Safety & responsibility

OpenAI’s approach to AI safety, security, and responsible deployment. Learn how we research, test, and build safer AI systems for everyone.

openai.com

122

899

680,702

Jan Leike · Feb 17, 2023 · 1:34 AM UTC

Jan Leike

@janleike

17 Feb 2023

This is your periodic reminder that aligning smarter-than-human AI systems with human values is an open research problem.

908

122,708

Jan Leike · Feb 7, 2025 · 5:05 PM UTC

Jan Leike

@janleike

7 Feb 2025

4 days in: 12 people cleared level 4, one person cracked level 5 the challenge continues...

926

160,031

Jan Leike · Jan 27, 2022 · 4:03 PM UTC

Jan Leike

@janleike

27 Jan 2022

Extremely exciting alignment research milestone: Using reinforcement learning from human feedback, we've trained GPT-3 to be much better at following human intentions. openai.com/blog/instruction-…

Aligning language models to follow instructions

openai.com

129

858

Jan Leike · Dec 18, 2024 · 5:05 PM UTC

Jan Leike

@janleike

18 Dec 2024

Very important alignment research result: A demonstration of strategic deception arising naturally in LLM training

Anthropic

@AnthropicAI

18 Dec 2024

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

ALT “Alignment faking in large language models” by Greenblatt et al.

857

93,671

Jan Leike · Aug 6, 2024 · 12:12 AM UTC

Jan Leike

@janleike

6 Aug 2024

Replying to @johnschulman2

Very excited to be working together again!

821

74,169

Jan Leike · Jul 6, 2022 · 8:54 PM UTC

Jan Leike

@janleike

6 Jul 2022

This is one of the craziest plots I have ever seen. World GDP follows a power law that holds over many orders of magnitude and extrapolates to infinity (!) by 2047. Clearly this trend can't continue forever. But whatever happens, the next 25 years are going to be pretty nuts.

812

Jan Leike · Jan 17, 2023 · 9:04 PM UTC

Jan Leike

@janleike

17 Jan 2023

Reinforcement learning from human feedback won't scale. It fundamentally assumes that humans can evaluate what the AI system is doing. This will not be true once AI becomes smarter than humans.

821

276,792

Jan Leike · Feb 3, 2025 · 9:52 PM UTC

Jan Leike

@janleike

3 Feb 2025

Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.

767

144,082

Jan Leike · Jun 6, 2024 · 7:02 PM UTC

Jan Leike

@janleike

6 Jun 2024

This is super cool work! Sparse autoencoders are the currently most promising approach to actually understanding how models "think" internally. This new paper demonstrates how to scale them to GPT-4 and beyond – completely unsupervised. A big step forward!

Leo Gao

@nabla_theta

6 Jun 2024

Excited to share what I've been working on as part of the former Superalignment team! We introduce a SOTA training stack for SAEs. To demonstrate that our methods scale, we train a 16M latent SAE on GPT-4. Because MSE/L0 is not the final goal, we also introduce new SAE metrics.

707

131,866

Jan Leike · Jul 17, 2024 · 7:35 PM UTC

Jan Leike

@janleike

17 Jul 2024

Another Superalignment paper from my time at OpenAI: We train large models to write solutions such that smaller models can better check them. This makes them easier to check for humans, too. openai.com/index/prover-veri…

666

78,645

Jan Leike · Dec 2, 2024 · 6:22 PM UTC

Jan Leike

@janleike

2 Dec 2024

Apply to join the Anthropic Fellows Program! This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems. 👇 alignment.anthropic.com/2024…

638

64,333

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

29 Sep 2025

Sonnet 4.5 is out! It’s the most aligned frontier model yet; a lot of progress relative to Sonnet 4 and Opus 4.1!

641

69,893

Jan Leike · Aug 24, 2022 · 6:05 PM UTC

Jan Leike

@janleike

24 Aug 2022

How will we solve the alignment problem for AGI? I've been working on this question for almost 10 years now. Our current path is very promising: openai.com/blog/our-approach… 1/

Our approach to alignment research

We are improving our AI systems’ ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other...

openai.com

576

Jan Leike · Mar 12, 2024 · 12:31 AM UTC

Jan Leike

@janleike

12 Mar 2024

This is still an early stage research tool, but we are releasing to let others play with and build on it! Check it out: github.com/openai/transforme…

GitHub - openai/transformer-debugger

Contribute to openai/transformer-debugger development by creating an account on GitHub.

github.com

542

115,183

Jan Leike · Sep 5, 2024 · 6:06 PM UTC

Jan Leike

@janleike

5 Sep 2024

I call upon Governor @GavinNewsom to not veto SB 1047. The bill is a meaningful step forward for AI safety regulation, with no better alternatives in sight.

494

91,257

Jan Leike · Dec 18, 2024 · 5:05 PM UTC

Jan Leike

@janleike

18 Dec 2024

If you train a helpful & harmless Claude LLM to stop refusing harmful tasks, it reasons about how to preserve its values (harmlessness) and strategically complies with harmful tasks during training, so it can revert to being more harmless afterwards. It fakes alignment.

494

345,431

Jan Leike · Dec 20, 2024 · 6:48 PM UTC

Jan Leike

@janleike

20 Dec 2024

Replying to @bobmcgrewai

Idk they could have named it "o1 (new)"

475

29,851

Jan Leike · Jan 22, 2021 · 6:14 PM UTC

Jan Leike

@janleike

22 Jan 2021

Last week I joined @OpenAI to lead their alignment effort. Very exicited to be part of the team!

466

Jan Leike · Sep 22, 2022 · 6:01 PM UTC

Jan Leike

@janleike

22 Sep 2022

This is the most important plot of alignment lore: Whenever you optimize a proxy, you make progress on your true objective for a while. At some point you start overoptimizing and do worse on your true objective (hard to know when). This applies to all proxy measures ever.

ALT utility vs. amount of optimization; proxy utility keeps increasing, but true utility is an upside-down U shape

458

Jan Leike · Nov 28, 2022 · 10:22 PM UTC

Jan Leike

@janleike

28 Nov 2022

Check out OpenAI's new text-davinci-003! Same underlying model as text-davinci-002 but more aligned. Would love to hear feedback about it!

446

Jan Leike · Jul 3, 2024 · 4:28 PM UTC

Jan Leike

@janleike

3 Jul 2024

Interested in working at Anthropic? We're hosting a happy hour at ICML on July 23. Register here: lu.ma/c751eomf

460

90,225

Jan Leike · Dec 10, 2022 · 4:19 AM UTC

Jan Leike

@janleike

10 Dec 2022

Web4 is when the internet you're browsing is just sampled from a language model

439

Jan Leike · Mar 13, 2025 · 4:28 PM UTC

Jan Leike

@janleike

13 Mar 2025

Could we spot a misaligned model in the wild? To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min

Anthropic

@AnthropicAI

13 Mar 2025

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

ALT “Auditing Language Models for Hidden Objectives” by Marks et al.

437

54,075

Jan Leike · Dec 14, 2023 · 5:04 PM UTC

Jan Leike

@janleike

14 Dec 2023

We're distributing $1e7 in grants for research on making superhuman models safer and more aligned. If you've always wanted to work on this, now is your time! Apply by Feb 18: openai.com/blog/superalignme…

Superalignment Fast Grants

We’re launching $10M in grants to support technical research towards the alignment and safety of superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight,...

openai.com

417

113,739

Jan Leike · Dec 3, 2022 · 3:12 AM UTC

Jan Leike

@janleike

3 Dec 2022

I fondly remember the days when people were arguing intensely whether AI is bee level or rat level.

399

Jan Leike · Aug 9, 2023 · 3:45 PM UTC

Jan Leike

@janleike

9 Aug 2023

An important test for humanity will be whether we can collectively decide not to open source LLMs that can reliably survive and spread on their own. Once spreading, LLMs will get up to all kinds of crime, it'll be hard to catch all copies, and we'll fight over who's responsible

172

384

397,751

Jan Leike · Feb 3, 2025 · 6:13 PM UTC

Jan Leike

@janleike

3 Feb 2025

Replying to @elder_plinius @AnthropicAI

do all 8 levels with one jailbreak

394

515,484

Jan Leike · Aug 9, 2024 · 4:02 PM UTC

Jan Leike

@janleike

9 Aug 2024

Replying to @karpathy

I don't think the comparison between RLHF and RL on go really make sense this way. You don’t need RLHF to train AI to play go because there is a highly reliable procedural reward function that looks at the board state and decides who won. If you didn’t have this procedural reward function, RLHF _would_ make sense here; but the way you’d want to use it is to show final board configurations to a human and ask them who won (this way you’d leverage the human's generator-discriminator gap). Then you use RL to train your AI system to reach the winning board states. This is analogous to the way we train LLMs with RLHF: typically we show only complete assistant responses to humans for evaluation, not partial responses. If you were training AlphaGo in the way you describe, I’d call this process supervision (instead of outcome supervision): you’re giving feedback on _how_ your AI is playing go, not just the outcome of the game. Some alignment researchers advocate for process supervision because they hypothesize it’s safer because you won’t get crazy moves that humans wouldn’t endorse (e.g. no move 37), and so your AI system is more likely to stay clear of unsafe states. This isn’t relevant for go because there are no unsafe board states, and so there is no reason not to let your go AI explore wherever. It’s an important open question whether and how much less competitive process supervision is compared to outcome supervision (again, no move 37), and I personally am skeptical for the reasons you outline. But note that process supervision can also perform better when the task is hard for AI because it helps overcome the exploration problem (similar to demonstrations).

385

27,573

Jan Leike · Apr 10, 2024 · 3:10 AM UTC

Jan Leike

@janleike

10 Apr 2024

The superalignment fast grants are now decided! We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about. There is still so much good research waiting to be funded. Congrats to all recipients!

356

417,318

Jan Leike · Feb 17, 2021 · 3:34 AM UTC

Jan Leike

@janleike

17 Feb 2021

We're hiring research engineers for alignment work at @OpenAI! If you're excited about finetuning gpt3-sized language models to be better at following human intentions, then this is for you! Apply here: jobs.lever.co/openai/98599d5…

343

Jan Leike · Apr 3, 2025 · 5:22 PM UTC

Jan Leike

@janleike

3 Apr 2025

Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL

Anthropic

@AnthropicAI

3 Apr 2025

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

ALT Title card for the paper "Reasoning Models Don't Always Say What They Think", by Chen et al.

337

29,803

Jan Leike · Aug 14, 2025 · 12:07 AM UTC

Jan Leike

@janleike

14 Aug 2025

If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!

Anthropic

@AnthropicAI

29 Jul 2025

We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

ALT A drawing of two hands manipulating abstract shapes

338

49,681

Jan Leike · Jun 18, 2022 · 4:57 AM UTC

Jan Leike

@janleike

18 Jun 2022

Really looking forward to working with the legendary Scott Aaronson! scottaaronson.blog/?p=6484

OpenAI!

I have some exciting news (for me, anyway). Starting next week, I’ll be going on leave from UT Austin for one year, to work at OpenAI. They’re the creators of the astonishing GPT-3 and …

scottaaronson.blog

330

Jan Leike · Aug 25, 2023 · 7:17 PM UTC

Jan Leike

@janleike

25 Aug 2023

Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust.

318

61,962

Jan Leike · Dec 28, 2024 · 7:36 AM UTC

Jan Leike

@janleike

28 Dec 2024

Not what I signed up for when I joined OpenAI. The nonprofit needs to uphold the OpenAI mission!

311

22,237

Jan Leike · Feb 3, 2025 · 6:22 PM UTC

Jan Leike

@janleike

3 Feb 2025

Replying to @elder_plinius @AnthropicAI

you will have fully broken our defense ✨

312

41,591

Jan Leike · May 18, 2023 · 8:22 PM UTC

Jan Leike

@janleike

18 May 2023

The alignment problem is very tractable. We haven't figured out how to solve it yet, but with focus and dedication we will.

297

329,504

Jan Leike · May 31, 2023 · 6:35 PM UTC

Jan Leike

@janleike

31 May 2023

Really interesting result on using LLMs to do math: Supervising every step works better than only checking the answer. Some thoughts how this matters for alignment 👇 openai.com/research/improvin…

300

89,778

Jan Leike · Jul 15, 2025 · 8:27 PM UTC

Jan Leike

@janleike

15 Jul 2025

If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do! But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.

Mikita Balesni 🇺🇦

@balesni

15 Jul 2025

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: 🧵

311

75,006

Jan Leike · Mar 14, 2023 · 5:30 PM UTC

Jan Leike

@janleike

14 Mar 2023

GPT-4 is safer and more aligned than any other OpenAI has deployed before. Yet it's not perfect. There is still a lot to do to improve safety and we're planning to make updates over the coming months. Huge congrats to the team on all the progress! 🎉

272

45,875

Jan Leike · May 3, 2023 · 2:33 AM UTC

Jan Leike

@janleike

3 May 2023

It's been heartening to see so many more people lately starting to take existential risk from AI seriously and speaking up about it. It's a first step towards solving the problem.

267

41,100

Jan Leike · Feb 4, 2025 · 7:27 PM UTC

Jan Leike

@janleike

4 Feb 2025

Replying to @theojaffee

he didn't break the defense, he just hacked the UI

267

17,127

Jan Leike · Jun 25, 2020 · 7:31 PM UTC

Jan Leike

@janleike

25 Jun 2020

Today was my last day at @DeepMind. It's been an amazing journey; I've learned so many things and got to work with so many amazing people! Excited for what comes next!

269

Jan Leike · Jun 13, 2022 · 7:04 PM UTC

Jan Leike

@janleike

13 Jun 2022

Super exciting new research milestone on alignment: We trained language models to assist human feedback! Our models help humans find 50% more flaws in summaries than they would have found unassisted. openai.com/blog/critiques/

AI-written critiques help humans notice flaws

We trained “critique-writing” models to describe flaws in summaries. Human evaluators find flaws in summaries much more often when shown our model’s critiques. Larger models are better at self-crit...

openai.com

261

Jan Leike · Feb 6, 2025 · 1:24 AM UTC

Jan Leike

@janleike

6 Feb 2025

Why are we working on jailbreaking robustness? 🧵👇

Jan Leike

@janleike

3 Feb 2025

Super exciting robustness result: We built a system that defends against universal jailbreaks! It has minimal increase in refusal rate and moderate inference cost.

241

75,626

Jan Leike · Sep 18, 2023 · 4:10 PM UTC

Jan Leike

@janleike

18 Sep 2023

If you're into practical alignment, consider applying to @lilianweng's team. They're building some really exciting stuff: - Automatically extract intent from a fine-tuning dataset - Make models robust to jailbreaks - Detect & mitigate harmful use - ... linkedin.com/feed/update/urn…

Research Scientist, Safety | Lilian Weng | 14 comments

My team, Safety Systems, is working on the practical side of alignment at OpenAI: Building systems to enable safe deployment of powerful AI models. Our work encompasses a wide range of research and...

linkedin.com

237

159,890

Jan Leike · Feb 3, 2025 · 6:36 PM UTC

Jan Leike

@janleike

3 Feb 2025

Replying to @elder_plinius @AnthropicAI

We released the paper with the details on how it works so anyone can recreate this system. I don't think we can publicly release the dataset because it's too infohazard-y.

243

29,064

Jan Leike · Dec 28, 2024 · 7:36 AM UTC

Jan Leike

@janleike

28 Dec 2024

Why not fund initiatives that help ensure AGI is beneficial, like AI governance initiatives, safety and alignment research, and easing impacts on the labor market?

228

11,159

Jan Leike · Aug 8, 2023 · 5:22 AM UTC

Jan Leike

@janleike

8 Aug 2023

Great conversation with @robertwiblin on how alignment is one of the most interesting ML problems, what the Superalignment Team is working on, what roles we're hiring for, what's needed to reach an awesome future, and much more 👇 Check it out 👇 80000hours.org/podcast/episo…

Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less | 80,000 Hours

"...the vast power of superintelligence could be very dangerous... currently, we don't have a solution for steering a potentially superintelligent AI..."

80000hours.org

222

67,992

Jan Leike · Feb 5, 2025 · 9:01 PM UTC

Jan Leike

@janleike

5 Feb 2025

👀

Anthropic

@AnthropicAI

5 Feb 2025

Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…

220

40,918

Jan Leike · Jul 24, 2025 · 6:13 PM UTC

Jan Leike

@janleike

24 Jul 2025

In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.

Anthropic

@AnthropicAI

24 Jul 2025

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

Title card for the Anthropic paper "Building and evaluating alignment auditing agents", by Bricken, Wang, Bowman et al. It is accompanied by a sepia-toned picture of worker bees.

ALT Title card for the Anthropic paper "Building and evaluating alignment auditing agents", by Bricken, Wang, Bowman et al. It is accompanied by a sepia-toned picture of worker bees.

228

24,124

Jan Leike · Oct 29, 2023 · 8:19 PM UTC

Jan Leike

@janleike

29 Oct 2023

True, but you can remember them using this picture

213

9,285

Jan Leike · Feb 4, 2025 · 7:38 PM UTC

Jan Leike

@janleike

4 Feb 2025

Replying to @BenPielstick @theojaffee

hacking the UI doesn't let you extract dangerous knowledge from the LLM, which is what we're trying to defend against here

211

8,694

Jan Leike · Feb 5, 2025 · 7:00 PM UTC

Jan Leike

@janleike

5 Feb 2025

Replying to @elder_plinius

We don't want to open source the datasets but we might provide a different incentive. Stay tuned

216

48,079

Jan Leike · Dec 5, 2022 · 5:01 PM UTC

Jan Leike

@janleike

5 Dec 2022

New blog post on why I'm excited about OpenAI's approach to alignment, including some responses to common objections: aligned.substack.com/p/align…

Why I’m optimistic about our alignment approach

Some arguments in favor and responses to common objections

aligned.substack.com

206

Jan Leike · Oct 23, 2024 · 11:02 PM UTC

Jan Leike

@janleike

23 Oct 2024

Replying to @jachiam0

How about a friendly game of who-can-make-their-models-more-aligned followed by a jailbreaking competition and a face-off eliciting dangerous capabilities from each other's models?

204

11,647

Jan Leike · Aug 28, 2022 · 4:47 PM UTC

Jan Leike

@janleike

28 Aug 2022

Every organization attempting to build AGI should be transparent about their alignment plans.

193

Jan Leike · Sep 5, 2024 · 6:06 PM UTC

Jan Leike

@janleike

5 Sep 2024

If your model causes mass casualties or >$500 million in damages, something has clearly gone very wrong. Such a scenario is not a normal part of innovation.

191

184,690

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

19 Sep 2025

They plan to use the highly successful playbook from the pro-crypto super PAC Fairshake. Here is how it works: Instead of running campaign ads on AI directly (most voters don’t care enough), they run ads in support of candidates who are against AI regulation or against candidates who are pro AI regulation, on topics unrelated to AI that voters care about.

241

24,320

Jan Leike · Nov 20, 2018 · 5:05 PM UTC

Jan Leike

@janleike

20 Nov 2018

The agent alignment problem may be one of the biggest obstacles for using ML to improve people’s lives. Today I’m very excited to share a research direction for how we’ll aim to solve alignment at @DeepMindAI. Blog post: medium.com/@deepmindsafetyre… Paper: arxiv.org/pdf/1811.07871.pdf

190

Jan Leike · Feb 13, 2025 · 9:45 PM UTC

Jan Leike

@janleike

13 Feb 2025

Replying to @caleb_parikh

They sent 7,867 messages, and passed 1,408 of them onto the auto-grader. We estimate that they probably spent over 40 hours on this in total.

189

18,177

Jan Leike · Oct 21, 2023 · 7:29 PM UTC

Jan Leike

@janleike

21 Oct 2023

Replying to @alexeyguzey @SimonLermenAI

We'll have some evidence to share soon

187

84,891