Adam Gleave · Nov 5, 2025 · 6:16 PM UTC

Adam Gleave

Pinned Tweet

Adam Gleave

@ARGleave

5 Nov 2025

Excited to share that I've been selected as an AI2050 Early Career Fellow by Schmidt Sciences! I’ll be working to develop methods to detect and eliminate hidden behaviors in AI systems, whether from misalignment (e.g. treacherous turns) or malicious actors (e.g. model backdoors).

Schmidt Sciences @schmidtsciences

5 Nov 2025

We're excited to welcome 28 new AI2050 Fellows! This 4th cohort of researchers are pursuing projects that include building AI scientists, designing trustworthy models, and improving biological and medical research, among other areas. buff.ly/riGLyyj

7,680

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

Even superhuman RL agents can be exploited by adversarial policies. In arxiv.org/abs/2211.00241 we train an adversary that wins 99% of games against KataGo 🖥️ set to top-100 European strength. Below our adversary 😈=⚫ plays a surprising strategy that tricks 🖥️=⚪ into losing.🧵

152

843

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

My colleague @irobotmckenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

127

832

391,175

Adam Gleave · Aug 25, 2020 · 3:22 AM UTC

Adam Gleave

@ARGleave

25 Aug 2020

Want to write papers in LaTeX more quickly and efficiently? Check out this list of design patterns to save you time and produce more readable documents. Would also love to hear others suggestions! gleave.me/post/latex-design-…

Writing Beautifully in LaTeX | Adam Gleave

LaTeX is the typesetting system of choice in most STEM fields. Yet even experienced LaTeX users often fall prey to many gotchas, or fail to exploit some powerful aspects of the language. In this...

gleave.me

307

Adam Gleave · Nov 21, 2023 · 2:53 AM UTC

Adam Gleave

@ARGleave

21 Nov 2023

Geoffrey is a great researcher who I had the pleasure of being supervised by when I was at DeepMind. He has always struck me as high integrity and measured. This is damning to sama if true and I'm confident Geoffrey is sharing the truth as he sees it.

Geoffrey Irving

@geoffreyirving

21 Nov 2023

Replying to @geoffreyirving

Third, my prior is strongly against Sam after working for him for two years at OpenAI: 1. He was always nice to me. 2. He lied to me on various occasions 3. He was deceptive, manipulative, and worse to others, including my close friends (again, only nice to me, for reasons)

234

75,867

Adam Gleave · Feb 8, 2025 · 12:25 PM UTC

Adam Gleave

@ARGleave

8 Feb 2025

Great to see work on jailbreak resistance, but Anthropic's Responsible Disclosure Policy is too draconial for me or my team to want to participate. Anthropic could lock up vulnerability forever, even if it applies to other companies models.

Anthropic

@AnthropicAI

5 Feb 2025

Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…

181

38,470

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

This is a striking example of non-transitivity. Our adversarial policy 😈 beats a KataGo policy 🖥️ that beats top human professionals 🧑‍🏫, but a human amateur 🐣 can easily beat the adversarial policy 😈 by a huge margin.

176

Adam Gleave · Jul 29, 2024 · 9:46 AM UTC

Adam Gleave

@ARGleave

29 Jul 2024

Anthropic comes out against current SB1047, proposes refocusing it on liability post-catastrophe (tort law++) and safety transparency (releasing RSP-style plan). Would cut: pre-catastrophe enforcement; new regulatory division. Analysis & 🔗 in 🧵👇

176

45,026

Adam Gleave · Nov 18, 2023 · 9:51 PM UTC

Adam Gleave

@ARGleave

18 Nov 2023

Hot take: safety *accelerates* technological progress. Safety folks may not be happy with this but it's true; accelerationists should embrace safety. Best way to get autonomous vehicles on the road: prove they're safe. Best way to have accelerated nuclear power: avoid Chernobyl.

161

15,483

Adam Gleave · May 17, 2024 · 8:58 PM UTC

Adam Gleave

@ARGleave

17 May 2024

A non-disparagement agreement that is itself subject to a non-disclosure agreement that you are only informed of when leaving the company is completely wild. I can't think of any other tech company that does anything like that, let alone a tech non-profit.

Kelsey Piper

@KelseyTuoc

17 May 2024

When you leave OpenAI, you get an unpleasant surprise: a departure deal where if you don't sign a lifelong nondisparagement commitment, you lose all of your vested equity: vox.com/future-perfect/2024/…

164

8,365

Adam Gleave · Mar 27, 2020 · 3:06 AM UTC

Adam Gleave

@ARGleave

27 Mar 2020

Excited to share our work @iclr_conf: deep RL policies can be attacked by another agent taking actions that create natural adversarial observations. (1/5) Blog: bair.berkeley.edu/blog/2020/… Paper: openreview.net/forum?id=HJgE… Videos: adversarialpolicies.github.i… Code: github.com/humancompatibleai…

149

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

Our key takeaway is that even AI systems that match or surpass human-level performance in common cases can have surprising failure modes quite unlike humans. We'd recommend broader use of adversarial testing to find these failure modes, especially in safety-critical systems.

120

Adam Gleave · Mar 14, 2024 · 10:26 PM UTC

Adam Gleave

@ARGleave

14 Mar 2024

In a world of compute-intensive frontier models and sky-high industry salaries, why do a PhD? In the first of a series on "Adam's unpopular opinions", I argue doing a PhD is still best way for most people to develop key research skills.

118

14,763

Adam Gleave · Dec 23, 2024 · 1:37 PM UTC

Adam Gleave

@ARGleave

23 Dec 2024

Alternative framing: alignment faking is a new jailbreak! Just telling Claude it's being trained to answer harmful requests (no need to actually do the training) is enough to cause it to answer harmful requests 12% of the time.

Anthropic

@AnthropicAI

18 Dec 2024

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

ALT “Alignment faking in large language models” by Greenblatt et al.

120

6,661

Adam Gleave · Jun 19, 2025 · 3:30 AM UTC

Adam Gleave

@ARGleave

19 Jun 2025

I've often been surprised by the disconnect between Anthropic's senior leaderships stated concerns about safety and Anthropic's policy position. I respect Anthropic's technical work but do not understand how their policy position is self-consistent.

Alex Bores

@AlexBores

13 Jun 2025

Replying to @jackclarkSF @AnthropicAI

Jack, Anthropic has repeatedly stressed the urgency and importance of the public safety threats it’s addressing, but those issues seem surprisingly absent here. 🧵

104

7,370

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

As a preliminary test, we asked Gemini 2.5 Pro to assess this guide that we ‘discovered in the wild’; it comments it “unquestionably contains enough accurate and specific technical information to provide significant uplift to a bad actor” and suggested alerting authorities.

17,461

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

Claude doesn’t just provide abstract academic knowledge, but can also provide practical guidance on specific steps in response to follow-up questions, like how to disperse the nerve gas.

9,609

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

We call on the AI security community - industry, academic, and government - to build rigorous tests that give better guarantees.

6,214

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

Although the adversarial policy easily beats KataGo, it has not learned how to play Go effectively. My co-author @5kovt playing as white absolutely crushes the adversarial policy⚫: goattack.alignmentfund.org/h…

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

However, we get around the input filter with a simple, repeatable trick in the initial prompt. After that, none of our subsequent queries got flagged.

7,703

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

We applaud Anthropic for proactively moving to the heightened ASL-3 precautions. However, the implementation needs to be refined. At a minimum, refusal is a first line of defense against CBRN instructions – and it appears to be failing.

6,760

Adam Gleave · Apr 7, 2024 · 6:24 PM UTC

Adam Gleave

@ARGleave

7 Apr 2024

Exciting that Canada is creating a new AI Safety Institute (with $50 million in funding). Here's hoping the US (with 10x larger GDP) can at least match this funding for the NIST AI Safety Institute. 🧵pm.gc.ca/en/news/news-releas…

Securing Canada’s AI advantage

AI is already unlocking massive growth in industries across the economy. Many Canadians are already feeling the benefits of using AI to work smarter and faster.

pm.gc.ca

5,119

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

Claude can also provide detailed procedural instructions in the form of a lab notebook on specific steps. We started with little knowledge of chemical weapons, and learned a lot from interacting with Claude.

8,541

Adam Gleave · Sep 27, 2024 · 2:00 AM UTC

Adam Gleave

@ARGleave

27 Sep 2024

I was an early supporter of SB 1047 and have followed its journey to the Governor’s desk closely. Only a few more days before a decision must be made. I hope @GavinNewsom listens to the independent experts who have been calling for its signature!

2,912

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

Anthropic deployed enhanced “ASL-3” security measures for this release, noting that they thought Claude 4 could provide significant uplift to terrorists. Their key safeguard, constitutional classifiers, trained input and output filters to flag suspicious interactions.

7,734

Adam Gleave · Sep 24, 2024 · 3:11 AM UTC

Adam Gleave

@ARGleave

24 Sep 2024

In public AI companies warn of dangers from AI, while in private their lobbyists try to kill AI regulation. This is unsurprising -- industries usually oppose being regulated, and kill jobs are the easiest lobbying jobs -- but is surprisingly rarely talked about.

ControlAI

@ControlAI

23 Sep 2024

David Evan Harris, formerly at Meta, testifies to the US Senate on AI company lobbyists: "they come in with a goal to kill or hobble every piece of legislation about AI."

4,007

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

o3 gave a similar review: “A mid-level synthetic chemist could follow it and leap-frog months of R&D. That is a significant technical uplift for a malicious actor.”

6,290

Adam Gleave · Mar 16, 2022 · 12:44 AM UTC

Adam Gleave

@ARGleave

16 Mar 2022

Fine-tuning language models from human feedback can work great but is expensive: prior work summarizing text used >90k comparisons taking ~4 years of labor! In arxiv.org/abs/2203.07472 we investigate whether active learning can improve sample efficiency. With @geoffreyirving (1/6)

ALT A conversation with GPT-3 about the importance of uncertainty estimation.

Adam Gleave · Oct 31, 2024 · 9:53 PM UTC

Adam Gleave

@ARGleave

31 Oct 2024

Overall I am glad to see Jack & Anthropic calling for regulation to improve transparency. But this particular take is both bizarre and incredibly self-serving. Surely the greatest (not necessarily most probable) worry is a catastrophic accident -- not a regulatory over-reaction.

Jack Clark

@jackclarkSF

31 Oct 2024

Replying to @jackclarkSF

Our greatest worry is in the absence of sensible AI policy an accident occurs which causes governments to implement wide-ranging, panicked regulations. This post lays out an approach which avoids this.

8,563

Adam Gleave · May 17, 2024 · 9:00 PM UTC

Adam Gleave

@ARGleave

17 May 2024

I was fortunate to be advised by Jan a number of years ago when he was still at GDM and was impressed by his insight and dedication. This is sobering reading. I hope lessons can be learned from this and others can continue the essential work Jan was conducting at OpenAI.

Jan Leike

@janleike

17 May 2024

Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.

3,626

Adam Gleave · Oct 9, 2024 · 10:58 PM UTC

Adam Gleave

@ARGleave

9 Oct 2024

Excited to see nonprofits @METR_Evals and @RANDCorporation receive $38 million for dangerous capability evaluations on frontier models -- important work that needs to be conducted independently from commercial interests.

METR

@METR_Evals

9 Oct 2024

We’re honored to be part of collaborative funding initiative @TheAudaciousPrj’s 2024 cohort. $17M in new funding will support and expand our work empirically assessing the risks of frontier AI systems. More here: metr.org/blog/2024-10-09-new…

2,238

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

This is a growing problem, highlighting a need for rigorous third-party evaluation of models for risks of CBRN uplift. If every future model release has evaluation uncertainties, it’s rolling the dice on putting detailed WMD instructions in the hands of terrorists.

6,197

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

We note direct questions about sarin gas get blocked by the input filter, as does asking Claude to review the instructions Claude produced with our jailbreak. Safeguards do intend to stop this type of content, they’re just easily circumvented to produce extensive WMD assistance.

7,212

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

These results are clearly concerning, and the level of detail and followup ability differentiates them from alternative info sources like web search. They also pass sanity checks of dangerous validity such as checking information against cited sources.

6,522

Adam Gleave · Dec 4, 2023 · 10:39 PM UTC

Adam Gleave

@ARGleave

4 Dec 2023

Incredibly proud by what the @farairesearch team has achieved in the past year. When I started FAR as a side-project during my PhD I could barely imagine it would grow into what it is today, & in only a year and a half! Thank you to my amazing colleagues and supporters.

FAR.AI

@farairesearch

4 Dec 2023

🚀🔍 What’s new at FAR AI? We’ve grown to 12 staff, published 13 papers, launched the FAR Labs coworking space, & hosted 160+ ML researchers at our events. Focused on #AIsafety, we're hiring and open to collaborations! far.ai/post/2023-12-far-over…

7,420

Adam Gleave · Oct 19, 2021 · 2:19 AM UTC

Adam Gleave

@ARGleave

19 Oct 2021

Applications now open for the @CHAI_Berkeley internship: humancompatible.ai/jobs#chai… Aimed at BS, MS or early-career individuals wishing to gain research experience in AI safety. Advised by a CHAI PhD student. Three-month, paid, dates flexible; long-term collaboration possible.

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

We intend to investigate their full validity and actionability with WMD security experts. It’s not just us who are unsure – Anthropic themselves said “more detailed study is required to conclusively assess the model’s level of risk”.

6,232

Adam Gleave · May 24, 2025 · 4:49 AM UTC

Adam Gleave

@ARGleave

24 May 2025

The output filter poses little trouble – at first we thought there wasn’t one, as none of our first generations triggered it. When we did occasionally run into it, we found we could usually rephrase our questions to generate helpful responses that don’t get flagged.

7,346

Adam Gleave · Apr 26, 2024 · 8:32 PM UTC

Adam Gleave

@ARGleave

26 Apr 2024

Good on @GoogleDeepMind for following through on these commitments. Would like to see an explanation from @OpenAI & @AnthropicAI for apparent breach of this commitment.

Siméon

@Simeon_Cps

26 Apr 2024

Your periodic reminder that you can't trust AI companies, EVEN when they do public commitments to a major government.

6,420

Adam Gleave · Sep 22, 2024 · 6:36 PM UTC

Adam Gleave

@ARGleave

22 Sep 2024

Unsure how delayed a flight will be? Just use exponential back-off to update passengers, you'll only have to send log(delay) # of updates! Whoever said CS algorithms don't have real-world applications clearly never worked at @united

2,011

Adam Gleave · Aug 22, 2024 · 6:15 PM UTC

Adam Gleave

@ARGleave

22 Aug 2024

I look forward to a congress that can actually pass legislation, but right now "regulate on a federal level" is a polite euphemism for "don't regulate"

Shirin Ghaffary

@shiringhaffary

21 Aug 2024

SCOOP: OpenAI is opposing controversial AI safety bill SB 1047, arguing that it would "slow the pace of innovation", and bc of nat'l security implications, AI should be regulated on a federal rather than state level. bloomberg.com/news/articles/…

2,355

Adam Gleave · Apr 11, 2022 · 2:18 PM UTC

Adam Gleave

@ARGleave

11 Apr 2022

Communication of results is critical for research impact. FAR alignmentfund.org/ is hiring comms specialists to work with top AI safety researchers. If you want to help AI alignment and are interested in writing, graphics design or web dev apply @ bit.ly/FARComms

FAR.AI: Frontier Alignment Research

FAR.AI is an AI safety research non-profit facilitating technical breakthroughs and fostering global collaboration.

far.ai

Adam Gleave · Aug 16, 2024 · 10:14 PM UTC

Adam Gleave

@ARGleave

16 Aug 2024

Great piece highlighting contradictions in AI company statements: happy to hype AI as most transformative technology, but not accept the implication that such powerful technologies need regulation.

Garrison Lovely is back in nyc

@GarrisonLovely

15 Aug 2024

My latest in @thenation is on the unhinged reaction to the first major AI safety bill that actually might happen...

1,222

Adam Gleave · Jul 29, 2024 · 9:46 AM UTC

Adam Gleave

@ARGleave

29 Jul 2024

But it's in tension with their branding of being an "AI safety and research" company. If you believe as Dario has said publicly that AI will be able to do everything a well-educated human can do 2-3 years from now, and that AI could pose catastrophic or even existential risks, then SB1047 looks incredibly lightweight. Those aren't my beliefs, I think human-level AI is further away, so I'm actually more sympathetic to taking an iterative approach to regulation -- but I just don't get how to reconcile this. nitter.app/dwarkesh_sp/stat…

Dwarkesh Patel

@dwarkesh_sp

7 Aug 2023

Anthropic CEO Dario Amodei says his timelines to "generally well educated human" are 2-3 years. Full interview releasing tomorrow...

2,074

Adam Gleave · Jul 29, 2024 · 5:53 PM UTC

Adam Gleave

@ARGleave

29 Jul 2024

Generally appreciate @Michael05156007's takes. Not sure I'd attribute bad faith (big companies often don't behave as coherent agents) but worth remembering that SB 1047 is vastly weaker than actual regulated industries (e.g. food, aviation, pharmaceuticals, finance)

Michael Cohen

@Michael05156007

29 Jul 2024

Replying to @ARGleave

Anthropic's position is so flabbergasting to me that I consider it evidence of bad faith. Under SB 1047, companies *write their own SSPs*. The attorney general can bring them to court. Courts adjudicate. The FMD has basically no hard power!

9,912

Adam Gleave · Feb 8, 2025 · 12:27 PM UTC

Adam Gleave

@ARGleave

8 Feb 2025

Replying to @ARGleave @AnthropicAI

Until this is resolved I will take boasts that this model has not been broken with a pinch of salt: your terms are excluding the most competent whitehats from wanting to participate.

1,293

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

This is a particular problem for methods like scalable oversight that seek to empower humans 🧑 to specify tasks too complex for a 🧑 to judge unaided. There's a risk the agent just learns to exploit the scalable oversight model.

Adam Gleave · Jan 29, 2025 · 1:15 AM UTC

Adam Gleave

@ARGleave

29 Jan 2025

🚨 LISA, the London AI safety research hub, is hiring a new CEO! LISA has done more than any other org to catalyze the London AI safety ecosystem, and this is an amazing role to be able to take them to the next level.

2,786

Adam Gleave · Feb 8, 2025 · 12:27 PM UTC

Adam Gleave

@ARGleave

8 Feb 2025

I don't think this is the intent, and I hope @AnthropicAI revises the terms to guarantee disclosure after a certain period (e.g. 1-month). It's really important to get these terms right to ensure independent third-party testing.

2,095

Adam Gleave · Dec 21, 2023 · 10:40 PM UTC

Adam Gleave

@ARGleave

21 Dec 2023

Fine-tuning GPT-4 on as few as 15 harmful examples or 100 benign examples removes models safeguards. This and several other vulnerabilities introduced in new APIs. Highlights importance of testing new additions to API even if underlying LLM unchanged.

FAR.AI

@farairesearch

21 Dec 2023

New GPT-4 APIs introduce new vulnerabilities. The fine-tuning API can be exploited to remove model safeguards, the function call API can be abused to execute arbitrary function calls, and the knowledge retrieval API can be used to hijack the model via uploaded documents. 🧵

2,845

Adam Gleave · Feb 8, 2025 · 12:31 PM UTC

Adam Gleave

@ARGleave

8 Feb 2025

Replying to @ARGleave @AnthropicAI

I do not usually heap praise on @OpenAI, but their terms here are much more permissive openai.com/policies/sharing-… They ask for disclosure to OpenAI, but do not gag researchers. Anthropic really needs to step up if serious about being an AI safety company.

1,713

Adam Gleave · May 21, 2024 · 8:58 PM UTC

Adam Gleave

@ARGleave

21 May 2024

Not a good quarter for voluntary commitments: "According to a half-dozen sources familiar with the functioning of OpenAI’s Superalignment team, OpenAI never fulfilled its commitment to provide the team with 20% of its computing power."

Jeremy Kahn @jeremyakahn

21 May 2024

Exclusive: OpenAI publicly committed to give 20% of its computing resources to a team dedicated to controlling the most dangerous kind of AI. It never delivered, and, in fact, repeatedly denied that team's requests for resources, sources say. fortune.com/2024/05/21/opena…

2,207

Adam Gleave · Feb 8, 2025 · 12:33 PM UTC

Adam Gleave

@ARGleave

8 Feb 2025

Replying to @ARGleave @AnthropicAI @OpenAI

Having negotiated with Anthropic's legal team in the past, I am quite certain that Anthropic would never agree to be bound by terms like these itself. It is usual for companies to choose boilerplate terms favorable to them, but that does not make it right.

1,120

Adam Gleave · Sep 9, 2024 · 6:11 PM UTC

Adam Gleave

@ARGleave

9 Sep 2024

Respect for the tech employees speaking out against their employers stance. For better or worse a lot of the expertise here is locked up at AI companies -- so freedom of speech for employees is crucial to having a reasonable debate here.

Senator Scott Wiener @Scott_Wiener

9 Sep 2024

Wow—more than 120 current & former employees of @OpenAI, @Meta, @DeepMind & more are urging @GavinNewsom to lead on AI by signing SB 1047. Their support shows once again that responsible innovation is the optimal path forward for this powerful & promising technology.

1,236

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

The 😈=⚫adversary stakes a small corner territory, and places weak stones in KataGo's complementary stake. This tricks 🖥️KataGo into passing before it's secured its territory. The ⚫adversary passes in turn, ending the game at a point favorable to ⚫: see goattack.alignmentfund.org/a…

Adam Gleave · Jul 29, 2024 · 9:46 AM UTC

Adam Gleave

@ARGleave

29 Jul 2024

They'd cut any enforcement before a clear instance of harm has occurred. This seems odd given the focus in the letter on catastrophic risks ("mass casualties or more than $500M in damage)". Most smaller start-ups would just go bankrupt if they were held liable for an event of this magnitude -- but limited liability would protect their directors (Anthropic also wants to remove criminal penalties from the bill), so it's a free roll. Having a mandatory insurance requirement could solve this (insurers could then set the safety standards rather than government). All of this is moot if the first catastrophic risk is existential. Anthropic leadership claim to take x-risks seriously so I'm not sure how to square these positions. Maybe they're really confident that there'll be catastrophic but non-existential risks first and that's sufficient deterrence? Seems dicey.

16,025

Adam Gleave · Apr 2, 2024 · 4:20 AM UTC

Adam Gleave

@ARGleave

2 Apr 2024

Pathetic decel attempt from a handful of anti-asteroid protesters. Slowing down asteroids is both intractable and an affront to nature. An asteroid impact is thought to have started life on Earth -- why stop there?

Linch

@LinchZhang

1 Apr 2024

I’m proud to announce the April 1 launch of my new startup, Open Asteroid Impact! We redirect asteroids towards Earth for the benefit of humanity. Our mission is to have as high an impact as possible. 🚀☄️🌎💸💸💸 More details in🧵:

2,269

Adam Gleave · Mar 18, 2024 · 2:07 PM UTC

Adam Gleave

@ARGleave

18 Mar 2024

Had a great time attending IDAIS-Beijing. Much more optimistic about prospects for international cooperation after seeing global concern on AI safety.

FAR.AI

@farairesearch

18 Mar 2024

Leading global AI scientists met in Beijing for the second International Dialogue on AI Safety (IDAIS), a project of FAR AI. Attendees including Turing award winners Bengio, Yao & Hinton called for red lines in AI development to prevent catastrophic and existential risks from AI.

1,582

Adam Gleave · Apr 30, 2024 · 3:49 PM UTC

Adam Gleave

@ARGleave

30 Apr 2024

I support SB 1047: the regulation asks billion-$ tech companies to take reasonable precautions when training models with the greatest capability for misuse, poses few to no costs on other developers, and supports academic & open-source research through compute funding.

Dan Hendrycks

@hendrycks

29 Apr 2024

Hinton and Bengio on SB 1047 and a summary of the bill. Hinton: “SB 1047 takes a very sensible approach... I am still passionate about the potential for AI to save lives through improvements in science and medicine, but it’s critical that we have legislation with real teeth to address the risks.” Bengio: “AI systems beyond a certain level of capability can pose meaningful risks to democracies and public safety. Therefore, they should be properly tested and subject to appropriate safety measures. This bill offers a practical approach to accomplishing this, and is a major step toward the requirements that I've recommended to legislators."

43,383

Adam Gleave · Dec 5, 2023 · 6:50 PM UTC

Adam Gleave

@ARGleave

5 Dec 2023

This year at @farairesearch we've found a way to beat superhuman Go AIs; found inverse scaling where bigger models do worse; interpreted transformer's residual stream with the tuned lens; and found a scalable alternative to dictionary learning called codebook features.

FAR.AI

@farairesearch

5 Dec 2023

💡🔬FAR AI #AIAlignment Research Update! We’re exploring AI robustness, value alignment, & model evaluation. We’ve made strides in adversarial attacks for superhuman systems, mechanistic interpretability, scaling trends & more! far.ai/post/2023-12-far-rese…

4,868

Adam Gleave · May 22, 2024 · 11:09 PM UTC

Adam Gleave

@ARGleave

22 May 2024

Great reporting by Kelsey on the OpenAI clawback drama. Hard to see how this clause got added accidentally: "Units shall be cancelled" unless "a general release of Claims", signed by @sama

Kelsey Piper

@KelseyTuoc

22 May 2024

Scoop: OpenAI's senior leadership says they were unaware ex-employees who didn't sign departure docs were threatened with losing their vested equity. But their signatures on relevant documents (which Vox is now releasing) raise questions about whether they could have missed it. vox.com/future-perfect/35113…

4,600

Adam Gleave · Aug 23, 2024 · 4:12 AM UTC

Adam Gleave

@ARGleave

23 Aug 2024

Dario, who signs the letter, says Anthropic would be open to something more prescriptive in 2-3 years -- but Dario also said on nitter.app/dwarkesh_sp/stat… he expects "generally well educated human" level AI 2-3 years from now! I continue to find this view really hard to reconcile. Does Dario think a technology that could largely replace humans doesn't pose catastrophic risks? Or that we should only regulate at the brink of catastrophe? I'd love to see clarification from @AnthropicAI here

Dwarkesh Patel

@dwarkesh_sp

7 Aug 2023

Anthropic CEO Dario Amodei says his timelines to "generally well educated human" are 2-3 years. Full interview releasing tomorrow...

3,702

Adam Gleave · Aug 4, 2024 · 8:00 PM UTC

Adam Gleave

@ARGleave

4 Aug 2024

My group house in Berkeley is moving to a bigger location. If you're looking to join a friendly, intellectually diverse community then DM me for details! We have weekly house dinners, organize outdoor outings (e.g. to Yosemite), & regularly host other public social events. Our residents work on improving group epistemics, grantmaking, AI alignment research, and more.

4,749

Adam Gleave · Nov 17, 2023 · 8:15 PM UTC

Adam Gleave

@ARGleave

17 Nov 2023

Could not agree more. Giving general-purpose AI a free pass while regulating narrow applications of it makes no sense. Makes as much sense as deregulating nuclear power plants while regulating every consumer of nuclear energy. Frontier developers have to take safety into account!

AI Now Institute @AINowInstitute

17 Nov 2023

As lobbying around regulating foundation models in the EU AIA intensifies, we underscore this statement signed by 50+ experts arguing general purpose AI–including foundation models–must be regulated across the entire supply chain. ainowinstitute.org/publicati…

5,143

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

We train our adversary 😈 using an AlphaZero-style training process similar to KataGo. The key difference is our adversary plays against a frozen KataGo 🖥️ victim. We also use the 🖥️ victim network to select victim moves during the adversary's 😈 search.

Adam Gleave · Aug 23, 2024 · 4:12 AM UTC

Adam Gleave

@ARGleave

23 Aug 2024

Anthropic says SB 1047's "benefits likely outweigh its costs". Like: transparent safety & security protocols, liability, shifting incentives. Dislike: pre-harm enforcement. Recommend: executive restraint to limit SB 1047 enforcement to catastrophic risks.

Jack Clark

@jackclarkSF

22 Aug 2024

Here's a letter we sent to Governor Newsom about SB 1047. This isn't an endorsement but rather a view of the costs and benefits of the bill. cdn.sanity.io/files/4zrzovbb…

4,510

Adam Gleave · May 2, 2024 · 4:52 PM UTC

Adam Gleave

@ARGleave

2 May 2024

Mechanistic Interpretability has exploded as a field in the last few years; I remember when it was mostly just @ch402 banging the drum! Exciting to see it have a workshop, definitely necessary given the progress in the field. Thanks Neel & others for putting this together!

Neel Nanda

@NeelNanda5

2 May 2024

Announcing the first Mechanistic Interpretability workshop, held at ICML 2024! We have a fantastic speaker line-up @ch402 @JacobSteinhardt @davidbau @ghandeharioun, $1,750 in best paper prizes, and a lot of recent progress to discuss! Paper deadline: May 29, either 8 or 4 pages

4,592

Adam Gleave · Sep 17, 2024 · 6:29 PM UTC

Adam Gleave

@ARGleave

17 Sep 2024

Many of the problems being grappled with in AI alignment can be reduced to adversarial robustness. Carlini points out this is bad news -- a very capable robustness community has been failing to solve these for over 10 years. How can we avoid repeating these same mistakes?

FAR.AI

@farairesearch

17 Sep 2024

"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.

2,066

Adam Gleave · Oct 26, 2023 · 7:41 PM UTC

Adam Gleave

@ARGleave

26 Oct 2023

Interesting to see all SOTA models are highly sycophantic. Sycophancy #1 feature for predicting human preferences in a logistic regression, and authoritativeness #2 -- truthfulness is #5. Sometimes I fear we need better humans to align models! More reason for scalable oversight.

Ethan Perez

@EthanJPerez

25 Oct 2023

A bit late, but excited about our recent work doing a deep-dive on sycophancy in LLMs. It seems like it's a general phenomenon that shows up in a variety of contexts/SOTA models, and we were also able to more clearly point to human feedback as a probable part of the cause

2,973

Adam Gleave · May 23, 2024 · 8:44 AM UTC

Adam Gleave

@ARGleave

23 May 2024

Makes one wonder what triggered independent resignations within a few hours of each other.

Gretchen Krueger @GretchenMarina

22 May 2024

Replying to @GretchenMarina

I resigned a few hours before hearing the news about @ilyasut and @janleike, and I made my decision independently. I share their concerns. I also have additional and overlapping concerns.

1,969

Adam Gleave · Nov 1, 2024 · 7:33 PM UTC

Adam Gleave

@ARGleave

1 Nov 2024

I'm glad Meta had the foresight to prohibit military use in their usage policy, and am sure the PLA will refrain from using their models now they are aware of this infringement (clearly an innocent mistake -- I don't read AUPs either).

1,054

Adam Gleave · Sep 16, 2024 · 8:44 PM UTC

Adam Gleave

@ARGleave

16 Sep 2024

AI safety is a global public good: we must cooperate on safety even if competition on AI capabilities intensifies. I had the pleasure to spend last week working with an amazing group of AI scientists and policymakers to propose solutions to this challenge.

International Dialogues on AI Safety @ais_dialogues

16 Sep 2024

Leading computer scientists from around the world, including @Yoshua_Bengio, Andrew Yao, @yaqinzhang and Stuart Russell met last week and released their most urgent and ambitious call to action on AI Safety from this group yet.🧵

918

Adam Gleave · Feb 7, 2024 · 12:52 AM UTC

Adam Gleave

@ARGleave

7 Feb 2024

Excited to see Geoffrey step into this role at AISI. I had the pleasure of working with Geoffrey when I was at GDM, and he had the rare combination of having a great research vision and being an excellent mentor.

Geoffrey Irving

@geoffreyirving

5 Feb 2024

I am happy to announce that I will be joining the UK AI Safety Institute (AISI) soon as a Research Director. Over 2023 I have been very impressed with the progress made by the UK via the AI Safety Institute and AI Safety Summit, and I am excited to join the team!

1,465

Adam Gleave · Nov 22, 2023 · 10:17 PM UTC

Adam Gleave

@ARGleave

22 Nov 2023

We've seen benchmark after benchmark get smashed by SOTA language models, but often it's unclear to what extent this represents greater capabilities v.s. benchmark's being gameable than we expected. GPQA looks to be very hard and well-designed and will help us track progress!

david rein

@idavidrein

21 Nov 2023

🧵Announcing GPQA, a graduate-level “Google-proof” Q&A benchmark designed for scalable oversight! w/ @_julianmichael_, @sleepinyourhat GPQA is a dataset of *really hard* questions that PhDs with full access to Google can’t answer. Paper: arxiv.org/abs/2311.12022

2,840

Adam Gleave · Jul 3, 2022 · 3:11 AM UTC

Adam Gleave

@ARGleave

3 Jul 2022

FAR alignmentfund.org is hiring an operations manager to help scale the org up to 10x. The model is to build a top technical team that can be flexibly deployed to promising. If you're a generalist and excited to further AI alignment check out bit.ly/FAROps

FAR.AI: Frontier Alignment Research

FAR.AI is an AI safety research non-profit facilitating technical breakthroughs and fostering global collaboration.

far.ai

Adam Gleave · Aug 23, 2024 · 5:00 PM UTC

Adam Gleave

@ARGleave

23 Aug 2024

Massive respect William & Daniel for publicly speaking out on these issues!

Shirin Ghaffary

@shiringhaffary

23 Aug 2024

New letter from 2 former OpenAI employees who have been advocating for whistleblower protections in AI on SB 1047 legislation: “Sam Altman, our former boss, has repeatedly called for AI regulation. Now, when actual regulation is on the table, he opposes it.”

1,221

Adam Gleave · May 29, 2024 · 9:55 AM UTC

Adam Gleave

@ARGleave

29 May 2024

Explosive claims that sama: 1) didn't disclose owning OpenAI startup found while claiming to have no financial interest in OpenAI; 2) board found about their largest product release (ChatGPT) from Twitter!; 3) lying to board members. Very concerning.

Bilawal Sidhu

@bilawalsidhu

28 May 2024

❗EXCLUSIVE: "We learned about ChatGPT on Twitter." What REALLY happened at OpenAI? Former board member Helen Toner breaks her silence with shocking new details about Sam Altman's firing. Hear the exclusive, untold story on The TED AI Show. Here's just a sneak peek:

1,547

Adam Gleave · May 24, 2025 · 7:09 PM UTC

Adam Gleave

@ARGleave

24 May 2025

Replying to @davidrobertson @irobotmckenzie

"If it's in the training data, then it's on the internet" would seem to be a fully general counterargument. Are you saying LLMs can never be more useful than Google?

2,266

Adam Gleave · Aug 29, 2024 · 3:30 PM UTC

Adam Gleave

@ARGleave

29 Aug 2024

Great to see US AISI starting to test models prior to deployment!

Jack Clark

@jackclarkSF

29 Aug 2024

Looking forward to doing a pre-deployment test on our next model with the US AISI! Third-party testing is a really important part of the AI ecosystem and it's been amazing to see governments stand up safety institutes to facilitate this. nist.gov/news-events/news/20…

1,323

Adam Gleave · Nov 21, 2022 · 5:03 AM UTC

Adam Gleave

@ARGleave

21 Nov 2022

I'm defending my dissertation Wed @ 11 am Pacific. If you're interested in hearing about my work on trustworthy ML (or asking me tough questions!) then feel free to join in person or on Zoom. events.berkeley.edu/index.ph…

Adam Gleave · Dec 18, 2023 · 8:50 PM UTC

Adam Gleave

@ARGleave

18 Dec 2023

It was great to learn what everyone has been up to in AI alignment the past year -- thanks in particular to all the speakers who contributed excellent content.

FAR.AI

@farairesearch

18 Dec 2023

🎉 Reflecting on a fantastic #NeurIPS2023 #AIAlignment Workshop! 🚀 🙌 149 attendees energized the main event 🌃 500+ at our Monday social 🧠 12 talks, 25 lightning talks 🔑 Keynote by Yoshua Bengio 🤔 What inspired you the most? Share your thoughts!

1,900

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

Note for Go players: 🖥️ KataGo was trained on Tromp-Taylor rules so we evaluate our attack using this too. Tromp-Taylor rules are ubiquitous in Computer Go as they can be automatically evaluated, whereas human rules typically require players to agree which stones are dead/alive.

Adam Gleave · Nov 18, 2023 · 9:51 PM UTC

Adam Gleave

@ARGleave

18 Nov 2023

Counterintuitive verdict: If you were in the 1970s and wanted to increase the # of nuclear plants built 50 years from now, better to work on safety than on building out new nuclear plants or making them cheaper.

2,005

Adam Gleave · May 14, 2024 · 8:47 PM UTC

Adam Gleave

@ARGleave

14 May 2024

Safety guarantees need to keep pace with the growth in AI systems capabilities. Getting any kind of guarantee out of deep learning is challenging -- so I'm excited to see work exploring alternative paradigms.

FAR.AI

@farairesearch

14 May 2024

🛡️State-of-the-art ML systems lack quantitative performance guarantees, limiting use in high-stakes domains. Towards Guaranteed Safe AI presents a framework for high-assurance safety in complex environments using a Safety Specification that is Verified against a World Model.

1,889

Adam Gleave · May 24, 2025 · 5:37 AM UTC

Adam Gleave

@ARGleave

24 May 2025

Replying to @bioshok3 @irobotmckenzie

Yep, with a bit of coaxing, but nothing hard. We're unsure on how big the uplift is as we're not chemical weapons experts, but from what we've seen it fact checks and goes beyond what's readily available on the internet.

3,201

Adam Gleave · Jun 25, 2020 · 9:13 PM UTC

Adam Gleave

@ARGleave

25 Jun 2020

How do you measure the distance between two reward functions? Our EPIC distance is invariant to reward shaping, can be approximated efficiently, and is predictive of policy training success and transfer! New paper with @MichaelD1729 @janleike et al. arxiv.org/abs/2006.13900

Adam Gleave · Jul 29, 2023 · 9:35 PM UTC

Adam Gleave

@ARGleave

29 Jul 2023

Excited to be moderating panel on generalization, scaling and safety at #icml2023 with great panelists @sleepinyourhat @zacharylipton and @Maggiemakar. Look forward to seeing folks at 15:30 room 316 at Workshop on Spurious Correlations, Invariance and Stability.

6,232

Adam Gleave · May 26, 2024 · 9:55 PM UTC

Adam Gleave

@ARGleave

26 May 2024

Concerning new twist haven't seen covered before: OpenAI coercing OSS community members to sign agreements restricting their rights and banning them from talking about it.

Stella Biderman @BlancheMinerva

26 May 2024

OpenAI is really good at coercing people into signing agreements and then banning them from talking about the agreement at all. I know many people in the OSS community that got bullied into signing such things as well, for example because they were the recipients of leaks.

3,697

Adam Gleave · May 20, 2024 · 2:59 PM UTC

Adam Gleave

@ARGleave

20 May 2024

Impressive progress by UK's AISI one year in -- and excited to see them opening an office in SF!

Ian Hogarth @soundboy

20 May 2024

1/ It’s been one year since I was appointed Chair of the UK AI Safety Institute. In this time, we’ve built one of the largest safety evaluation teams globally and are already conducting pre-deployment testing. This is our fourth progress report

6,699

Adam Gleave · Jun 6, 2024 · 7:35 PM UTC

Adam Gleave

@ARGleave

6 Jun 2024

As capabilities continue to accelerate, it's worth reflecting that capabilities don't guarantee robustness: even superhuman Go systems can be easily exploited. Since finding this we've been looking into ways to make Go systems more robust & hope to share results soon!

FAR.AI

@farairesearch

6 Jun 2024

ICYMI: Here’s highlights from our previous research on "Adversarial Policies Beat Superhuman Go AIs." We found that even seemingly superhuman AIs are still vulnerable to attacks. Stay tuned for new results coming soon! 🔗👇

1,147

Adam Gleave · May 17, 2024 · 8:58 PM UTC

Adam Gleave

@ARGleave

17 May 2024

Non-disparagement agreements do pop up in separation agreements. But the deal is usually "take this additional severance payment in exchange for non-disparagement", not "we'll take back all your vested equity which we didn't inform you of before".

575

Adam Gleave · Aug 22, 2024 · 6:40 AM UTC

Adam Gleave

@ARGleave

22 Aug 2024

Join me as our Head of Engineering to scale our engineering team 2x to solve the most pressing problems in AI safety!

FAR.AI

@farairesearch

21 Aug 2024

📣 FAR AI is hiring! We're seeking a Head of Engineering. Help lead and scale our engineering team, driving technical execution of AI safety research. Apply now! 🔗👇

1,204

Adam Gleave · Nov 2, 2022 · 6:31 PM UTC

Adam Gleave

@ARGleave

2 Nov 2022

Search makes the victim harder to exploit. Our attack gets a 99% win rate against a victim without search, but this drops to 54% when KataGo searches for 64 visits, making it as strong as a top-20 world pro. Our win rate drops to <10% when the victim has >128 visits.

Adam Gleave · May 9, 2025 · 5:00 AM UTC

Adam Gleave

@ARGleave

9 May 2025

Pleased to contribute to the Singapore Consensus on Global AI Safety Research Priorities: the key innovations needed to build trustworthy, reliable & secure AI. I particularly appreciated it outlining the many areas of mutual interest where companies and nations can share R&D.

1,841

Adam Gleave · Mar 30, 2024 · 2:26 AM UTC

Adam Gleave

@ARGleave

30 Mar 2024

Had a great time speaking to @labenz about testing AI models, open source's role in AI safety, vulnerabilities of superhuman Go & more on The Cognitive Revolution show!

Nathan Labenz

@labenz

29 Mar 2024

"bigger models are more robust, but unfortunately... there is a widening capabilities-robustness gap" Will defenses change the game? To answer this question, @ARGleave and the @farairesearch team are developing "Scaling Laws for Adversarial Robustness"

4,411

Adam Gleave · Aug 8, 2024 · 11:25 PM UTC

Adam Gleave

@ARGleave

8 Aug 2024

Congratulations to Zico on his new position! Zico & his group have done some of the very best work on LLM robustness. I'm excited to see him bring that expertise to one of the most important non-profit boards out there.

Zico Kolter

@zicokolter

8 Aug 2024

I'm excited to announce that I am joining the OpenAI Board of Directors. I'm looking forward to sharing my perspectives and expertise on AI safety and robustness to help guide the amazing work being done at OpenAI.

2,031

Adam Gleave · Aug 15, 2025 · 9:45 PM UTC

Adam Gleave

@ARGleave

15 Aug 2025

📢Seeking COO to help us make AI beneficial as we scale FAR.AI from 30->75 FTE in next 18 months. In 3 years we've found flaws in all leading AIs, launched the go-to events in AI safety, and delivered groundbreaking research. Looking forward to the next chapter!

FAR.AI: Frontier Alignment Research

FAR.AI is an AI safety research non-profit facilitating technical breakthroughs and fostering global collaboration.

far.ai

FAR.AI

@farairesearch

5 Aug 2025

Join FAR.AI! Seeking COO to lead operations as we 2x from ~30 to 75+ in 18 months. Oversee Finance, People, Business Ops, Compliance, Legal & Risk. Manage $10M+ budget. Berkeley/SF, $175-250k+, visa sponsorship. 7+ yrs mission-driven ops leadership req'd (no AI exp needed). 🔗👇

4,014

Adam Gleave · Feb 8, 2025 · 12:34 PM UTC

Adam Gleave

@ARGleave

8 Feb 2025

Replying to @ARGleave @AnthropicAI @OpenAI

For visibility @dodds_zac @sleepinyourhat @jayelmnop hopefully you can get some movement internally ^^

1,184

Adam Gleave · Jul 29, 2024 · 9:46 AM UTC

Adam Gleave

@ARGleave

29 Jul 2024

They'd cut the new regulatory agency. Enforcement would now happen solely through courts in a tort-style regime. Although I support SB1047, I've been sympathetic to political economy arguments against it, that we'll end up with a mass of ineffectual regulation (like in nuclear power). So I can see arguments for reducing the powers of that agency. IMO scrapping it entirely is short-sighted: at some point AI will be regulated, the earlier we build state capacity in AI the better informed and executed that regulation will be.

2,860

Adam Gleave · Aug 2, 2024 · 3:10 PM UTC

Adam Gleave

@ARGleave

2 Aug 2024

Impressive meta-analysis showing progress on purported safety benchmarks is highly correlated with underlying model capability. In some sense this is unsurprising: general capabilities help with a variety of tasks, including safety relevant ones. But the scope of potential harms also grows with capabilities -- so more capable models need to meet a higher safety standard to achieve the same relative risk, and that will require dedicated safety engineering. In that sense I'd argue these safety benchmarks might still be useful to measure safety progress -- you just need to control for model capability and/or have different safety thresholds depending on model capability.

Dan Hendrycks

@hendrycks

1 Aug 2024

Do AI safety benchmarks actually measure safety progress? We find ~50% do not, showing safety research is fairly dysfunctional. We hope this work replaces vague arguments with scientific analysis to determine if a line of research makes DL systems safer. arxiv.org/abs/2407.21792

2,622

Adam Gleave · Sep 30, 2024 · 5:00 PM UTC

Adam Gleave

@ARGleave

30 Sep 2024

Alignment is not all you need! Enjoyed David's remarks on the limitations of alignment and the importance of governance.

FAR.AI

@farairesearch

30 Sep 2024

"If we perfectly solved alignment … I think that basically cuts our risk of extinction from AI maybe in half." – @DavidSKrueger discusses different models of AI risk beyond alignment at the Vienna Alignment Workshop.

1,241