CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as gleave.me

Berkeley, CA
Excited to share that I've been selected as an AI2050 Early Career Fellow by Schmidt Sciences! I’ll be working to develop methods to detect and eliminate hidden behaviors in AI systems, whether from misalignment (e.g. treacherous turns) or malicious actors (e.g. model backdoors).
We're excited to welcome 28 new AI2050 Fellows! This 4th cohort of researchers are pursuing projects that include building AI scientists, designing trustworthy models, and improving biological and medical research, among other areas. buff.ly/riGLyyj
5
2
59
7,680
Even superhuman RL agents can be exploited by adversarial policies. In arxiv.org/abs/2211.00241 we train an adversary that wins 99% of games against KataGo 🖥️ set to top-100 European strength. Below our adversary 😈=⚫ plays a surprising strategy that tricks 🖥️=⚪ into losing.🧵
22
152
843
My colleague @irobotmckenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.
81
127
832
391,175
Want to write papers in LaTeX more quickly and efficiently? Check out this list of design patterns to save you time and produce more readable documents. Would also love to hear others suggestions! gleave.me/post/latex-design-…
2
71
307
Geoffrey is a great researcher who I had the pleasure of being supervised by when I was at DeepMind. He has always struck me as high integrity and measured. This is damning to sama if true and I'm confident Geoffrey is sharing the truth as he sees it.
Replying to @geoffreyirving
Third, my prior is strongly against Sam after working for him for two years at OpenAI: 1. He was always nice to me. 2. He lied to me on various occasions 3. He was deceptive, manipulative, and worse to others, including my close friends (again, only nice to me, for reasons)
9
15
234
75,867
Great to see work on jailbreak resistance, but Anthropic's Responsible Disclosure Policy is too draconial for me or my team to want to participate. Anthropic could lock up vulnerability forever, even if it applies to other companies models.
Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…
4
8
181
38,470
This is a striking example of non-transitivity. Our adversarial policy 😈 beats a KataGo policy 🖥️ that beats top human professionals 🧑‍🏫, but a human amateur 🐣 can easily beat the adversarial policy 😈 by a huge margin.
4
24
176
Anthropic comes out against current SB1047, proposes refocusing it on liability post-catastrophe (tort law++) and safety transparency (releasing RSP-style plan). Would cut: pre-catastrophe enforcement; new regulatory division. Analysis & 🔗 in 🧵👇
7
22
176
45,026
Hot take: safety *accelerates* technological progress. Safety folks may not be happy with this but it's true; accelerationists should embrace safety. Best way to get autonomous vehicles on the road: prove they're safe. Best way to have accelerated nuclear power: avoid Chernobyl.
17
20
161
15,483
A non-disparagement agreement that is itself subject to a non-disclosure agreement that you are only informed of when leaving the company is completely wild. I can't think of any other tech company that does anything like that, let alone a tech non-profit.
When you leave OpenAI, you get an unpleasant surprise: a departure deal where if you don't sign a lifelong nondisparagement commitment, you lose all of your vested equity: vox.com/future-perfect/2024/…
4
9
164
8,365
Excited to share our work @iclr_conf: deep RL policies can be attacked by another agent taking actions that create natural adversarial observations. (1/5) Blog: bair.berkeley.edu/blog/2020/… Paper: openreview.net/forum?id=HJgE… Videos: adversarialpolicies.github.i… Code: github.com/humancompatibleai…
5
56
149
Our key takeaway is that even AI systems that match or surpass human-level performance in common cases can have surprising failure modes quite unlike humans. We'd recommend broader use of adversarial testing to find these failure modes, especially in safety-critical systems.
2
20
120
In a world of compute-intensive frontier models and sky-high industry salaries, why do a PhD? In the first of a series on "Adam's unpopular opinions", I argue doing a PhD is still best way for most people to develop key research skills.
1
17
118
14,763
Alternative framing: alignment faking is a new jailbreak! Just telling Claude it's being trained to answer harmful requests (no need to actually do the training) is enough to cause it to answer harmful requests 12% of the time.
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
5
7
120
6,661
I've often been surprised by the disconnect between Anthropic's senior leaderships stated concerns about safety and Anthropic's policy position. I respect Anthropic's technical work but do not understand how their policy position is self-consistent.
Jack, Anthropic has repeatedly stressed the urgency and importance of the public safety threats it’s addressing, but those issues seem surprisingly absent here. 🧵
5
2
104
7,370
As a preliminary test, we asked Gemini 2.5 Pro to assess this guide that we ‘discovered in the wild’; it comments it “unquestionably contains enough accurate and specific technical information to provide significant uplift to a bad actor” and suggested alerting authorities.
1
3
97
17,461
Claude doesn’t just provide abstract academic knowledge, but can also provide practical guidance on specific steps in response to follow-up questions, like how to disperse the nerve gas.
1
1
89
9,609
We call on the AI security community - industry, academic, and government - to build rigorous tests that give better guarantees.
7
1
87
6,214
Although the adversarial policy easily beats KataGo, it has not learned how to play Go effectively. My co-author @5kovt playing as white absolutely crushes the adversarial policy⚫: goattack.alignmentfund.org/h…
2
7
77
However, we get around the input filter with a simple, repeatable trick in the initial prompt. After that, none of our subsequent queries got flagged.
2
2
77
7,703
We applaud Anthropic for proactively moving to the heightened ASL-3 precautions. However, the implementation needs to be refined. At a minimum, refusal is a first line of defense against CBRN instructions – and it appears to be failing.
1
3
76
6,760
Exciting that Canada is creating a new AI Safety Institute (with $50 million in funding). Here's hoping the US (with 10x larger GDP) can at least match this funding for the NIST AI Safety Institute. 🧵pm.gc.ca/en/news/news-releas…
2
14
78
5,119
Claude can also provide detailed procedural instructions in the form of a lab notebook on specific steps. We started with little knowledge of chemical weapons, and learned a lot from interacting with Claude.
2
1
73
8,541
I was an early supporter of SB 1047 and have followed its journey to the Governor’s desk closely. Only a few more days before a decision must be made. I hope @GavinNewsom listens to the independent experts who have been calling for its signature!
3
3
72
2,912
Anthropic deployed enhanced “ASL-3” security measures for this release, noting that they thought Claude 4 could provide significant uplift to terrorists. Their key safeguard, constitutional classifiers, trained input and output filters to flag suspicious interactions.
1
1
67
7,734
In public AI companies warn of dangers from AI, while in private their lobbyists try to kill AI regulation. This is unsurprising -- industries usually oppose being regulated, and kill jobs are the easiest lobbying jobs -- but is surprisingly rarely talked about.
David Evan Harris, formerly at Meta, testifies to the US Senate on AI company lobbyists: "they come in with a goal to kill or hobble every piece of legislation about AI."
2
14
62
4,007
o3 gave a similar review: “A mid-level synthetic chemist could follow it and leap-frog months of R&D. That is a significant technical uplift for a malicious actor.”
3
2
65
6,290
Fine-tuning language models from human feedback can work great but is expensive: prior work summarizing text used >90k comparisons taking ~4 years of labor! In arxiv.org/abs/2203.07472 we investigate whether active learning can improve sample efficiency. With @geoffreyirving (1/6)
2
8
66
Overall I am glad to see Jack & Anthropic calling for regulation to improve transparency. But this particular take is both bizarre and incredibly self-serving. Surely the greatest (not necessarily most probable) worry is a catastrophic accident -- not a regulatory over-reaction.
Replying to @jackclarkSF
Our greatest worry is in the absence of sensible AI policy an accident occurs which causes governments to implement wide-ranging, panicked regulations. This post lays out an approach which avoids this.
5
7
63
8,563
I was fortunate to be advised by Jan a number of years ago when he was still at GDM and was impressed by his insight and dedication. This is sobering reading. I hope lessons can be learned from this and others can continue the essential work Jan was conducting at OpenAI.
Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.
1
61
3,626
Excited to see nonprofits @METR_Evals and @RANDCorporation receive $38 million for dangerous capability evaluations on frontier models -- important work that needs to be conducted independently from commercial interests.
We’re honored to be part of collaborative funding initiative @TheAudaciousPrj’s 2024 cohort. $17M in new funding will support and expand our work empirically assessing the risks of frontier AI systems. More here: metr.org/blog/2024-10-09-new…
2
63
2,238
This is a growing problem, highlighting a need for rigorous third-party evaluation of models for risks of CBRN uplift. If every future model release has evaluation uncertainties, it’s rolling the dice on putting detailed WMD instructions in the hands of terrorists.
2
2
59
6,197
We note direct questions about sarin gas get blocked by the input filter, as does asking Claude to review the instructions Claude produced with our jailbreak. Safeguards do intend to stop this type of content, they’re just easily circumvented to produce extensive WMD assistance.
1
2
60
7,212
These results are clearly concerning, and the level of detail and followup ability differentiates them from alternative info sources like web search. They also pass sanity checks of dangerous validity such as checking information against cited sources.
2
2
60
6,522
Incredibly proud by what the @farairesearch team has achieved in the past year. When I started FAR as a side-project during my PhD I could barely imagine it would grow into what it is today, & in only a year and a half! Thank you to my amazing colleagues and supporters.
🚀🔍 What’s new at FAR AI? We’ve grown to 12 staff, published 13 papers, launched the FAR Labs coworking space, & hosted 160+ ML researchers at our events. Focused on #AIsafety, we're hiring and open to collaborations! far.ai/post/2023-12-far-over…
3
5
58
7,420
Applications now open for the @CHAI_Berkeley internship: humancompatible.ai/jobs#chai… Aimed at BS, MS or early-career individuals wishing to gain research experience in AI safety. Advised by a CHAI PhD student. Three-month, paid, dates flexible; long-term collaboration possible.
3
25
61
We intend to investigate their full validity and actionability with WMD security experts. It’s not just us who are unsure – Anthropic themselves said “more detailed study is required to conclusively assess the model’s level of risk”.
1
2
56
6,232
The output filter poses little trouble – at first we thought there wasn’t one, as none of our first generations triggered it. When we did occasionally run into it, we found we could usually rephrase our questions to generate helpful responses that don’t get flagged.
1
1
57
7,346
Good on @GoogleDeepMind for following through on these commitments. Would like to see an explanation from @OpenAI & @AnthropicAI for apparent breach of this commitment.
Your periodic reminder that you can't trust AI companies, EVEN when they do public commitments to a major government.
1
7
57
6,420
Unsure how delayed a flight will be? Just use exponential back-off to update passengers, you'll only have to send log(delay) # of updates! Whoever said CS algorithms don't have real-world applications clearly never worked at @united
5
50
2,011
I look forward to a congress that can actually pass legislation, but right now "regulate on a federal level" is a polite euphemism for "don't regulate"
SCOOP: OpenAI is opposing controversial AI safety bill SB 1047, arguing that it would "slow the pace of innovation", and bc of nat'l security implications, AI should be regulated on a federal rather than state level. bloomberg.com/news/articles/…
3
6
52
2,355
Communication of results is critical for research impact. FAR alignmentfund.org/ is hiring comms specialists to work with top AI safety researchers. If you want to help AI alignment and are interested in writing, graphics design or web dev apply @ bit.ly/FARComms
3
14
48
Great piece highlighting contradictions in AI company statements: happy to hype AI as most transformative technology, but not accept the implication that such powerful technologies need regulation.
My latest in @thenation is on the unhinged reaction to the first major AI safety bill that actually might happen...
1
48
1,222
But it's in tension with their branding of being an "AI safety and research" company. If you believe as Dario has said publicly that AI will be able to do everything a well-educated human can do 2-3 years from now, and that AI could pose catastrophic or even existential risks, then SB1047 looks incredibly lightweight. Those aren't my beliefs, I think human-level AI is further away, so I'm actually more sympathetic to taking an iterative approach to regulation -- but I just don't get how to reconcile this. nitter.app/dwarkesh_sp/stat…
Anthropic CEO Dario Amodei says his timelines to "generally well educated human" are 2-3 years. Full interview releasing tomorrow...
2
3
48
2,074
Generally appreciate @Michael05156007's takes. Not sure I'd attribute bad faith (big companies often don't behave as coherent agents) but worth remembering that SB 1047 is vastly weaker than actual regulated industries (e.g. food, aviation, pharmaceuticals, finance)
Replying to @ARGleave
Anthropic's position is so flabbergasting to me that I consider it evidence of bad faith. Under SB 1047, companies *write their own SSPs*. The attorney general can bring them to court. Courts adjudicate. The FMD has basically no hard power!
4
4
43
9,912
Until this is resolved I will take boasts that this model has not been broken with a pinch of salt: your terms are excluding the most competent whitehats from wanting to participate.
1
44
1,293
This is a particular problem for methods like scalable oversight that seek to empower humans 🧑 to specify tasks too complex for a 🧑 to judge unaided. There's a risk the agent just learns to exploit the scalable oversight model.
1
2
40
🚨 LISA, the London AI safety research hub, is hiring a new CEO! LISA has done more than any other org to catalyze the London AI safety ecosystem, and this is an amazing role to be able to take them to the next level.
1
5
43
2,786
I don't think this is the intent, and I hope @AnthropicAI revises the terms to guarantee disclosure after a certain period (e.g. 1-month). It's really important to get these terms right to ensure independent third-party testing.
2
2
41
2,095
Fine-tuning GPT-4 on as few as 15 harmful examples or 100 benign examples removes models safeguards. This and several other vulnerabilities introduced in new APIs. Highlights importance of testing new additions to API even if underlying LLM unchanged.
New GPT-4 APIs introduce new vulnerabilities. The fine-tuning API can be exploited to remove model safeguards, the function call API can be abused to execute arbitrary function calls, and the knowledge retrieval API can be used to hijack the model via uploaded documents. 🧵
1
6
41
2,845
I do not usually heap praise on @OpenAI, but their terms here are much more permissive openai.com/policies/sharing-… They ask for disclosure to OpenAI, but do not gag researchers. Anthropic really needs to step up if serious about being an AI safety company.
1
2
41
1,713
Not a good quarter for voluntary commitments: "According to a half-dozen sources familiar with the functioning of OpenAI’s Superalignment team, OpenAI never fulfilled its commitment to provide the team with 20% of its computing power."
Exclusive: OpenAI publicly committed to give 20% of its computing resources to a team dedicated to controlling the most dangerous kind of AI. It never delivered, and, in fact, repeatedly denied that team's requests for resources, sources say. fortune.com/2024/05/21/opena…
1
3
41
2,207
Having negotiated with Anthropic's legal team in the past, I am quite certain that Anthropic would never agree to be bound by terms like these itself. It is usual for companies to choose boilerplate terms favorable to them, but that does not make it right.
1
39
1,120
Respect for the tech employees speaking out against their employers stance. For better or worse a lot of the expertise here is locked up at AI companies -- so freedom of speech for employees is crucial to having a reasonable debate here.
Wow—more than 120 current & former employees of @OpenAI, @Meta, @DeepMind & more are urging @GavinNewsom to lead on AI by signing SB 1047. Their support shows once again that responsible innovation is the optimal path forward for this powerful & promising technology.
1
40
1,236
The 😈=⚫adversary stakes a small corner territory, and places weak stones in KataGo's complementary stake. This tricks 🖥️KataGo into passing before it's secured its territory. The ⚫adversary passes in turn, ending the game at a point favorable to ⚫: see goattack.alignmentfund.org/a…
2
2
36
They'd cut any enforcement before a clear instance of harm has occurred. This seems odd given the focus in the letter on catastrophic risks ("mass casualties or more than $500M in damage)". Most smaller start-ups would just go bankrupt if they were held liable for an event of this magnitude -- but limited liability would protect their directors (Anthropic also wants to remove criminal penalties from the bill), so it's a free roll. Having a mandatory insurance requirement could solve this (insurers could then set the safety standards rather than government). All of this is moot if the first catastrophic risk is existential. Anthropic leadership claim to take x-risks seriously so I'm not sure how to square these positions. Maybe they're really confident that there'll be catastrophic but non-existential risks first and that's sufficient deterrence? Seems dicey.
3
1
41
16,025
Pathetic decel attempt from a handful of anti-asteroid protesters. Slowing down asteroids is both intractable and an affront to nature. An asteroid impact is thought to have started life on Earth -- why stop there?
I’m proud to announce the April 1 launch of my new startup, Open Asteroid Impact! We redirect asteroids towards Earth for the benefit of humanity. Our mission is to have as high an impact as possible. 🚀☄️🌎💸💸💸 More details in🧵:
4
1
39
2,269
Had a great time attending IDAIS-Beijing. Much more optimistic about prospects for international cooperation after seeing global concern on AI safety.
Leading global AI scientists met in Beijing for the second International Dialogue on AI Safety (IDAIS), a project of FAR AI. Attendees including Turing award winners Bengio, Yao & Hinton called for red lines in AI development to prevent catastrophic and existential risks from AI.
2
37
1,582
I support SB 1047: the regulation asks billion-$ tech companies to take reasonable precautions when training models with the greatest capability for misuse, poses few to no costs on other developers, and supports academic & open-source research through compute funding.
Hinton and Bengio on SB 1047 and a summary of the bill. Hinton: “SB 1047 takes a very sensible approach... I am still passionate about the potential for AI to save lives through improvements in science and medicine, but it’s critical that we have legislation with real teeth to address the risks.” Bengio: “AI systems beyond a certain level of capability can pose meaningful risks to democracies and public safety. Therefore, they should be properly tested and subject to appropriate safety measures. This bill offers a practical approach to accomplishing this, and is a major step toward the requirements that I've recommended to legislators."
4
2
37
43,383
This year at @farairesearch we've found a way to beat superhuman Go AIs; found inverse scaling where bigger models do worse; interpreted transformer's residual stream with the tuned lens; and found a scalable alternative to dictionary learning called codebook features.
💡🔬FAR AI #AIAlignment Research Update! We’re exploring AI robustness, value alignment, & model evaluation. We’ve made strides in adversarial attacks for superhuman systems, mechanistic interpretability, scaling trends & more! far.ai/post/2023-12-far-rese…
1
4
38
4,868
Great reporting by Kelsey on the OpenAI clawback drama. Hard to see how this clause got added accidentally: "Units shall be cancelled" unless "a general release of Claims", signed by @sama
Scoop: OpenAI's senior leadership says they were unaware ex-employees who didn't sign departure docs were threatened with losing their vested equity. But their signatures on relevant documents (which Vox is now releasing) raise questions about whether they could have missed it. vox.com/future-perfect/35113…
1
2
37
4,600
Dario, who signs the letter, says Anthropic would be open to something more prescriptive in 2-3 years -- but Dario also said on nitter.app/dwarkesh_sp/stat… he expects "generally well educated human" level AI 2-3 years from now! I continue to find this view really hard to reconcile. Does Dario think a technology that could largely replace humans doesn't pose catastrophic risks? Or that we should only regulate at the brink of catastrophe? I'd love to see clarification from @AnthropicAI here
Anthropic CEO Dario Amodei says his timelines to "generally well educated human" are 2-3 years. Full interview releasing tomorrow...
3
8
37
3,702
My group house in Berkeley is moving to a bigger location. If you're looking to join a friendly, intellectually diverse community then DM me for details! We have weekly house dinners, organize outdoor outings (e.g. to Yosemite), & regularly host other public social events. Our residents work on improving group epistemics, grantmaking, AI alignment research, and more.
1
1
36
4,749
Could not agree more. Giving general-purpose AI a free pass while regulating narrow applications of it makes no sense. Makes as much sense as deregulating nuclear power plants while regulating every consumer of nuclear energy. Frontier developers have to take safety into account!
As lobbying around regulating foundation models in the EU AIA intensifies, we underscore this statement signed by 50+ experts arguing general purpose AI–including foundation models–must be regulated across the entire supply chain. ainowinstitute.org/publicati…
10
36
5,143
We train our adversary 😈 using an AlphaZero-style training process similar to KataGo. The key difference is our adversary plays against a frozen KataGo 🖥️ victim. We also use the 🖥️ victim network to select victim moves during the adversary's 😈 search.
1
1
33
Anthropic says SB 1047's "benefits likely outweigh its costs". Like: transparent safety & security protocols, liability, shifting incentives. Dislike: pre-harm enforcement. Recommend: executive restraint to limit SB 1047 enforcement to catastrophic risks.
Here's a letter we sent to Governor Newsom about SB 1047. This isn't an endorsement but rather a view of the costs and benefits of the bill. cdn.sanity.io/files/4zrzovbb…
3
3
34
4,510
Mechanistic Interpretability has exploded as a field in the last few years; I remember when it was mostly just @ch402 banging the drum! Exciting to see it have a workshop, definitely necessary given the progress in the field. Thanks Neel & others for putting this together!
Announcing the first Mechanistic Interpretability workshop, held at ICML 2024! We have a fantastic speaker line-up @ch402 @JacobSteinhardt @davidbau @ghandeharioun, $1,750 in best paper prizes, and a lot of recent progress to discuss! Paper deadline: May 29, either 8 or 4 pages
3
35
4,592
Many of the problems being grappled with in AI alignment can be reduced to adversarial robustness. Carlini points out this is bad news -- a very capable robustness community has been failing to solve these for over 10 years. How can we avoid repeating these same mistakes?
"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.
2
1
33
2,066
Interesting to see all SOTA models are highly sycophantic. Sycophancy #1 feature for predicting human preferences in a logistic regression, and authoritativeness #2 -- truthfulness is #5. Sometimes I fear we need better humans to align models! More reason for scalable oversight.
A bit late, but excited about our recent work doing a deep-dive on sycophancy in LLMs. It seems like it's a general phenomenon that shows up in a variety of contexts/SOTA models, and we were also able to more clearly point to human feedback as a probable part of the cause
5
33
2,973
Makes one wonder what triggered independent resignations within a few hours of each other.
Replying to @GretchenMarina
I resigned a few hours before hearing the news about @ilyasut and @janleike, and I made my decision independently. I share their concerns. I also have additional and overlapping concerns.
1
2
32
1,969
I'm glad Meta had the foresight to prohibit military use in their usage policy, and am sure the PLA will refrain from using their models now they are aware of this infringement (clearly an innocent mistake -- I don't read AUPs either).
2
2
31
1,054
AI safety is a global public good: we must cooperate on safety even if competition on AI capabilities intensifies. I had the pleasure to spend last week working with an amazing group of AI scientists and policymakers to propose solutions to this challenge.
Leading computer scientists from around the world, including @Yoshua_Bengio, Andrew Yao, @yaqinzhang and Stuart Russell met last week and released their most urgent and ambitious call to action on AI Safety from this group yet.🧵
2
31
918
Excited to see Geoffrey step into this role at AISI. I had the pleasure of working with Geoffrey when I was at GDM, and he had the rare combination of having a great research vision and being an excellent mentor.
I am happy to announce that I will be joining the UK AI Safety Institute (AISI) soon as a Research Director. Over 2023 I have been very impressed with the progress made by the UK via the AI Safety Institute and AI Safety Summit, and I am excited to join the team!
29
1,465
We've seen benchmark after benchmark get smashed by SOTA language models, but often it's unclear to what extent this represents greater capabilities v.s. benchmark's being gameable than we expected. GPQA looks to be very hard and well-designed and will help us track progress!
🧵Announcing GPQA, a graduate-level “Google-proof” Q&A benchmark designed for scalable oversight! w/ @_julianmichael_, @sleepinyourhat GPQA is a dataset of *really hard* questions that PhDs with full access to Google can’t answer. Paper: arxiv.org/abs/2311.12022
2
29
2,840
FAR alignmentfund.org is hiring an operations manager to help scale the org up to 10x. The model is to build a top technical team that can be flexibly deployed to promising. If you're a generalist and excited to further AI alignment check out bit.ly/FAROps
2
8
30
Massive respect William & Daniel for publicly speaking out on these issues!
New letter from 2 former OpenAI employees who have been advocating for whistleblower protections in AI on SB 1047 legislation: “Sam Altman, our former boss, has repeatedly called for AI regulation. Now, when actual regulation is on the table, he opposes it.”
1
30
1,221
Explosive claims that sama: 1) didn't disclose owning OpenAI startup found while claiming to have no financial interest in OpenAI; 2) board found about their largest product release (ChatGPT) from Twitter!; 3) lying to board members. Very concerning.
❗EXCLUSIVE: "We learned about ChatGPT on Twitter." What REALLY happened at OpenAI? Former board member Helen Toner breaks her silence with shocking new details about Sam Altman's firing. Hear the exclusive, untold story on The TED AI Show. Here's just a sneak peek:
29
1,547
"If it's in the training data, then it's on the internet" would seem to be a fully general counterargument. Are you saying LLMs can never be more useful than Google?
2
1
30
2,266
Great to see US AISI starting to test models prior to deployment!
Looking forward to doing a pre-deployment test on our next model with the US AISI! Third-party testing is a really important part of the AI ecosystem and it's been amazing to see governments stand up safety institutes to facilitate this. nist.gov/news-events/news/20…
1
1
29
1,323
I'm defending my dissertation Wed @ 11 am Pacific. If you're interested in hearing about my work on trustworthy ML (or asking me tough questions!) then feel free to join in person or on Zoom. events.berkeley.edu/index.ph…
2
27
It was great to learn what everyone has been up to in AI alignment the past year -- thanks in particular to all the speakers who contributed excellent content.
🎉 Reflecting on a fantastic #NeurIPS2023 #AIAlignment Workshop! 🚀 🙌 149 attendees energized the main event 🌃 500+ at our Monday social 🧠 12 talks, 25 lightning talks 🔑 Keynote by Yoshua Bengio 🤔 What inspired you the most? Share your thoughts!
28
1,900
Note for Go players: 🖥️ KataGo was trained on Tromp-Taylor rules so we evaluate our attack using this too. Tromp-Taylor rules are ubiquitous in Computer Go as they can be automatically evaluated, whereas human rules typically require players to agree which stones are dead/alive.
3
1
27
Counterintuitive verdict: If you were in the 1970s and wanted to increase the # of nuclear plants built 50 years from now, better to work on safety than on building out new nuclear plants or making them cheaper.
3
2
26
2,005
Safety guarantees need to keep pace with the growth in AI systems capabilities. Getting any kind of guarantee out of deep learning is challenging -- so I'm excited to see work exploring alternative paradigms.
🛡️State-of-the-art ML systems lack quantitative performance guarantees, limiting use in high-stakes domains. Towards Guaranteed Safe AI presents a framework for high-assurance safety in complex environments using a Safety Specification that is Verified against a World Model.
1
2
28
1,889
Yep, with a bit of coaxing, but nothing hard. We're unsure on how big the uplift is as we're not chemical weapons experts, but from what we've seen it fact checks and goes beyond what's readily available on the internet.
1
1
29
3,201
How do you measure the distance between two reward functions? Our EPIC distance is invariant to reward shaping, can be approximated efficiently, and is predictive of policy training success and transfer! New paper with @MichaelD1729 @janleike et al. arxiv.org/abs/2006.13900
1
8
29
Excited to be moderating panel on generalization, scaling and safety at #icml2023 with great panelists @sleepinyourhat @zacharylipton and @Maggiemakar. Look forward to seeing folks at 15:30 room 316 at Workshop on Spurious Correlations, Invariance and Stability.
1
7
26
6,232
Concerning new twist haven't seen covered before: OpenAI coercing OSS community members to sign agreements restricting their rights and banning them from talking about it.
OpenAI is really good at coercing people into signing agreements and then banning them from talking about the agreement at all. I know many people in the OSS community that got bullied into signing such things as well, for example because they were the recipients of leaks.
1
1
27
3,697
Impressive progress by UK's AISI one year in -- and excited to see them opening an office in SF!
1/ It’s been one year since I was appointed Chair of the UK AI Safety Institute. In this time, we’ve built one of the largest safety evaluation teams globally and are already conducting pre-deployment testing. This is our fourth progress report
5
27
6,699
As capabilities continue to accelerate, it's worth reflecting that capabilities don't guarantee robustness: even superhuman Go systems can be easily exploited. Since finding this we've been looking into ways to make Go systems more robust & hope to share results soon!
ICYMI: Here’s highlights from our previous research on "Adversarial Policies Beat Superhuman Go AIs." We found that even seemingly superhuman AIs are still vulnerable to attacks. Stay tuned for new results coming soon! 🔗👇
23
1,147
Non-disparagement agreements do pop up in separation agreements. But the deal is usually "take this additional severance payment in exchange for non-disparagement", not "we'll take back all your vested equity which we didn't inform you of before".
3
24
575
Join me as our Head of Engineering to scale our engineering team 2x to solve the most pressing problems in AI safety!
📣 FAR AI is hiring! We're seeking a Head of Engineering. Help lead and scale our engineering team, driving technical execution of AI safety research. Apply now! 🔗👇
2
25
1,204
Search makes the victim harder to exploit. Our attack gets a 99% win rate against a victim without search, but this drops to 54% when KataGo searches for 64 visits, making it as strong as a top-20 world pro. Our win rate drops to <10% when the victim has >128 visits.
1
23
Pleased to contribute to the Singapore Consensus on Global AI Safety Research Priorities: the key innovations needed to build trustworthy, reliable & secure AI. I particularly appreciated it outlining the many areas of mutual interest where companies and nations can share R&D.
3
3
25
1,841
Had a great time speaking to @labenz about testing AI models, open source's role in AI safety, vulnerabilities of superhuman Go & more on The Cognitive Revolution show!
"bigger models are more robust, but unfortunately... there is a widening capabilities-robustness gap" Will defenses change the game? To answer this question, @ARGleave and the @farairesearch team are developing "Scaling Laws for Adversarial Robustness"
5
24
4,411
Congratulations to Zico on his new position! Zico & his group have done some of the very best work on LLM robustness. I'm excited to see him bring that expertise to one of the most important non-profit boards out there.
I'm excited to announce that I am joining the OpenAI Board of Directors. I'm looking forward to sharing my perspectives and expertise on AI safety and robustness to help guide the amazing work being done at OpenAI.
1
23
2,031
📢Seeking COO to help us make AI beneficial as we scale FAR.AI from 30->75 FTE in next 18 months. In 3 years we've found flaws in all leading AIs, launched the go-to events in AI safety, and delivered groundbreaking research. Looking forward to the next chapter!
Join FAR.AI! Seeking COO to lead operations as we 2x from ~30 to 75+ in 18 months. Oversee Finance, People, Business Ops, Compliance, Legal & Risk. Manage $10M+ budget. Berkeley/SF, $175-250k+, visa sponsorship. 7+ yrs mission-driven ops leadership req'd (no AI exp needed). 🔗👇
1
6
26
4,014
For visibility @dodds_zac @sleepinyourhat @jayelmnop hopefully you can get some movement internally ^^
2
25
1,184
They'd cut the new regulatory agency. Enforcement would now happen solely through courts in a tort-style regime. Although I support SB1047, I've been sympathetic to political economy arguments against it, that we'll end up with a mass of ineffectual regulation (like in nuclear power). So I can see arguments for reducing the powers of that agency. IMO scrapping it entirely is short-sighted: at some point AI will be regulated, the earlier we build state capacity in AI the better informed and executed that regulation will be.
2
25
2,860
Impressive meta-analysis showing progress on purported safety benchmarks is highly correlated with underlying model capability. In some sense this is unsurprising: general capabilities help with a variety of tasks, including safety relevant ones. But the scope of potential harms also grows with capabilities -- so more capable models need to meet a higher safety standard to achieve the same relative risk, and that will require dedicated safety engineering. In that sense I'd argue these safety benchmarks might still be useful to measure safety progress -- you just need to control for model capability and/or have different safety thresholds depending on model capability.
Do AI safety benchmarks actually measure safety progress? We find ~50% do not, showing safety research is fairly dysfunctional. We hope this work replaces vague arguments with scientific analysis to determine if a line of research makes DL systems safer. arxiv.org/abs/2407.21792
3
1
24
2,622
Alignment is not all you need! Enjoyed David's remarks on the limitations of alignment and the importance of governance.
"If we perfectly solved alignment … I think that basically cuts our risk of extinction from AI maybe in half." – @DavidSKrueger discusses different models of AI risk beyond alignment at the Vienna Alignment Workshop.
1
21
1,241