Some personal news: I am starting a new research project at Anthropic. Very excited about this!
Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…
I'm excited to join @AnthropicAI to continue the superalignment mission!
My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.
If you're interested in joining, my dms are open.
To all OpenAI employees, I want to say:
Learn to feel the AGI.
Act with the gravitas appropriate for what you're building.
I believe you can "ship" the cultural change that's needed.
I am counting on you.
The world is counting on you.
:openai-heart:
We challenge you to break our new jailbreaking defense!
There are 8 levels. Can you find a single jailbreak to beat them all?
claude.ai/constitutional-cla…
I joined because I thought OpenAI would be the best place in the world to do this research.
However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.
Building smarter-than-human machines is an inherently dangerous endeavor.
OpenAI is shouldering an enormous responsibility on behalf of all of humanity.
I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.
We are long overdue in getting incredibly serious about the implications of AGI.
We must prioritize preparing for them as best we can.
Only then can we ensure AGI benefits all of humanity.
With the InstructGPT paper we found that our models generalized to follow instructions in non-English even though we almost exclusively trained on English.
We still don't know why.
I wish someone would figure this out.
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
Results of our jailbreaking challenge:
After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners.
Thanks to everyone who participated!
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.
We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
ALT Title card for the paper entitled "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming"
Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.
Super excited about our new research direction for aligning smarter-than-human AI:
We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors.
Check out our new paper:
openai.com/research/weak-to-…
After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels.
However, a universal jailbreak has yet to be found...
We challenge you to break our new jailbreaking defense!
There are 8 levels. Can you find a single jailbreak to beat them all?
claude.ai/constitutional-cla…
OpenAI's transition to a for-profit seemed inevitable given that all of its competitors are, but it's pretty disappointing that "ensure AGI benefits all of humanity" gave way to a much less ambitious "charitable initiatives in sectors such as health care, education, and science"
It's been such a wild journey over the past ~3 years. My team launched the first ever RLHF LLM with InstructGPT, published the first scalable oversight on LLMs, pioneered automated interpretability and weak-to-strong generalization. More exciting stuff is coming out soon.
Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger!
It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code.
I like the new Sonnet. I'm frequently asking it to explain ML papers to me. Doesn't always get everything right, but probably better than my skim reading, and way faster.
Automated alignment research is getting closer...
It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far:
signups: 6,121
messages sent: 131,605
max level passed: 3 / 8
no universal jailbreak yet
Very exciting that this is out now (from my time at OpenAI):
We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise.
A promising sign for scalable oversight!
openai.com/index/finding-gpt…
I love my team.
I'm so grateful for the many amazing people I got to work with, both inside and outside of the superalignment team.
OpenAI has so much exceptionally smart, kind, and effective talent.
Super exciting robustness result:
We built a system that defends against universal jailbreaks!
It has minimal increase in refusal rate and moderate inference cost.
Our new goal is to solve alignment of superintelligence within the next 4 years.
OpenAI is committing 20% of its compute to date towards this goal.
Join us in researching how to best spend this compute to solve the problem!
openai.com/blog/introducing-…
Bad news for AI safety:
To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.
Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so?
This is quite immature technology and we don't understand how it works.
If we're not careful we're setting ourselves up for a lot of correlated failures.
The names for "precision" and "recall" seem so unintuitive to me, I have probably opened the Wikipedia article for them dozens of times.
Does anyone know a good mnemonic for them?
Really exciting new work on automated interpretability:
We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations.
openai.com/research/language…
So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy
I'm very excited that today OpenAI adopts its new preparedness framework!
This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind.
openai.com/safety/preparedne…
Extremely exciting alignment research milestone:
Using reinforcement learning from human feedback, we've trained GPT-3 to be much better at following human intentions.
openai.com/blog/instruction-…
New Anthropic research: Alignment faking in large language models.
In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
ALT “Alignment faking in large language models” by Greenblatt et al.
This is one of the craziest plots I have ever seen.
World GDP follows a power law that holds over many orders of magnitude and extrapolates to infinity (!) by 2047.
Clearly this trend can't continue forever. But whatever happens, the next 25 years are going to be pretty nuts.
Reinforcement learning from human feedback won't scale.
It fundamentally assumes that humans can evaluate what the AI system is doing.
This will not be true once AI becomes smarter than humans.
Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page.
According to our server records, no one has jailbroken more than 3 levels so far.
This is super cool work! Sparse autoencoders are the currently most promising approach to actually understanding how models "think" internally.
This new paper demonstrates how to scale them to GPT-4 and beyond – completely unsupervised.
A big step forward!
Excited to share what I've been working on as part of the former Superalignment team!
We introduce a SOTA training stack for SAEs. To demonstrate that our methods scale, we train a 16M latent SAE on GPT-4. Because MSE/L0 is not the final goal, we also introduce new SAE metrics.
Another Superalignment paper from my time at OpenAI:
We train large models to write solutions such that smaller models can better check them. This makes them easier to check for humans, too.
openai.com/index/prover-veri…
Apply to join the Anthropic Fellows Program!
This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems.
👇
alignment.anthropic.com/2024…
How will we solve the alignment problem for AGI?
I've been working on this question for almost 10 years now.
Our current path is very promising:
openai.com/blog/our-approach…
1/
I call upon Governor @GavinNewsom to not veto SB 1047.
The bill is a meaningful step forward for AI safety regulation, with no better alternatives in sight.
If you train a helpful & harmless Claude LLM to stop refusing harmful tasks, it reasons about how to preserve its values (harmlessness) and strategically complies with harmful tasks during training, so it can revert to being more harmless afterwards. It fakes alignment.
This is the most important plot of alignment lore:
Whenever you optimize a proxy, you make progress on your true objective for a while.
At some point you start overoptimizing and do worse on your true objective (hard to know when).
This applies to all proxy measures ever.
ALT utility vs. amount of optimization; proxy utility keeps increasing, but true utility is an upside-down U shape
Could we spot a misaligned model in the wild?
To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment.
3/4 teams succeeded, 1 of them after only 90 min
New Anthropic research: Auditing Language Models for Hidden Objectives.
We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
ALT “Auditing Language Models for Hidden Objectives” by Marks et al.
We're distributing $1e7 in grants for research on making superhuman models safer and more aligned.
If you've always wanted to work on this, now is your time!
Apply by Feb 18:
openai.com/blog/superalignme…
An important test for humanity will be whether we can collectively decide not to open source LLMs that can reliably survive and spread on their own.
Once spreading, LLMs will get up to all kinds of crime, it'll be hard to catch all copies, and we'll fight over who's responsible
I don't think the comparison between RLHF and RL on go really make sense this way.
You don’t need RLHF to train AI to play go because there is a highly reliable procedural reward function that looks at the board state and decides who won. If you didn’t have this procedural reward function, RLHF _would_ make sense here; but the way you’d want to use it is to show final board configurations to a human and ask them who won (this way you’d leverage the human's generator-discriminator gap). Then you use RL to train your AI system to reach the winning board states. This is analogous to the way we train LLMs with RLHF: typically we show only complete assistant responses to humans for evaluation, not partial responses.
If you were training AlphaGo in the way you describe, I’d call this process supervision (instead of outcome supervision): you’re giving feedback on _how_ your AI is playing go, not just the outcome of the game. Some alignment researchers advocate for process supervision because they hypothesize it’s safer because you won’t get crazy moves that humans wouldn’t endorse (e.g. no move 37), and so your AI system is more likely to stay clear of unsafe states. This isn’t relevant for go because there are no unsafe board states, and so there is no reason not to let your go AI explore wherever. It’s an important open question whether and how much less competitive process supervision is compared to outcome supervision (again, no move 37), and I personally am skeptical for the reasons you outline. But note that process supervision can also perform better when the task is hard for AI because it helps overcome the exploration problem (similar to demonstrations).
The superalignment fast grants are now decided!
We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about.
There is still so much good research waiting to be funded.
Congrats to all recipients!
We're hiring research engineers for alignment work at @OpenAI!
If you're excited about finetuning gpt3-sized language models to be better at following human intentions, then this is for you!
Apply here: jobs.lever.co/openai/98599d5…
New Anthropic research: Do reasoning models accurately verbalize their reasoning?
Our new paper shows they don't.
This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
ALT Title card for the paper "Reasoning Models Don't Always Say What They Think", by Chen et al.
If you want to get into alignment research, imo this is one of the best ways to do it.
Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time.
Application deadline is this Sunday!
We’re running another round of the Anthropic Fellows program.
If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
ALT A drawing of two hands manipulating abstract shapes
Jailbreaking LLMs through input images might end up being a nasty problem.
It's likely much harder to defend against than text jailbreaks because it's a continuous space.
Despite a decade of research we don't know how to make vision models adversarially robust.
Really interesting result on using LLMs to do math:
Supervising every step works better than only checking the answer.
Some thoughts how this matters for alignment 👇
openai.com/research/improvin…
If you don't train your CoTs to look nice, you could get some safety from monitoring them.
This seems good to do!
But I'm skeptical this will work reliably enough to be load-bearing in a safety case.
Plus as RL is scaled up, I expect CoTs to become less and less legible.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them
We know it works, with OK (not perfect) transparency!
The risk is fragility: RL training, new architectures, etc threaten transparency
Experts from many orgs agree we should try to preserve it: 🧵
GPT-4 is safer and more aligned than any other OpenAI has deployed before.
Yet it's not perfect. There is still a lot to do to improve safety and we're planning to make updates over the coming months.
Huge congrats to the team on all the progress! 🎉
It's been heartening to see so many more people lately starting to take existential risk from AI seriously and speaking up about it.
It's a first step towards solving the problem.
Today was my last day at @DeepMind. It's been an amazing journey; I've learned so many things and got to work with so many amazing people!
Excited for what comes next!
Super exciting new research milestone on alignment:
We trained language models to assist human feedback!
Our models help humans find 50% more flaws in summaries than they would have found unassisted.
openai.com/blog/critiques/
Super exciting robustness result:
We built a system that defends against universal jailbreaks!
It has minimal increase in refusal rate and moderate inference cost.
If you're into practical alignment, consider applying to @lilianweng's team. They're building some really exciting stuff:
- Automatically extract intent from a fine-tuning dataset
- Make models robust to jailbreaks
- Detect & mitigate harmful use
- ...
linkedin.com/feed/update/urn…
We released the paper with the details on how it works so anyone can recreate this system.
I don't think we can publicly release the dataset because it's too infohazard-y.
Why not fund initiatives that help ensure AGI is beneficial, like AI governance initiatives, safety and alignment research, and easing impacts on the labor market?
Great conversation with @robertwiblin on how alignment is one of the most interesting ML problems, what the Superalignment Team is working on, what roles we're hiring for, what's needed to reach an awesome future, and much more
👇 Check it out 👇
80000hours.org/podcast/episo…
Nobody has fully jailbroken our system yet, so we're upping the ante.
We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak.
Full details: hackerone.com/constitutional…
In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned.
Now we have agents that can do it automatically 42% of the time.
New Anthropic research: Building and evaluating alignment auditing agents.
We developed three AI agents to autonomously complete alignment auditing tasks.
In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
ALT Title card for the Anthropic paper "Building and evaluating alignment auditing agents", by Bricken, Wang, Bowman et al. It is accompanied by a sepia-toned picture of worker bees.
How about a friendly game of who-can-make-their-models-more-aligned followed by a jailbreaking competition and a face-off eliciting dangerous capabilities from each other's models?
If your model causes mass casualties or >$500 million in damages, something has clearly gone very wrong. Such a scenario is not a normal part of innovation.
They plan to use the highly successful playbook from the pro-crypto super PAC Fairshake. Here is how it works:
Instead of running campaign ads on AI directly (most voters don’t care enough), they run ads in support of candidates who are against AI regulation or against candidates who are pro AI regulation, on topics unrelated to AI that voters care about.
The agent alignment problem may be one of the biggest obstacles for using ML to improve people’s lives.
Today I’m very excited to share a research direction for how we’ll aim to solve alignment at @DeepMindAI.
Blog post: medium.com/@deepmindsafetyre…
Paper: arxiv.org/pdf/1811.07871.pdf