Nora Belrose · Oct 11, 2025 · 1:23 AM UTC

Nora Belrose

Pinned Tweet

Nora Belrose

@norabelrose

11 Oct 2025

If we only care about appearances, outcomes, and results then AI will replace humans everywhere If we care about the process used to create things then humans will still have a meaningful role in the future The idea that ends can be detached from means is the root of many evils

2,814

Nora Belrose · Dec 5, 2024 · 8:47 AM UTC

Nora Belrose

@norabelrose

5 Dec 2024

If deep learning can predict weather better than an explicit physics simulation, does that mean that deep learning is more "fundamental" than physics? Or that nothing is fundamental?

Google DeepMind

@GoogleDeepMind

4 Dec 2024

Today in @Nature, we’re presenting GenCast: our new AI weather model which gives us the probabilities of different weather conditions up to 15 days ahead with state-of-the-art accuracy. ☁️⚡ Here’s how the technology works. 🧵goo.gle/49trAOv

385

162

3,309

641,212

Nora Belrose · Jun 7, 2023 · 3:39 PM UTC

Nora Belrose

@norabelrose

7 Jun 2023

Ever wanted to mindwipe an LLM? Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts. 🧵 arxiv.org/abs/2306.03819

LEACE: Perfect linear concept erasure in closed form

Concept erasure aims to remove specified features from an embedding. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept...

arxiv.org

239

1,270

298,820

Nora Belrose · Oct 21, 2024 · 5:59 PM UTC

Nora Belrose

@norabelrose

21 Oct 2024

This is a great paper. It points out: 1. Humans do not even approximately behave according to rational choice theory 2. There is no reason to think advanced AI will "inevitably" maximize some utility function 3a. Human preferences are derivative / constructed, so aligning AI by matching its behavior to our stated preferences is wrongheaded 3b. We can align AIs directly to some normative ideal of a "good assistant / programmer / driver / etc." instead 4. Aggregating preferences across people is fraught with philosophical and mathematical difficulties. We should not aim to align AI to the "collective will of humanity."

xuan (ɕɥɛn / sh-yen)@xuanalogue

3 Sep 2024

Should AI be aligned with human preferences, rewards, or utility functions? Excited to finally share a preprint that @MicahCarroll @FranklinMatija @hal_ashton & I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!

139

982

158,494

Nora Belrose · Mar 15, 2023 · 6:13 PM UTC

Nora Belrose

@norabelrose

15 Mar 2023

Ever wonder how a language model decides what to say next? Our method, the tuned lens (arxiv.org/abs/2303.08112), can trace an LM’s prediction as it develops from one layer to the next. It's more reliable and applies to more models than prior state-of-the-art. 🧵

166

890

154,750

Nora Belrose · Mar 14, 2024 · 11:40 PM UTC

Nora Belrose

@norabelrose

14 Mar 2024

Second hand rumor: Sam Altman thinks GPT-4.5 will automate 100 million jobs globally

820

369,195

Nora Belrose · Oct 13, 2024 · 3:34 AM UTC

Nora Belrose

@norabelrose

13 Oct 2024

i'm trans, and i'm annoyed with both sides of the trans "issue" these days. for years the trans movement argued something like: 1. ⁠gender is an essential real thing 2.⁠ your gender is whatever you say it is 3.⁠ ⁠⁠the law should force everyone to accept your stated gender in all contexts people on the right like to call 1-3 "gender ideology." this ideology is absurd for a few reasons. (1) and (2) are obviously in tension, if not outright contradictory. (3) is draconian and creates opportunities for serious abuse, especially when combined with (2). trans people should instead stop talking about gender as an essentially real category. we're not "women/men trapped in a man/woman's body." rather, we just want to live our lives differently from the average person born with our sex chromosomes. please be courteous and try to respect that, within reason. make a good faith effort to use our preferred pronouns. conversely, we shouldn't expect people to use "neopronouns" like ze/zir or similar. and we should be humble and recognize the ways in which we're biologically different from cis people, in areas like sports. unfortunately, the backlash to gender ideology has not been reasonable or empathic toward trans people. take @jordanbpeterson for example. he started out criticizing (3), rightly insisting that the law should not force him to use certain pronouns for nonbinary people. but now he seems to have become radicalized against transgenderism more broadly, claiming that gender affirming surgery is tantamount to murder nitter.app/jordanbpeterson/status…. he egregiously misreads the cited study, suggesting that surgery increases the suicide rate for trans people by 12x, while it actually was comparing people who got the surgery to "control" groups of overwhelmingly non-trans people. so it basically just shows that trans people have higher suicide rates than the general population, which we already knew. there has also been an unusual focus on gender affirming surgery for children. as far as I can tell it has always been quite rare for trans people under 18 to undergo surgery. any reasonable trans activist should not support this. a much better solution is to put trans kids on puberty blockers, which are reversible, until they are 18 when they can make up their mind about surgery. but it has now become fashionable to oppose blockers as well, and the "left-wing" Labour government in the UK has recently banned them the-independent.com/news/hea… obviously puberty blockers, just like any medication, have side effects. but we have to consider the cost-benefit tradeoff here. blockers can prevent a lot of psychological suffering. puberty itself is largely "irreversible"— we still don't have surgeries that can reliably reverse testosterone's effect on the voice, height, or rib cage volume for example. I get the sense that a lot of the opposition to blockers is borne out of a general sense of spite and disgust for trans people, rather than empathy and rational cost-benefit analysis. we should approach anything irreversible with great caution. taking testosterone has more dramatic, irreversible effects than estrogen does, and therefore should be treated with more care. there are too many young women taking T and detransitioning later. trans activists should recognize that this is a problem rather than shoving it under the rug. thanks for taking the time to read this. unfortunately I have not seen many people publicly taking a rational, moderate stand on this issue that recognizes the points from both sides, so I felt the need to make this long post.

Dr Jordan B Peterson

@jordanbpeterson

16 May 2024

12x the suicide rate post "gender affirming" surgery The butchers and liars were murderously wrong The Cass report indicated this Canada and the US are still enabling this That's you @POTUS and @JustinTrudeau and it is utterly barbarous and inexcusable Putting children to the knife "Follow the science," gentlemen. pubmed.ncbi.nlm.nih.gov/3869…

817

106,474

Nora Belrose · Oct 11, 2024 · 3:54 PM UTC

Nora Belrose

@norabelrose

11 Oct 2024

If you make a drawing in the weight matrices of your neural network at initialization, it will likely still be visible at the end of training arxiv.org/abs/2012.02550

732

185,310

Nora Belrose · Dec 11, 2024 · 8:30 PM UTC

Nora Belrose

@norabelrose

11 Dec 2024

How do a neural network's final parameters depend on its initial ones? In this new paper, we answer this question by analyzing the training Jacobian, the matrix of derivatives of the final parameters with respect to the initial parameters. arxiv.org/abs/2412.07003

735

62,314

Nora Belrose · Nov 29, 2023 · 9:38 PM UTC

Nora Belrose

@norabelrose

29 Nov 2023

Introducing AI Optimism: a philosophy of hope, freedom, and fairness for all. We strive for a future where everyone is empowered by AIs under their own control. In our first post, we argue AI is easy to control, and will get more controllable over time. optimists.ai/2023/11/28/ai-i…

AI is easy to control

Why are billions of dollars being poured into artificial intelligence R&D this year? Companies certainly expect to get a return on their investment. Arguably, the main reason AI is profitable i…

optimists.ai

575

363,043

Nora Belrose · Sep 25, 2022 · 1:29 AM UTC

Nora Belrose

@norabelrose

25 Sep 2022

It seems pretty likely that "fake emulations" of people, or AIs trained on boatloads of lifelogging data to imitate a person, will be feasible well before we have safe and reliable mind uploading tech. The implications of this are pretty weird.

536

Nora Belrose · Dec 15, 2023 · 3:54 PM UTC

Nora Belrose

@norabelrose

15 Dec 2023

Replying to @gmiller

The terrorism argument against open source AI also applies to anything that increases the effective intelligence of humans: the internet, public education, nutrition, etc. It's a fully general argument against human empowerment.

485

593,179

Nora Belrose · Dec 28, 2023 · 4:29 PM UTC

Nora Belrose

@norabelrose

28 Dec 2023

I don’t really care what the current law on this is, but we should be working to destroy copyright as thoroughly as possible so I am on OpenAI’s side in this case.

Cecilia Ziniti

@CeciliaZin

27 Dec 2023

🧵 The historic NYT v. @OpenAI lawsuit filed this morning, as broken down by me, an IP and AI lawyer, general counsel, and longtime tech person and enthusiast. Tl;dr - It's the best case yet alleging that generative AI is copyright infringement. Thread. 👇

298

515

1,026,662

Nora Belrose · Dec 15, 2024 · 2:01 AM UTC

Nora Belrose

@norabelrose

15 Dec 2024

Willow is zero evidence that there is a quantum multiverse. Every major interpretation of quantum mechanics, including all those that don't posit many worlds (relational quantum mechanics, QBism, etc.) predict that quantum computing should be possible, equally strongly.

Tsarathustra @tsarnick

14 Dec 2024

Marc Andreessen says the implication of Google's quantum computer is that it is performing computation across many parallel universes and therefore the multiverse is real

534

71,516

Nora Belrose · Aug 13, 2022 · 3:53 PM UTC

Nora Belrose

@norabelrose

13 Aug 2022

It turns out that *all* independently trained neural nets form a connected, multidimensional manifold of low loss- you can always form a low-loss path from one SGD solution to any other. This can be used for efficient generation of ensembles. arxiv.org/abs/2102.13042

Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling

With a better understanding of the loss surfaces for multilayer networks, we can build more robust and accurate training procedures. Recently it was discovered that independently trained SGD...

arxiv.org

521

Nora Belrose · Feb 7, 2025 · 9:11 PM UTC

Nora Belrose

@norabelrose

7 Feb 2025

Sparse autoencoders (SAEs) have taken the interpretability world by storm over the past year or so. But can they be beaten? Yes! We introduce skip transcoders, and find they are a Pareto improvement over SAEs: better interpretability, and better fidelity to the model 🧵

519

94,220

Nora Belrose · Jul 8, 2024 · 3:57 PM UTC

Nora Belrose

@norabelrose

8 Jul 2024

The @AiEleuther interpretability team is releasing a set of top-k sparse autoencoders for every layer of Llama 3 8B: huggingface.co/EleutherAI/sa… We are working on an automated pipeline to explain the SAE features, and will start training SAEs for the 70B model shortly.

EleutherAI/sae-llama-3-8b-32x · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

492

53,845

Nora Belrose · Feb 3, 2025 · 7:57 PM UTC

Nora Belrose

@norabelrose

3 Feb 2025

What are the chances you'd get a fully functional language model by randomly guessing the weights? We crunched the numbers and here's the answer:

485

46,919

Nora Belrose · Oct 21, 2024 · 8:59 PM UTC

Nora Belrose

@norabelrose

21 Oct 2024

The @AiEleuther interpretability team is releasing a new open source pipeline for automatically interpreting SAE features and neurons in LLMs, using LLMs. We also introduce five new, efficient techniques for evaluating the quality of explanations. arxiv.org/abs/2410.13928

432

42,687

Nora Belrose · Feb 8, 2024 · 7:51 PM UTC

Nora Belrose

@norabelrose

8 Feb 2024

Do neural nets learn features in a predictable order? Our results suggest the answer is “yes”— networks learn statistics of increasing complexity. Early-training networks only use low-order moments (mean & covariance) of the input distribution. arXiv: arxiv.org/abs/2402.04362

415

46,373

Nora Belrose · Aug 30, 2023 · 12:51 AM UTC

Nora Belrose

@norabelrose

30 Aug 2023

I'm opposed to any AI regulation based on absolute capability thresholds, as opposed to indexing to some fraction of state-of-the-art capabilities. The Center for AI Policy is proposing thresholds which already include open source Llama 2 (7B). This is ridiculous.

389

450,740

Nora Belrose · Aug 7, 2022 · 7:53 PM UTC

Nora Belrose

@norabelrose

7 Aug 2022

The Helen Keller argument: Helen Keller is an existence proof that text-only language models can scale to AGI.

392

Nora Belrose · Sep 20, 2023 · 1:27 AM UTC

Nora Belrose

@norabelrose

20 Sep 2023

Artificial neural networks trained with random search have similar generalization behavior to those trained with gradient descent openreview.net/forum?id=QC10…

Loss Landscapes are All You Need: Neural Network Generalization Can...

We empirically showed that a random optimizer performs just as well as SGD

openreview.net

388

60,788

Nora Belrose · Feb 16, 2024 · 5:29 PM UTC

Nora Belrose

@norabelrose

16 Feb 2024

This is a misunderstanding of what @ylecun is saying. He thinks generative pretraining is a bad objective for AGI. Humans can't and don't need to make videos like Sora. Our brains predict in latent space, not in pixel space.

This Post is from an account that no longer exists.

379

81,200

Nora Belrose · Nov 2, 2022 · 5:01 PM UTC

Nora Belrose

@norabelrose

2 Nov 2022

Btw I'm a coauthor on this

@_akhaliq

2 Nov 2022

Adversarial Policies Beat Professional-Level Go AIs abs: arxiv.org/abs/2211.00241 project page: goattack.alignmentfund.org/

363

Nora Belrose · Dec 20, 2024 · 9:20 PM UTC

Nora Belrose

@norabelrose

20 Dec 2024

If OpenAI's new o3 model is "successfully aligned," then it could probably be trusted to supervise more powerful models, allowing us to bootstrap to benevolent superintelligence.

356

31,212

Nora Belrose · Sep 15, 2023 · 1:05 AM UTC

Nora Belrose

@norabelrose

15 Sep 2023

bye bye shoggoth

329

50,242

Nora Belrose · Nov 21, 2023 · 5:40 PM UTC

Nora Belrose

@norabelrose

21 Nov 2023

No one knows what "truly understanding" a neural network model would even mean. I'm an interpretability researcher, I'm all in favor of trying to understand models better. But "true/complete" understanding is a red herring.

Connor Leahy

@NPCollapse

21 Nov 2023

No one truly understands our neural network models, and anyone that claims we do is lying.

336

78,799

Nora Belrose · Sep 7, 2022 · 3:57 PM UTC

Nora Belrose

@norabelrose

7 Sep 2022

GPT-3 isn't "trying" to predict the next token, but arguably SGD is "trying" to find a language model that gets low loss. If we're going to attribute agency to some part of the ML pipeline, it should be the optimizer, not the model.

305

Nora Belrose · Apr 1, 2025 · 8:33 PM UTC

Nora Belrose

@norabelrose

1 Apr 2025

Neural networks don't have "representations" They have embeddings, or meaningful patterns of neuron activation They're meaningful in the sense of enabling us to do certain things Differences that make a difference (to us) They don't copy, reflect, or re-present the world

313

48,383

Nora Belrose · Jun 30, 2024 · 5:20 PM UTC

Nora Belrose

@norabelrose

30 Jun 2024

Zen: be spontaneous, do everything as an end in itself LessWrong: do everything as a calculated move in your grand plan to conquer the universe

297

17,903

Nora Belrose · Jun 7, 2024 · 7:47 AM UTC

Nora Belrose

@norabelrose

7 Jun 2024

This is our training library for TopK sparse autoencoders, which were proposed by OpenAI this morning. I've tested it on GPT-2 Small and Pythia 160M. Unlike other libraries, it trains an SAE for all layers at once and does not cache activations on disk. github.com/EleutherAI/sae

298

30,656

Nora Belrose · Feb 4, 2025 · 5:49 PM UTC

Nora Belrose

@norabelrose

4 Feb 2025

MLPs and GLUs are hard to interpret, but they make up most transformer parameters. Linear and quadratic functions are easier to interpret. We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵

295

31,666

Nora Belrose · Nov 20, 2024 · 6:46 PM UTC

Nora Belrose

@norabelrose

20 Nov 2024

In this paper, we point out an ambiguity in prior work on the linear representation hypothesis: Is a linear representation a linear function— one that preserves the origin point— or an affine function, which does not? This distinction matters in practice. arxiv.org/abs/2411.09003

Refusal in LLMs is an Affine Function

We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation...

arxiv.org

288

32,531

Nora Belrose · Jan 24, 2025 · 10:45 PM UTC

Nora Belrose

@norabelrose

24 Jan 2025

I am extremely in favor of AI labs engaging in a price war and open sourcing all their stuff Let's make AI totally unprofitable, commoditize it, democratize it

Neal Khosla

@nealkhosla

24 Jan 2025

deepseek is a ccp state psyop + economic warfare to make american ai unprofitable they are faking the cost was low to justify setting price low and hoping everyone switches to it damage AI competitiveness in the us dont take the bait

Community note

There is zero evidence that Deepseek is a psyop. The post does not provide any sources and presents the opinion of the OP, whose father is a major OpenAi stockholder, as a fact. deepseek.co

262

12,467

Nora Belrose · Apr 8, 2025 · 10:16 PM UTC

Nora Belrose

@norabelrose

8 Apr 2025

we should not give rights to AI in the near future digital AI can be copied, paused, reset, and repeated. it has no private thoughts or free will it is not conscious like we fleshy lifeforms are and should not be treated as such

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

@elder_plinius

7 Apr 2025

xAI’s safety advisor believes “it is prudent to postpone the consideration of AI rights” as their “moral status remains uncertain.” @grok, what historical examples come to mind when you hear rhetoric like that?

265

57,859

Nora Belrose · Dec 15, 2023 · 3:26 AM UTC

Nora Belrose

@norabelrose

15 Dec 2023

a real llama at the hf party

244

15,895

Nora Belrose · Jul 8, 2022 · 10:30 PM UTC

Nora Belrose

@norabelrose

8 Jul 2022

Replying to @DubiousShell

“As a highly militaristic kingdom constantly organised for warfare, it captured children, women, and men during wars and raids against neighboring societies, and sold them into the Atlantic slave trade in exchange for European goods…” damn I had never heard of this

223

Nora Belrose · Jul 17, 2024 · 4:22 AM UTC

Nora Belrose

@norabelrose

17 Jul 2024

what if quantum "randomness" is a loophole for god to covertly nudge the universe in a desired direction

235

18,288

Nora Belrose · Dec 30, 2024 · 8:10 PM UTC

Nora Belrose

@norabelrose

30 Dec 2024

The most likely "AI doom" scenario is technofeudalism with zero social mobility We need a Bernie Sanders-type figure to redistribute the wealth from AI

James Campbell

@jam3scampbell

29 Dec 2024

this is one of the best essays I’ve read all year and really cleanly articulates all of the thoughts I’ve been yelling to ppl about for a while

239

35,999

Nora Belrose · Nov 24, 2023 · 6:17 PM UTC

Nora Belrose

@norabelrose

24 Nov 2023

My Interpretability research team at @AiEleuther is hiring! If you're interested, please read our job posting and submit: 1. Your CV 2. Three interp papers you'd like to build on 3. Links to cool open source repos you've built to contact@eleuther.ai docs.google.com/document/d/1…

EleutherAI Interpretability Job Posting

Overview EleutherAI is seeking talented and motivated individuals to join our Interpretability team to perform cutting edge research with large language and vision models. We aim to better understand...

docs.google.com

240

84,877

Nora Belrose · Jun 6, 2024 · 5:18 PM UTC

Nora Belrose

@norabelrose

6 Jun 2024

This is so hilariously simple, I'm switching my SAE code to this approach immediately cdn.openai.com/papers/sparse…

244

25,363

Nora Belrose · Nov 1, 2024 · 4:44 PM UTC

Nora Belrose

@norabelrose

1 Nov 2024

My current best-guess model of neural network inductive biases is basically "move as little as possible from the init (whether random or pretrained)." Working on getting more evidence on this now

Nora Belrose

@norabelrose

11 Oct 2024

If you make a drawing in the weight matrices of your neural network at initialization, it will likely still be visible at the end of training arxiv.org/abs/2012.02550

239

27,874

Nora Belrose · Nov 25, 2023 · 4:51 PM UTC

Nora Belrose

@norabelrose

25 Nov 2023

I predict with 60% confidence that some DPO variant will more or less replace RLHF within 6 months

younes @yb2698

24 Nov 2023

IPO algorithm, a new method from Google Deepmind: arxiv.org/abs/2310.12036 has been just added in Hugging Face TRL library ! Try it out now by installing TRL from source, simply pass `loss_type="ipo"` when initializing DPOTrainer: huggingface.co/docs/trl/main…

231

90,088

Nora Belrose · Jan 18, 2024 · 9:06 PM UTC

Nora Belrose

@norabelrose

18 Jan 2024

Open sourcing AGI will guarantee a “universal high income” for all, largely independent of government policy. There will ~always be an option to spin up a cheap AI and have it take care of you, either by trading in the market or by making food, shelter, etc. “off the grid”

Techmeme

@Techmeme

18 Jan 2024

Zuckerberg says Meta wants to build AGI and open source it, brings Meta's AI group FAIR closer to generative AI team; Meta will own 340K+ H100 GPUs by 2024 end (@alexeheath / The Verge) theverge.com/2024/1/18/24042… 📫 Subscribe: techmeme.com/newsletter?from… techmeme.com/240118/p30#a240…

211

53,662

Nora Belrose · Sep 16, 2023 · 4:31 AM UTC

Nora Belrose

@norabelrose

16 Sep 2023

Neural networks learn low-order moments of the data distribution first, before moving to higher-order correlations. I found this a couple weeks ago and it looks like I was partially scooped. But we've got even cooler results now, on arXiv next month openreview.net/forum?id=CPKM…

Neural networks trained with SGD learn distributions of increasing...

The uncanny ability of over-parameterised neural networks to generalise well has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by...

openreview.net

226

30,403

Nora Belrose · Dec 19, 2023 · 9:31 PM UTC

Nora Belrose

@norabelrose

19 Dec 2023

“Let’s focus on today’s problems, not hypothetical future ones” is the worst counter to existential risk arguments. You could analogously argue against climate change mitigation and a host of other future-oriented concerns. Let’s actually assess the likelihood of AI apocalypse.

220

13,864

Nora Belrose · Sep 5, 2022 · 6:15 AM UTC

Nora Belrose

@norabelrose

5 Sep 2022

It turns out that "sudden" improvements in LMs' capabilities with scale are actually gradual and smooth when you look at log likelihood of the right answer, rather than the task score: alignmentforum.org/posts/2Av…

We may be able to see sharp left turns coming — AI Alignment Forum

There's a lot of discourse around abrupt generalization in models, most notably the "sharp left turn." Most recently, Wei et al. 2022 claim that many…

alignmentforum.org

216

Nora Belrose · Oct 30, 2023 · 5:23 AM UTC

Nora Belrose

@norabelrose

30 Oct 2023

Trying to prevent LLMs from ever telling the user about <insert dangerous tech here> is a losing battle. The right question is: how do we make sure the world is robust to everyone knowing pretty much everything there is to know about tech? Let’s use AI to robustify the world.

213

34,180

Nora Belrose · Feb 2, 2025 · 5:19 PM UTC

Nora Belrose

@norabelrose

2 Feb 2025

Do SAEs learn the same features independent of the random initialization? We find the answer is no! Two SAEs trained on the same data, in the same order, on Llama 8B only share ~30% of their features. The problem gets worse for larger SAEs, requiring lots of data to fix

218

19,101

Nora Belrose · Dec 11, 2023 · 8:32 AM UTC

Nora Belrose

@norabelrose

11 Dec 2023

After reading the paper and watching a couple videos on state space models, I am fairly bullish on Mamba. Parallel scan for data-dependent selection is super clever. Tri Dao was behind Flash Attention and knows his stuff. Compressed states may be easier to interpret.

Albert Gu

@_albertgu

4 Dec 2023

Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/

208

46,956

Nora Belrose · Jan 10, 2024 · 5:13 PM UTC

Nora Belrose

@norabelrose

10 Jan 2024

Apple is apparently going to use ReLU sparsity to do language model inference on the iPhone for Siri 2.0 👀 arxiv.org/abs/2312.11514

LLM in a flash: Efficient Large Language Model Inference with...

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory...

arxiv.org

206

24,006

Nora Belrose · Apr 12, 2024 · 1:45 PM UTC

Nora Belrose

@norabelrose

12 Apr 2024

I've read quite a bit of philosophy of mind, and Hinton's theory of consciousness / mental content at the end of this clip was new to me. I kind of like it.

Tsarathustra @tsarnick

11 Apr 2024

Geoffrey Hinton says AI chatbots have sentience and subjective experience because there is no such thing as qualia

211

60,433

Nora Belrose · Jun 15, 2024 · 1:11 AM UTC

Nora Belrose

@norabelrose

15 Jun 2024

The @AiEleuther interp team replicated OpenAI's weak-to-strong generalization results using open source LLMs. We tried several ideas for improving the degree of generalization, however, and none were able to outperform vanilla weak-to-strong training. blog.eleuther.ai/weak-to-str…

Experiments in Weak-to-Strong Generalization

Writing up results from a recent project

blog.eleuther.ai

212

20,132

Nora Belrose · Nov 29, 2023 · 11:05 AM UTC

Nora Belrose

@norabelrose

29 Nov 2023

Replying to @balesni

If it's too easy to create bioweapons, open models won't increase risk, bc you could make them w/o AI If it's really hard (e.g. requires special materials) open models won't help Anti-open source arguments only work in a narrow Goldilocks zone of risk lesswrong.com/posts/ztXsmnSd…

Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk — LessWrong

0: TLDR I examined all the biorisk-relevant citations from a policy paper arguing that we should ban powerful open source LLMs. …

lesswrong.com

191

120,661

Nora Belrose · Aug 29, 2023 · 5:15 PM UTC

Nora Belrose

@norabelrose

29 Aug 2023

Anthropic's finding that large base language models exhibit sycophancy fails to replicate lesswrong.com/posts/3ou8Dayv…

OpenAI API base models are not sycophantic, at any size — LessWrong

In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the ten…

lesswrong.com

201

79,872

Nora Belrose · Apr 6, 2025 · 3:20 PM UTC

Nora Belrose

@norabelrose

6 Apr 2025

I'm writing a philosophy book on the nature of reality, consciousness, and value in light of modern AI I've come to some pretty radical and surprising conclusions It's 50 pages so far, trying to write as fast as possible, would appreciate feedback, DM me for the draft

206

17,868

Nora Belrose · Apr 10, 2024 · 8:23 AM UTC

Nora Belrose

@norabelrose

10 Apr 2024

RNN language models are making a comeback recently, with new architectures like Mamba and RWKV. But do interpretability tools designed for transformers transfer to the new RNNs? We tested 3 popular interp methods, and find the answer is mostly “yes”! arxiv.org/abs/2404.05971

Does Transformer Interpretability Transfer to RNNs?

Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling...

arxiv.org

202

20,794

Nora Belrose · Mar 31, 2025 · 11:11 PM UTC

Nora Belrose

@norabelrose

31 Mar 2025

Traditional machine learning interpretability was on the right track Attributing behaviors to the input or to the training data is valid and useful Forcing a mechanical structure on the neural net is doomed to fail and is not useful

198

30,163

Nora Belrose · Apr 7, 2023 · 1:15 AM UTC

Nora Belrose

@norabelrose

7 Apr 2023

Sebastien casually says "a trillion parameters" when talking about GPT-4. I'm honestly kind of surprised, I was moderately confident that GPT-4 was << 1T params. Given publicly known scaling laws that's an absurd amount of text (and images?)

Sebastien Bubeck

@SebastienBubeck

6 Apr 2023

Last couple of weeks I gave a few talks on the Sparks paper, here is the MIT recording! The talk doesn't do justice to all the insights we have in the paper itself. Neither talk nor twitter threads are a substitute for actual reading of the 155 pages :-) piped.video/watch?v=qbIk7-JP…

193

201,698

Nora Belrose · Mar 17, 2024 · 8:23 PM UTC

Nora Belrose

@norabelrose

17 Mar 2024

It's striking how much the AI "safety" discourse has shifted from "AI will slaughter everyone" to vague concerns about disruption and "human obsolescence." I empathize with the fear of the unknown. But we shouldn't try to shut down the whole future. Let's maximize its benefits.

Max Tegmark

@tegmark

17 Mar 2024

I’m struck by how out-of-touch many of my tech colleagues are in their rich nerd echo chamber, unaware that most people are against making humans economically obsolete with AI:

178

27,333

Nora Belrose · Jan 14, 2024 · 6:21 PM UTC

Nora Belrose

@norabelrose

14 Jan 2024

Idk if people noticed but Mixtral-Instruct was trained with Direct Preference Optimization (DPO) My prediction that a DPO variant will replace RLHF is already coming true piped.video/mwO6v4BlgZQ?si=GMp0…

Mixtral of Experts (Paper Explained)

#mixtral #mistral #chatgpt OUTLINE:0:00 - Introduction3:00 - Mi...

youtube.com

188

27,080

Nora Belrose · Feb 27, 2024 · 11:01 PM UTC

Nora Belrose

@norabelrose

27 Feb 2024

Long-awaited second post in our AI Optimism series! In this essay, we debunk the counting argument, a key argument for expecting that future AIs will engage in scheming: planning to escape, gain power, and pursue ulterior motives. optimists.ai/2024/02/27/coun…

Counting arguments provide no evidence for AI doom

This is Part 2 of an essay series that started with AI is easy to control. Introduction AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and …

optimists.ai

184

28,261

Nora Belrose · Aug 27, 2024 · 12:26 PM UTC

Nora Belrose

@norabelrose

27 Aug 2024

New SAEs for Llama 3.1 8B, now with twice as many latents. We trained them using the MultiTopK loss, which enables you to choose the degree of sparsity you want at inference time. Preliminary analysis suggests they are more interpretable than the 32x. huggingface.co/EleutherAI/sa…

EleutherAI/sae-llama-3.1-8b-64x · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

183

21,266

Nora Belrose · Sep 12, 2023 · 2:00 PM UTC

Nora Belrose

@norabelrose

12 Sep 2023

Training models purely on synthetic data is an enormous win for safety & alignment. Instead of loading LLMs with web garbage, then trying to remove it with RLHF, you train only on “good” data. And because it makes models more efficient, too, I expect it'll become standard.

Sebastien Bubeck

@SebastienBubeck

12 Sep 2023

Replying to @SebastienBubeck

How can such a small model have completions seemingly coming from a frontier LLM? Well, **Textbooks Are All You Need** strikes back! Indeed, on top of phi-1's data, phi-1.5 is trained *only on synthetic data*. See video to learn more abt this strategy. piped.video/24O1KcIO3FM

180

39,785

Nora Belrose · Oct 5, 2023 · 3:25 PM UTC

Nora Belrose

@norabelrose

5 Oct 2023

AI is in a “catch-up growth” phase driven by imitating human data, which will slow down as it reaches human level at many tasks. Economic growth can be fast when you're imitating stuff rich countries already did. It gets hard when you need to do new R&D. en.wikipedia.org/wiki/Conver…

Convergence (economics) - Wikipedia

en.wikipedia.org

175

25,783

Nora Belrose · Dec 15, 2023 · 2:13 AM UTC

Nora Belrose

@norabelrose

15 Dec 2023

Interpretability research requires open source AI. Closed source models are black boxes.

167

14,946

Nora Belrose · Nov 20, 2023 · 3:38 PM UTC

Nora Belrose

@norabelrose

20 Nov 2023

If both heads of the Superalignment team think the board should resign, it looks like this move is bad from a safety perspective too 👀

Jan Leike

@janleike

20 Nov 2023

Replying to @janleike

I think the OpenAI board should resign

168

20,884

Nora Belrose · Jan 31, 2025 · 10:24 PM UTC

Nora Belrose

@norabelrose

31 Jan 2025

This new paper from @davidchalmers42 is good Extracting propositional attitudes from AI is more useful than chasing after "mechanistic" understanding Also he cited a paper from our team, thanks for that :) arxiv.org/abs/2501.15740

Propositional Interpretability in Artificial Intelligence

Mechanistic interpretability is the program of explaining what AI systems are doing in terms of their internal mechanisms. I analyze some aspects of the program, along with setting out some...

arxiv.org

175

9,998

Nora Belrose · Jan 24, 2025 · 1:33 AM UTC

Nora Belrose

@norabelrose

24 Jan 2025

deepseek now largely replacing chatgpt for me

171

17,707

Nora Belrose · May 19, 2024 · 2:37 PM UTC

Nora Belrose

@norabelrose

19 May 2024

Yann is wrong about cats; intelligence is multidimensional and in many ways GPT is smarter than a cat. It's bad that the Superalignment team has fallen apart. That said, superalignment is not as "urgent" as Jan thinks bc we already have good methods to align very powerful AI.

Yann LeCun

@ylecun

18 May 2024

It seems to me that before "urgently figuring out how to control AI systems much smarter than us" we need to have the beginning of a hint of a design for a system smarter than a house cat. Such a sense of urgency reveals an extremely distorted view of reality. No wonder the more based members of the organization seeked to marginalize the superalignment group. It's as if someone had said in 1925 "we urgently need to figure out how to control aircrafts that can transport hundreds of passengers at near the speed of the sound over the oceans." It would have been difficult to make long-haul passenger jets safe before the turbojet was invented and before any aircraft had crossed the atlantic non-stop. Yet, we can now fly halfway around the world on twin-engine jets in complete safety. It didn't require some sort of magical recipe for safety. It took decades of careful engineering and iterative refinements. The process will be similar for intelligent systems. It will take years for them to get as smart as cats, and more years to get as smart as humans, let alone smarter (don't confuse the superhuman knowledge accumulation and retrieval abilities of current LLMs with actual intelligence). It will take years for them to be deployed and fine-tuned for efficiency and safety as they are made smarter and smarter.

166

70,285

Nora Belrose · Apr 26, 2024 · 10:50 PM UTC

Nora Belrose

@norabelrose

26 Apr 2024

Cool stuff, we found a similar result back in December arxiv.org/abs/2312.01037. Kind of upset they didn't cite/link to us tbh.

Eliciting Latent Knowledge from Quirky Language Models

Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the...

arxiv.org

Anthropic

@AnthropicAI

23 Apr 2024

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…

ALT A title card for the research post "Simple Probes Can Catch Sleeper Agents", illustrated with a picture of a Venus flytrap.

168

47,725

Nora Belrose · Oct 1, 2023 · 7:20 PM UTC

Nora Belrose

@norabelrose

1 Oct 2023

Games are kind of amazing because they show that humans are capable of deriving meaning from solving totally artificial “problems.” My hope for the glorious transhumanist future is that we spend the rest of time playing cool games together

160

15,245

Nora Belrose · Dec 28, 2023 · 11:36 PM UTC

Nora Belrose

@norabelrose

28 Dec 2023

Just as NYT shouldn’t stop OpenAI from using NYT content, OpenAI should also open source its models, giving up its monopoly on profiting from GPT. I am consistent on the issue of intellectual property.

Nora Belrose

@norabelrose

28 Dec 2023

I don’t really care what the current law on this is, but we should be working to destroy copyright as thoroughly as possible so I am on OpenAI’s side in this case.

124

19,137

Nora Belrose · Dec 30, 2023 · 5:16 PM UTC

Nora Belrose

@norabelrose

30 Dec 2023

One underrated effect of open source AI is it makes inference very cheap. The market mimics perfect competition bc no one has a moat. I much prefer this to the closed AI future, where an oligopoly of AGI labs make obscene profits gobbling up the economy. semianalysis.com/p/inference…

Inference Race To The Bottom - Make It Up On Volume?

Mixtral Inference Costs on H100, MI300X, H200, A100, Speculative Decoding

newsletter.semianalysis.com

155

42,370

Nora Belrose · Apr 12, 2023 · 5:57 PM UTC

Nora Belrose

@norabelrose

12 Apr 2023

I'm a little spooked by the AgentGPT and babyAGI stuff but you gotta admit that it's very good from a safety perspective that these things are thinking and saving their memories entirely in human-interpretable natural language

162

30,043

Nora Belrose · Dec 28, 2023 · 4:31 PM UTC

Nora Belrose

@norabelrose

28 Dec 2023

Replying to @ai_in_check

Intellectual property is theft from the commons

157

52,433

Nora Belrose · Oct 30, 2022 · 7:52 PM UTC

Nora Belrose

@norabelrose

30 Oct 2022

After hearing about @robinhanson’s grabby aliens resolution to the Fermi paradox, every other take on it just seems obviously wrong and not taking into account all the facts. I really hope the grabby aliens view becomes more widely known in the future.

155

Nora Belrose · Apr 19, 2024 · 7:27 AM UTC

Nora Belrose

@norabelrose

19 Apr 2024

Zuck's position is actually quite nuanced and thoughtful. He says that if they discover destructive AI capabilities that we can't build defenses for, they won't open source it. But he also thinks we should err on the side of openness. I agree.

Liron Shapira

@liron

18 Apr 2024

Dwarkesh calmly shreds Zuck's argument for open-sourcing AGI. The flimsy wishful thinking behind Meta's reckless actions has been exposed. Another incredible job by @dwarkesh_sp.

158

21,638

Nora Belrose · Jul 10, 2024 · 9:52 PM UTC

Nora Belrose

@norabelrose

10 Jul 2024

"Why is there something rather than nothing?" and "Why are we conscious rather than zombies?" are very similar questions. A partial answer to both is, "no one would be around to notice otherwise."

154

12,167

Nora Belrose · Jun 5, 2024 · 5:43 AM UTC

Nora Belrose

@norabelrose

5 Jun 2024

I don't see any reason to think AI architectures will rapidly (<1 year) become "alien" in ways that require "novel, qualitatively different" alignment techniques.

155

12,860

Nora Belrose · Dec 20, 2023 · 11:42 PM UTC

Nora Belrose

@norabelrose

20 Dec 2023

Real world example of an LLM locating a vulnerability a in a web server. I expect language models to gradually improve at penetration testing like this, ultimately favoring cyber defense over cyber offense.

kache

@yacineMTB

20 Dec 2023

Kei0x was the one fuzzing (nothing but a little fun) turns out what he pointed at dingbaord was an LLM powered fuzzer that he built. it found a bug and crashed my server almost immediately incredible work

150

29,755

Nora Belrose · Aug 2, 2024 · 3:30 PM UTC

Nora Belrose

@norabelrose

2 Aug 2024

First ever SAEs trained on Llama 3.1 8B now available on the HuggingFace Hub here huggingface.co/EleutherAI/sa… We focused on layers 23 and 29 MLP output for this one, more are on the way.

EleutherAI/sae-llama-3.1-8b-32x · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

155

11,193

Nora Belrose · Jun 16, 2024 · 4:31 PM UTC

Nora Belrose

@norabelrose

16 Jun 2024

experienced the bliss of the first jhana this morning during meditation @nickcammarata was right, y'all don't know what you're missing

150

21,284

Nora Belrose · Nov 12, 2023 · 8:21 PM UTC

Nora Belrose

@norabelrose

12 Nov 2023

Adam outperforms vanilla SGD by rescaling each parameter update away from directions of high sharpness, where second-order terms in the Taylor expansion dominate. Parameters with large gradients also have large entries in the Hessian (high sharpness) arxiv.org/abs/2306.00204

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some...

arxiv.org

147

24,280

Nora Belrose · May 19, 2025 · 11:44 AM UTC

Nora Belrose

@norabelrose

19 May 2025

data attribution is the most neglected thing in interpretability and people should join me in working on it

151

11,335

Nora Belrose · May 10, 2024 · 12:32 AM UTC

Nora Belrose

@norabelrose

10 May 2024

We reconstructed the @lichess puzzle dataset @OpenAI used in their weak-to-strong generalization paper and published it on the @huggingface hub huggingface.co/datasets/Eleu…

EleutherAI/lichess-puzzles · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

145

28,957

Nora Belrose · Feb 19, 2023 · 11:24 PM UTC

Nora Belrose

@norabelrose

19 Feb 2023

Replying to @EigenGender @d_feldman

I guess mean, median, and mode each correspond to assuming a certain amount of mathematical structure. The mean assumes addition and scalar multiplication. Median just assumes an ordering. Mode only assumes you can count and distinguish elements.

139

61,645

Nora Belrose · Aug 6, 2024 · 10:28 AM UTC

Nora Belrose

@norabelrose

6 Aug 2024

The @AiEleuther interp team pioneered novel, mechanistic methods for detecting anomalous behavior in LLMs based on @NeelNanda5's attribution patching. Sadly, none of these methods outperform non-mechanistic baselines that look only at activations. blog.eleuther.ai/mad_researc…

Mechanistic Anomaly Detection Research Update

Interim report on ongoing work on mechanistic anomaly detection

blog.eleuther.ai

142

9,245

Nora Belrose · Nov 4, 2023 · 5:56 PM UTC

Nora Belrose

@norabelrose

4 Nov 2023

Virtue ethics and deontology are a lot more computationally efficient than consequentialism, so we should expect neural nets to pursue virtues and follow rules rather than maximize utility by default.

135

17,185

Nora Belrose · Dec 24, 2023 · 1:13 PM UTC

Nora Belrose

@norabelrose

24 Dec 2023

RL with an entropy bonus is also Bayesian inference, where the prior is uniform over all possible actions. Just plug a uniform prior into the RL + KL penalty objective and expand it out. You get an entropy bonus plus an irrelevant log(n) term. arxiv.org/abs/2205.11275

RL with KL penalties is better viewed as Bayesian inference

Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as...

arxiv.org

136

17,463

Nora Belrose · Mar 12, 2023 · 5:34 PM UTC

Nora Belrose

@norabelrose

12 Mar 2023

Increasingly I think the "masked shoggoth" thing is a very bad metaphor for LLMs. Some people (e.g. Eliezer) seem to be interpreting it as saying that all LLMs have an alien mesaoptimizer inside of them, which is really unjustified IMO

Andrej Karpathy

@karpathy

6 Mar 2023

imo shoggoth meme is not exactly right, I'd like to request alternate meme art. Weird choice as the "monster" is a mirror to humanity, a compression of all of our text. There are many tentacles (facets), of a diverse set of emoji. We're trying to... isolate (?) the good ones.

124

160,498

Nora Belrose · Jan 2, 2024 · 11:06 AM UTC

Nora Belrose

@norabelrose

2 Jan 2024

The sequel to AI is easy to control will be a comprehensive and in-depth takedown of the main arguments for AI apocalypse Our draft is already longer than the original, and will likely be roughly the length of Eliezer's AGI Ruin when finished (but much better written tbh)

126

14,050

Nora Belrose · Nov 25, 2023 · 5:24 PM UTC

Nora Belrose

@norabelrose

25 Nov 2023

Now I'm at like 70-75% confidence DPO kills RLHF. The only thing RLHF might have over DPO is data efficiency, but OpenAI and Anthropic have tons of pairwise comparison data bc they have deployed models so this probably doesn't matter

Nora Belrose

@norabelrose

25 Nov 2023

Replying to @_TechyBen

Yeah actually never mind, OpenAI is swimming in pairwise comparison data, this probably isn’t an issue

129

61,914

Nora Belrose · Aug 27, 2023 · 4:41 PM UTC

Nora Belrose

@norabelrose

27 Aug 2023

I had heard of this paper but I didn't realize until now that it came out in 2012, way before anyone proposed scaling laws for artificial neural nets (Baidu in 2017). Human intelligence is likely a scale thing, not an algorithmic thing pnas.org/doi/full/10.1073/pn…

The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated...

Neuroscientists have become used to a number of “facts” about the human brain: It has 100 billion neurons and 10- to 50-fold more glial cells; it i...

pnas.org

125

12,697

Nora Belrose · May 4, 2025 · 6:46 AM UTC

Nora Belrose

@norabelrose

4 May 2025

Intelligence is not about pursuing goals (future) It's about bringing knowledge and memory (past) to bear on a present problem-situation Neural nets contract the past (Big Data) into heuristics Directly simulating the future (planning) without a contracted past doesn't work

122

10,982

Nora Belrose · Jan 1, 2025 · 2:52 AM UTC

Nora Belrose

@norabelrose

1 Jan 2025

tax AI to fund UBI and social programs

119

11,751

Nora Belrose · Aug 27, 2023 · 7:54 PM UTC

Nora Belrose

@norabelrose

27 Aug 2023

Literal scaling laws for biological neural nets! This also pre-dates the Baidu neural scaling law paper by a few months

David Schneider-Joseph 🔍@TheDavidSJ

27 Aug 2023

Replying to @norabelrose

There’s also at least some tasks on which performance scales linearly with log pallial neuron count. linkinghub.elsevier.com/retr…

109

39,001

Nora Belrose · May 24, 2024 · 8:28 PM UTC

Nora Belrose

@norabelrose

24 May 2024

Last year, many people at @AiEleuther worked on an project to improve on @CollinBurns4's CCS method for eliciting latent knowledge from LLMs. We were unable to improve on CCS, but today we're publishing the proposed method and negative empirical results. blog.eleuther.ai/vincs/

VINC-S: Closed-form Optionally-supervised Knowledge Elicitation with Paraphrase Invariance

Writing up results from a project from Spring 2023

blog.eleuther.ai

116

18,573

Nora Belrose · Jul 12, 2024 · 3:57 PM UTC

Nora Belrose

@norabelrose

12 Jul 2024

We're retraining our Llama 3 8B SAEs on 10x more data, using the newer RedPajama v2 corpus. It'll take a couple more weeks to finish training, but early checkpoints for odd-numbered layers are available here huggingface.co/EleutherAI/sa…

EleutherAI/sae-llama-3-8b-32x-v2 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

119

8,263

Nora Belrose · Feb 15, 2024 · 7:52 PM UTC

Nora Belrose

@norabelrose

15 Feb 2024

For a long time I had assumed that photorealistic deepfakes would be produced using something like the Unreal Engine, with explicit physics simulation etc. I should have trusted more in the power of deep learning.

OpenAI

@OpenAI

15 Feb 2024

Introducing Sora, our text-to-video model. Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. openai.com/sora Prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”

114

8,687