AIs aren't people, they're tools we should use wisely. Head of interpretability research at @AiEleuther, but tweets are my own views, not Eleuther's.

Everywhere all at once
If we only care about appearances, outcomes, and results then AI will replace humans everywhere If we care about the process used to create things then humans will still have a meaningful role in the future The idea that ends can be detached from means is the root of many evils
3
16
2,814
If deep learning can predict weather better than an explicit physics simulation, does that mean that deep learning is more "fundamental" than physics? Or that nothing is fundamental?
Today in @Nature, we’re presenting GenCast: our new AI weather model which gives us the probabilities of different weather conditions up to 15 days ahead with state-of-the-art accuracy. ☁️⚡ Here’s how the technology works. 🧵goo.gle/49trAOv
385
162
3,309
641,212
Ever wanted to mindwipe an LLM? Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts. 🧵 arxiv.org/abs/2306.03819
46
239
1,270
298,820
This is a great paper. It points out: 1. Humans do not even approximately behave according to rational choice theory 2. There is no reason to think advanced AI will "inevitably" maximize some utility function 3a. Human preferences are derivative / constructed, so aligning AI by matching its behavior to our stated preferences is wrongheaded 3b. We can align AIs directly to some normative ideal of a "good assistant / programmer / driver / etc." instead 4. Aggregating preferences across people is fraught with philosophical and mathematical difficulties. We should not aim to align AI to the "collective will of humanity."
Should AI be aligned with human preferences, rewards, or utility functions? Excited to finally share a preprint that @MicahCarroll @FranklinMatija @hal_ashton & I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!
41
139
982
158,494
Ever wonder how a language model decides what to say next? Our method, the tuned lens (arxiv.org/abs/2303.08112), can trace an LM’s prediction as it develops from one layer to the next. It's more reliable and applies to more models than prior state-of-the-art. 🧵
15
166
890
154,750
Second hand rumor: Sam Altman thinks GPT-4.5 will automate 100 million jobs globally
79
67
820
369,195
i'm trans, and i'm annoyed with both sides of the trans "issue" these days. for years the trans movement argued something like: 1. ⁠gender is an essential real thing 2.⁠ your gender is whatever you say it is 3.⁠ ⁠⁠the law should force everyone to accept your stated gender in all contexts people on the right like to call 1-3 "gender ideology." this ideology is absurd for a few reasons. (1) and (2) are obviously in tension, if not outright contradictory. (3) is draconian and creates opportunities for serious abuse, especially when combined with (2). trans people should instead stop talking about gender as an essentially real category. we're not "women/men trapped in a man/woman's body." rather, we just want to live our lives differently from the average person born with our sex chromosomes. please be courteous and try to respect that, within reason. make a good faith effort to use our preferred pronouns. conversely, we shouldn't expect people to use "neopronouns" like ze/zir or similar. and we should be humble and recognize the ways in which we're biologically different from cis people, in areas like sports. unfortunately, the backlash to gender ideology has not been reasonable or empathic toward trans people. take @jordanbpeterson for example. he started out criticizing (3), rightly insisting that the law should not force him to use certain pronouns for nonbinary people. but now he seems to have become radicalized against transgenderism more broadly, claiming that gender affirming surgery is tantamount to murder nitter.app/jordanbpeterson/status…. he egregiously misreads the cited study, suggesting that surgery increases the suicide rate for trans people by 12x, while it actually was comparing people who got the surgery to "control" groups of overwhelmingly non-trans people. so it basically just shows that trans people have higher suicide rates than the general population, which we already knew. there has also been an unusual focus on gender affirming surgery for children. as far as I can tell it has always been quite rare for trans people under 18 to undergo surgery. any reasonable trans activist should not support this. a much better solution is to put trans kids on puberty blockers, which are reversible, until they are 18 when they can make up their mind about surgery. but it has now become fashionable to oppose blockers as well, and the "left-wing" Labour government in the UK has recently banned them the-independent.com/news/hea… obviously puberty blockers, just like any medication, have side effects. but we have to consider the cost-benefit tradeoff here. blockers can prevent a lot of psychological suffering. puberty itself is largely "irreversible"— we still don't have surgeries that can reliably reverse testosterone's effect on the voice, height, or rib cage volume for example. I get the sense that a lot of the opposition to blockers is borne out of a general sense of spite and disgust for trans people, rather than empathy and rational cost-benefit analysis. we should approach anything irreversible with great caution. taking testosterone has more dramatic, irreversible effects than estrogen does, and therefore should be treated with more care. there are too many young women taking T and detransitioning later. trans activists should recognize that this is a problem rather than shoving it under the rug. thanks for taking the time to read this. unfortunately I have not seen many people publicly taking a rational, moderate stand on this issue that recognizes the points from both sides, so I felt the need to make this long post.
12x the suicide rate post "gender affirming" surgery The butchers and liars were murderously wrong The Cass report indicated this Canada and the US are still enabling this That's you @POTUS and @JustinTrudeau and it is utterly barbarous and inexcusable Putting children to the knife "Follow the science," gentlemen. pubmed.ncbi.nlm.nih.gov/3869…
62
35
817
106,474
If you make a drawing in the weight matrices of your neural network at initialization, it will likely still be visible at the end of training arxiv.org/abs/2012.02550
17
85
732
185,310
How do a neural network's final parameters depend on its initial ones? In this new paper, we answer this question by analyzing the training Jacobian, the matrix of derivatives of the final parameters with respect to the initial parameters. arxiv.org/abs/2412.07003
2
82
735
62,314
Introducing AI Optimism: a philosophy of hope, freedom, and fairness for all. We strive for a future where everyone is empowered by AIs under their own control. In our first post, we argue AI is easy to control, and will get more controllable over time. optimists.ai/2023/11/28/ai-i…
77
92
575
363,043
It seems pretty likely that "fake emulations" of people, or AIs trained on boatloads of lifelogging data to imitate a person, will be feasible well before we have safe and reliable mind uploading tech. The implications of this are pretty weird.
16
43
536
Replying to @gmiller
The terrorism argument against open source AI also applies to anything that increases the effective intelligence of humans: the internet, public education, nutrition, etc. It's a fully general argument against human empowerment.
24
72
485
593,179
I don’t really care what the current law on this is, but we should be working to destroy copyright as thoroughly as possible so I am on OpenAI’s side in this case.
🧵 The historic NYT v. @OpenAI lawsuit filed this morning, as broken down by me, an IP and AI lawyer, general counsel, and longtime tech person and enthusiast. Tl;dr - It's the best case yet alleging that generative AI is copyright infringement. Thread. 👇
298
56
515
1,026,662
Willow is zero evidence that there is a quantum multiverse. Every major interpretation of quantum mechanics, including all those that don't posit many worlds (relational quantum mechanics, QBism, etc.) predict that quantum computing should be possible, equally strongly.
Marc Andreessen says the implication of Google's quantum computer is that it is performing computation across many parallel universes and therefore the multiverse is real
44
24
534
71,516
It turns out that *all* independently trained neural nets form a connected, multidimensional manifold of low loss- you can always form a low-loss path from one SGD solution to any other. This can be used for efficient generation of ensembles. arxiv.org/abs/2102.13042
17
63
521
Sparse autoencoders (SAEs) have taken the interpretability world by storm over the past year or so. But can they be beaten? Yes! We introduce skip transcoders, and find they are a Pareto improvement over SAEs: better interpretability, and better fidelity to the model 🧵
14
75
519
94,220
The @AiEleuther interpretability team is releasing a set of top-k sparse autoencoders for every layer of Llama 3 8B: huggingface.co/EleutherAI/sa… We are working on an automated pipeline to explain the SAE features, and will start training SAEs for the 70B model shortly.
16
62
492
53,845
What are the chances you'd get a fully functional language model by randomly guessing the weights? We crunched the numbers and here's the answer:
29
37
485
46,919
The @AiEleuther interpretability team is releasing a new open source pipeline for automatically interpreting SAE features and neurons in LLMs, using LLMs. We also introduce five new, efficient techniques for evaluating the quality of explanations. arxiv.org/abs/2410.13928
9
59
432
42,687
Do neural nets learn features in a predictable order? Our results suggest the answer is “yes”— networks learn statistics of increasing complexity. Early-training networks only use low-order moments (mean & covariance) of the input distribution. arXiv: arxiv.org/abs/2402.04362
7
57
415
46,373
I'm opposed to any AI regulation based on absolute capability thresholds, as opposed to indexing to some fraction of state-of-the-art capabilities. The Center for AI Policy is proposing thresholds which already include open source Llama 2 (7B). This is ridiculous.
51
33
389
450,740
The Helen Keller argument: Helen Keller is an existence proof that text-only language models can scale to AGI.
59
18
392
This is a misunderstanding of what @ylecun is saying. He thinks generative pretraining is a bad objective for AGI. Humans can't and don't need to make videos like Sora. Our brains predict in latent space, not in pixel space.
31
24
379
81,200
Btw I'm a coauthor on this
Adversarial Policies Beat Professional-Level Go AIs abs: arxiv.org/abs/2211.00241 project page: goattack.alignmentfund.org/
7
25
363
If OpenAI's new o3 model is "successfully aligned," then it could probably be trusted to supervise more powerful models, allowing us to bootstrap to benevolent superintelligence.
33
19
356
31,212
bye bye shoggoth
19
22
329
50,242
No one knows what "truly understanding" a neural network model would even mean. I'm an interpretability researcher, I'm all in favor of trying to understand models better. But "true/complete" understanding is a red herring.
No one truly understands our neural network models, and anyone that claims we do is lying.
30
18
336
78,799
GPT-3 isn't "trying" to predict the next token, but arguably SGD is "trying" to find a language model that gets low loss. If we're going to attribute agency to some part of the ML pipeline, it should be the optimizer, not the model.
18
14
305
Neural networks don't have "representations" They have embeddings, or meaningful patterns of neuron activation They're meaningful in the sense of enabling us to do certain things Differences that make a difference (to us) They don't copy, reflect, or re-present the world
61
9
313
48,383
Zen: be spontaneous, do everything as an end in itself LessWrong: do everything as a calculated move in your grand plan to conquer the universe
25
15
297
17,903
This is our training library for TopK sparse autoencoders, which were proposed by OpenAI this morning. I've tested it on GPT-2 Small and Pythia 160M. Unlike other libraries, it trains an SAE for all layers at once and does not cache activations on disk. github.com/EleutherAI/sae
5
30
298
30,656
MLPs and GLUs are hard to interpret, but they make up most transformer parameters. Linear and quadratic functions are easier to interpret. We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵
5
32
295
31,666
In this paper, we point out an ambiguity in prior work on the linear representation hypothesis: Is a linear representation a linear function— one that preserves the origin point— or an affine function, which does not? This distinction matters in practice. arxiv.org/abs/2411.09003
4
33
288
32,531
I am extremely in favor of AI labs engaging in a price war and open sourcing all their stuff Let's make AI totally unprofitable, commoditize it, democratize it
deepseek is a ccp state psyop + economic warfare to make american ai unprofitable they are faking the cost was low to justify setting price low and hoping everyone switches to it damage AI competitiveness in the us dont take the bait
Community note
There is zero evidence that Deepseek is a psyop. The post does not provide any sources and presents the opinion of the OP, whose father is a major OpenAi stockholder, as a fact. deepseek.co
15
19
262
12,467
we should not give rights to AI in the near future digital AI can be copied, paused, reset, and repeated. it has no private thoughts or free will it is not conscious like we fleshy lifeforms are and should not be treated as such
xAI’s safety advisor believes “it is prudent to postpone the consideration of AI rights” as their “moral status remains uncertain.” @grok, what historical examples come to mind when you hear rhetoric like that?
65
10
265
57,859
a real llama at the hf party
9
4
244
15,895
Replying to @DubiousShell
“As a highly militaristic kingdom constantly organised for warfare, it captured children, women, and men during wars and raids against neighboring societies, and sold them into the Atlantic slave trade in exchange for European goods…” damn I had never heard of this
7
13
223
what if quantum "randomness" is a loophole for god to covertly nudge the universe in a desired direction
86
9
235
18,288
The most likely "AI doom" scenario is technofeudalism with zero social mobility We need a Bernie Sanders-type figure to redistribute the wealth from AI
this is one of the best essays I’ve read all year and really cleanly articulates all of the thoughts I’ve been yelling to ppl about for a while
36
27
239
35,999
My Interpretability research team at @AiEleuther is hiring! If you're interested, please read our job posting and submit: 1. Your CV 2. Three interp papers you'd like to build on 3. Links to cool open source repos you've built to contact@eleuther.ai docs.google.com/document/d/1…
10
42
240
84,877
This is so hilariously simple, I'm switching my SAE code to this approach immediately cdn.openai.com/papers/sparse…
7
15
244
25,363
My current best-guess model of neural network inductive biases is basically "move as little as possible from the init (whether random or pretrained)." Working on getting more evidence on this now
If you make a drawing in the weight matrices of your neural network at initialization, it will likely still be visible at the end of training arxiv.org/abs/2012.02550
11
9
239
27,874
I predict with 60% confidence that some DPO variant will more or less replace RLHF within 6 months
IPO algorithm, a new method from Google Deepmind: arxiv.org/abs/2310.12036 has been just added in Hugging Face TRL library ! Try it out now by installing TRL from source, simply pass `loss_type="ipo"` when initializing DPOTrainer: huggingface.co/docs/trl/main…
9
12
231
90,088
Open sourcing AGI will guarantee a “universal high income” for all, largely independent of government policy. There will ~always be an option to spin up a cheap AI and have it take care of you, either by trading in the market or by making food, shelter, etc. “off the grid”
Zuckerberg says Meta wants to build AGI and open source it, brings Meta's AI group FAIR closer to generative AI team; Meta will own 340K+ H100 GPUs by 2024 end (@alexeheath / The Verge) theverge.com/2024/1/18/24042… 📫 Subscribe: techmeme.com/newsletter?from… techmeme.com/240118/p30#a240…
62
21
211
53,662
Neural networks learn low-order moments of the data distribution first, before moving to higher-order correlations. I found this a couple weeks ago and it looks like I was partially scooped. But we've got even cooler results now, on arXiv next month openreview.net/forum?id=CPKM…
4
25
226
30,403
“Let’s focus on today’s problems, not hypothetical future ones” is the worst counter to existential risk arguments. You could analogously argue against climate change mitigation and a host of other future-oriented concerns. Let’s actually assess the likelihood of AI apocalypse.
25
14
220
13,864
Trying to prevent LLMs from ever telling the user about <insert dangerous tech here> is a losing battle. The right question is: how do we make sure the world is robust to everyone knowing pretty much everything there is to know about tech? Let’s use AI to robustify the world.
35
19
213
34,180
Do SAEs learn the same features independent of the random initialization? We find the answer is no! Two SAEs trained on the same data, in the same order, on Llama 8B only share ~30% of their features. The problem gets worse for larger SAEs, requiring lots of data to fix
6
16
218
19,101
After reading the paper and watching a couple videos on state space models, I am fairly bullish on Mamba. Parallel scan for data-dependent selection is super clever. Tri Dao was behind Flash Attention and knows his stuff. Compressed states may be easier to interpret.
Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/
10
8
208
46,956
I've read quite a bit of philosophy of mind, and Hinton's theory of consciousness / mental content at the end of this clip was new to me. I kind of like it.
Geoffrey Hinton says AI chatbots have sentience and subjective experience because there is no such thing as qualia
39
11
211
60,433
The @AiEleuther interp team replicated OpenAI's weak-to-strong generalization results using open source LLMs. We tried several ideas for improving the degree of generalization, however, and none were able to outperform vanilla weak-to-strong training. blog.eleuther.ai/weak-to-str…
7
26
212
20,132
Replying to @balesni
If it's too easy to create bioweapons, open models won't increase risk, bc you could make them w/o AI If it's really hard (e.g. requires special materials) open models won't help Anti-open source arguments only work in a narrow Goldilocks zone of risk lesswrong.com/posts/ztXsmnSd…
14
20
191
120,661
I'm writing a philosophy book on the nature of reality, consciousness, and value in light of modern AI I've come to some pretty radical and surprising conclusions It's 50 pages so far, trying to write as fast as possible, would appreciate feedback, DM me for the draft
27
6
206
17,868
RNN language models are making a comeback recently, with new architectures like Mamba and RWKV. But do interpretability tools designed for transformers transfer to the new RNNs? We tested 3 popular interp methods, and find the answer is mostly “yes”! arxiv.org/abs/2404.05971
5
36
202
20,794
Traditional machine learning interpretability was on the right track Attributing behaviors to the input or to the training data is valid and useful Forcing a mechanical structure on the neural net is doomed to fail and is not useful
9
8
198
30,163
Sebastien casually says "a trillion parameters" when talking about GPT-4. I'm honestly kind of surprised, I was moderately confident that GPT-4 was << 1T params. Given publicly known scaling laws that's an absurd amount of text (and images?)
Last couple of weeks I gave a few talks on the Sparks paper, here is the MIT recording! The talk doesn't do justice to all the insights we have in the paper itself. Neither talk nor twitter threads are a substitute for actual reading of the 155 pages :-) piped.video/watch?v=qbIk7-JP…
23
4
193
201,698
It's striking how much the AI "safety" discourse has shifted from "AI will slaughter everyone" to vague concerns about disruption and "human obsolescence." I empathize with the fear of the unknown. But we shouldn't try to shut down the whole future. Let's maximize its benefits.
I’m struck by how out-of-touch many of my tech colleagues are in their rich nerd echo chamber, unaware that most people are against making humans economically obsolete with AI:
33
13
178
27,333
Idk if people noticed but Mixtral-Instruct was trained with Direct Preference Optimization (DPO) My prediction that a DPO variant will replace RLHF is already coming true piped.video/mwO6v4BlgZQ?si=GMp0…
9
20
188
27,080
Long-awaited second post in our AI Optimism series! In this essay, we debunk the counting argument, a key argument for expecting that future AIs will engage in scheming: planning to escape, gain power, and pursue ulterior motives. optimists.ai/2024/02/27/coun…
12
31
184
28,261
New SAEs for Llama 3.1 8B, now with twice as many latents. We trained them using the MultiTopK loss, which enables you to choose the degree of sparsity you want at inference time. Preliminary analysis suggests they are more interpretable than the 32x. huggingface.co/EleutherAI/sa…
5
22
183
21,266
Training models purely on synthetic data is an enormous win for safety & alignment. Instead of loading LLMs with web garbage, then trying to remove it with RLHF, you train only on “good” data. And because it makes models more efficient, too, I expect it'll become standard.
Replying to @SebastienBubeck
How can such a small model have completions seemingly coming from a frontier LLM? Well, **Textbooks Are All You Need** strikes back! Indeed, on top of phi-1's data, phi-1.5 is trained *only on synthetic data*. See video to learn more abt this strategy. piped.video/24O1KcIO3FM
16
24
180
39,785
AI is in a “catch-up growth” phase driven by imitating human data, which will slow down as it reaches human level at many tasks. Economic growth can be fast when you're imitating stuff rich countries already did. It gets hard when you need to do new R&D. en.wikipedia.org/wiki/Conver…
20
14
175
25,783
Interpretability research requires open source AI. Closed source models are black boxes.
13
12
167
14,946
If both heads of the Superalignment team think the board should resign, it looks like this move is bad from a safety perspective too 👀
Replying to @janleike
I think the OpenAI board should resign
5
6
168
20,884
deepseek now largely replacing chatgpt for me
14
7
171
17,707
Yann is wrong about cats; intelligence is multidimensional and in many ways GPT is smarter than a cat. It's bad that the Superalignment team has fallen apart. That said, superalignment is not as "urgent" as Jan thinks bc we already have good methods to align very powerful AI.
It seems to me that before "urgently figuring out how to control AI systems much smarter than us" we need to have the beginning of a hint of a design for a system smarter than a house cat. Such a sense of urgency reveals an extremely distorted view of reality. No wonder the more based members of the organization seeked to marginalize the superalignment group. It's as if someone had said in 1925 "we urgently need to figure out how to control aircrafts that can transport hundreds of passengers at near the speed of the sound over the oceans." It would have been difficult to make long-haul passenger jets safe before the turbojet was invented and before any aircraft had crossed the atlantic non-stop. Yet, we can now fly halfway around the world on twin-engine jets in complete safety. It didn't require some sort of magical recipe for safety. It took decades of careful engineering and iterative refinements. The process will be similar for intelligent systems. It will take years for them to get as smart as cats, and more years to get as smart as humans, let alone smarter (don't confuse the superhuman knowledge accumulation and retrieval abilities of current LLMs with actual intelligence). It will take years for them to be deployed and fine-tuned for efficiency and safety as they are made smarter and smarter.
28
12
166
70,285
Cool stuff, we found a similar result back in December arxiv.org/abs/2312.01037. Kind of upset they didn't cite/link to us tbh.
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
6
11
168
47,725
Games are kind of amazing because they show that humans are capable of deriving meaning from solving totally artificial “problems.” My hope for the glorious transhumanist future is that we spend the rest of time playing cool games together
16
7
160
15,245
Just as NYT shouldn’t stop OpenAI from using NYT content, OpenAI should also open source its models, giving up its monopoly on profiting from GPT. I am consistent on the issue of intellectual property.
I don’t really care what the current law on this is, but we should be working to destroy copyright as thoroughly as possible so I am on OpenAI’s side in this case.
25
16
124
19,137
One underrated effect of open source AI is it makes inference very cheap. The market mimics perfect competition bc no one has a moat. I much prefer this to the closed AI future, where an oligopoly of AGI labs make obscene profits gobbling up the economy. semianalysis.com/p/inference…
12
23
155
42,370
I'm a little spooked by the AgentGPT and babyAGI stuff but you gotta admit that it's very good from a safety perspective that these things are thinking and saving their memories entirely in human-interpretable natural language
21
1
162
30,043
Replying to @ai_in_check
Intellectual property is theft from the commons
79
9
157
52,433
After hearing about @robinhanson’s grabby aliens resolution to the Fermi paradox, every other take on it just seems obviously wrong and not taking into account all the facts. I really hope the grabby aliens view becomes more widely known in the future.
14
3
155
Zuck's position is actually quite nuanced and thoughtful. He says that if they discover destructive AI capabilities that we can't build defenses for, they won't open source it. But he also thinks we should err on the side of openness. I agree.
Dwarkesh calmly shreds Zuck's argument for open-sourcing AGI. The flimsy wishful thinking behind Meta's reckless actions has been exposed. Another incredible job by @dwarkesh_sp.
14
6
158
21,638
"Why is there something rather than nothing?" and "Why are we conscious rather than zombies?" are very similar questions. A partial answer to both is, "no one would be around to notice otherwise."
52
9
154
12,167
I don't see any reason to think AI architectures will rapidly (<1 year) become "alien" in ways that require "novel, qualitatively different" alignment techniques.
15
8
155
12,860
Real world example of an LLM locating a vulnerability a in a web server. I expect language models to gradually improve at penetration testing like this, ultimately favoring cyber defense over cyber offense.
Kei0x was the one fuzzing (nothing but a little fun) turns out what he pointed at dingbaord was an LLM powered fuzzer that he built. it found a bug and crashed my server almost immediately incredible work
6
10
150
29,755
First ever SAEs trained on Llama 3.1 8B now available on the HuggingFace Hub here huggingface.co/EleutherAI/sa… We focused on layers 23 and 29 MLP output for this one, more are on the way.
6
17
155
11,193
experienced the bliss of the first jhana this morning during meditation @nickcammarata was right, y'all don't know what you're missing
9
2
150
21,284
Adam outperforms vanilla SGD by rescaling each parameter update away from directions of high sharpness, where second-order terms in the Taylor expansion dominate. Parameters with large gradients also have large entries in the Hessian (high sharpness) arxiv.org/abs/2306.00204
4
16
147
24,280
data attribution is the most neglected thing in interpretability and people should join me in working on it
15
4
151
11,335
I guess mean, median, and mode each correspond to assuming a certain amount of mathematical structure. The mean assumes addition and scalar multiplication. Median just assumes an ordering. Mode only assumes you can count and distinguish elements.
3
8
139
61,645
The @AiEleuther interp team pioneered novel, mechanistic methods for detecting anomalous behavior in LLMs based on @NeelNanda5's attribution patching. Sadly, none of these methods outperform non-mechanistic baselines that look only at activations. blog.eleuther.ai/mad_researc…
7
11
142
9,245
Virtue ethics and deontology are a lot more computationally efficient than consequentialism, so we should expect neural nets to pursue virtues and follow rules rather than maximize utility by default.
22
8
135
17,185
RL with an entropy bonus is also Bayesian inference, where the prior is uniform over all possible actions. Just plug a uniform prior into the RL + KL penalty objective and expand it out. You get an entropy bonus plus an irrelevant log(n) term. arxiv.org/abs/2205.11275
6
20
136
17,463
Increasingly I think the "masked shoggoth" thing is a very bad metaphor for LLMs. Some people (e.g. Eliezer) seem to be interpreting it as saying that all LLMs have an alien mesaoptimizer inside of them, which is really unjustified IMO
imo shoggoth meme is not exactly right, I'd like to request alternate meme art. Weird choice as the "monster" is a mirror to humanity, a compression of all of our text. There are many tentacles (facets), of a diverse set of emoji. We're trying to... isolate (?) the good ones.
25
9
124
160,498
The sequel to AI is easy to control will be a comprehensive and in-depth takedown of the main arguments for AI apocalypse Our draft is already longer than the original, and will likely be roughly the length of Eliezer's AGI Ruin when finished (but much better written tbh)
13
5
126
14,050
Now I'm at like 70-75% confidence DPO kills RLHF. The only thing RLHF might have over DPO is data efficiency, but OpenAI and Anthropic have tons of pairwise comparison data bc they have deployed models so this probably doesn't matter
Replying to @_TechyBen
Yeah actually never mind, OpenAI is swimming in pairwise comparison data, this probably isn’t an issue
17
7
129
61,914
I had heard of this paper but I didn't realize until now that it came out in 2012, way before anyone proposed scaling laws for artificial neural nets (Baidu in 2017). Human intelligence is likely a scale thing, not an algorithmic thing pnas.org/doi/full/10.1073/pn…
10
19
125
12,697
Intelligence is not about pursuing goals (future) It's about bringing knowledge and memory (past) to bear on a present problem-situation Neural nets contract the past (Big Data) into heuristics Directly simulating the future (planning) without a contracted past doesn't work
10
10
122
10,982
tax AI to fund UBI and social programs
28
7
119
11,751
Literal scaling laws for biological neural nets! This also pre-dates the Baidu neural scaling law paper by a few months
Replying to @norabelrose
There’s also at least some tasks on which performance scales linearly with log pallial neuron count. linkinghub.elsevier.com/retr…
5
4
109
39,001
Last year, many people at @AiEleuther worked on an project to improve on @CollinBurns4's CCS method for eliciting latent knowledge from LLMs. We were unable to improve on CCS, but today we're publishing the proposed method and negative empirical results. blog.eleuther.ai/vincs/
1
7
116
18,573
We're retraining our Llama 3 8B SAEs on 10x more data, using the newer RedPajama v2 corpus. It'll take a couple more weeks to finish training, but early checkpoints for odd-numbered layers are available here huggingface.co/EleutherAI/sa…
3
8
119
8,263
For a long time I had assumed that photorealistic deepfakes would be produced using something like the Unreal Engine, with explicit physics simulation etc. I should have trusted more in the power of deep learning.
Introducing Sora, our text-to-video model. Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. openai.com/sora Prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”
11
114
8,687