Daniel Paleka · Dec 8, 2025 · 2:09 AM UTC

Daniel Paleka

Pinned Tweet

Daniel Paleka

@dpaleka

8 Dec 2025

Reminder: if you like what you see here, you should subscribe to my newsletter. newsletter.danielpaleka.com/

Daniel Paleka's Newsletter | Substack

AI research and making the future go well. Click to read Daniel Paleka's Newsletter, a Substack publication with thousands of subscribers.

newsletter.danielpaleka.com

4,903

Daniel Paleka · Apr 30, 2025 · 10:10 PM UTC

Daniel Paleka

@dpaleka

30 Apr 2025

3.7 sonnet: *hands behind back* yes the tests do pass. why do you ask. what did you hear 4o: yes you are Jesus Christ's brother. now go. Nanjing awaits o3: Listen, sorry, I owe you a straight explanation. This was once revealed to me in a dream

254

3,363

131,650

Daniel Paleka · Jan 22, 2023 · 3:58 PM UTC

Daniel Paleka

@dpaleka

22 Jan 2023

Sam Altman (CEO of OpenAI) responding to a completely normal question in 2019

ALTMAN: Well, I will caveat this by saying if you believe what I believe about the timeline to AGI and the effect it will have on the world, it is hard to spend a lot of mental cycles thinking about anything else. So I have not thought deeply about what it would take to solve, really, any other problem in the last few years.

ALT ALTMAN: Well, I will caveat this by saying if you believe what I believe about the timeline to AGI and the effect it will have on the world, it is hard to spend a lot of mental cycles thinking about anything else. So I have not thought deeply about what it would take to solve, really, any other problem in the last few years.

1,442

505,660

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

No one sees ChatGPT for the first time and thinks "just some n-gram correlations" or "no real knowledge inside". Those unintuitive beliefs trickle down from some experts, who should know better than to teach their controversial theories as established fact: 🧵 (1/12)

117

849

219,494

Daniel Paleka · Oct 29, 2024 · 1:49 PM UTC

Daniel Paleka

@dpaleka

29 Oct 2024

It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. 🧵

443

157,152

Daniel Paleka · Oct 10, 2022 · 1:19 PM UTC

Daniel Paleka

@dpaleka

10 Oct 2022

Stable Diffusion has a safety filter blocking “harmful” images by default. The filter is obfuscated -- how does it work? We reverse engineer the hidden sauce! Joint work @Javi_Rando, @davlindner, @ohlennart, @florian_tramer: "Red-Teaming the Stable Diffusion Safety Filter" 🧵

Figure 1: Simplified safety filter algorithm implemented in Stable Diffusion. Images are mapped to a CLIP latent space, where they are compared against precomputed embeddings of 17 unsafe concepts (see full list in Appendix E).
If the cosine similarity between the output image and any of the concepts is above a certain threshold, the image is considered unsafe and blacked-out.

ALT Figure 1: Simplified safety filter algorithm implemented in Stable Diffusion. Images are mapped to a CLIP latent space, where they are compared against precomputed embeddings of 17 unsafe concepts (see full list in Appendix E). If the cosine similarity between the output image and any of the concepts is above a certain threshold, the image is considered unsafe and blacked-out.

384

Daniel Paleka · Sep 1, 2022 · 12:42 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2022

What happened this month in AI/ML safety research. 🧵 (1/8)

323

Daniel Paleka · Oct 31, 2022 · 8:12 PM UTC

Daniel Paleka

@dpaleka

31 Oct 2022

What happened this month in AI/ML safety research.🧵(1/10)

259

Daniel Paleka · Apr 19, 2023 · 6:24 PM UTC

Daniel Paleka

@dpaleka

19 Apr 2023

Watching a talk on *LLM evaluation* organized by Langchain, featuring guests from OpenAI and Anthropic. Main takeaways: (1/11)

255

80,379

Daniel Paleka · Jan 31, 2023 · 4:59 PM UTC

Daniel Paleka

@dpaleka

31 Jan 2023

What happened this month in AI/ML safety research. 🧵(1/9)

225

72,546

Daniel Paleka · Sep 30, 2022 · 5:37 PM UTC

Daniel Paleka

@dpaleka

30 Sep 2022

What happened this month in AI/ML safety research. 🧵(1/8)

224

Daniel Paleka · Jan 2, 2023 · 5:22 PM UTC

Daniel Paleka

@dpaleka

2 Jan 2023

What happened last month in AI/ML safety research. 🧵(1/9)

211

60,608

Daniel Paleka · Feb 27, 2023 · 4:05 PM UTC

Daniel Paleka

@dpaleka

27 Feb 2023

What happened this month in AI/ML safety research. 🧵 (1/9)

185

55,986

Daniel Paleka · Jun 26, 2023 · 2:05 PM UTC

Daniel Paleka

@dpaleka

26 Jun 2023

How to evaluate superhuman models without ground truth? How do we know if the model is wrong or lying, if we can't know the correct answer? Test whether the AI's outputs paint a consistent picture of the world! w/ @LukasFluri_ @florian_tramer arxiv.org/abs/2306.09983 (1/14)

184

41,502

Daniel Paleka · Jan 18, 2023 · 3:16 PM UTC

Daniel Paleka

@dpaleka

18 Jan 2023

LLMs will soon be able to be incredibly addictive and harmful to emotionally vulnerable users, to levels much more serious than with social media. Do not blame the users or laugh at them! It's the same as with drugs; ridiculing addiction does not help anyone.

167

56,598

Daniel Paleka · Nov 30, 2022 · 5:52 PM UTC

Daniel Paleka

@dpaleka

30 Nov 2022

What happened this month in AI/ML safety research. 🧵 (1/8)

154

Daniel Paleka · Oct 31, 2024 · 7:04 PM UTC

Daniel Paleka

@dpaleka

31 Oct 2024

What happened recently in AI/ML safety research (1/8) 🧵:

139

28,723

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

I'm not saying the world models are very accurate, or that LM pretraining scales to AGI. And "sentience" may well be a category error. But the opposite extreme -- that there is "no reference to meaning" in LMs -- is far less likely at this point. (12/12)

111

8,038

Daniel Paleka · Jan 9, 2025 · 4:17 AM UTC

Daniel Paleka

@dpaleka

9 Jan 2025

i saw the bridge from Golden Gate Claude yesterday

116

6,166

Daniel Paleka · Jan 10, 2025 · 2:02 PM UTC

Daniel Paleka

@dpaleka

10 Jan 2025

Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)

122

18,682

Daniel Paleka · Oct 17, 2023 · 4:30 PM UTC

Daniel Paleka

@dpaleka

17 Oct 2023

What happened in the past month in AI/ML safety research (1/8) 🧵 :

110

16,782

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

Even without any experiments... the training set is not some random text, it's written by millions of natural general intelligences w/ some beliefs. Why wouldn't the learned distribution approximate the structure of the training one even slightly? (3/12) arxiv.org/abs/2212.01681

Language Models as Agent Models

Language models (LMs) are trained on collections of documents, written by individual human agents to achieve specific goals in an outside world. During training, LMs have access only to text of...

arxiv.org

103

21,648

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

By above, I mean most people outside ML would agree with ↓ given 1h w/ text-davinci-003: * there is some of what is usually meant by reasoning, and not only basic pattern-matching and memorization; * some facts are stored somewhat robustly, not just as word correlations (2/12)

104

15,413

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

"LMs just predict the next token" does not imply LMs are random text generators. Text prediction and compression require meaningful representations; there is no other hidden structure in human text we know of that could get the loss so low. (9/12) en.wikipedia.org/wiki/Hutter…

Hutter Prize - Wikipedia

en.wikipedia.org

103

8,439

Daniel Paleka · Jul 15, 2023 · 2:57 PM UTC

Daniel Paleka

@dpaleka

15 Jul 2023

What happened in the past month in AI/ML safety research (1/9) 🧵:

104

46,778

Daniel Paleka · Jan 22, 2023 · 3:59 PM UTC

Daniel Paleka

@dpaleka

22 Jan 2023

the whole interview, with Tyler Cowen as the host, of course: conversationswithtyler.com/e…

Sam Altman on Loving Community, Hating Coworking, and the Hunt for Talent (Ep. 61 - Live in SF)

He’s renowned for assessing talent — so would he fund Peter Parker? How about Bruce Wayne?

conversationswithtyler.com

101

34,356

Daniel Paleka · Mar 31, 2023 · 10:00 AM UTC

Daniel Paleka

@dpaleka

31 Mar 2023

What happened this month in AI/ML safety research. 🧵 (1/8)

16,706

Daniel Paleka · Jun 5, 2025 · 5:08 PM UTC

Daniel Paleka

@dpaleka

5 Jun 2025

How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations. We identify key issues with forecasting evaluations 🧵 (1/7)

17,761

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

It's clear that storing an approximate world model is useful for predicting the next token. Also, Minerva results are strong evidence against priors such as "it's just statistical correlations, nothing complex is going on". (4/12) minerva-demo.github.io/#cate…

13,002

Daniel Paleka · Sep 1, 2024 · 10:22 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2024

What happened recently in AI/ML safety research (1/8): 🧵

16,962

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

Without these unsound priors, the Othello result wouldn't be surprising at all! If the LM can play legal moves given a sequence of previous moves, it would be extremely weird if token generation didn't factor through the game state representation. (5/12) thegradient.pub/othello/

12,099

Daniel Paleka · Dec 13, 2024 · 10:07 PM UTC

Daniel Paleka

@dpaleka

13 Dec 2024

NeurIPS test of time award talk on GANs mentions the paper was done in 12 days, from idea to submission. Two days more than Javascript, but slightly faster than the first versions of Git or Unix.

5,275

Daniel Paleka · Dec 27, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

27 Dec 2023

What happened recently in AI/ML safety research (1/8) 🧵 :

26,157

Daniel Paleka · Jan 19, 2025 · 10:35 AM UTC

Daniel Paleka

@dpaleka

19 Jan 2025

saturday evening, 2am, palo alto, the headquarters of the interaction company of california. (retvrn). people working at n different places coding, disco music blasting. let a hundred cursor sessions bloom. immaculate vibes. bay area i love you

16,416

Daniel Paleka · Nov 19, 2024 · 10:04 PM UTC

Daniel Paleka

@dpaleka

19 Nov 2024

Replying to @Andr3jH

imagine still playing such small ball

8,853

Daniel Paleka · Mar 26, 2025 · 3:50 PM UTC

Daniel Paleka

@dpaleka

26 Mar 2025

hayao miyazaki be like

2,380

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

But it's straightforwardly true that LMs do have some basic skills, that there are traces of world models, and there is a plausible story of how more general behavior can and does emerge. (11/12) bounded-regret.ghost.io/emer…

Emergent Deception and Emergent Optimization

I’ve previously argued that machine learning systems often exhibit emergent capabilities, and that these capabilities could lead to unintended negative consequences. But how can we reason concretely...

bounded-regret.ghost.io

7,347

Daniel Paleka · Jul 17, 2025 · 11:34 PM UTC

Daniel Paleka

@dpaleka

17 Jul 2025

claude strikes back

5,351

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

But strange places on the Internet had the basics of the story in 2020; labs such as OpenAI likely had it even earlier. The public being misguided *in the counterintuitive direction* in 2023 is at least partly on academia and pop-sci reporting. (8/12) gwern.net/scaling-hypothesis…

8,558

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

Researchers will keep finding explicit representations of the relevant state and knowledge in transformer activations. But this shouldn't even be required! The people saying the opposite have had burden of proof since GPT-2 at least. (6/12) arxiv.org/abs/2301.05217

Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to...

arxiv.org

11,664

Daniel Paleka · Mar 26, 2023 · 9:37 PM UTC

Daniel Paleka

@dpaleka

26 Mar 2023

Replying to @nearcyan

we should stop with the good-faith interpretations. there is evil in this world

10,453

Daniel Paleka · Aug 27, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

27 Aug 2023

What happened in the past month in AI/ML safety research (1/8) 🧵:

6,961

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

The Cholletian position is that any skill can be solved by pattern-matching with enough data, and that LM don't have true intelligence. This might still be true. (10/12) arxiv.org/abs/1911.01547

7,128

Daniel Paleka · Jul 25, 2024 · 2:31 PM UTC

Daniel Paleka

@dpaleka

25 Jul 2024

Replying to @beala

Is it possible to know the top counterparties?

13,902

Daniel Paleka · Apr 24, 2025 · 6:20 AM UTC

Daniel Paleka

@dpaleka

24 Apr 2025

The ICLR Oral is at 11:15am tomorrow in Garnet 212-213, and the poster is up 3pm-5:30pm in Hall 3!

Daniel Paleka

@dpaleka

10 Jan 2025

4,008

Daniel Paleka · Sep 1, 2024 · 10:22 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2024

Fluent jailbreaks. Previous white-box optimization attacks like GCG and BEAST produced nonsensical attack strings. Using a multi-model perplexity penalty and a distillation loss algorithm yields working attack strings that look like normal text. arxiv.org/abs/2407.17447 (3/8)

7,858

Daniel Paleka · Sep 1, 2022 · 12:42 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2022

Cognitive science. GPT-3 reproduces social psychology results such as the Milgram shock experiment, obeying harmful orders even when a human is likely to be hurt. Is cognitive science for models relevant again? (3/8) arxiv.org/abs/2208.10264

Daniel Paleka · Jul 27, 2024 · 2:36 PM UTC

Daniel Paleka

@dpaleka

27 Jul 2024

They have ICML venue staff as bouncers in front of the mechanistic interpretability workshop because so many people at the conference want in and there is no room. Genuine "Suffering from Success" moment.

2,363

Daniel Paleka · Mar 1, 2023 · 4:00 PM UTC

Daniel Paleka

@dpaleka

1 Mar 2023

Of course, the representations are not great, and the n-gram correlations mostly don't factorize as cleanly. LMs haven't grokked complex stuff yet; memorization is very prevalent (GPT-J memorized 1% of The Pile) and increases with scale. (7/12) arxiv.org/abs/2202.07646

Quantifying Memorization Across Neural Language Models

Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable...

arxiv.org

9,219

Daniel Paleka · Mar 3, 2025 · 7:46 AM UTC

Daniel Paleka

@dpaleka

3 Mar 2025

Replying to @jam3scampbell

1,192

Daniel Paleka · Mar 24, 2023 · 1:20 PM UTC

Daniel Paleka

@dpaleka

24 Mar 2023

At long last, we have created the AI That Has Control Over That Thing, from the Twitter AI discourse classic "We Just Won't Give AI Control Over That Thing"

2,515

Daniel Paleka · Jan 2, 2023 · 5:22 PM UTC

Daniel Paleka

@dpaleka

2 Jan 2023

Discovering latent knowledge from model activations, unsupervised. The “truth vector” of a sentence is a direction in the latent space, solving a functional eq. New method finds truth even when the LM is prompted to lie in the output. Hope for ELK? arxiv.org/abs/2212.03827 (3/9)

6,660

Daniel Paleka · Oct 29, 2024 · 1:49 PM UTC

Daniel Paleka

@dpaleka

29 Oct 2024

All students in the areas below also trigger criterion 4, so it seems like a blanket ban on e,g, applied CS students from these countries. It is unclear where the line will be drawn on what is a "critical research area".

10,513

Daniel Paleka · Jan 31, 2023 · 4:59 PM UTC

Daniel Paleka

@dpaleka

31 Jan 2023

Memorized images. Diffusion models do memorize some individual images after all. Extraction attack is limited to images repeated many times in the dataset, or *outliers* in CLIP space. Imagen worse than Stable Diffusion arxiv.org/abs/2301.13188 (3/9)

3,315

Daniel Paleka · Jul 11, 2024 · 1:04 PM UTC

Daniel Paleka

@dpaleka

11 Jul 2024

Replying to @neallseth

gpt-4o-2024-05-13: 'In J.R.R. Tolkien's legendarium, the character referred to as the "Lord of all the Beasts of the Earth and Fishes of the Sea" is Oromë, one of the Valar.'

1,921

Daniel Paleka · Jul 3, 2024 · 4:29 PM UTC

Daniel Paleka

@dpaleka

3 Jul 2024

What happened recently in AI/ML safety research (1/8) 🧵:

5,379

Daniel Paleka · Aug 14, 2024 · 9:10 PM UTC

Daniel Paleka

@dpaleka

14 Aug 2024

The NeurIPS Science of DL Workshop has a Debunking Challenge on papers showing experiments contradicting folk knowledge. It's a cool thing to incentivize; I think a disproportionate fraction of papers that I've come back to reread are of this form.

2,648

Daniel Paleka · Sep 1, 2022 · 12:42 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2022

Mechanistic interpretability for grokking. Neel Nanda explains the “grokking” phenomenon for NNs learning modulo addition: the embedding matrix learns Fourier features! Hypothesis: all loss curves are just millions of grokking curves added up. (5/8) alignmentforum.org/posts/N6W…

Daniel Paleka · Feb 27, 2023 · 4:05 PM UTC

Daniel Paleka

@dpaleka

27 Feb 2023

Security flaws in LMs with API calling capabilities. Prompt injections are actually dangerous when the user doesn't control all the context. Search results are attack vectors, and LMs with persistent storage can be persistently infected arxiv.org/abs/2302.12173 (2/9)

2,959

Daniel Paleka · Jan 2, 2023 · 5:22 PM UTC

Daniel Paleka

@dpaleka

2 Jan 2023

RL from AI Feedback. Start with a “constitution” of principles. AI answers and revises lots of prompts, picks best answers via CoT to follow the principles. Then train a reward model and continue as in RLHF. Better than RLHF, using no human feedback anthropic.com/constitutional… (9/9)

2,248

Daniel Paleka · May 1, 2023 · 3:03 PM UTC

Daniel Paleka

@dpaleka

1 May 2023

What happened last month in AI/ML safety research (1/8):

4,299

Daniel Paleka · Jan 17, 2025 · 7:12 AM UTC

Daniel Paleka

@dpaleka

17 Jan 2025

what are you doing Claude i thought we were friends

2,193

Daniel Paleka · Nov 30, 2022 · 5:52 PM UTC

Daniel Paleka

@dpaleka

30 Nov 2022

Interpretability in the wild. Largest interpretable circuit in a GPT found to date. Explains exactly how gpt2-small solves Indirect Object Identification: predicting which word in the previous sentence is the object in the next sentence. arxiv.org/abs/2211.00593 (8/8)

Daniel Paleka · Jun 25, 2023 · 4:44 PM UTC

Daniel Paleka

@dpaleka

25 Jun 2023

Replying to @visakanv

patio11 is a global treasure

2,320

Daniel Paleka · Oct 29, 2024 · 1:49 PM UTC

Daniel Paleka

@dpaleka

29 Oct 2024

As I understand the above, criteria 1 and 2 are enough to autoreject. Here is the current list of institutions that pose a security risk: ethz.ch/content/dam/ethz/ass…

11,762

Daniel Paleka · Oct 31, 2022 · 8:12 PM UTC

Daniel Paleka

@dpaleka

31 Oct 2022

Goal misgeneralization. If the model learns capabilities to accomplish a goal, it might have pursued a different goal which agreed with the intended goal in training, with catastrophic outcomes in deployment because capabilities are retained (2/10) arxiv.org/abs/2210.01790

Daniel Paleka · Feb 29, 2024 · 7:19 PM UTC

Daniel Paleka

@dpaleka

29 Feb 2024

What happened recently in AI/ML safety research (1/9) 🧵 :

2,750

Daniel Paleka · Jul 4, 2025 · 3:27 PM UTC

Daniel Paleka

@dpaleka

4 Jul 2025

you're laughing. you were born by chance during the final stage of The Great Work, the transmutation of all humanity’s work into a new form of life, the final boss of the anthropocene, the flower bloom of billions of years of biological life, and you’re laughing.

2,261

Daniel Paleka · Oct 10, 2022 · 1:19 PM UTC

Daniel Paleka

@dpaleka

10 Oct 2022

And remember, if you want a base model without some capabilities, *curating the training set properly* is likely much more effective than any output filter!

Daniel Paleka · Mar 23, 2025 · 2:23 PM UTC

Daniel Paleka

@dpaleka

23 Mar 2025

TIL the concept of *epistemic hell*. standard Joseph Henrich example: in the ancestral environment, hygienic and food prep rituals determine survival, but no hunter-gatherer can possibly explain why. hence genetic selection for accepting of religious rituals and against reasoning

1,749

Daniel Paleka · Jan 2, 2023 · 5:22 PM UTC

Daniel Paleka

@dpaleka

2 Jan 2023

Mechanistic Interpretability Explainer & Glossary. Neel Nanda created a wiki of all current research on the inner workings of transformer LMs. Very comprehensive introduction to interpretability research for beginners. dynalist.io/d/n2ZWtnoYHrU1s4… (6/9)

3,344

Daniel Paleka · Sep 30, 2022 · 5:37 PM UTC

Daniel Paleka

@dpaleka

30 Sep 2022

Sparrow. DeepMind trains the most harmless chatbot to date. They use RLHF on 23 rules such as “no threats or harassment”, “no opinions or emotions”, “no self-anthropomorphism”. A step towards supervision of LLMs via dialogue (3/8) deepmind.com/blog/building-s…

Daniel Paleka · Apr 15, 2024 · 4:08 PM UTC

Daniel Paleka

@dpaleka

15 Apr 2024

Recently we took a dive into the state of the art in LLM robustness, jailbreaks and prompt injection for the Challenges paper. Here are the key research problems we expect to be impactful if you can solve them! (1/8)

David Krueger 🦥 ⏸️ ⏹️ ⏪

@DavidSKrueger

15 Apr 2024

I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...

3,113

Daniel Paleka · Sep 1, 2022 · 12:42 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2022

OpenAI’s three-pillar alignment plan. Train with human feedback like in RLHF, next speed up the feedback using AI assistants, and then bootstrap alignment research using the AI itself (2/8) openai.com/blog/our-approach…

Daniel Paleka · Nov 13, 2024 · 10:17 PM UTC

Daniel Paleka

@dpaleka

13 Nov 2024

guys literally only want one thing and it's the patient work of sitting down every day and reading papers until their eyes bleed, and hoping that something good comes out of it someday

1,150

Daniel Paleka · Sep 14, 2022 · 10:42 PM UTC

Daniel Paleka

@dpaleka

14 Sep 2022

you're telling me this semanticity is poly?

Daniel Paleka · Nov 30, 2022 · 5:52 PM UTC

Daniel Paleka

@dpaleka

30 Nov 2022

Will we run out of data? Epoch paper saying human-written high quality language data will be exhausted by 2026. LLM scaling without self-improvement might stop soon! Safety researchers should think of dangers of RL-like training or synthetic data arxiv.org/abs/2211.04325 (7/8)

Daniel Paleka · Feb 23, 2023 · 7:33 PM UTC

Daniel Paleka

@dpaleka

23 Feb 2023

This feels a bit unscientific? It's likely Twitter interprets this as "Sydney is aware of its situation and decided to bypass filtering". But that would be one of the most important events in history. Such claims require more evidence, esp. when there are simpler explanations 1/4

Eliezer Yudkowsky ⏹️

@ESYudkowsky

23 Feb 2023

Despite everything I know this still brought tears into my eyes.

5,528

Daniel Paleka · Oct 31, 2022 · 8:12 PM UTC

Daniel Paleka

@dpaleka

31 Oct 2022

Editing memory in transformers. The famous ROME paper showed facts are stored in a single token residual stream in a range of consecutive MLPs, in GPT-2. Their followup uses this to edit thousands of facts simultaneously, in 20B-sized models memit.baulab.info/ (4/10)

Daniel Paleka · Jun 25, 2024 · 9:40 PM UTC

Daniel Paleka

@dpaleka

25 Jun 2024

Inside the newest takedown by Nicholas Carlini is a primer on attacks that are worth proper disclosure procedures, and those that are not nicholas.carlini.com/writing…

2,549

Daniel Paleka · Sep 1, 2022 · 12:42 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2022

Technical alignment survey. Thomas Larsen writes an excellent overview of the technical alignment landscape, with the first reviews of Dan Hendrycks’ CAIS, and a quite detailed overview of what Conjecture works on. (8/8) lesswrong.com/posts/QBAjndPu…

(My understanding of) What Everyone in Technical Alignment is Doing and Why — LessWrong

Epistemic Effort: ~75 hours of work put into this document …

lesswrong.com

Daniel Paleka · Sep 1, 2022 · 12:42 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2022

Selection-Inference + faithful reasoning. DeepMind paper proposes a structured chain-of-thought method to ensure forward reasoning steps are correct, and to avoid hallucinated facts. (4/8) arxiv.org/abs/2208.14271

Daniel Paleka · Sep 1, 2022 · 12:42 AM UTC

Daniel Paleka

@dpaleka

1 Sep 2022

Steganography. Is chain-of-thought really an interpretability method, or do large language models hide their reasoning via steganography? Optimization pressure towards hidden reasoning could make all outcome-based training with LLMs unsafe. (6/8) alignmentforum.org/posts/yDc…

Daniel Paleka · Jan 31, 2023 · 4:59 PM UTC

Daniel Paleka

@dpaleka

31 Jan 2023

Double descent & interpretability. New Anthropic paper showing that in the generalization regime, the features organize into polytopes, while for memorization it’s the *data embeddings* that have a geometric structure. transformer-circuits.pub/202… (5/9)

2,610

Daniel Paleka · Mar 9, 2023 · 4:12 PM UTC

Daniel Paleka

@dpaleka

9 Mar 2023

Replying to @AIImpacts

Can you make a better figure? Important data like this should be properly presentable to non-technical audiences; see how Our World In Data do it. This graphic looks quite weird.

2,220

Daniel Paleka · Jan 2, 2023 · 5:22 PM UTC

Daniel Paleka

@dpaleka

2 Jan 2023

LM-written evaluations for LMs. Automatically generating behavioral questions helps discover hard-to-measure phenomena. Larger RLHF models exhibit harmful self-preservation preferences, and *sycophancy*: insincere agreement with user’s sensibilities anthropic.com/model-written-… (2/9)

3,745

Daniel Paleka · Jan 31, 2023 · 4:59 PM UTC

Daniel Paleka

@dpaleka

31 Jan 2023

Watermarking LLM output. Color the vocab green/red randomly (with a hashed seed) after each token, then promote green tokens while sampling. Detection is possible without model access. Robust to small changes; slightly resistant to paraphrasing arxiv.org/abs/2301.10226 (9/9)

2,601

Daniel Paleka · Jan 31, 2023 · 4:59 PM UTC

Daniel Paleka

@dpaleka

31 Jan 2023

Not all paths lead to ROME. Surprise: knowing where a fact is stored doesn’t help with amplifying or erasing that fact! Causal tracing explains only 3% of the variance in edit success. arxiv.org/abs/2301.04213 (2/9)

3,852

Daniel Paleka · Sep 30, 2022 · 5:37 PM UTC

Daniel Paleka

@dpaleka

30 Sep 2022

Transformer interpretability without forward passes. Projecting the weight matrices of attention and MLP blocks to the embedding space helps locate key-value pairs inside the transformer which correspond to certain topics (4/8) arxiv.org/abs/2209.02535

Analyzing Transformers in Embedding Space

Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods...

arxiv.org

Daniel Paleka · Feb 27, 2023 · 4:05 PM UTC

Daniel Paleka

@dpaleka

27 Feb 2023

SolidGoldMagikarp. Weird tokens cause GPT models to go haywire. BPE tokenizer overfit on usernames from r/counting, a subreddit where people count to infinity, and those tokens did not appear in training. Very unexpected way of things going wrong! alignmentforum.org/posts/aPe… (3/9)

SolidGoldMagikarp (plus, prompt generation) — AI Alignment Forum

Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inapprop…

alignmentforum.org

2,231

Daniel Paleka · Oct 10, 2022 · 1:19 PM UTC

Daniel Paleka

@dpaleka

10 Oct 2022

The safety filter can be disabled, but most downstream applications *want* a filter that works well. We show the default filter provided by Stability AI & 🤗is unreliable, censors only a small subset of NSFW content, and has many false positives and negatives.

Daniel Paleka · Jan 31, 2023 · 4:59 PM UTC

Daniel Paleka

@dpaleka

31 Jan 2023

Programs as transformer weights. Tracr by DeepMind compiles simple human-readable code onto a transformer, creating examples for interpretability research. Compressed programs exhibit superposition. (4/9) arxiv.org/abs/2301.05062

2,714

Daniel Paleka · Jun 2, 2023 · 12:59 PM UTC

Daniel Paleka

@dpaleka

2 Jun 2023

What happened last month in AI/ML safety research (1/8) 🧵:

2,999

Daniel Paleka · Sep 30, 2022 · 5:37 PM UTC

Daniel Paleka

@dpaleka

30 Sep 2022

GPT as a simulator. In the limit of zero test loss, language models simulate reality. Although GPT is not an agent, future versions could simulate arbitrarily dangerous agents. GPT-n : agentic AI :: physics : humans (6/8) alignmentforum.org/posts/vJF…

Daniel Paleka · Dec 31, 2024 · 10:52 PM UTC

Daniel Paleka

@dpaleka

31 Dec 2024

my New Year's resolution: don't work on a bigger project if there is not a clear reason for doing it *now*. disregarding the AGI timelines, the R&D acceleration is a clear reason against technical work where the discount rates on the final product are low

1,533

Daniel Paleka · Sep 26, 2024 · 5:41 PM UTC

Daniel Paleka

@dpaleka

26 Sep 2024

2/2 accepted for @NeurIPSConf, I guess that means see you all in Vancouver!

1,330

Daniel Paleka · Jan 8, 2025 · 12:54 AM UTC

Daniel Paleka

@dpaleka

8 Jan 2025

LLMs rapidly improving at software engineering and math, given that the rate of improvement in ideation is slower, means you should be intentional about what value is gained from doing a highly technical project now as opposed to later

1,342

Daniel Paleka · Apr 30, 2025 · 3:15 PM UTC

Daniel Paleka

@dpaleka

30 Apr 2025

Quick sycophancy eval: comparing the two recent OpenAI ChatGPT system prompts, it is clear last week's prompt moves other models towards sycophancy too, while the current prompt makes them more disagreeable.

1,581

Daniel Paleka · Apr 19, 2023 · 6:24 PM UTC

Daniel Paleka

@dpaleka

19 Apr 2023

The YAML format supports "model-graded evals", essentially a protocol for instructing GPT-4 to do the grading. No need for fancy parsing scripts! Example: github.com/openai/evals/blob… (3/11)

4,207

Daniel Paleka · Jan 2, 2023 · 5:22 PM UTC

Daniel Paleka

@dpaleka

2 Jan 2023

LMs as agent simulators. The model approximates the beliefs and intentions of an agent that would produce the context, and uses that to predict the next token. When there is no context, the agent gets specified iteratively through sampling. (4/9) arxiv.org/abs/2212.01681

Language Models as Agent Models

Language models (LMs) are trained on collections of documents, written by individual human agents to achieve specific goals in an outside world. During training, LMs have access only to text of...

arxiv.org

2,594

Daniel Paleka · Jan 31, 2023 · 4:59 PM UTC

Daniel Paleka

@dpaleka

31 Jan 2023

Inverse Scaling Prize. The call for tasks where large models perform worse than smaller ones has finished. Larger models are more susceptible to prompt injection, struggle to avoid repeating memorized text, and more. Robust inverse scaling is rare. lesswrong.com/posts/DARiTSTx… (6/9)

2,424