ai safety researcher | phd @CSatETH | danielpaleka.com

Zurich
3.7 sonnet: *hands behind back* yes the tests do pass. why do you ask. what did you hear 4o: yes you are Jesus Christ's brother. now go. Nanjing awaits o3: Listen, sorry, I owe you a straight explanation. This was once revealed to me in a dream
36
254
3,363
131,650
Sam Altman (CEO of OpenAI) responding to a completely normal question in 2019
25
85
1,442
505,660
No one sees ChatGPT for the first time and thinks "just some n-gram correlations" or "no real knowledge inside". Those unintuitive beliefs trickle down from some experts, who should know better than to teach their controversial theories as established fact: 🧵 (1/12)
19
117
849
219,494
It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. 🧵
24
72
443
157,152
Stable Diffusion has a safety filter blocking “harmful” images by default. The filter is obfuscated -- how does it work? We reverse engineer the hidden sauce! Joint work @Javi_Rando, @davlindner, @ohlennart, @florian_tramer: "Red-Teaming the Stable Diffusion Safety Filter" 🧵
9
62
384
What happened this month in AI/ML safety research. 🧵 (1/8)
4
38
323
What happened this month in AI/ML safety research.🧵(1/10)
3
28
259
Watching a talk on *LLM evaluation* organized by Langchain, featuring guests from OpenAI and Anthropic. Main takeaways: (1/11)
3
36
255
80,379
What happened this month in AI/ML safety research. 🧵(1/9)
2
40
225
72,546
What happened this month in AI/ML safety research. 🧵(1/8)
5
27
224
What happened last month in AI/ML safety research. 🧵(1/9)
4
23
211
60,608
What happened this month in AI/ML safety research. 🧵 (1/9)
2
34
185
55,986
How to evaluate superhuman models without ground truth? How do we know if the model is wrong or lying, if we can't know the correct answer? Test whether the AI's outputs paint a consistent picture of the world! w/ @LukasFluri_ @florian_tramer arxiv.org/abs/2306.09983 (1/14)
8
31
184
41,502
LLMs will soon be able to be incredibly addictive and harmful to emotionally vulnerable users, to levels much more serious than with social media. Do not blame the users or laugh at them! It's the same as with drugs; ridiculing addiction does not help anyone.
12
11
167
56,598
What happened this month in AI/ML safety research. 🧵 (1/8)
6
20
154
What happened recently in AI/ML safety research (1/8) 🧵:
4
24
139
28,723
I'm not saying the world models are very accurate, or that LM pretraining scales to AGI. And "sentience" may well be a category error. But the opposite extreme -- that there is "no reference to meaning" in LMs -- is far less likely at this point. (12/12)
5
5
111
8,038
i saw the bridge from Golden Gate Claude yesterday
2
4
116
6,166
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
4
7
122
18,682
What happened in the past month in AI/ML safety research (1/8) 🧵 :
3
18
110
16,782
Even without any experiments... the training set is not some random text, it's written by millions of natural general intelligences w/ some beliefs. Why wouldn't the learned distribution approximate the structure of the training one even slightly? (3/12) arxiv.org/abs/2212.01681
3
6
103
21,648
By above, I mean most people outside ML would agree with ↓ given 1h w/ text-davinci-003: * there is some of what is usually meant by reasoning, and not only basic pattern-matching and memorization; * some facts are stored somewhat robustly, not just as word correlations (2/12)
3
3
104
15,413
"LMs just predict the next token" does not imply LMs are random text generators. Text prediction and compression require meaningful representations; there is no other hidden structure in human text we know of that could get the loss so low. (9/12) en.wikipedia.org/wiki/Hutter…
2
5
103
8,439
What happened in the past month in AI/ML safety research (1/9) 🧵:
1
19
104
46,778
What happened this month in AI/ML safety research. 🧵 (1/8)
3
17
88
16,706
How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations. We identify key issues with forecasting evaluations 🧵 (1/7)
5
13
89
17,761
It's clear that storing an approximate world model is useful for predicting the next token. Also, Minerva results are strong evidence against priors such as "it's just statistical correlations, nothing complex is going on". (4/12) minerva-demo.github.io/#cate…
2
1
77
13,002
What happened recently in AI/ML safety research (1/8): 🧵
2
16
71
16,962
Without these unsound priors, the Othello result wouldn't be surprising at all! If the LM can play legal moves given a sequence of previous moves, it would be extremely weird if token generation didn't factor through the game state representation. (5/12) thegradient.pub/othello/
2
2
71
12,099
NeurIPS test of time award talk on GANs mentions the paper was done in 12 days, from idea to submission. Two days more than Javascript, but slightly faster than the first versions of Git or Unix.
2
7
73
5,275
What happened recently in AI/ML safety research (1/8) 🧵 :
1
16
71
26,157
saturday evening, 2am, palo alto, the headquarters of the interaction company of california. (retvrn). people working at n different places coding, disco music blasting. let a hundred cursor sessions bloom. immaculate vibes. bay area i love you
5
4
72
16,416
Replying to @Andr3jH
imagine still playing such small ball
1
63
8,853
hayao miyazaki be like
1
1
65
2,380
claude strikes back
3
4
64
5,351
But strange places on the Internet had the basics of the story in 2020; labs such as OpenAI likely had it even earlier. The public being misguided *in the counterintuitive direction* in 2023 is at least partly on academia and pop-sci reporting. (8/12) gwern.net/scaling-hypothesis…
3
2
60
8,558
Researchers will keep finding explicit representations of the relevant state and knowledge in transformer activations. But this shouldn't even be required! The people saying the opposite have had burden of proof since GPT-2 at least. (6/12) arxiv.org/abs/2301.05217
1
1
59
11,664
Replying to @nearcyan
we should stop with the good-faith interpretations. there is evil in this world
3
53
10,453
What happened in the past month in AI/ML safety research (1/8) 🧵:
1
10
51
6,961
The Cholletian position is that any skill can be solved by pattern-matching with enough data, and that LM don't have true intelligence. This might still be true. (10/12) arxiv.org/abs/1911.01547
3
49
7,128
Replying to @beala
Is it possible to know the top counterparties?
2
49
13,902
The ICLR Oral is at 11:15am tomorrow in Garnet 212-213, and the poster is up 3pm-5:30pm in Hall 3!
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
7
48
4,008
Fluent jailbreaks. Previous white-box optimization attacks like GCG and BEAST produced nonsensical attack strings. Using a multi-model perplexity penalty and a distillation loss algorithm yields working attack strings that look like normal text. arxiv.org/abs/2407.17447 (3/8)
2
3
35
7,858
Cognitive science. GPT-3 reproduces social psychology results such as the Milgram shock experiment, obeying harmful orders even when a human is likely to be hurt. Is cognitive science for models relevant again? (3/8) arxiv.org/abs/2208.10264
2
5
43
They have ICML venue staff as bouncers in front of the mechanistic interpretability workshop because so many people at the conference want in and there is no room. Genuine "Suffering from Success" moment.
3
1
45
2,363
Of course, the representations are not great, and the n-gram correlations mostly don't factorize as cleanly. LMs haven't grokked complex stuff yet; memorization is very prevalent (GPT-J memorized 1% of The Pile) and increases with scale. (7/12) arxiv.org/abs/2202.07646
1
43
9,219
Replying to @jam3scampbell
4
2
43
1,192
At long last, we have created the AI That Has Control Over That Thing, from the Twitter AI discourse classic "We Just Won't Give AI Control Over That Thing"
1
3
43
2,515
Discovering latent knowledge from model activations, unsupervised. The “truth vector” of a sentence is a direction in the latent space, solving a functional eq. New method finds truth even when the LM is prompted to lie in the output. Hope for ELK? arxiv.org/abs/2212.03827 (3/9)
2
2
37
6,660
All students in the areas below also trigger criterion 4, so it seems like a blanket ban on e,g, applied CS students from these countries. It is unclear where the line will be drawn on what is a "critical research area".
2
34
10,513
Memorized images. Diffusion models do memorize some individual images after all. Extraction attack is limited to images repeated many times in the dataset, or *outliers* in CLIP space. Imagen worse than Stable Diffusion arxiv.org/abs/2301.13188 (3/9)
1
2
33
3,315
Replying to @neallseth
gpt-4o-2024-05-13: 'In J.R.R. Tolkien's legendarium, the character referred to as the "Lord of all the Beasts of the Earth and Fishes of the Sea" is Oromë, one of the Valar.'
32
1,921
What happened recently in AI/ML safety research (1/8) 🧵:
1
8
34
5,379
The NeurIPS Science of DL Workshop has a Debunking Challenge on papers showing experiments contradicting folk knowledge. It's a cool thing to incentivize; I think a disproportionate fraction of papers that I've come back to reread are of this form.
1
4
34
2,648
Mechanistic interpretability for grokking. Neel Nanda explains the “grokking” phenomenon for NNs learning modulo addition: the embedding matrix learns Fourier features! Hypothesis: all loss curves are just millions of grokking curves added up. (5/8) alignmentforum.org/posts/N6W…
2
3
29
Security flaws in LMs with API calling capabilities. Prompt injections are actually dangerous when the user doesn't control all the context. Search results are attack vectors, and LMs with persistent storage can be persistently infected arxiv.org/abs/2302.12173 (2/9)
1
2
31
2,959
RL from AI Feedback. Start with a “constitution” of principles. AI answers and revises lots of prompts, picks best answers via CoT to follow the principles. Then train a reward model and continue as in RLHF. Better than RLHF, using no human feedback anthropic.com/constitutional… (9/9)
1
2
30
2,248
What happened last month in AI/ML safety research (1/8):
2
3
29
4,299
what are you doing Claude i thought we were friends
4
31
2,193
Interpretability in the wild. Largest interpretable circuit in a GPT found to date. Explains exactly how gpt2-small solves Indirect Object Identification: predicting which word in the previous sentence is the object in the next sentence. arxiv.org/abs/2211.00593 (8/8)
1
29
Replying to @visakanv
patio11 is a global treasure
1
31
2,320
As I understand the above, criteria 1 and 2 are enough to autoreject. Here is the current list of institutions that pose a security risk: ethz.ch/content/dam/ethz/ass…
2
1
29
11,762
Goal misgeneralization. If the model learns capabilities to accomplish a goal, it might have pursued a different goal which agreed with the intended goal in training, with catastrophic outcomes in deployment because capabilities are retained (2/10) arxiv.org/abs/2210.01790
1
2
29
What happened recently in AI/ML safety research (1/9) 🧵 :
1
4
29
2,750
you're laughing. you were born by chance during the final stage of The Great Work, the transmutation of all humanity’s work into a new form of life, the final boss of the anthropocene, the flower bloom of billions of years of biological life, and you’re laughing.
3
1
31
2,261
And remember, if you want a base model without some capabilities, *curating the training set properly* is likely much more effective than any output filter!
2
28
TIL the concept of *epistemic hell*. standard Joseph Henrich example: in the ancestral environment, hygienic and food prep rituals determine survival, but no hunter-gatherer can possibly explain why. hence genetic selection for accepting of religious rituals and against reasoning
1
2
30
1,749
Mechanistic Interpretability Explainer & Glossary. Neel Nanda created a wiki of all current research on the inner workings of transformer LMs. Very comprehensive introduction to interpretability research for beginners. dynalist.io/d/n2ZWtnoYHrU1s4… (6/9)
1
3
27
3,344
Sparrow. DeepMind trains the most harmless chatbot to date. They use RLHF on 23 rules such as “no threats or harassment”, “no opinions or emotions”, “no self-anthropomorphism”. A step towards supervision of LLMs via dialogue (3/8) deepmind.com/blog/building-s…
2
1
28
Recently we took a dive into the state of the art in LLM robustness, jailbreaks and prompt injection for the Challenges paper. Here are the key research problems we expect to be impactful if you can solve them! (1/8)
I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...
1
5
29
3,113
OpenAI’s three-pillar alignment plan. Train with human feedback like in RLHF, next speed up the feedback using AI assistants, and then bootstrap alignment research using the AI itself (2/8) openai.com/blog/our-approach…
1
2
25
guys literally only want one thing and it's the patient work of sitting down every day and reading papers until their eyes bleed, and hoping that something good comes out of it someday
28
1,150
you're telling me this semanticity is poly?
1
2
26
Will we run out of data? Epoch paper saying human-written high quality language data will be exhausted by 2026. LLM scaling without self-improvement might stop soon! Safety researchers should think of dangers of RL-like training or synthetic data arxiv.org/abs/2211.04325 (7/8)
1
24
This feels a bit unscientific? It's likely Twitter interprets this as "Sydney is aware of its situation and decided to bypass filtering". But that would be one of the most important events in history. Such claims require more evidence, esp. when there are simpler explanations 1/4
Despite everything I know this still brought tears into my eyes.
4
2
26
5,528
Editing memory in transformers. The famous ROME paper showed facts are stored in a single token residual stream in a range of consecutive MLPs, in GPT-2. Their followup uses this to edit thousands of facts simultaneously, in 20B-sized models memit.baulab.info/ (4/10)
1
3
26
Inside the newest takedown by Nicholas Carlini is a primer on attacks that are worth proper disclosure procedures, and those that are not nicholas.carlini.com/writing…
2
2
27
2,549
Technical alignment survey. Thomas Larsen writes an excellent overview of the technical alignment landscape, with the first reviews of Dan Hendrycks’ CAIS, and a quite detailed overview of what Conjecture works on. (8/8) lesswrong.com/posts/QBAjndPu…
1
1
26
Selection-Inference + faithful reasoning. DeepMind paper proposes a structured chain-of-thought method to ensure forward reasoning steps are correct, and to avoid hallucinated facts. (4/8) arxiv.org/abs/2208.14271
1
1
24
Steganography. Is chain-of-thought really an interpretability method, or do large language models hide their reasoning via steganography? Optimization pressure towards hidden reasoning could make all outcome-based training with LLMs unsafe. (6/8) alignmentforum.org/posts/yDc…
2
4
26
Double descent & interpretability. New Anthropic paper showing that in the generalization regime, the features organize into polytopes, while for memorization it’s the *data embeddings* that have a geometric structure. transformer-circuits.pub/202… (5/9)
1
2
26
2,610
Replying to @AIImpacts
Can you make a better figure? Important data like this should be properly presentable to non-technical audiences; see how Our World In Data do it. This graphic looks quite weird.
4
25
2,220
LM-written evaluations for LMs. Automatically generating behavioral questions helps discover hard-to-measure phenomena. Larger RLHF models exhibit harmful self-preservation preferences, and *sycophancy*: insincere agreement with user’s sensibilities anthropic.com/model-written-… (2/9)
2
1
26
3,745
Watermarking LLM output. Color the vocab green/red randomly (with a hashed seed) after each token, then promote green tokens while sampling. Detection is possible without model access. Robust to small changes; slightly resistant to paraphrasing arxiv.org/abs/2301.10226 (9/9)
1
4
25
2,601
Not all paths lead to ROME. Surprise: knowing where a fact is stored doesn’t help with amplifying or erasing that fact! Causal tracing explains only 3% of the variance in edit success. arxiv.org/abs/2301.04213 (2/9)
1
3
26
3,852
Transformer interpretability without forward passes. Projecting the weight matrices of attention and MLP blocks to the embedding space helps locate key-value pairs inside the transformer which correspond to certain topics (4/8) arxiv.org/abs/2209.02535
1
4
25
SolidGoldMagikarp. Weird tokens cause GPT models to go haywire. BPE tokenizer overfit on usernames from r/counting, a subreddit where people count to infinity, and those tokens did not appear in training. Very unexpected way of things going wrong! alignmentforum.org/posts/aPe… (3/9)
1
22
2,231
The safety filter can be disabled, but most downstream applications *want* a filter that works well. We show the default filter provided by Stability AI & 🤗is unreliable, censors only a small subset of NSFW content, and has many false positives and negatives.
3
1
21
Programs as transformer weights. Tracr by DeepMind compiles simple human-readable code onto a transformer, creating examples for interpretability research. Compressed programs exhibit superposition. (4/9) arxiv.org/abs/2301.05062
1
5
24
2,714
What happened last month in AI/ML safety research (1/8) 🧵:
3
3
23
2,999
GPT as a simulator. In the limit of zero test loss, language models simulate reality. Although GPT is not an agent, future versions could simulate arbitrarily dangerous agents. GPT-n : agentic AI :: physics : humans (6/8) alignmentforum.org/posts/vJF…
1
1
23
my New Year's resolution: don't work on a bigger project if there is not a clear reason for doing it *now*. disregarding the AGI timelines, the R&D acceleration is a clear reason against technical work where the discount rates on the final product are low
2
23
1,533
2/2 accepted for @NeurIPSConf, I guess that means see you all in Vancouver!
1
1
24
1,330
LLMs rapidly improving at software engineering and math, given that the rate of improvement in ideation is slower, means you should be intentional about what value is gained from doing a highly technical project now as opposed to later
3
2
21
1,342
Quick sycophancy eval: comparing the two recent OpenAI ChatGPT system prompts, it is clear last week's prompt moves other models towards sycophancy too, while the current prompt makes them more disagreeable.
2
23
1,581
The YAML format supports "model-graded evals", essentially a protocol for instructing GPT-4 to do the grading. No need for fancy parsing scripts! Example: github.com/openai/evals/blob… (3/11)
2
1
23
4,207
LMs as agent simulators. The model approximates the beliefs and intentions of an agent that would produce the context, and uses that to predict the next token. When there is no context, the agent gets specified iteratively through sampling. (4/9) arxiv.org/abs/2212.01681
1
1
22
2,594
Inverse Scaling Prize. The call for tasks where large models perform worse than smaller ones has finished. Larger models are more susceptible to prompt injection, struggle to avoid repeating memorized text, and more. Robust inverse scaling is rare. lesswrong.com/posts/DARiTSTx… (6/9)
1
2
23
2,424