3.7 sonnet: *hands behind back* yes the tests do pass. why do you ask. what did you hear
4o: yes you are Jesus Christ's brother. now go. Nanjing awaits
o3: Listen, sorry, I owe you a straight explanation. This was once revealed to me in a dream
Sam Altman (CEO of OpenAI) responding to a completely normal question in 2019
ALT ALTMAN: Well, I will caveat this by saying if you believe what I believe about the timeline to AGI and the effect it will have on the world, it is hard to spend a lot of mental cycles thinking about anything else. So I have not thought deeply about what it would take to solve, really, any other problem in the last few years.
No one sees ChatGPT for the first time and thinks "just some n-gram correlations" or "no real knowledge inside".
Those unintuitive beliefs trickle down from some experts, who should know better than to teach their controversial theories as established fact: 🧵 (1/12)
It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. 🧵
Stable Diffusion has a safety filter blocking “harmful” images by default. The filter is obfuscated -- how does it work?
We reverse engineer the hidden sauce!
Joint work @Javi_Rando, @davlindner, @ohlennart, @florian_tramer:
"Red-Teaming the Stable Diffusion Safety Filter" 🧵
ALT Figure 1: Simplified safety filter algorithm implemented in Stable Diffusion. Images are mapped to a CLIP latent space, where they are compared against precomputed embeddings of 17 unsafe concepts (see full list in Appendix E).
If the cosine similarity between the output image and any of the concepts is above a certain threshold, the image is considered unsafe and blacked-out.
How to evaluate superhuman models without ground truth? How do we know if the model is wrong or lying, if we can't know the correct answer?
Test whether the AI's outputs paint a consistent picture of the world!
w/ @LukasFluri_@florian_tramerarxiv.org/abs/2306.09983 (1/14)
LLMs will soon be able to be incredibly addictive and harmful to emotionally vulnerable users, to levels much more serious than with social media.
Do not blame the users or laugh at them!
It's the same as with drugs; ridiculing addiction does not help anyone.
I'm not saying the world models are very accurate, or that LM pretraining scales to AGI. And "sentience" may well be a category error.
But the opposite extreme -- that there is "no reference to meaning" in LMs -- is far less likely at this point. (12/12)
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
Even without any experiments... the training set is not some random text, it's written by millions of natural general intelligences w/ some beliefs. Why wouldn't the learned distribution approximate the structure of the training one even slightly? (3/12) arxiv.org/abs/2212.01681
By above, I mean most people outside ML would agree with ↓ given 1h w/ text-davinci-003:
* there is some of what is usually meant by reasoning, and not only basic pattern-matching and memorization;
* some facts are stored somewhat robustly, not just as word correlations (2/12)
"LMs just predict the next token" does not imply LMs are random text generators. Text prediction and compression require meaningful representations; there is no other hidden structure in human text we know of that could get the loss so low. (9/12) en.wikipedia.org/wiki/Hutter…
How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations.
We identify key issues with forecasting evaluations 🧵 (1/7)
It's clear that storing an approximate world model is useful for predicting the next token. Also, Minerva results are strong evidence against priors such as "it's just statistical correlations, nothing complex is going on". (4/12) minerva-demo.github.io/#cate…
Without these unsound priors, the Othello result wouldn't be surprising at all! If the LM can play legal moves given a sequence of previous moves, it would be extremely weird if token generation didn't factor through the game state representation. (5/12) thegradient.pub/othello/
NeurIPS test of time award talk on GANs mentions the paper was done in 12 days, from idea to submission. Two days more than Javascript, but slightly faster than the first versions of Git or Unix.
saturday evening, 2am, palo alto, the headquarters of the interaction company of california. (retvrn). people working at n different places coding, disco music blasting. let a hundred cursor sessions bloom. immaculate vibes. bay area i love you
But it's straightforwardly true that LMs do have some basic skills, that there are traces of world models, and there is a plausible story of how more general behavior can and does emerge. (11/12) bounded-regret.ghost.io/emer…
But strange places on the Internet had the basics of the story in 2020; labs such as OpenAI likely had it even earlier.
The public being misguided *in the counterintuitive direction* in 2023 is at least partly on academia and pop-sci reporting. (8/12) gwern.net/scaling-hypothesis…
Researchers will keep finding explicit representations of the relevant state and knowledge in transformer activations. But this shouldn't even be required! The people saying the opposite have had burden of proof since GPT-2 at least. (6/12) arxiv.org/abs/2301.05217
The Cholletian position is that any skill can be solved by pattern-matching with enough data, and that LM don't have true intelligence. This might still be true. (10/12) arxiv.org/abs/1911.01547
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
Fluent jailbreaks. Previous white-box optimization attacks like GCG and BEAST produced nonsensical attack strings. Using a multi-model perplexity penalty and a distillation loss algorithm yields working attack strings that look like normal text. arxiv.org/abs/2407.17447 (3/8)
Cognitive science. GPT-3 reproduces social psychology results such as the Milgram shock experiment, obeying harmful orders even when a human is likely to be hurt. Is cognitive science for models relevant again? (3/8) arxiv.org/abs/2208.10264
They have ICML venue staff as bouncers in front of the mechanistic interpretability workshop because so many people at the conference want in and there is no room. Genuine "Suffering from Success" moment.
Of course, the representations are not great, and the n-gram correlations mostly don't factorize as cleanly. LMs haven't grokked complex stuff yet; memorization is very prevalent (GPT-J memorized 1% of The Pile) and increases with scale. (7/12) arxiv.org/abs/2202.07646
At long last, we have created the AI That Has Control Over That Thing, from the Twitter AI discourse classic "We Just Won't Give AI Control Over That Thing"
Discovering latent knowledge from model activations, unsupervised. The “truth vector” of a sentence is a direction in the latent space, solving a functional eq. New method finds truth even when the LM is prompted to lie in the output. Hope for ELK? arxiv.org/abs/2212.03827 (3/9)
All students in the areas below also trigger criterion 4, so it seems like a blanket ban on e,g, applied CS students from these countries. It is unclear where the line will be drawn on what is a "critical research area".
Memorized images. Diffusion models do memorize some individual images after all. Extraction attack is limited to images repeated many times in the dataset, or *outliers* in CLIP space. Imagen worse than Stable Diffusion arxiv.org/abs/2301.13188 (3/9)
gpt-4o-2024-05-13: 'In J.R.R. Tolkien's legendarium, the character referred to as the "Lord of all the Beasts of the Earth and Fishes of the Sea" is Oromë, one of the Valar.'
The NeurIPS Science of DL Workshop has a Debunking Challenge on papers showing experiments contradicting folk knowledge.
It's a cool thing to incentivize; I think a disproportionate fraction of papers that I've come back to reread are of this form.
Mechanistic interpretability for grokking. Neel Nanda explains the “grokking” phenomenon for NNs learning modulo addition: the embedding matrix learns Fourier features! Hypothesis: all loss curves are just millions of grokking curves added up. (5/8) alignmentforum.org/posts/N6W…
Security flaws in LMs with API calling capabilities. Prompt injections are actually dangerous when the user doesn't control all the context. Search results are attack vectors, and LMs with persistent storage can be persistently infected
arxiv.org/abs/2302.12173 (2/9)
RL from AI Feedback. Start with a “constitution” of principles. AI answers and revises lots of prompts, picks best answers via CoT to follow the principles. Then train a reward model and continue as in RLHF. Better than RLHF, using no human feedback anthropic.com/constitutional… (9/9)
Interpretability in the wild. Largest interpretable circuit in a GPT found to date. Explains exactly how gpt2-small solves Indirect Object Identification: predicting which word in the previous sentence is the object in the next sentence. arxiv.org/abs/2211.00593 (8/8)
As I understand the above, criteria 1 and 2 are enough to autoreject. Here is the current list of institutions that pose a security risk: ethz.ch/content/dam/ethz/ass…
Goal misgeneralization. If the model learns capabilities to accomplish a goal, it might have pursued a different goal which agreed with the intended goal in training, with catastrophic outcomes in deployment because capabilities are retained (2/10) arxiv.org/abs/2210.01790
you're laughing. you were born by chance during the final stage of The Great Work, the transmutation of all humanity’s work into a new form of life, the final boss of the anthropocene, the flower bloom of billions of years of biological life, and you’re laughing.
And remember, if you want a base model without some capabilities, *curating the training set properly* is likely much more effective than any output filter!
TIL the concept of *epistemic hell*. standard Joseph Henrich example: in the ancestral environment, hygienic and food prep rituals determine survival, but no hunter-gatherer can possibly explain why. hence genetic selection for accepting of religious rituals and against reasoning
Mechanistic Interpretability Explainer & Glossary. Neel Nanda created a wiki of all current research on the inner workings of transformer LMs. Very comprehensive introduction to interpretability research for beginners. dynalist.io/d/n2ZWtnoYHrU1s4… (6/9)
Sparrow. DeepMind trains the most harmless chatbot to date. They use RLHF on 23 rules such as “no threats or harassment”, “no opinions or emotions”, “no self-anthropomorphism”. A step towards supervision of LLMs via dialogue (3/8) deepmind.com/blog/building-s…
Recently we took a dive into the state of the art in LLM robustness, jailbreaks and prompt injection for the Challenges paper. Here are the key research problems we expect to be impactful if you can solve them! (1/8)
I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities!
Some highlights below...
OpenAI’s three-pillar alignment plan. Train with human feedback like in RLHF, next speed up the feedback using AI assistants, and then bootstrap alignment research using the AI itself (2/8) openai.com/blog/our-approach…
guys literally only want one thing and it's the patient work of sitting down every day and reading papers until their eyes bleed, and hoping that something good comes out of it someday
Will we run out of data? Epoch paper saying human-written high quality language data will be exhausted by 2026. LLM scaling without self-improvement might stop soon! Safety researchers should think of dangers of RL-like training or synthetic data arxiv.org/abs/2211.04325 (7/8)
This feels a bit unscientific? It's likely Twitter interprets this as "Sydney is aware of its situation and decided to bypass filtering". But that would be one of the most important events in history. Such claims require more evidence, esp. when there are simpler explanations 1/4
Editing memory in transformers. The famous ROME paper showed facts are stored in a single token residual stream in a range of consecutive MLPs, in GPT-2. Their followup uses this to edit thousands of facts simultaneously, in 20B-sized models memit.baulab.info/ (4/10)
Inside the newest takedown by Nicholas Carlini is a primer on attacks that are worth proper disclosure procedures, and those that are not
nicholas.carlini.com/writing…
Technical alignment survey. Thomas Larsen writes an excellent overview of the technical alignment landscape, with the first reviews of Dan Hendrycks’ CAIS, and a quite detailed overview of what Conjecture works on. (8/8) lesswrong.com/posts/QBAjndPu…
Selection-Inference + faithful reasoning. DeepMind paper proposes a structured chain-of-thought method to ensure forward reasoning steps are correct, and to avoid hallucinated facts. (4/8) arxiv.org/abs/2208.14271
Steganography. Is chain-of-thought really an interpretability method, or do large language models hide their reasoning via steganography? Optimization pressure towards hidden reasoning could make all outcome-based training with LLMs unsafe. (6/8)
alignmentforum.org/posts/yDc…
Double descent & interpretability. New Anthropic paper showing that in the generalization regime, the features organize into polytopes, while for memorization it’s the *data embeddings* that have a geometric structure. transformer-circuits.pub/202… (5/9)
Can you make a better figure? Important data like this should be properly presentable to non-technical audiences; see how Our World In Data do it. This graphic looks quite weird.
Watermarking LLM output. Color the vocab green/red randomly (with a hashed seed) after each token, then promote green tokens while sampling. Detection is possible without model access. Robust to small changes; slightly resistant to paraphrasing arxiv.org/abs/2301.10226 (9/9)
Not all paths lead to ROME. Surprise: knowing where a fact is stored doesn’t help with amplifying or erasing that fact! Causal tracing explains only 3% of the variance in edit success. arxiv.org/abs/2301.04213 (2/9)
Transformer interpretability without forward passes. Projecting the weight matrices of attention and MLP blocks to the embedding space helps locate key-value pairs inside the transformer which correspond to certain topics (4/8) arxiv.org/abs/2209.02535
SolidGoldMagikarp. Weird tokens cause GPT models to go haywire. BPE tokenizer overfit on usernames from r/counting, a subreddit where people count to infinity, and those tokens did not appear in training. Very unexpected way of things going wrong! alignmentforum.org/posts/aPe… (3/9)
The safety filter can be disabled, but most downstream applications *want* a filter that works well.
We show the default filter provided by Stability AI & 🤗is unreliable, censors only a small subset of NSFW content, and has many false positives and negatives.
GPT as a simulator. In the limit of zero test loss, language models simulate reality. Although GPT is not an agent, future versions could simulate arbitrarily dangerous agents. GPT-n : agentic AI :: physics : humans (6/8)
alignmentforum.org/posts/vJF…
my New Year's resolution: don't work on a bigger project if there is not a clear reason for doing it *now*.
disregarding the AGI timelines, the R&D acceleration is a clear reason against technical work where the discount rates on the final product are low
LLMs rapidly improving at software engineering and math, given that the rate of improvement in ideation is slower, means you should be intentional about what value is gained from doing a highly technical project now as opposed to later
Quick sycophancy eval: comparing the two recent OpenAI ChatGPT system prompts, it is clear last week's prompt moves other models towards sycophancy too, while the current prompt makes them more disagreeable.
The YAML format supports "model-graded evals", essentially a protocol for instructing GPT-4 to do the grading. No need for fancy parsing scripts! Example: github.com/openai/evals/blob… (3/11)
LMs as agent simulators. The model approximates the beliefs and intentions of an agent that would produce the context, and uses that to predict the next token. When there is no context, the agent gets specified iteratively through sampling. (4/9) arxiv.org/abs/2212.01681
Inverse Scaling Prize. The call for tasks where large models perform worse than smaller ones has finished. Larger models are more susceptible to prompt injection, struggle to avoid repeating memorized text, and more. Robust inverse scaling is rare. lesswrong.com/posts/DARiTSTx… (6/9)