Using interpretability to understand, learn from, and design AI.

San Francisco
Pinned Tweet
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
311
1,683
11,257
3,263,839
Checked. Same story as French and Spanish. The LoRAs wreck Dutch and Swedish performance, the single component edit suppression fine-tune leaves them alone.
2
9
1,826
Correction: a plotting error caused the bars in the plot of off-target effects to display at 0.01 nats above the true means. The corrected plot is below:
45
5,828
We removed an LM's ability to speak German by fine-tuning on only 4 German tokens. As part of a 1-day hackathon with our product Silico, we removed a 67M-parameter language model's ability to predict German text, by tuning only a scalar factor on one subcomponent of the weights. (1/6)
57
131
1,427
300,250
Plus, that interpretability lets us notice and fix problems. E.g.: initially we tuned the top 16 German-related components, but their labels showed most were about foreign languages in general. So we narrowed to the single component for German alone, improving precision. (5/6)
1
62
7,604
This is an early demo of how parameter decomposition could enable targeted, predictable model editing. Details on this experiment: lesswrong.com/posts/ieoWstub… If you want to run experiments on your model too, learn more and request access to Silico: goodfire.ai/silico
1
2
90
7,226
Goodfire retweeted
we're hiring for a bunch of technical GTM roles at @GoodfireAI across forward deployed engineering, sales, and growth come help us understand every model across biology, materials, robotics, language, and more apply here or DM me: goodfire.ai/careers
19
16
245
22,973
Stories have shapes: a comedy rises toward joy; a tragedy falls into loss. Inside an LLM, that’s visible more literally: as an LLM reads a story, its internal activations trace a wandering path that reflects the model’s sense of what kind of story it is reading. (1/5)
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
21
121
859
101,327
Emotions in stories are a simple case study, but the lesson is general: a model's activations, viewed over time, trace trajectories along manifolds. Fully understanding models, and debugging and designing them, means studying how representations change over time! (4/5)
2
1
35
2,502
Goodfire retweeted
Following the blog post from our collaboration with @GoodfireAI, the arxiv paper for PROPEL is now available.
1
14
3,770
We're hosting a happy hour at ICML, Wednesday July 8! Come connect with members of the Goodfire team. Learn about our work in neural geometry and other recent publications. ​Note that space is limited, and we’re prioritizing attendees who are actively engaged in relevant AI research areas. Link to register in the thread!
2
7
133
14,482
Happy to see our work cited in the Claude Fable & Mythos system card! Steering against eval awareness can carry confounds (e.g. making the model more friendly). Interpretability can help us understand these, and is a promising source of new methods to deal with eval awareness.
2
7
43
2,347
Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
26
109
902
179,900
If you train models on preference data, you have a curriculum you've never read. Predictive data debugging lets you read it, understand it, and rewrite it. We've built it into Silico, our platform for model design. Request access to Silico here: goodfire.ai/silico (9/9)
3
1
53
4,649