Member of Technical Staff at @TransluceAI. Building tools to study neural nets and their behaviors. He/him.

San Francisco
Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: github.com/google-deepmind/p…
37
387
1,981
338,692
When can you expect to learn a good representation with contrastive learning? In recent work, we show that multiple existing techniques can produce provably *minimax-optimal* representations, based on a surprising connection to kernel methods. 🧵 arxiv.org/abs/2210.01883
3
90
501
Life update: Excited to say I'll be starting a PhD this fall at the University of Toronto / Vector Institute!
17
2
376
Why do language models hallucinate? Here, I argue that they are "uncertain simulators": they divide probability across possible outcomes instead of acting conservatively when uncertain. I also give five high-level strategies for avoiding this mismatch. danieldjohnson.com/2023/03/2…
10
66
330
68,152
New paper: How can you tell when a model is hallucinating? Let it cheat! An expert doesn't need to cheat, so if your model learns to cheat, there must be something it doesn't know. Our general new approach for measuring uncertainty: arxiv.org/abs/2402.08733
5
63
324
62,854
Personal news: I've left Google DeepMind to work on tools for understanding AI systems at @TransluceAI! I'm excited to build open tech for understanding and anticipating new AI behaviors, and to figure out what questions we should ask to make sure they are safe to deploy.
7
11
276
25,836
We are excited to present the Graph Finite-State Automaton (GFSA) layer, which learns to add long-distance edges to graphs end-to-end based on a downstream objective! arxiv.org/abs/2007.04929 (With @numbercrunching and @hugo_larochelle. 1/9)
2
59
243
Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway!
Replying to @TransluceAI
Interestingly, when o3 is asked for details about its laptop, the reasoning summary suggests the model knows it doesn’t have a real laptop, and intends to clarify to the user that it’s “just simulating this setup.” (2/)
9
28
227
31,652
By popular demand, the Treescope pretty-printer from the Penzai neural net library can now be installed separately, and supports both JAX and PyTorch! And that's not all: Penzai itself now has less boilerplate and includes more pretrained Transformer models!
4
34
217
24,579
Happy to announce that our paper "Learning Graph Structure With A Finite-State Automaton Layer" has been accepted to NeurIPS as a spotlight!
We are excited to present the Graph Finite-State Automaton (GFSA) layer, which learns to add long-distance edges to graphs end-to-end based on a downstream objective! arxiv.org/abs/2007.04929 (With @numbercrunching and @hugo_larochelle. 1/9)
2
13
91
I'm at ICML this week, presenting our recent work on quantifying model uncertainty! Come check out our poster on Wednesday July 24, from 1:30-3pm (Hall C #1005):
New paper: How can you tell when a model is hallucinating? Let it cheat! An expert doesn't need to cheat, so if your model learns to cheat, there must be something it doesn't know. Our general new approach for measuring uncertainty: arxiv.org/abs/2402.08733
13
72
10,975
Penzai integrates seamlessly with @GoogleColab and the JAX ecosystem. It represents models as legible, editable data structures, to help researchers understand and modify them after they are trained. Built with support from @DougalMaclaurin, @dtarlow2, and @hugo_larochelle!
1
2
69
7,034
LLM-based assistants can speed up software development, but what should they do when they aren't sure what code to write? We're excited to share R-U-SURE, a drop-in system for adding uncertainty annotations to code suggestions! Read our paper here: arxiv.org/abs/2303.00732
1
17
68
19,473
I'll be at ICLR in Vienna next week, demo-ing Penzai (Tues @ Google DeepMind booth) and presenting recent work on measuring model uncertainty (Sat @ R2-FM workshop)! Want to chat about what models know, how they work, or tools to help us understand them? Please reach out!
1
9
58
15,776
Penzai's goal is to reduce the barrier of entry for research on understanding pretrained neural networks and steering their behaviors, and to make it easier for researchers to quickly try out new ideas. I'm excited to see what the community can do with it!
2
2
45
4,672
Want to get started? Penzai's documentation (penzai.readthedocs.io/) includes guided tutorials that show how to visualize, analyze, and fine-tune the Gemma models in Colab. Interpreting attention heads: penzai.readthedocs.io/en/sta… Low-rank finetuning: penzai.readthedocs.io/en/sta…
1
4
43
5,529
Our open-source implementation of R-U-SURE is available now at github.com/google-research/r…! Try out our demo notebook here: colab.research.google.com/gi…
LLM-based assistants can speed up software development, but what should they do when they aren't sure what code to write? We're excited to share R-U-SURE, a drop-in system for adding uncertainty annotations to code suggestions! Read our paper here: arxiv.org/abs/2303.00732
1
7
24
3,493
Is "reward modeling" the only way to think about learning from human preferences? And does it make the right assumptions about how people express those preferences? Check out our new paper that reinterprets learning from human feedback as probability density estimation!
New paper by my colleagues @_ddjohnson, @pcastr, @hugo_larochelle, @ynd, and me on learning from pairwise human preferences seen through the lens of probabilistic modelling: arxiv.org/abs/2311.14115.
1
21
1,990
Stop by our poster at ICML at the Thursday 10:30 AM session, poster 628!
LLM-based assistants can speed up software development, but what should they do when they aren't sure what code to write? We're excited to share R-U-SURE, a drop-in system for adding uncertainty annotations to code suggestions! Read our paper here: arxiv.org/abs/2303.00732
2
22
5,982
At ICLR? Want to see how you can use Penzai to visualize and edit a transformer? Stop by the Google DeepMind booth between 12:30pm and 1:30pm for a demo!
Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: github.com/google-deepmind/p…
1
2
18
2,598
We are excited by the potential applications of our technique and hope that it can be used to build safer and more trustworthy AI systems that know what they don't know. Paper link: arxiv.org/abs/2402.08733 (Joint work with @dtarlow2, @DavidDuvenaud, and @cjmaddison!)
2
2
20
813
Check out our new paper on training language models to elicit behaviors from other language models!
🕵️New @TransluceAI paper: Eliciting Language Model Behaviors with Investigator Agents🕵️ We train investigator models to elicit behaviors in LMs (including harmful responses, hallucinations, and aberrant personalities)! arxiv.org/abs/2502.01236
14
1,091
The key point: a sample from a language model is a prediction about what *some other agent* will say, and it's hard to predict the behavior of someone who knows more than you do!
18
2,221
Look 👏 at 👏 your 👏 data 👏
To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇
17
850
Tune in for the spotlight presentation and live Q&A at 10AM EST, or check out our poster from 12PM to 2PM EST in Gather Town B1 - Spot C1! Spotlight track: neurips.cc/virtual/2020/prot… Poster link: neurips.cc/virtual/2020/prot…
We are excited to present the Graph Finite-State Automaton (GFSA) layer, which learns to add long-distance edges to graphs end-to-end based on a downstream objective! arxiv.org/abs/2007.04929 (With @numbercrunching and @hugo_larochelle. 1/9)
2
17
This means that predicted probabilities aren't enough to tell us whether statements are correct. Instead, we need to distinguish between the model's lack of knowledge (called epistemic uncertainty) and randomness in the true distribution of responses (aleatoric uncertainty).
1
2
16
1,569
Moreover, our strategy is applicable even to tasks without well-defined correct answers. We train a model on pairs of expert demonstrations in the "Frozen Lake" gridworld environment, and show that our method avoids unsafe actions under asymmetric partial observability.
1
1
13
839
Language models have pretty weird behaviors. We've made some exciting progress toward discovering and studying them!
Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
1
14
1,394
I'll be at the @TransluceAI lunch social today at NeurIPS! Stop by to learn about our work on using AI-backed tools to understand AI systems 🤖🔎→🤖 Plus, I'll be hanging out in Vancouver until tomorrow afternoon, so let me know if you're around and interested in chatting!
Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dDl…
14
1,052
Why is detecting hallucinations hard? Suppose you ask a LLM about digit 5641 of pi and score its response. It might report a small probability because it doesn't know the digit (is it 7 or 4?), or just because there are many correct answers ("That's 7" v.s. "It's an odd number").
1
2
13
2,017
In fact, we prove that making calibrated predictions about paired responses is mathematically equivalent to making calibrated estimates of the error between the model's predictions and the true probabilities p(Y|X). We call this property *second-order calibration*.
1
2
12
1,002
Some previous approaches try to do this by clustering equivalent samples or rephrasing questions as multiple choice, but this can be tricky for open-ended questions with multiple correct answers. Ideally, we would like to be able to score individual answers on their own.
1
1
13
1,153
Why does this work? If the model can't fit p(Y|X) well, it can still reduce its loss by "cheating" and using Y1 to improve its guess about Y2. But if it can fit p(Y|X) exactly, Y1 provides no new information, so there is no incentive to use it.
1
1
13
864
Replying to @jacobaustin132
They also didn't ask evaluators to assess accuracy when measuring "quality" of responses for some reason nitter.app/RottenInDenmark/status…
1
11
676
I'm incredibly excited to be a part of what @TransluceAI is building, and can't wait to see what we can do! I'll also be moving to San Francisco soon. I'm looking forward to catching up with old friends and making new ones!
3
12
757
Other approaches use Bayesian posteriors or model ensembles to quantify uncertainty, but these estimates can be inaccurate if the model is misspecified. Most models are misspecified in practice, which often leads to overconfident predictions!
1
1
10
1,031
Open-source implementation of GFSA coming 🔜! Also, special shout-out to the library #JAX, without which this probably wouldn't have been possible. Computing gradients through iterative solutions to a batch of implicitly-defined linear systems has never been simpler! 8/9
1
1
9
According to the automatically-generated NeurIPS subtitles, next week I will be presenting the "Finance State of Thomason Layer", an approach for "learning structural abstractions in grass"
4
11
Our method overcomes these limitations by collecting and training on pairs of expert responses (Y1, Y2) drawn i.i.d. from the true p(Y|X). We can then quantify uncertainty by measuring how much Y2 depends on Y1 under the model.
1
1
11
932
This means a good model, i.e., one that is making good use of the information that it has, will cheat only when it doesn't know p(Y|X)! We formalize this intuition using the notion of calibration.
1
1
11
815
Using our equivalence, we derive a number of bounds on the error probability of calibrated pair predictors. In particular, we show how to estimate confidence by computing how much the model's predicted probabilities change when we pretend its own guesses are expert responses.
1
2
10
738
Want to see what's in your model? Penzai's IPython pretty-printer, Treescope, is now a separate library! It's built to let you explore models and data interactively, with inline tensor visualizations and many other features. See treescope.readthedocs.io for how to set it up!
1
1
11
738
We demonstrate this empirically on a synthetic language-modeling task: generating English descriptions of digits of pi. Despite never training on a false statement, our confidence closely tracks the hallucination rate, and outperforms other popular sample-filtering strategies.
1
1
9
661
While at Google DeepMind, I spent much of this year working on open-source tools to help researchers look at model internals. I'm excited to continue this line of work at Transluce, with the explicit mission of building understanding for the public good.
By popular demand, the Treescope pretty-printer from the Penzai neural net library can now be installed separately, and supports both JAX and PyTorch! And that's not all: Penzai itself now has less boilerplate and includes more pretrained Transformer models!
1
9
923
I've also recently released a writeup on arXiv about Penzai and Treescope, at arxiv.org/abs/2408.00211. Check it out for a discussion of the design goals of Penzai and Treescope, and please feel free to cite it if you find these libraries useful for your own research!
1
1
9
395
Replying to @hugo_larochelle
Thanks Hugo for doing so much to support academic research at Google and for making the Google Montreal team a great place to work! I can't wait to see what you do next.
9
779
This is important because today’s models do not always generalize in human-like ways, and rarely conform to expectations of what AI systems should do. Researchers are continuously discovering new emergent capabilities, idiosyncratic personality quirks, and puzzling blind spots.
1
9
344
Interesting paper showing that power-law scaling behavior can emerge from a collection of discrete learned computations!
Understanding the origin of neural scaling laws and the emergence of new capabilities with scale is key to understanding what deep neural networks are learning. In our new paper, @tegmark, @ZimingLiu11, @uzpg_ and I develop a theory of neural scaling. 🧵: arxiv.org/abs/2303.13506
9
776
Replying to @TaliaRinger
This seems related to the notion of "options" in reinforcement learning, where one of the actions a policy can select is to transfer control to some sub-policy (until that sub-policy "returns" back)
1
7
I'm also excited to work on understanding the patterns behind model behaviors. How coherent are model personalities across contexts? When does it make sense to view LLM assistants as having intentions and goals, and how can we identify the goals that best explain their behaviors?
1
8
392
I am thankful to have had the chance to work with so many talented and creative researchers at Google. I'm especially grateful to @dtarlow2 and @hugo_larochelle, my original AI residency mentors, whose advice and support during my time at Google has helped me in so many ways.
1
8
1,218
Our cheat-corrected epistemic confidence can be estimated by scoring each sample twice, and we prove that, for a sampled prediction with confidence C, the probability that it was a hallucination is at most 1-C!
1
6
651
In fact, we can build provably-valid confidence intervals for p(Y|X) without *any* assumptions about the model or p(Y|X), as long as we can draw i.i.d. paired responses. This means known impossibility results for distribution-free binary regression do not apply to pair training.
1
1
8
593
When/where: Penzai demo (penzai.readthedocs.io/): Tuesday 12:30-1:30 at the Google DeepMind booth "Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs" (arxiv.org/abs/2402.08733): Saturday at the Workshop on Reliable & Responsible Foundation Models (poster)
1
8
400
And I believe the best way to reach an informed consensus about how to deploy AI systems responsibly is to build tools for scalably observing, understanding, and interacting with them. I'm especially interested in building tools that help us figure out the right questions to ask.
1
8
288
Replying to @DynamicWebPaige
Have you played Tunic? Cute exploration game with a lot of secrets and mysteries. Other favorites: - Celeste - Far: Lone Sails - The Talos Principle - The Witness - Baba is You - Hyper Light Drifter - Fez - Taiji - Paradise Marsh - Return of the Obra Dinn - Hades - Inside
7
334
I believe the AI research field is still far away from understanding what behaviors and drives exist in these models, how they emerge, and which ones we should be watching for. Without this, we may overfit to specific known risks and overlook dangerous unknown failure modes.
1
8
316
You can read more about Penzai at penzai.readthedocs.io! Penzai and Treescope are both open source, at github.com/google-deepmind/p… and github.com/google-deepmind/t…, respectively.
2
1
8
517
I'll be talking about this work during the ICML GRL+ workshop, if you'd like to hear more details or have any questions! 9/9
1
6
Agreed! I think it's interesting to think of explanations as simple predictive models of behavior, with reconstruction error measuring predictive power. If we search the right space and use more compute to explore it automatically I think we can make a lot of progress.
2
7
427
This is an interesting research area! It seems like one of the current best ways is to draw a lot of samples and see how consistent they are. @_lorenzkuhn has a recent paper on this kind of metric: arxiv.org/abs/2302.09664
1
7
105
We can also construct bounds on the distance between the true probabilities and the model predictions. On the CIFAR-10H dataset, we find that our approach gives well-calibrated estimates of this distance, whereas previous techniques are usually overconfident.
1
1
7
582
This is still a difficult optimization problem, so we adapt two tricks from combinatorial optimization: dual decomposition, which breaks our problem into a set of message-passing subproblems, and decision diagrams, which let us solve subproblems efficiently.
1
7
201
Coming to ICML and interested in understanding models and their behaviors? Stop by Transluce's happy hour on Thursday!
Transluce is hosting an #IMCL2025 happy hour on Thursday, July 17 in Vancouver. Come meet us and learn more about our work! 🥂 lu.ma/1w854pjn
1
7
623
Overall, we are excited about incorporating user interaction into minimum-Bayes-risk objectives to mitigate harms of model hallucinations. We see our work as a step toward ML assistants that empower users by giving conservative predictions in the presence of uncertainty.
1
6
344
Our analysis builds on previous theoretical analyses of contrastive learning (HaoChen et al. 2021, Tosh et al. 2021). It is also related to more recent unifications of various self-supervised learning methods, including those by Balestriero & LeCun (2022) and Tian (2022).
1
6
Our key observation is that samples from a well-trained generative model can be interpreted as plausible goal states for the user's code! We can thus use these samples to approximate the expected utility of a suggestion, similar to sample-based minimum Bayes risk decoding.
1
6
242
I did a bit of work on learning an interpretable finite-state automaton for extracting information from a graph, which was inspired by options and can sort of be interpreted as learning a family of tail-recursive functions if you squint hard enough (arxiv.org/abs/2007.04929)
5
One way to understand the uncertainty in a language model's output is to look at its per-token probabilities. However, this can be hard to interpret and sometimes misleading, since token probabilities always depend on all previous tokens and on the model vocabulary.
1
6
361
(Joint work with @AyoubElHanchi and @cjmaddison!) Our first insight is that multiple popular contrastive learning objectives can be re-interpreted as *kernel learning* objectives: minimizing them involves building an approximation of a particular "positive-pair kernel".
1
6
Empirically, we find that R-U-SURE is better than baselines at identifying the regions that differ between model suggestions and ground truth intents from our test set. The utility of our suggestions against the ground-truth intent is also high, and improves with more samples.
1
5
183
Additionally, with weaker augmentations, the eigenfunction approximations become much worse. It's thus likely that inductive biases play a larger role in representation learning under weak augmentations, as was also discussed by Saunshi et al. (arxiv.org/abs/2202.14037).
1
4
In contrast, our system produces annotations by explicitly approximating the utility of a suggestion for a user with a particular intent. We focus on edit distance, and assume that identifying regions as uncertain makes them easier to edit, but less useful if they are correct.
1
5
229
Replying to @_dsevero
Congrats!!
2
130
Formally, our goal is to find an annotated suggestion that maximizes our edit-distance based utility metric for the (unknown) code that the user wants to write. Since we don't know the user's intent exactly, we maximize the expected value of this metric over possible intents.
1
5
201
This kernel implicitly defines a geometric embedding of the augmentations into a high-dimensional space! We can identify the most significant directions in this space using Kernel PCA, and use this to build a parameterization-agnostic representation of the data.
1
4
Overall, we hope that these connections between contrastive learning, kernel methods, PCA, and Markov chains provide a useful lens for theoretical study of self-supervised representation learning and give new insights toward building useful representations in practice.
5
Performance of graph neural networks and related models can sometimes be improved by adding higher-level abstract edges. For instance, when representing source code, we might start with the abstract syntax tree, and then add edges based on program analyses. 2/9
1
3
Great analysis, looks like GPT-3.5 may have overfit to what a "theory of mind" scenario looks like?
So about the 'Large Language Models Learned Theory-of-Mind(?)' discussion: arxiv.org/pdf/2302.08399.pdf Has ToM emerged in current LLMs? I doubt it.
1
5
1,006
Surprisingly, it turns out that performing Kernel PCA with the positive pair kernel is actually *equivalent* to identifying these Markov chain eigenfunctions! So we can expect representations built with Kernel PCA to be well suited to approximating view-invariant functions.
1
4
It's possible to average over all possible paths of the agent by iteratively solving a linear system inside the layer, which leads to the new edges for the downstream model. Gradients can be computed efficiently via implicit differentiation. 4/9
1
3
We can rewrite this property in terms of the eigenfunctions of a Markov chain over positive pairs, which reveals that approximately view-invariant functions must concentrate on eigenfunctions with eigenvalue near 1.
1
4
We next study the behavior of contrastive learning models on two synthetic tasks for which the exact positive-pair kernel can be computed. Specifically, we extract learned approximations of each eigenfunction and compare them to each of the true eigenfunctions.
1
4
Interested in taking apart neural networks and putting them back together in JAX? Penzai 0.2 simplifies the original Penzai neural network API and adds support for mutable variables, making it even easier to manipulate model components and capture intermediate values.
1
4
552
In short, this means that there's a fundamental connection between minimizers of contrastive learning objectives and functions satisfying our view invariance assumption, allowing us to build the best possible representation under that assumption!
1
4
We indeed find that we can extract good approximations of these eigenfunctions from models trained with multiple different losses and parameterizations, and that eigenvalues are related to view-invariance in a predictable way regardless of which loss is used!
1
4
We can even invert the meaning of the annotations, and use our system to identify the most useful parts of a long generated sample! This could be used to preemptively show documentation or usage examples instead of directly suggesting code.
1
4
190
An advantage of our approach is that it gives a lot of flexibility to define the utility function. We can adapt the edit distance calculation to use AST structure, annotate locations of possible insertions, and allow truncating suggestions if the uncertainty is too high.
1
4
162
Plus, Penzai 0.2 also includes modular implementations of the Gemma, Llama, Mistral, and Pythia / GPT-NeoX transformer language models out of the box! Because of Penzai's declarative design, you can directly see the differences in model architectures just by printing them out.
1
4
875
A cool thing is that this kind of approach depends on *description complexity* rather than complexity of internal circuits. A black-box explanation like "Claude 3 Opus cares about animal welfare" might still lead to good predictions even if you don't know how that's implemented!
1
2
210
Furthermore, our system is independent of the model architecture, and does not require any fine-tuning, making it applicable to any pretrained generative model of code. And an open-source implementation is coming soon!
1
1
4
266
Very cool work: unbiased automatic differentiation for programs with discrete random variables!
Differentiable programming (dP) is great: train neural networks to match anything w/ gradients! ODEs? Neural ODEs. Physics? Yes. Agent-Based models? Nope, not differentiable... or are they? Check out our new paper at NeurIPS on Stochastic dP!🧵 arxiv.org/abs/2210.08572
4
Indeed, we prove that the Kernel PCA representation is *minimax-optimal* for quadratic approximation error: it minimizes the approximation error of an adversarially-chosen target function satisfying our view-invariance assumption.
1
3
Finally, we combine the GFSA layer with larger graph network models, and train the whole system end to end on the variable misuse task. We find that the GFSA layer improves accuracy v.s. hand-engineered edges or learned baselines. 7/9
2
2
Replying to @agarwl_
Congrats! Excited to see what you do next!
2
306
Update: Code is now available!
Our open-source implementation of R-U-SURE is available now at github.com/google-research/r…! Try out our demo notebook here: colab.research.google.com/gi…
2
3
689
Replying to @colin_fraser
@grok is this true
1
3
263
Modifications to a model create a copy of the model object that shares the same backend array memory! This works because all arrays are immutable in JAX.
2
167
What properties do we want this representation to have? Ideally, we'd want a single representation that allows us to estimate any function that is approximately invariant to multiple views, i.e. any function that assigns similar values to positive pairs on average.
1
3