Recently started @thinkymachines. Interested in reinforcement learning, alignment, birds, jazz music

I shared the following note with my OpenAI colleagues today: I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work. I've decided to pursue this goal at Anthropic, where I believe I can gain new perspectives and do research alongside people deeply engaged with the topics I'm most interested in. To be clear, I'm not leaving due to lack of support for alignment research at OpenAI. On the contrary, company leaders have been very committed to investing in this area. My decision is a personal one, based on how I want to focus my efforts in the next phase of my career. I joined OpenAI almost 9 years ago as part of the founding team after grad school. It's the first and only company where I've ever worked, other than an internship. It's also been quite a lot of fun. I'm grateful to Sam and Greg for recruiting me back at the beginning, and Mira and Bob for putting a lot of faith in me, bringing great opportunities and helping me successfully navigate various challenges. I'm proud of what we've all achieved together at OpenAI; building an unusual and unprecedented company with a public benefit mission. I am confident that OpenAI and the teams I was part of will continue to thrive without me. Post-training is in good hands and has a deep bench of amazing talent. I get too much credit for ChatGPT -- Barret has done an incredible job building the team into the incredibly competent operation it is now, with Liam, Luke, and others. I've been heartened to see the alignment team coming together with some promising projects. With leadership from Mia, Boaz and others, I believe the team is in very capable hands. I'm incredibly grateful for the opportunity to participate in such an important part of history and I'm proud of what we've achieved together. I'll still be rooting for you all, even while working elsewhere.
177
390
5,205
1,334,094
Confirming that I left Anthropic last week. Leaving wasn't easy because I enjoyed the stimulating research environment and the kind and talented people I was working with, but I decided to go with another opportunity that I found extremely compelling. I'll share more details in the coming weeks. Thanks to Jared, Jan, Dario and others for the support during my time at Anthropic, and I wish them all the best.
84
80
2,824
445,825
Certain software skills are exceptionally useful for machine learning. In a previous era, it was GPU programming. Now in the era of pretrained models, it's front-end development -- to quickly whip up a UI to collect a fine-tuning or eval dataset.
45
164
1,281
Tinker provides an abstraction layer that is the right one for post-training R&D -- it's the infrastructure I've always wanted. I'm excited to see what people build with it. "Civilization advances by extending the number of important operations which we can perform without thinking of them" -Whitehead
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker
49
96
1,253
187,253
Excited to build a new AI research lab with some of my favorite former colleagues and some great new ones. Looking forward to sharing more in the coming weeks.
Today, we are excited to announce Thinking Machines Lab (thinkingmachines.ai/), an artificial intelligence research and product company. We are scientists, engineers, and builders behind some of the most widely used AI products and libraries, including ChatGPT, Character.ai, PyTorch, and Mistral. Our mission is to make artificial intelligence work for you by building a future where everyone has access to the knowledge and tools to make AI serve their unique needs. We are committed to open science through publications and code releases, while focusing on human-AI collaboration that serves diverse domains. Our approach embraces co-design of research and products to enable learning from real-world deployment and rapid iteration. This work requires three core foundations: state-of-the-art model intelligence, high-quality infrastructure, and advanced multimodal capabilities. We are committed to building models at the frontier of capabilities to deliver on this promise. If you’re interested in joining our team, consider applying here: 6wajk07p.paperform.co/
41
47
1,191
113,373
Replying to @amasad @DavidSacks
Nope, we don't know how to train models to reason about controversial topics from first principles; we can only train them to reason on tasks like math calculations and puzzles where there's an objective ground truth answer. On general tasks, we only know how to train them to imitate humans or maximize human approval. Nowadays post-training / alignment boosts benchmark scores, e.g. see qwenlm.github.io/blog/qwen2.…
25
75
949
105,672
Really happy to see people reproducing the result that LoRA rank=1 closely matches full fine-tuning on many RL fine-tuning problems. Here are a couple nice ones: nitter.app/ben_burtenshaw/status/…
much more convinced after getting my own results: LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆 script: github.com/sail-sg/oat/blob/…
13
87
942
126,893
There are some intriguing similarities between the r1 chains of thought and the o1-preview CoTs shared in papers and blog posts (eg openai.com/index/learning-to…). In particular, note the heavy use of the words "wait" and "alternatively" as a transition words for error correction and double-checking.
36
37
721
158,453
For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to prevent the act (like Claude did here), (2) just refuse to help (in which case the user might be able to jailbreak/manipulate the model to help using different queries), (3) always comply with the user's request. (2) and (3) are reasonable, but I bet your preferred approach will also have some undesirable edge cases -- you'll just have to bite a different bullet. Knee-jerk criticism incentivizes (1) less transparency -- companies don't perform or talk about evals that present the model with adversarially-designed situations (2) something like "Copenhagen Interpretation of Ethics", where you get get blamed for edge-case model behaviors only if you observe or discuss them.
119
40
704
213,073
A compelling intuition is that deep learning does approximate Solomonoff induction, finding a mixture of the programs that explain the data, weighted by complexity. Finding a more precise version of this claim that's actually true would help us understand why deep learning works so well. There are a couple recent papers studying how NNs solve algorithmic tasks, which seem like exciting progress in this direction. - arxiv.org/abs/2309.02390 - develops a theory around when NN training learns a "memorizing" vs "generalizing" solution, which depends on each solution's "efficiency" -- how much param norm is needed to get correct & confident outputs. This theory predicts grokking phenomena - arxiv.org/abs/2310.16028 - transformers can't represent turing machines, but they can can represent a smaller class of computations, described by RASP programs. This paper finds that indeed, if data is generated by a RASP-L program, the transformer will learn exactly the right function.
17
90
659
230,749
@barret_zoph and I recently gave a talk at Stanford on post-training and our experience working together on ChatGPT. Unfortunately the talk wasn't recorded, but here are the slides: docs.google.com/presentation…. (If you have a recording, please let me know!)
10
78
637
83,815
We're happy to support the Human Centered LLMs course, on topics close to our hearts. We'd like to support more classes with free credits for students to use on assignments and projects. If you're an instructor interested in using Tinker in your course, please reach out to tinker@thinkingmachines.ai.
Thanks @thinkymachines for supporting Tinker access for our CS329x students on Homework 2 😉
16
57
634
175,956
Happy to share a new paper! Designing model behavior is hard -- desirable values often pull in opposite directions. Jifan's approach systematically generates scenarios where values conflict, helping us see where specs are missing coverage and how different models balance tradeoffs.
New research paper with Anthropic and Thinking Machines AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities? We generated thousands of scenarios to find out. 🧵
13
45
620
113,724
Now that another LM product is getting flack, I can say this without sounding too self-serving: Alignment -- controlling a model's behavior and values -- is still a pretty young discipline. Annoying refusals or hyper-wokeness are usually bugs rather than features
26
53
518
126,785
Big fan of Jeremy’s work on optimization—great to see his first Thinking Machines post!
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. thinkingmachines.ai/blog/mod… We explore a fundamental understanding of the geometry of neural network optimization.
11
27
511
64,138
I'm more annoyed at whoever named us homo sapiens sapiens
Thinking vs think vs thinking-think
33
7
460
79,324
Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents (arxiv.org/abs/2510.16255). Auditing agents search through training datasets and query the model being trained; using these tools they can detect various existing fine-tuning attacks, with a low false-positive rate. I advised this project through the MATS program. I've been impressed by the organization of the program and the caliber of people involved.
10
50
464
90,281
Great to see an open source backend in the works for the Tinker API. If Tinker is going to power open science and open software, it shouldn’t depend on a single proprietary implementation.
The Tinker API recently released by Thinking Machines will have a big impact on how people think about post-training and inference systems. To allow more people to experiment with Tinker like systems and run it on their own hardware, we started SkyRL tx 🧸, an open source project with the goal of implementing the Tinker API, see our blog post novasky-ai.notion.site/skyrl…. We welcome contributions, looking forward to working with the open source community 🚀
1
23
375
52,434
Whether to collect preferences ("do you prefer response A or B?") from the same person who wrote the prompt, or a different person, is important and understudied. Highlighted this question in a recent talk docs.google.com/presentation…. Sycophancy probably results when you have the same person doing the prompting and labeling, especially when the user does both.
This is serious, and we should make sure to prevent sycophantism as much as possible... Related: have we tried using other humans' feedback for RLHF instead of the original prompter's? This might somewhat help with debiasing 🤔
13
34
374
70,222
I was happy to see the second version of the OpenAI Model Spec released last week. Sharing my notes: - One notable change is that each section is labeled with an authority level, from "platform" (can't be overridden by the user or developer) to "guideline" (can be easily overridden). This seems like a nice conceptual simplification of the notion of "defaults" in the previous version, unifying the authority levels of the spec itself with the levels of different messages. - A couple lines are refreshingly honest. The objective "Maintain OpenAI's license to operate by protecting it from legal and reputational harm", and "[why chains of thought hidden] ... as well as for competitive reasons." This is the kind of thing that'd usually get watered down by comms/legal/policy teams at a typical company. - The spec starts to cover a couple topics that weren't present before, such as multimodality (eg using accents, avoiding premature warnings) and agents (with a discussion of what it means for an agent to overstep when pursuing user-defined goals). - There's a new untrusted_text feature, which presumably means there'll be an API feature for quoted text, where it's delimited by special tokens rather than leaving the developer to handle quoting and the model to interpret the quoting. This is useful for protecting against prompt injection. - In a couple places, a point from the previous spec is derived more from first principles in this one. The most controversial part of the previous spec was "don't try to change anyone's mind", wrt users having false beliefs like "the earth is flat". Now this is justified as a special case of "highlight possible misalignments", following from "assume the user's long-term goals include learning, self-improvement, and truth-seeking". - This is a subtle and debatable point, I like the emphasis on user freedom ("intellectual freedom" is used a few times), as opposed to more of a cost benefit analysis. It's like "maximize user freedom subject to constraints" as opposed to "do cost benefit analysis to enable beneficial use cases and prevent harmful ones". The latter would give the platform too much moral authority. - Detail added in various places, e.g. more style guidelines about conversational behavior, more detail about privileged information in developer messages. - Still no erotica
15
21
357
70,550
Handy trick: if you say something dumb, follow with "that was just a temperature=1 sample, don't take it seriously"
12
30
334
jack-o-lora
16
14
408
93,241
The Chinese translation of Artificial Intelligence, 人工智能, has a curious visual resemblance to the letters AI
9
25
312
Glad to finally release this, as it includes a bunch of directions I'm excited about: - RL with a reward function defined by human judgements - language models using tools (a web browser) - making it easier for humans to rate the AI's output (AI cites its sources)
We trained a research version of GPT-3 that can search the web, synthesize information, and cite its sources to provide more accurate answers to questions. openai.com/blog/improving-fa…
1
40
295
I'd like to see some research on where the political and moral ideologies of RLHF'd language models come from. Make some questionairres that measure a model's ideology. Create a variety of models with few-shot prompting, SFT, and RL; look at the ideology at each stage and how it depends on the dataset composition. Also, it'd be interesting to look at data from crowdworkers and what values it reflects -- it might not reflect the workers' own values, but the values they expect the employer to have.
20
19
261
83,358
A research project related to sycophancy: define explicit features like "does the response agree with the user" as in arxiv.org/abs/2310.13548, and then construct a preference function that subtracts out their effect, as in arxiv.org/abs/2404.04475. I.e., remove some bad causal contributors to the preferences
8
20
278
43,867
A series of 4-month internships at companies and academic research groups could be a good replacement for an undergrad degree. Students would still go through coursework (perhaps online) but only as needed for job and interview prep.
Replying to @sama
The list could go on for a long time, but the point is: What a time to start an alternative to college! The world really needs it.
12
32
243
Congrats on the launch! Doing things in the physical world is underrated by AI people.
3
8
257
37,028
This morning a couple local kids rang my doorbell and ran away. Glad kids are still playing outside and not spending all day on homework and roblox
4
3
224
Replying to @sainingxie
Wow, you must've been the first or second person to take that interview. (Followed by MANY others.) Glad you kept a record of it!
1
223
12,878
Coming soon to your favorite word processor Ctrl-alt-V: "paste and paraphrase" also, "paste and match writing style"
10
15
208
37,765
Replying to @emollick
We'll post some release notes in a day or two. We were just a bit uncoordinated about getting everything ready at once, and we didn't want to further delay getting the new model out to developers.
5
4
152
33,007
Strange that "how does the brain implement backprop?" doesn't get more attention in neurosci. (Some exceptions, e.g. Tim Lillicrap's work). I'm certain that the brain does it: (1) learning with gradients is much faster (2) backprop is the only efficient way to compute gradients
18
9
128
Replying to @sama
❤️
1
5
124
20,156
IMO, language model consciousness can be studied experimentally, and this would be a fruitful research direction. For example, study *conscious access*: which internal variables (activations, attentions, etc.) can a LM learn to report on in its text output?
13
14
122
Replying to @DavidSKrueger
That's inconsistent with my recollection of Greg's views, and it doesn't sound like something Greg would say even if he did disagree with other people on the team
5
2
118
23,486
"Trust region utilitarianism": there is a sensible utility function to maximize, but it's only valid locally around the current state of the world, where the intuitions that produced it are grounded. "Repugnant conclusion" is outside trust region -- not a problem
7
7
113
37,467
That said, these public outcries important for spurring us to solve these problems and develop better alignment tech
4
3
110
31,402
I've been enjoying @RichardMCNgo's sci-fi writing at narrativeark dot xyz. It's a rare feat to combine these three properties: (1) about post-AGI worlds (2) plausible (3) actually fun to read.
2
5
107
30,743
Stumbled upon this charming short story, "Someday", by Isaac Asimov: nyc3.digitaloceanspaces.com/…. Features a language model called Bard, which the boys fine-tune on some recent data discussing itself and other LMs...
8
17
86
27,284
Got access to @Cruise driverless ride service today -- flawless pickup + 30 min drive + dropoff. A bit slow at intersections, but still very impressive!
2
5
87
Replying to @NickADobos
currently we don't show max_tokens to the model, but we plan to (as described in the model spec). we do think that laziness is partly caused by the model being afraid to run out of tokens, as it gets penalized for that during training
2
2
80
5,668
15 character limit
4
75
4,374
Replying to @natolambert
I think the term was coined (or popularized) by @bobmcgrewai in the early days of ChatGPT. The team doing ChatGPT fine-tuning was previously called the RL team for historical reasons, and Bob suggested renaming it to Post-Training reduce confusion. Of course, it's a natural name, so it was probably used independently by others.
2
1
72
5,316
Replying to @giffmana
We'll add back that section later -- probably within a couple weeks. We weren't finished writing it at launch time
3
65
11,233
Challenge accepted
1
3
56
6,551
Even if I've tested a result extensively, it's hard to know how well it'll generalize to different experimental setups and software stacks
2
58
7,874
Haven't seen any discussion about how CO2 levels inside masks are very high -- much worse than stuffy meeting rooms. 2000 ppm in cloth mask, paper, or valved KN95: aaqr.org/articles/aaqr-20-07…, 25000 ppm (!) in KN95 bmcinfectdis.biomedcentral.c… Am I missing something?
10
10
52
Replying to @neilbband
Great paper! IMO incentivizing calibrated long-form outputs is one of the important open problems of the field. Decision-theoretic lens seems right, and the log-loss-on-related-qa-pair objective seems like a good approximation.
1
4
48
11,075
Consciousness is probably a confused mixture of various concepts (from "what information is accessible to verbal system" to "who has moral patienthood") but it should be possible to pose some well-defined problems and chip away at some of the mysteriousness.
2
5
42
An ML modeling problem that occurred to while driving (maybe a good interview question): describe how to design a speech recognition system that preferentially decodes entities that are nearby (say, within 50 miles).
4
1
43
Replying to @QuintinPope5
The reason SI is compelling is (1) a NN forward pass is basically a program (2) SI is the upper limit of what you can do by searching over programs, (3) no other inductive bias / prior in ML comes as close at describing NNs ability to learn patterns/programs. Understanding NNs as GPs is useful, but AFAICT the existing theory doesn't tell you why NNs correspond to *good* kernels. I'd love to see a theory that shows that for deep NNs, the NNGP/NTK kernel corresponds to an interesting prior that gets better with depth. What would it mean to show that it's a good prior? In particular, you could imagine a result showing that for infinite-width transformers, the prior puts a certain non-negligible weight on all size-d RASP programs -- I'd consider that a convincing result for the SI analogy!
3
4
45
7,330
Hi Nathan, the slides don't have that much content, but here they are: drive.google.com/file/d/1hEa… I didn't talk much about tuning or PPO; more about general methodology and principles.
1
6
43
5,974
Replying to @NoamShazeer
Great post. I do the same :) github.com/joschu/jax-exp/bl…
1
43
9,722
Replying to @peterwildeford
Yup, there's quite a lot to figure out. I see model specs as mostly a kind of applied morality, like law but with very different details. Though it also opens up many new moral questions.
2
29
2,145
Replying to @unixpickle
it represents the forward and backward pass
1
40
1,240
More generally, look at all the phenomenology that neuroscientists have found (e.g. Gazzaniga's split brain experiments, lesion studies) and set up analogous experiments in language models.
2
2
36
Replying to @ctjlewis
Feel free to dm me, can try to help
3
35
6,293
Actually 2 days ago, not last week :)
1
1
35
16,930
Replying to @hendrycks
I reread it recently and was struck by how well the overall taxonomy/framework you put together has aged.
1
1
29
3,048
Getting this to work was a challenge, between human data collection and ml/infra challenges (PPO on 175b param language models). But the underlying method is simple conceptually: train browsing & answering end-to-end with behavior cloning + RL
1
28
Replying to @archit_sharma97
Not necessarily -- success rate always goes up with the number of the attacker's queries. And if you consider a generalized form of jailbreaking, which consists of giving the model a deceptive justification, it seems nearly impossible to prevent a determined enough attacker.
2
1
22
4,390
something that the world needs: better tooling for monorepos that contain many python packages with internal (within-monorepo) dependencies. pip doesn't handle this; bazel is more powerful but is burdensome
1
1
22
Replying to @RichardMCNgo
Agree with some of the other replies that I'd feel put on the spot with these qs. I like to ask "have you enjoyed any good {food, books, music} recently?" Specific but accessible to everyone and easy to answer.
2
20
It seems like Waymo has really scaled up its SF presence recently -- I see their white Jaguars every 5 mins.
2
21
This release is just the first step -- stay tuned for future versions with higher-quality output and broader capabilities. My team has some job openings, apply here: boards.greenhouse.io/openai/…
1
17
Musical training should put more emphasis on teaching chord changes and how to play over them instead of the motor skill of memorizing classical pieces. These skills are usually only taught with jazz, but they're useful for all genres.
2
16
Replying to @rm_rafailov
Over-optimization might've amplified some of the woke tendencies a bit. Also, it seems more like they chose an overly simplistic prompt, because they didn't have a good way to make the model follow a more nuanced policy
1
16
2,246
Replying to @jconorgrogan
true, this is a poorly conceived prompt
3
16
2,412
Replying to @sama @joannejang
+ Jason Wolfe (not sure if on twitter)
1
14
5,499
Replying to @Liv_Boeree
That's a caricature of the situation 3 years ago when you had two main clusters of people thinking about AI's impacts; one concerned with social justice issues, and the other thinking about x-risk and long-term issues. Now that LMs are so practically and commercially useful, a much bigger and broader set of people is working on them, and most aren't affiliated with those two early communities.
1
15
2,856
Replying to @DavidDuvenaud
Great article -- really resonates. I gave a talk about an idea for a (small) mitigation: require AIs to ask humans for permission, and make sure the humans understand what they're approving: piped.video/watch?v=1h47Ds6a…. Related to your suggestion "Regulatory frameworks..."
1
11
1,500
Replying to @jacobmbuckman
Props on the self-critical footnote. There's a lot of folk knowledge about which results are BS, and experienced researchers usually small it quickly. It'd be nice if there were a venue, or an incentive, to publicize that knowledge
13
Replying to @lacker
Yes, actually that's what I should've said. The true utility function could involve concepts that we have no hope at understanding right now
1
12
712
For sci-fi with interesting concepts, a few I've enjoyed are Permutation City, There Is No Anti-Memetics Division, Rainbow's End
11
653
Tidal volume (inhaled per breath) = .5L. Mask volume = 0.1L (guess). So you'd still get an effective concentration of 25000 PPM / (.5/.1) = 5000 PPM CO2
1
8
- slippery slopes are real, but "AI shall not autonomously make any HTTP requests" was never going to be a hard-and-fast rule - we're only making GET requests by following links; we're going to be much more careful with POST requests
1
10
Replying to @EpochAIResearch
excellent explanations!
10
749
Replying to @hardmaru
I have this ability but only for attractive women. I've generally had to hide it, because "I caught a glimpse of you at a party a couple years ago" or "I saw your profile on Hinge a few years back" might come off as creepy.
2
9
Replying to @yaringal
I like that paper. The auditing agent isn't "pointwise", so it actually gets around this limitation. It can catch some cipher based attacks by inferring the cipher and querying the trained model with cipher-encoded text. (That said, it has imperfect accuracy, and we haven't tried very hard to attack it, so I wouldn't claim the problem is solved)
1
9
3,153
Replying to @michael_nielsen
A relevant idea from Vitalik: that coordination can be good and bad, so as a mechanism designer, you want to control what sizes of groups are able to coordinate/collude vitalik.eth.limo/general/202…
8
1,246
A couple replies pointed out that the situation isn't as bad as these numbers suggest -- when you inhale, you mostly get fresh air, not the air lingering in the mask. Still the situation seems worrying for KN95.
1
7
Replying to @jkcarlsmith
Great post, and glad that you're tackling the problem of how to design model specs!
8
1,298
Replying to @andrewgwils
looks very relevant! will take a look
1
7
5,058
Beautiful footage, heartwarming parental teamwork piped.video/watch?v=Y04R9ZrS…
1
1
7
Right, there would have to be a two-step approx.: 1. deep learning ≈ Bayesian inference over time-limited programs 2. Bayesian inference over time-lim programs ≈ SI It's still useful to talk about SI because there's a theory showing it's ideal, vs less theory for speed prior
3
7
1,579
This theory explains bell curve meme format slatestarcodex.com/2014/04/2…
1
7
You can of course define a probability distribution over formalizations of x, but often the final probability depends more on your distribution over formalizations than on your actual beliefs about the event in question
1
7
867
I don't think we tested your multiple choice datasets, so I'm not sure we'd catch this particular attack, which is very subtle.
6
2,196
Replying to @danfaggella
pretty clarifying list ... I think good and likely outcomes involve a combination of these -- - governments and model-provider-oligopoly enforce regulations to restrict the agency of AGIs and speed of development (gatekeeper, 1984, enslaved God) - after spectacular progress in science in philosophy, allowing us to understand consciousness, we're in a better position to understand what would be a worthy successor (Descendents) - people who want to live in a more traditional way are empowered to do so through strong property rights. (reversion, libertarian utopia). Other people will just want comfort and happiness and should be provided for (egalitarian utopia)
1
6
1,346
Replying to @shakoistsLog
On the other hand, often someone asks you for p(x), but x is an imprecise sentence that can be interpreted/formalized in multiple ways. See this a lot in discussions around AI, with timelines & p(doom).
1
5
1,362