Barret Zoph · Sep 26, 2024 · 12:10 AM UTC

Barret Zoph

Barret Zoph

@barret_zoph

26 Sep 2024

I posted this note to OpenAI. Hey everybody, I have decided to leave OpenAI. This was a very difficult decision as I have has such an incredible time at OpenAI. I got to join right before ChatGPT and helped build the post-training team from scratch with John Schulman and others. I feel so grateful to have gotten the opportunity to run the post-training team and help build and scale ChatGPT to where it is today. Right now feels like a natural point for me to explore new opportunities outside of OpenAI. This is a personal decision based on how I want to evolve the next phase of my career.  I am very grateful for all the opportunities OpenAI has given me and all the support I have gotten from OpenAI leadership such as Sam and Greg. I am in particular grateful for everything Bob has done and for being an excellent manager and colleague to me over my career at OpenAI. The post-training team has many many talented leaders and is being left in good hands. OpenAI is doing and will continue to do incredible work and I am very optimistic about the future trajectory of the company and will be rooting everybody on.

155

168

3,399

1,089,134

Barret Zoph · Sep 6, 2022 · 3:19 PM UTC

Barret Zoph

@barret_zoph

6 Sep 2022

After 6 years at Google Brain I am excited to announce that I joined OpenAI! Very grateful for all the amazing collaborators and friends I have made at Google over the years Could not be more excited to continue to help push AI progress and for the new adventures ahead

1,517

Barret Zoph · Jun 6, 2023 · 2:59 AM UTC

Barret Zoph

@barret_zoph

6 Jun 2023

Our team at OpenAI is hiring! We're looking for engineers/researchers who do rigorous and thoughtful work understanding and evaluating LLMs like ChatGPT. If you're interested, please apply online and DM me with work that you've done!

704

625,006

Barret Zoph · Jan 12, 2021 · 3:20 AM UTC

Barret Zoph

@barret_zoph

12 Jan 2021

Introducing Switch Transformer, a simplified sparse architecture for scaling to trillion parameter language models Switch Transformers yield 4-7x speedups over strong Transformer T5 models w/ the same computational resources Paper: arxiv.org/abs/2101.03961

132

646

Barret Zoph · Nov 19, 2019 · 2:47 AM UTC

Barret Zoph

@barret_zoph

19 Nov 2019

*New paper* RandAugment: a new data augmentation. Better & simpler than AutoAugment. Main idea is to select transformations at random, and tune their magnitude. It achieves 85.0% top-1 on ImageNet. Paper: arxiv.org/abs/1909.13719 Code: git.io/Jeopl

143

566

Barret Zoph · Dec 15, 2020 · 8:08 PM UTC

Barret Zoph

@barret_zoph

15 Dec 2020

Can simply copying and pasting objects from one image to another be used to create more data to improve state-of-the-art instance segmentation? Yes! With Copy&Paste, we achieve 57.3 box AP and 49.1 mask AP on COCO. This is SoTA wrt @paperswithcode arxiv.org/abs/2012.07177

482

Barret Zoph · Mar 16, 2021 · 4:07 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

Revisiting ResNets: Improved Training and Scaling Strategies Our recent work that applies modern training and scaling techniques to the 2015 ResNet We find ResNets outperform some recent state-of-the-art architectures ResNets are remarkably durable! arxiv.org/abs/2103.07579

343

Barret Zoph · Sep 1, 2021 · 3:17 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

How do we combine knowledge from multiple labeled and unlabeled datasets to train a great general model? Multi-Task Self-Training (MuST) trains specialized teachers on labeled data, which then label unlabeled data to train a single general model. arxiv.org/abs/2108.11353

336

Barret Zoph · Nov 22, 2023 · 11:40 PM UTC

Barret Zoph

@barret_zoph

22 Nov 2023

What an incredible company OpenAI is to work at. I have never seen so many people so committed to the mission of the company and band together when things go wrong. Huge props the the leadership team for navigating these incredibly difficult times.

301

108,100

Barret Zoph · Jul 15, 2025 · 9:10 PM UTC

Barret Zoph

@barret_zoph

15 Jul 2025

Super excited to be part of this incredible team and company. Please reach out if you are interested in joining!

Mira Murati

@miramurati

15 Jul 2025

Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're excited that in the next couple months we’ll be able to share our first product, which will include a significant open source component and be useful for researchers and startups developing custom models. Soon, we’ll also share our best science to help the research community better understand frontier AI systems. To accelerate our progress, we’re happy to confirm that we’ve raised $2B led by a16z with participation from NVIDIA, Accel, ServiceNow, CISCO, AMD, Jane Street and more who share our mission. We’re always looking for extraordinary talent that learns by doing, turning research into useful things. We believe AI should serve as an extension of individual agency and, in the spirit of freedom, be distributed as widely and equitably as possible. We hope this vision resonates with those who share our commitment to advancing the field. If so, join us. thinkingmachines.paperform.c…

274

47,438

Barret Zoph · Dec 5, 2022 · 7:55 AM UTC

Barret Zoph

@barret_zoph

5 Dec 2022

What a fun first few months at OpenAI its been :)

Sam Altman

@sama

5 Dec 2022

ChatGPT launched on wednesday. today it crossed 1 million users!

269

Barret Zoph · Sep 7, 2022 · 2:58 PM UTC

Barret Zoph

@barret_zoph

7 Sep 2022

Want to learn more about how sparse expert models (e.g. MoEs, Switch Transformers, Hash Layers) work and their recent research advancements? Check out our recent review paper arxiv.org/abs/2209.01667

256

Barret Zoph · Sep 10, 2025 · 6:25 PM UTC

Barret Zoph

@barret_zoph

10 Sep 2025

Excited to share our first blog post -- one of many to follow!

Thinking Machines

@thinkymachines

10 Sep 2025

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/def…

260

34,964

Barret Zoph · Oct 1, 2025 · 6:12 PM UTC

Barret Zoph

@barret_zoph

1 Oct 2025

Excited to release Tinker and see what the community uses it for.

Thinking Machines

@thinkymachines

1 Oct 2025

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker

198

38,933

Barret Zoph · Apr 2, 2022 · 4:26 PM UTC

Barret Zoph

@barret_zoph

2 Apr 2022

Really enjoyed the Instruct-GPT paper Impressed by the results: 100x smaller models w/ same quality by updating models on the data distribution you care about Data is often overlooked & such a powerful tool -- smaller models for the same quality, which saves a lot at inference

185

Barret Zoph · May 9, 2022 · 7:56 PM UTC

Barret Zoph

@barret_zoph

9 May 2022

Lots of great work coming out on LLMs generating + understanding code (Codex, Scratch Pad, MBPP/MathQA, etc...) The Alpha code paper by DeepMind is quite impressive --- ranking ~50% percentile in competitive programming competitions w/ 5000+ participants A 🧵below:

174

Barret Zoph · Sep 24, 2024 · 7:37 PM UTC

Barret Zoph

@barret_zoph

24 Sep 2024

Super excited this is rolling out! Real time speech to speech will be a powerful feature -- I am very bullish on multi-modal being a core component of AI products. This was a great collaboration with post-training (h/t to @kirillov_a_n & @shuchaobi + team on post-training) and other teams across OpenAI to make this happen.

OpenAI

@OpenAI

24 Sep 2024

Advanced Voice is rolling out to all Plus and Team users in the ChatGPT app over the course of the week. While you’ve been patiently waiting, we’ve added Custom Instructions, Memory, five new voices, and improved accents. It can also say “Sorry I’m late” in over 50 languages.

159

40,911

Barret Zoph · Feb 21, 2022 · 5:02 PM UTC

Barret Zoph

@barret_zoph

21 Feb 2022

Interested in using sparse expert models, but find they are unstable, hard to design or don’t fine-tune well? We address these key issues and train 269B param MoE model (w/ FLOPs of 32B dense model) that improves SOTA on NLP benchmarks liked SuperGLUE. arxiv.org/abs/2202.08906

158

Barret Zoph · Sep 11, 2025 · 9:01 PM UTC

Barret Zoph

@barret_zoph

11 Sep 2025

Excited to be supporting this, please reach out if you are interested

Woosuk Kwon

@woosuk_k

11 Sep 2025

At Thinking Machines, our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to advance open-source vLLM and serve frontier models. If you are interested, please DM me or @barret_zoph! Here are some example roles / projects: * Distributed inference engineer to support large-scale models on Blackwell GPUs * PyTorch & model optimization engineer to support & optimize latest OSS models * MLSys generalist for various aspects of vLLM

136

30,257

Barret Zoph · Sep 30, 2025 · 4:48 PM UTC

Barret Zoph

@barret_zoph

30 Sep 2025

Exciting mission with a great team! With the progress of AI, now is the right time to start approaching these problems!

Liam Fedus

@LiamFedus

30 Sep 2025

Today, @ekindogus and I are excited to introduce @periodiclabs. Our goal is to create an AI scientist. Science works by conjecturing how the world might be, running experiments, and learning from the results. Intelligence is necessary, but not sufficient. New knowledge is created when ideas are found to be consistent with reality. And so, at Periodic, we are building AI scientists and the autonomous laboratories for them to operate. Until now, scientific AI advances have come from models trained on the internet. But despite its vastness — it’s still finite (estimates are ~10T text tokens where one English word may be 1-2 tokens). And in recent years the best frontier AI models have fully exhausted it. Researchers seek better use of this data, but as any scientist knows: though re-reading a textbook may give new insights, they eventually need to try their idea to see if it holds. Autonomous labs are central to our strategy. They provide huge amounts of high-quality data (each experiment can produce GBs of data!) that exists nowhere else. They generate valuable negative results which are seldom published. But most importantly, they give our AI scientists the tools to act. We’re starting in the physical sciences. Technological progress is limited by our ability to design the physical world. We’re starting here because experiments have high signal-to-noise and are (relatively) fast, physical simulations effectively model many systems, but more broadly, physics is a verifiable environment. AI has progressed fastest in domains with data and verifiable results - for example, in math and code. Here, nature is the RL environment. One of our goals is to discover superconductors that work at higher temperatures than today's materials. Significant advances could help us create next-generation transportation and build power grids with minimal losses. But this is just one example — if we can automate materials design, we have the potential to accelerate Moore’s Law, space travel, and nuclear fusion. We’re also working to deploy our solutions with industry. As an example, we're helping a semiconductor manufacturer that is facing issues with heat dissipation on their chips. We’re training custom agents for their engineers and researchers to make sense of their experimental data in order to iterate faster. Our founding team co-created ChatGPT, DeepMind’s GNoME, OpenAI’s Operator (now Agent), the neural attention mechanism, MatterGen; have scaled autonomous physics labs; and have contributed to some of the most important materials discoveries of the last decade. We’ve come together to scale up and reimagine how science is done. We’re fortunate to be backed by investors who share our vision, including @a16z who led our $300M round, as well as @Felicis, DST Global, NVentures (NVIDIA’s venture capital arm), @Accel and individuals including @JeffBezos , @eladgil , @ericschmidt, and @JeffDean. Their support will help us grow our team, scale our labs, and develop the first generation of AI scientists.

128

27,155

Barret Zoph · Dec 7, 2021 · 6:41 PM UTC

Barret Zoph

@barret_zoph

7 Dec 2021

Our new sparse model (SS-MoE) achieved SOTA on SuperGLUE (super.gluebenchmark.com/lead…)! Excited to see sparsity pushing state-of-the-art! This new work builds heavily on our prior work on Switch Transformer: arxiv.org/abs/2101.03961 Paper and more details to come soon!

SuperGLUE Benchmark

SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new public leaderboard.

super.gluebenchmark.com

112

Barret Zoph · Nov 20, 2023 · 1:16 PM UTC

Barret Zoph

@barret_zoph

20 Nov 2023

❤️

Ilya Sutskever

@ilyasut

20 Nov 2023

I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.

101

17,589

Barret Zoph · Jul 14, 2020 · 4:16 PM UTC

Barret Zoph

@barret_zoph

14 Jul 2020

Models and checkpoints are now open sourced for my recent work: "Rethinking Pre-training and Self-training". Paper link: arxiv.org/abs/2006.06882 Code Link: bit.ly/3j5sVAn. On COCO we achieve 54.3 AP and on Pascal Segmentation 90.5 mIOU!

109

Barret Zoph · Oct 27, 2025 · 5:10 PM UTC

Barret Zoph

@barret_zoph

27 Oct 2025

Great post on on-policy distillation the people should check out!

Thinking Machines

@thinkymachines

27 Oct 2025

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…

118

50,661

Barret Zoph · Dec 7, 2022 · 8:07 AM UTC

Barret Zoph

@barret_zoph

7 Dec 2022

Intersecting cutting edge AI research w/ products is an incredibly exciting area to work on. Products are the ultimate test set :)

Barret Zoph · Jan 22, 2021 · 11:38 PM UTC

Barret Zoph

@barret_zoph

22 Jan 2021

Great video summary of some of my recent work! Thanks @ykilcher!

Yannic Kilcher 🇸🇨

@ykilcher

22 Jan 2021

A bit late to the party, but 💃NEW VIDEO🕺 on Switch Transformers by @GoogleAI. Hard Routing, selective dropout, mixed precision & more to achieve a 🔥ONE TRILLION parameters🔥 language model. Watch to learn how it's done🧙💪 piped.video/iAR8LkkMMIM @LiamFedus @barret_zoph

Barret Zoph · Dec 13, 2020 · 3:34 AM UTC

Barret Zoph

@barret_zoph

13 Dec 2020

Super interesting work! Excited to see the future of attention models in computer vision.

Lucas Beyer (bl16)

@giffmana

4 Dec 2020

If you haven't read our latest ImageNet SOTA work "Vision Transformers (ViT)" yet, shame on you. But! There's hope! Here's the corresponding blogpost which is a nice tl;dr: ai.googleblog.com/2020/12/tr…

Barret Zoph · Jun 6, 2023 · 2:59 AM UTC

Barret Zoph

@barret_zoph

6 Jun 2023

We are looking for people to understand, improve and combine a variety of evaluation signals (e.g. automated and human), build eval infra (e.g. visualizations, testing) and do ML research on better eval methods.

18,308

Barret Zoph · Feb 14, 2024 · 7:44 AM UTC

Barret Zoph

@barret_zoph

14 Feb 2024

Pleasure working with you -- learned quite a lot! Excited for what you do next.

Andrej Karpathy

@karpathy

14 Feb 2024

Hi everyone yes, I left OpenAI yesterday. First of all nothing "happened" and it’s not a result of any particular event, issue or drama (but please keep the conspiracy theories coming as they are highly entertaining :)). Actually, being at OpenAI over the last ~year has been really great - the team is really strong, the people are wonderful, and the roadmap is very exciting, and I think we all have a lot to look forward to. My immediate plan is to work on my personal projects and see what happens. Those of you who’ve followed me for a while may have a sense for what that might look like ;) Cheers

22,573

Barret Zoph · Apr 15, 2022 · 11:09 PM UTC

Barret Zoph

@barret_zoph

15 Apr 2022

Replying to @jacobandreas @jacobaustin132 @_jasonwei

Yes I have also found this for math. If you append "I am a math tutor" it starts to answer with higher accuracy.

Barret Zoph · Apr 13, 2022 · 11:30 PM UTC

Barret Zoph

@barret_zoph

13 Apr 2022

Yes --- I think spending more time thinking about what to work on vs actually working on the thing is hugely important

Jason Wei

@_jasonwei

13 Apr 2022

The best meta- advice I've gotten is from @barret_zoph. It took me a year to begin to understand it. It went something like: Notice that many researchers work hard. Yet some are far more successful. This means the project you choose defines the upper-bound for your success.

Barret Zoph · Dec 1, 2019 · 6:18 PM UTC

Barret Zoph

@barret_zoph

1 Dec 2019

Slides and video of my talk at the Neural Architects workshop at ICCV this year! neuralarchitects.org/

Barret Zoph · Jun 6, 2023 · 2:59 AM UTC

Barret Zoph

@barret_zoph

6 Jun 2023

Research engineer role: openai.com/careers/research-… Research scientist role: openai.com/careers/research-…

Research Engineer

Research · San Francisco · FullTime

openai.com

21,596

Barret Zoph · Jun 12, 2022 · 8:53 AM UTC

Barret Zoph

@barret_zoph

12 Jun 2022

Exciting see sparse MoE models being 10x more calibrated than their dense LM counterparts. Better model calibration is a key research direction into better understand what models do vs don't know

Jascha Sohl-Dickstein

@jaschasd

10 Jun 2022

Replying to @jaschasd

Overall, sparse models perform as well as dense models which use ~2x more inference cost, but they are as well calibrated as dense models using ~10x more inference compute.

Barret Zoph · Feb 24, 2020 · 9:00 PM UTC

Barret Zoph

@barret_zoph

24 Feb 2020

My talk at the 2019 ICCV Neural Architects workshop is available online! piped.video/watch?v=O5Rrv6Bv…

Barret Zoph · Feb 19, 2021 · 9:13 PM UTC

Barret Zoph

@barret_zoph

19 Feb 2021

Nice work from @IrwanBello on his paper “LambdaNetworks: Modeling Long-Range Interactions without Attention” An interesting scalable alternative to self-attention with strong empirical results in computer vision! Link: arxiv.org/abs/2102.08602

Barret Zoph · Mar 24, 2021 · 7:59 PM UTC

Barret Zoph

@barret_zoph

24 Mar 2021

Code + checkpoints for the ResNet-RS paper are available!

Irwan Bello

@IrwanBello

24 Mar 2021

Training code and checkpoints here! github.com/tensorflow/tpu/tr…

Barret Zoph · May 5, 2021 · 3:17 AM UTC

Barret Zoph

@barret_zoph

5 May 2021

Great blogpost on our recent ResNet-RS work!

Aman Arora @amaarora

4 May 2021

Super excited to present my latest blog post on ResNet-RS - "Revisiting ResNets: Improved Training and Scaling Strategies". bit.ly/2QT3yIU I also share code implementation in PyTorch using TIMM & more! 1/3

Barret Zoph · Apr 16, 2022 · 7:16 PM UTC

Barret Zoph

@barret_zoph

16 Apr 2022

Yes +1. I remember studying parts of the Feynman lectures which showed me how much more clear my thought process could be. When reading his description of simple algebra and complex numbers I thought "wow I really am not thinking clearly enough": feynmanlectures.caltech.edu/…

Andrej Karpathy

@karpathy

16 Apr 2022

Looking back, my most valuable college classes were physics, but for general problem solving intuitions alone: - modeling systems with increasingly more complex terms - extrapolating variables to check behaviors at limits - pursuit of the simplest most powerful solutions ...

Barret Zoph · Jun 6, 2023 · 3:00 AM UTC

Barret Zoph

@barret_zoph

6 Jun 2023

Come work w/ @hwchung27 and @_jasonwei on this!

16,713

Barret Zoph · Feb 28, 2022 · 5:12 PM UTC

Barret Zoph

@barret_zoph

28 Feb 2022

I really like the "tcolorbox" package in LaTeX for research papers. It is a great feature for having nice looking summaries for sections or putting theorems. I enjoyed using it throughout my most recent work!

Barret Zoph · May 6, 2022 · 8:28 PM UTC

Barret Zoph

@barret_zoph

6 May 2022

AI progress has continually exceeded my expectations since I first started working in the space in 2015 The saying that people overestimate what they can do in a short amount of time and underestimate what can be achieved in longer periods of time definitely resonates w/ me

Roman Ring @Inoryy

6 May 2022

10 yrs ago @karpathy wrote a blog post on the outlook of AI: karpathy.github.io/2012/10/2… in which he describes how difficult it would be for an AI to understand a given photo, concluding "we are very, very far and this depresses me." Today, our Flamingo steps up to the challenge.

Barret Zoph · Jun 15, 2022 · 12:36 AM UTC

Barret Zoph

@barret_zoph

15 Jun 2022

Very excited to be able to release these sparse checkpoints to the research community!

Liam Fedus

@LiamFedus

14 Jun 2022

Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced! github.com/google-research/t… All thanks to the efforts of James Lee-Thorp, @ada_rob, and @hwchung27

Barret Zoph · Jun 10, 2022 · 5:46 PM UTC

Barret Zoph

@barret_zoph

10 Jun 2022

It was a pleasure to be part of this effort! Very bullish on the impact this will have for the future of LLMs. Also very impressed with the leadership for this project --- coordinating all of this to happen is nothing short of incredible!

Jascha Sohl-Dickstein

@jaschasd

10 Jun 2022

After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the github.com/google/BIG-bench paper is now live: arxiv.org/abs/2206.04615. BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.

Barret Zoph · Nov 22, 2019 · 9:12 PM UTC

Barret Zoph

@barret_zoph

22 Nov 2019

This is a great description of RandAugment! Thanks so much.

Connor Shorten

@CShorten30

20 Nov 2019

This video explains the new RandAugment AutoML Data Augmentation algorithm from @GoogleAI, improving on previous techniques (AutoAugment/PBA) on ImageNet and dramatically reducing the search space, making AutoML for Data Aug much easier! piped.video/Zzt9i3gDueE #100DaysOfMLCode

Barret Zoph · Apr 4, 2022 · 4:34 PM UTC

Barret Zoph

@barret_zoph

4 Apr 2022

Enjoyed The Pile dataset paper -- very thorough! Data is often overlooked and given the amount of money/time that goes into training these language models, this aspect should be taken seriously. arxiv.org/abs/2101.00027

Barret Zoph · Jan 12, 2021 · 3:20 AM UTC

Barret Zoph

@barret_zoph

12 Jan 2021

Switch Transformers introduce sparsity by sending different tokens to different weights We simplify MoE models by routing to the top expert only, which saves computation + communication costs We also introduce training techniques for training huge models in lower precision!

Barret Zoph · Sep 7, 2021 · 11:06 PM UTC

Barret Zoph

@barret_zoph

7 Sep 2021

Nice paper showing the power of simple scaling and training methods for video recognition! Follows the line of "RS" research I have done with some of these collaborators for Image Classification (arxiv.org/abs/2103.07579) and Object Detection (arxiv.org/abs/2107.00057).

Irwan Bello

@IrwanBello

7 Sep 2021

Wondering how simple 3D-ResNets perform on video recognition given all the recent architecture craze? In Revisiting 3D ResNets for Video Recognition, we study the impact of improved training and scaling methods on 3D ResNets. arxiv.org/abs/2109.01696

Barret Zoph · Sep 1, 2021 · 3:18 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

In prior work, we showed generating labels from a teacher model can be more flexible than pre-training. arxiv.org/abs/2006.06882 MuST is a natural extension where now we generate labels from multiple different teachers on various tasks to learn a general pre-trained model.

Barret Zoph · Apr 21, 2022 · 11:23 AM UTC

Barret Zoph

@barret_zoph

21 Apr 2022

Really fun chatting! Thanks for having us on.

Yannic Kilcher 🇸🇨

@ykilcher

21 Apr 2022

New interview with Barret Zoph (@barret_zoph) and William Fedus (@LiamFedus) of Google Brain on Sparse Expert Models. We talk about Switch Transformers, GLAM, information routing, distributed systems, and how to scale to TRILLIONS of parameters. Watch now: piped.video/ccBMRryxGog

ALT Sparse Expert Models with the authors

Barret Zoph · Jun 1, 2022 · 6:26 PM UTC

Barret Zoph

@barret_zoph

1 Jun 2022

To find these interest prompts, should we be looking at the pre-training data? Is "step by step" mentioned the most frequently in documents when an explanation comes next? Automatic prompt discovery from inspecting the pre-training data feels promising

Jason Wei

@_jasonwei

25 May 2022

Big language models can generate their own chain of thought, even without few-shot exemplars. Just add "Let's think step by step". Look me in the eye and tell me you don't like big language models. arxiv.org/abs/2205.11916

Barret Zoph · Jan 13, 2021 · 10:52 PM UTC

Barret Zoph

@barret_zoph

13 Jan 2021

Wow that is a very strong imagenet result! Cool to see further progress being made in semi-supervised methods for computer vision!

Quoc Le

@quocleix

13 Jan 2021

Some nice improvement on ImageNet: 90% top-1 accuracy has been achieved :-) This result is possible by using Meta Pseudo Labels, a semi-supervised learning method, to train EfficientNet-L2. More details here: arxiv.org/abs/2003.10580

Barret Zoph · Jan 12, 2021 · 3:20 AM UTC

Barret Zoph

@barret_zoph

12 Jan 2021

Switch Transformers are also found to be strong multi-task learners On multilingual language modeling (mT5) we outperform T5 models across 101 languages w/ a 5x speedup

Barret Zoph · Jun 28, 2020 · 6:31 PM UTC

Barret Zoph

@barret_zoph

28 Jun 2020

Thanks for the nice article on our recent work!

Aakash Kumar Nain

@A_K_Nain

28 Jun 2020

As promised, here is my new blogpost explaining the latest research from Google Research and Brain team. I liked this paper a lot because instead of building models with billions of params, it focuses on fundamental aspects. medium.com/@nainaakash012/re…

Barret Zoph · Jan 12, 2021 · 3:20 AM UTC

Barret Zoph

@barret_zoph

12 Jan 2021

We find we can distill some of the performance improvements from our sparse Switch Transformers into dense variants (w/ the same FLOPs per token)

Barret Zoph · Apr 2, 2022 · 4:26 PM UTC

Barret Zoph

@barret_zoph

2 Apr 2022

I would be surprised if a modeling improvement could yield a 10x smaller model for a fixed quality For data this is not the case and often the opposite feeling --- surprising if you couldn't reduce model size by 10x

Barret Zoph · Oct 15, 2021 · 7:32 PM UTC

Barret Zoph

@barret_zoph

15 Oct 2021

Excited to be giving it! Thanks for the invite.

KUIS AI @KuisAICenter

14 Oct 2021

📢 Next Wed at 5 pm, we’ll have (@barret_zoph ) from Gooogle Brain who will talk about the use of sparsity for large Transformer models: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" zoom info: ai-info@ku.edu.tr or just DM!

Barret Zoph · Jan 14, 2022 · 3:39 AM UTC

Barret Zoph

@barret_zoph

14 Jan 2022

Very useful LaTeX trick!

elvis

@omarsar0

13 Jan 2022

Nice and beautiful examples of how to produce annotated equations using LaTeX. 🤯 github.com/synercys/annotate…

Barret Zoph · Apr 20, 2022 · 3:13 PM UTC

Barret Zoph

@barret_zoph

20 Apr 2022

Thanks @jeremiecharris for having me on your podcast! Super fun chatting about mixture-of-expert models and how they fit into the current large language model landscape. Podcast: bit.ly/3vpsCr2

Barret Zoph · Sep 7, 2022 · 2:58 PM UTC

Barret Zoph

@barret_zoph

7 Sep 2022

Sparse expert models are becoming increasingly relevant as they are now being used across many domains (NLP, speech, vision, multi-modality) w/ very strong results Right now sparse expert models hold SOTA on various benchmarks (e.g. ST-MoE on SuperGlue, ANLI, ARC, etc…)

Barret Zoph · Jan 12, 2021 · 3:20 AM UTC

Barret Zoph

@barret_zoph

12 Jan 2021

How do Switch Transformers scale? Keeping the floating point operations per token fixed, increasing the number of sparse parameters by adding more experts significantly improves performance

Barret Zoph · Apr 2, 2022 · 12:45 AM UTC

Barret Zoph

@barret_zoph

2 Apr 2022

Yes this is a very important principle to keep in mind --- even when doing a single research project. It's often hard to find the right experimentation scale such that the "smaller" scale ideas have a higher probability of working at a "larger scale".

Andrej Karpathy

@karpathy

1 Apr 2022

Just making sure everyone read “The Bitter Lesson”, as it is one of the best compact pieces of insight into nature of progress in AI. Good habit to keep checking ideas on whether they pass the bitter lesson gut check incompleteideas.net/IncIdeas…

Barret Zoph · Jun 18, 2020 · 8:23 PM UTC

Barret Zoph

@barret_zoph

18 Jun 2020

Fantastic video on some our recent work! Really great job @CShorten30 .

Connor Shorten

@CShorten30

18 Jun 2020

"Rethinking Pre-training and Self-Training" from researchers @GoogleAI shows we get better results from self-training than either supervised or self-supervised pre-training. Demonstrated on Object Detection and Semantic Segmentation! piped.video/QSjMLGA7e2o #100DaysOfMLCode

Barret Zoph · Mar 16, 2021 · 4:07 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

We highlight the importance of disentangling the training methods and architectural components when making comparisons across architectures

Barret Zoph · Mar 16, 2021 · 4:07 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

The modern training techniques (data augmentation, label smoothing, etc…) lead to strong representations that rival sota self-supervised learning methods (e.g. SimCLR) on a bunch of vision tasks

Barret Zoph · Dec 15, 2020 · 8:08 PM UTC

Barret Zoph

@barret_zoph

15 Dec 2020

Copy-Paste greatly improves data efficiency (even on top of a strong augmentation baseline of aggressive scale jittering!) Data efficiency is critical for instance segmentation as its much more expensive compared to object detection and image classification

Barret Zoph · Mar 16, 2021 · 4:07 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

We study scaling strategies for vision models and observe the best scaling strategies heavily depends on the training setup When overfitting can occur (e.g. 350 epochs on ImageNet) scaling depth is best. In settings with larger datasets/fewer epochs width scaling is preferred.

Barret Zoph · Aug 6, 2023 · 6:34 PM UTC

Barret Zoph

@barret_zoph

6 Aug 2023

Replying to @giffmana

The T5 paper did something very similar right? Do the normal warmup, decay by 1/sqrt(step), then linearly decay by last 10% of training.

2,468

Barret Zoph · Sep 30, 2021 · 8:55 PM UTC

Barret Zoph

@barret_zoph

30 Sep 2021

Happy to see our work on ResNet-RS made it to NeurIPS!

Irwan Bello

@IrwanBello

29 Sep 2021

To appear #NeurIPS2021 as a spotlight - congrats team

Barret Zoph · Dec 15, 2020 · 8:08 PM UTC

Barret Zoph

@barret_zoph

15 Dec 2020

LVIS dataset was created to make progress on long-tail visual recognition. We outperform the ECCV 2020 challenge winner on LVIS by +3.6 mask AP on rare objects (and our baseline by +6.1 AP)

Barret Zoph · Sep 1, 2021 · 3:17 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

Example of MuST: Step 1: Train three models: NYU Depth, COCO Detection, Pascal Segmentation Step 2: Generate pseudo labels for depth estimation, detection and segmentation on all labeled / unlabeled images Step 3: Train new model on the combined human + pseudo labeled images

Barret Zoph · May 9, 2022 · 7:56 PM UTC

Barret Zoph

@barret_zoph

9 May 2022

Surprising to see how performance scales smoothly when the model goes from generating 1 solution all the way up to 1M solutions

Barret Zoph · May 9, 2022 · 7:56 PM UTC

Barret Zoph

@barret_zoph

9 May 2022

Exciting to see more encoder-decoder models (e.g. T5, T0, Switch Transformer, ST-MoE) Liked the dual loss pre-training strategy: use MLM on encoder and simple autoregressive LM on decoder

Barret Zoph · Apr 14, 2022 · 2:02 AM UTC

Barret Zoph

@barret_zoph

14 Apr 2022

Super excited to see the co-evolution of game design with these types of models. Open world games that could automatically generate new environments based on what the player has enjoyed so far would be so cool --- I often felt games got stale due to a lack of new environments.

Greg Brockman

@gdb

14 Apr 2022

DALL-E 2 applied to generating assets for game development:

Barret Zoph · Nov 23, 2022 · 11:33 PM UTC

Barret Zoph

@barret_zoph

23 Nov 2022

Awesome startup w/ awesome founders! Excited to see future space of AI x Legal. (Disclosure: I invested)

This tweet is unavailable

Barret Zoph · Aug 14, 2021 · 3:38 AM UTC

Barret Zoph

@barret_zoph

14 Aug 2021

Looking forward to giving this talk!

F. Güney @ftm_guney

12 Aug 2021

great talks lining up in September @KuisAICenter including @DeqingSun @jponttuset @barret_zoph, looking forward to all of them!

Barret Zoph · May 9, 2022 · 7:56 PM UTC

Barret Zoph

@barret_zoph

9 May 2022

Surprised the 41B model only was better than the 9B model once it could generate 1k+ samples Wonder how results for different model sizes change as a function of the pre-training and fine-tuning dataset size

Barret Zoph · Apr 4, 2022 · 11:45 PM UTC

Barret Zoph

@barret_zoph

4 Apr 2022

Impressive results w/ the continued scale of large LMs On certain tasks there were large discontinuous performance improvements not predicted by scaling curves Great leadership / coordination on this project to make it happen --- nice work team!

Google AI

@GoogleAI

4 Apr 2022

Introducing the 540 billion parameter Pathways Language Model. Trained on two Cloud #TPU v4 pods, it achieves state-of-the-art performance on benchmarks and shows exciting capabilities like mathematical reasoning, code writing, and even explaining jokes. goo.gle/3j6eMnK

Barret Zoph · Sep 1, 2021 · 3:17 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

When using only ImageNet images, MuST significantly outperforms both supervised and self-supervised representations across many tasks.

Barret Zoph · Mar 16, 2021 · 4:07 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

Hope these revamped ResNets can serve as baselines for future architectural and training method comparisons!

Barret Zoph · Jan 15, 2022 · 7:45 PM UTC

Barret Zoph

@barret_zoph

15 Jan 2022

Nice summary of a lot of the great work done by Google Research in the past year.

Jeff Dean

@JeffDean

11 Jan 2022

As in past years, I've spent part of the holiday break summarizing much of the work we've done in @GoogleResearch over the last year. On behalf of @Google's research community, I'm delighted to share this writeup (this year grouped into five themes). ai.googleblog.com/2022/01/go…

Barret Zoph · Sep 1, 2021 · 3:17 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

We observed adding more pseudo labels to each image to lead to better representations! So don’t just use classification and depth estimation labels, include segmentation and others too.

Barret Zoph · May 9, 2022 · 7:56 PM UTC

Barret Zoph

@barret_zoph

9 May 2022

Exciting research ahead to not require generating huge amounts of samples -- seems this should be possible Many applications of LLMs require generating lots of samples and even using discriminator models to further filter generated outputs (e.g. Lamda, OpenAI Verifiers)

Barret Zoph · Sep 1, 2021 · 3:18 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

What if I already trained my checkpoint? No problem! You can simply continue training your checkpoint with MuST for a few iterations and observe improvements! Results combining MuST with an ALIGN checkpoint.

Barret Zoph · Jun 18, 2022 · 7:33 PM UTC

Barret Zoph

@barret_zoph

18 Jun 2022

This really hit homes --- the amount of hand holding for experiments and models can be quite frustrating. You would think that this area would have more progress given these are the issues people training the models are having :)

rohan anil

@_arohan_

18 Jun 2022

The AGI I want is one that realizes I made a dumb mistake with batch size which makes it OOM on a supercomputer and tries a smaller one for me - while I am sleeping so I don’t have to babysit the models and increases the throughput in experimentation!

Barret Zoph · May 15, 2024 · 4:06 AM UTC

Barret Zoph

@barret_zoph

15 May 2024

Replying to @JeffDean @miramurati @markchen90

Thanks @JeffDean!

1,394

Barret Zoph · May 9, 2022 · 7:56 PM UTC

Barret Zoph

@barret_zoph

9 May 2022

Interesting how the validation loss isn't correlated with the solve rate Other tasks like dialogue (e.g. Lamda) seem to correlate much better to human evals Probably due to the one-to-many nature of coding tasks relative to dialogue as the authors point out

Barret Zoph · Apr 4, 2022 · 4:34 PM UTC

Barret Zoph

@barret_zoph

4 Apr 2022

Wouldn't be surprised if some of the most impactful papers in the language modeling space in the next few years come from pure dataset research

Barret Zoph · Dec 15, 2020 · 8:08 PM UTC

Barret Zoph

@barret_zoph

15 Dec 2020

Authors: Golnaz Ghiasi, @YinCui1, @AravSrinivas, @RuiQian3, @TsungYiLin1, @ekindogus, @quocleix, @barret_zoph

Barret Zoph · Mar 16, 2021 · 4:30 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

Nice summary of our recent work!

Andrey Lukyanenko @AndLukyane

16 Mar 2021

My review of the paper "Revisiting ResNets: Improved Training and Scaling Strategies". It seems that we have a new SOTA for CV tasks. Looking forwards for PyTorch version! andlukyane.com/blog/paper-re…

Barret Zoph · Mar 16, 2021 · 4:07 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

In a large scale semi-supervised learning setup we obtain 5.5x speedups over Noisy Student EfficientNets.

Barret Zoph · May 9, 2022 · 7:56 PM UTC

Barret Zoph

@barret_zoph

9 May 2022

Also seems the 41B models wasn't the "compute Pareto optimal" --- for a given TPU budget its almost always better to use the 9B model

Barret Zoph · Apr 15, 2022 · 6:24 PM UTC

Barret Zoph

@barret_zoph

15 Apr 2022

Replying to @_arohan_ @borisdayma

Yea +1 also to the power of these GLU/GELU FFN variants (like in arxiv.org/abs/2002.05202). These work very well.

Barret Zoph · Mar 16, 2021 · 4:07 PM UTC

Barret Zoph

@barret_zoph

16 Mar 2021

We design a Pareto curve of 11 different ResNet models named ResNet-RS by scaling the image size along with different network depths. We obtain 1.7-2.7x speedups over EfficientNets on ImageNet.

Barret Zoph · Sep 1, 2021 · 3:18 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

How do MuST representations compare to those trained with standard multi-task learning across datasets and tasks? MuST improves over multi-task training across all tasks!

Barret Zoph · Sep 1, 2021 · 3:18 PM UTC

Barret Zoph

@barret_zoph

1 Sep 2021

We studied MuST on a suite of different tasks and datasets. Training Datasets: Specialized teacher models trained on these datasets, which are used to produce pseudo labels. Evaluation Datasets: Datasets models are fine-tuned on.

Barret Zoph · Sep 7, 2022 · 2:58 PM UTC

Barret Zoph

@barret_zoph

7 Sep 2022

We dive into the tradeoffs of using sparse expert models versus standard dense models We hope this review can help to increase adoption for them as they are working quite well and lots of excellent research has been done for them!

Barret Zoph · Feb 21, 2022 · 5:02 PM UTC

Barret Zoph

@barret_zoph

21 Feb 2022

We finally combine our improvements and train a sparse model with 269B parameters (FLOP matched to a 32B dense model). This model achieve SOTA on a wide range of NLP tasks: SuperGLUE, XSum, CNN-DM, ANLI R3, ARC-Easy/Challenge, CB WebQA, CB NatQA.

Barret Zoph · Jun 15, 2022 · 12:33 AM UTC

Barret Zoph

@barret_zoph

15 Jun 2022

Great thread describing some of the approaches for getting models to perform well on tasks we care about!

Shayne Longpre

@ShayneRedford

14 Jun 2022

📢 A 🧵on the future of NLP model inputs. What are the options and where are we going? 🔭 1. Task-specific finetuning (FT) 2. Zero-shot prompting 3. Few-shot prompting 4. Chain of thought (CoT) 5. Parameter-efficient finetuning (PEFT) 6. Dialog [1/]

Barret Zoph · Feb 21, 2022 · 5:02 PM UTC

Barret Zoph

@barret_zoph

21 Feb 2022

We study the fine-tuning of sparse vs dense models The optimal batch sizes and learning rates for sparse vs dense models are very different In certain scenarios wrong values masked any of the pre-training performance improvements of sparse models over the dense models