@openai Past: - CTO & Co-Founder Thinking Machines Lab (@thinkymachines) - VP Research (Post-Training) @openai - Research Scientist at Google Brain

San Francisco, CA
I posted this note to OpenAI. Hey everybody, I have decided to leave OpenAI. This was a very difficult decision as I have has such an incredible time at OpenAI. I got to join right before ChatGPT and helped build the post-training team from scratch with John Schulman and others. I feel so grateful to have gotten the opportunity to run the post-training team and help build and scale ChatGPT to where it is today. Right now feels like a natural point for me to explore new opportunities outside of OpenAI. This is a personal decision based on how I want to evolve the next phase of my career.
 I am very grateful for all the opportunities OpenAI has given me and all the support I have gotten from OpenAI leadership such as Sam and Greg. I am in particular grateful for everything Bob has done and for being an excellent manager and colleague to me over my career at OpenAI. The post-training team has many many talented leaders and is being left in good hands. OpenAI is doing and will continue to do incredible work and I am very optimistic about the future trajectory of the company and will be rooting everybody on.
155
168
3,399
1,089,134
After 6 years at Google Brain I am excited to announce that I joined OpenAI! Very grateful for all the amazing collaborators and friends I have made at Google over the years Could not be more excited to continue to help push AI progress and for the new adventures ahead
57
42
1,517
Our team at OpenAI is hiring! We're looking for engineers/researchers who do rigorous and thoughtful work understanding and evaluating LLMs like ChatGPT. If you're interested, please apply online and DM me with work that you've done!
38
97
704
625,006
Introducing Switch Transformer, a simplified sparse architecture for scaling to trillion parameter language models Switch Transformers yield 4-7x speedups over strong Transformer T5 models w/ the same computational resources Paper: arxiv.org/abs/2101.03961
3
132
646
*New paper* RandAugment: a new data augmentation. Better & simpler than AutoAugment. Main idea is to select transformations at random, and tune their magnitude. It achieves 85.0% top-1 on ImageNet. Paper: arxiv.org/abs/1909.13719 Code: git.io/Jeopl
2
143
566
Can simply copying and pasting objects from one image to another be used to create more data to improve state-of-the-art instance segmentation? Yes! With Copy&Paste, we achieve 57.3 box AP and 49.1 mask AP on COCO. This is SoTA wrt @paperswithcode arxiv.org/abs/2012.07177
9
92
482
Revisiting ResNets: Improved Training and Scaling Strategies Our recent work that applies modern training and scaling techniques to the 2015 ResNet We find ResNets outperform some recent state-of-the-art architectures ResNets are remarkably durable! arxiv.org/abs/2103.07579
5
65
343
How do we combine knowledge from multiple labeled and unlabeled datasets to train a great general model? Multi-Task Self-Training (MuST) trains specialized teachers on labeled data, which then label unlabeled data to train a single general model. arxiv.org/abs/2108.11353
5
82
336
What an incredible company OpenAI is to work at. I have never seen so many people so committed to the mission of the company and band together when things go wrong. Huge props the the leadership team for navigating these incredibly difficult times.
14
6
301
108,100
Super excited to be part of this incredible team and company. Please reach out if you are interested in joining!
Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're excited that in the next couple months we’ll be able to share our first product, which will include a significant open source component and be useful for researchers and startups developing custom models. Soon, we’ll also share our best science to help the research community better understand frontier AI systems. To accelerate our progress, we’re happy to confirm that we’ve raised $2B led by a16z with participation from NVIDIA, Accel, ServiceNow, CISCO, AMD, Jane Street and more who share our mission. We’re always looking for extraordinary talent that learns by doing, turning research into useful things. We believe AI should serve as an extension of individual agency and, in the spirit of freedom, be distributed as widely and equitably as possible.  We hope this vision resonates with those who share our commitment to advancing the field. If so, join us. thinkingmachines.paperform.c…
14
12
274
47,438
What a fun first few months at OpenAI its been :)
ChatGPT launched on wednesday. today it crossed 1 million users!
4
4
269
Want to learn more about how sparse expert models (e.g. MoEs, Switch Transformers, Hash Layers) work and their recent research advancements? Check out our recent review paper arxiv.org/abs/2209.01667
3
56
256
Excited to share our first blog post -- one of many to follow!
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/def…
7
8
260
34,964
Excited to release Tinker and see what the community uses it for.
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker
5
9
198
38,933
Really enjoyed the Instruct-GPT paper Impressed by the results: 100x smaller models w/ same quality by updating models on the data distribution you care about Data is often overlooked & such a powerful tool -- smaller models for the same quality, which saves a lot at inference
5
20
185
Lots of great work coming out on LLMs generating + understanding code (Codex, Scratch Pad, MBPP/MathQA, etc...) The Alpha code paper by DeepMind is quite impressive --- ranking ~50% percentile in competitive programming competitions w/ 5000+ participants A 🧵below:
2
29
174
Super excited this is rolling out! Real time speech to speech will be a powerful feature -- I am very bullish on multi-modal being a core component of AI products. This was a great collaboration with post-training (h/t to @kirillov_a_n & @shuchaobi + team on post-training) and other teams across OpenAI to make this happen.
Advanced Voice is rolling out to all Plus and Team users in the ChatGPT app over the course of the week. While you’ve been patiently waiting, we’ve added Custom Instructions, Memory, five new voices, and improved accents. It can also say “Sorry I’m late” in over 50 languages.
8
5
159
40,911
Interested in using sparse expert models, but find they are unstable, hard to design or don’t fine-tune well? We address these key issues and train 269B param MoE model (w/ FLOPs of 32B dense model) that improves SOTA on NLP benchmarks liked SuperGLUE. arxiv.org/abs/2202.08906
5
32
158
Excited to be supporting this, please reach out if you are interested
At Thinking Machines, our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to advance open-source vLLM and serve frontier models. If you are interested, please DM me or @barret_zoph! Here are some example roles / projects: * Distributed inference engineer to support large-scale models on Blackwell GPUs * PyTorch & model optimization engineer to support & optimize latest OSS models * MLSys generalist for various aspects of vLLM
6
7
136
30,257
Exciting mission with a great team! With the progress of AI, now is the right time to start approaching these problems!
Today, @ekindogus and I are excited to introduce @periodiclabs. Our goal is to create an AI scientist. Science works by conjecturing how the world might be, running experiments, and learning from the results. Intelligence is necessary, but not sufficient. New knowledge is created when ideas are found to be consistent with reality. And so, at Periodic, we are building AI scientists and the autonomous laboratories for them to operate. Until now, scientific AI advances have come from models trained on the internet. But despite its vastness — it’s still finite (estimates are ~10T text tokens where one English word may be 1-2 tokens). And in recent years the best frontier AI models have fully exhausted it. Researchers seek better use of this data, but as any scientist knows: though re-reading a textbook may give new insights, they eventually need to try their idea to see if it holds. Autonomous labs are central to our strategy. They provide huge amounts of high-quality data (each experiment can produce GBs of data!) that exists nowhere else. They generate valuable negative results which are seldom published. But most importantly, they give our AI scientists the tools to act. We’re starting in the physical sciences. Technological progress is limited by our ability to design the physical world. We’re starting here because experiments have high signal-to-noise and are (relatively) fast, physical simulations effectively model many systems, but more broadly, physics is a verifiable environment. AI has progressed fastest in domains with data and verifiable results - for example, in math and code. Here, nature is the RL environment. One of our goals is to discover superconductors that work at higher temperatures than today's materials. Significant advances could help us create next-generation transportation and build power grids with minimal losses. But this is just one example — if we can automate materials design, we have the potential to accelerate Moore’s Law, space travel, and nuclear fusion. We’re also working to deploy our solutions with industry. As an example, we're helping a semiconductor manufacturer that is facing issues with heat dissipation on their chips. We’re training custom agents for their engineers and researchers to make sense of their experimental data in order to iterate faster. Our founding team co-created ChatGPT, DeepMind’s GNoME, OpenAI’s Operator (now Agent), the neural attention mechanism, MatterGen; have scaled autonomous physics labs; and have contributed to some of the most important materials discoveries of the last decade. We’ve come together to scale up and reimagine how science is done. We’re fortunate to be backed by investors who share our vision, including @a16z who led our $300M round, as well as @Felicis, DST Global, NVentures (NVIDIA’s venture capital arm), @Accel and individuals including @JeffBezos , @eladgil , @ericschmidt, and @JeffDean. Their support will help us grow our team, scale our labs, and develop the first generation of AI scientists.
2
8
128
27,155
Our new sparse model (SS-MoE) achieved SOTA on SuperGLUE (super.gluebenchmark.com/lead…)! Excited to see sparsity pushing state-of-the-art! This new work builds heavily on our prior work on Switch Transformer: arxiv.org/abs/2101.03961 Paper and more details to come soon!
3
17
112
❤️
I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.
1
3
101
17,589
Models and checkpoints are now open sourced for my recent work: "Rethinking Pre-training and Self-training". Paper link: arxiv.org/abs/2006.06882 Code Link: bit.ly/3j5sVAn. On COCO we achieve 54.3 AP and on Pascal Segmentation 90.5 mIOU!
1
23
109
Great post on on-policy distillation the people should check out!
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
9
4
118
50,661
Intersecting cutting edge AI research w/ products is an incredibly exciting area to work on. Products are the ultimate test set :)
3
6
82
Great video summary of some of my recent work! Thanks @ykilcher!
A bit late to the party, but 💃NEW VIDEO🕺 on Switch Transformers by @GoogleAI. Hard Routing, selective dropout, mixed precision & more to achieve a 🔥ONE TRILLION parameters🔥 language model. Watch to learn how it's done🧙💪 piped.video/iAR8LkkMMIM @LiamFedus @barret_zoph
6
82
Super interesting work! Excited to see the future of attention models in computer vision.
If you haven't read our latest ImageNet SOTA work "Vision Transformers (ViT)" yet, shame on you. But! There's hope! Here's the corresponding blogpost which is a nice tl;dr: ai.googleblog.com/2020/12/tr…
1
7
65
We are looking for people to understand, improve and combine a variety of evaluation signals (e.g. automated and human), build eval infra (e.g. visualizations, testing) and do ML research on better eval methods.
2
1
61
18,308
Pleasure working with you -- learned quite a lot! Excited for what you do next.
Hi everyone yes, I left OpenAI yesterday. First of all nothing "happened" and it’s not a result of any particular event, issue or drama (but please keep the conspiracy theories coming as they are highly entertaining :)). Actually, being at OpenAI over the last ~year has been really great - the team is really strong, the people are wonderful, and the roadmap is very exciting, and I think we all have a lot to look forward to. My immediate plan is to work on my personal projects and see what happens. Those of you who’ve followed me for a while may have a sense for what that might look like ;) Cheers
1
57
22,573
Yes I have also found this for math. If you append "I am a math tutor" it starts to answer with higher accuracy.
1
1
55
Yes --- I think spending more time thinking about what to work on vs actually working on the thing is hugely important
The best meta- advice I've gotten is from @barret_zoph. It took me a year to begin to understand it. It went something like: Notice that many researchers work hard. Yet some are far more successful. This means the project you choose defines the upper-bound for your success.
2
2
55
Slides and video of my talk at the Neural Architects workshop at ICCV this year! neuralarchitects.org/
17
47
Exciting see sparse MoE models being 10x more calibrated than their dense LM counterparts. Better model calibration is a key research direction into better understand what models do vs don't know
Replying to @jaschasd
Overall, sparse models perform as well as dense models which use ~2x more inference cost, but they are as well calibrated as dense models using ~10x more inference compute.
2
3
41
My talk at the 2019 ICCV Neural Architects workshop is available online! piped.video/watch?v=O5Rrv6Bv…
1
11
40
Nice work from @IrwanBello on his paper “LambdaNetworks: Modeling Long-Range Interactions without Attention” An interesting scalable alternative to self-attention with strong empirical results in computer vision! Link: arxiv.org/abs/2102.08602
1
4
33
Code + checkpoints for the ResNet-RS paper are available!
3
35
Great blogpost on our recent ResNet-RS work!
Super excited to present my latest blog post on ResNet-RS - "Revisiting ResNets: Improved Training and Scaling Strategies". bit.ly/2QT3yIU I also share code implementation in PyTorch using TIMM & more! 1/3
5
31
Yes +1. I remember studying parts of the Feynman lectures which showed me how much more clear my thought process could be. When reading his description of simple algebra and complex numbers I thought "wow I really am not thinking clearly enough": feynmanlectures.caltech.edu/…
Looking back, my most valuable college classes were physics, but for general problem solving intuitions alone: - modeling systems with increasingly more complex terms - extrapolating variables to check behaviors at limits - pursuit of the simplest most powerful solutions ...
3
2
31
Come work w/ @hwchung27 and @_jasonwei on this!
3
28
16,713
I really like the "tcolorbox" package in LaTeX for research papers. It is a great feature for having nice looking summaries for sections or putting theorems. I enjoyed using it throughout my most recent work!
1
27
AI progress has continually exceeded my expectations since I first started working in the space in 2015 The saying that people overestimate what they can do in a short amount of time and underestimate what can be achieved in longer periods of time definitely resonates w/ me
10 yrs ago @karpathy wrote a blog post on the outlook of AI: karpathy.github.io/2012/10/2… in which he describes how difficult it would be for an AI to understand a given photo, concluding "we are very, very far and this depresses me." Today, our Flamingo steps up to the challenge.
1
1
26
Very excited to be able to release these sparse checkpoints to the research community!
Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced! github.com/google-research/t… All thanks to the efforts of James Lee-Thorp, @ada_rob, and @hwchung27
2
1
26
It was a pleasure to be part of this effort! Very bullish on the impact this will have for the future of LLMs. Also very impressed with the leadership for this project --- coordinating all of this to happen is nothing short of incredible!
After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the github.com/google/BIG-bench paper is now live: arxiv.org/abs/2206.04615. BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.
1
3
25
This is a great description of RandAugment! Thanks so much.
This video explains the new RandAugment AutoML Data Augmentation algorithm from @GoogleAI, improving on previous techniques (AutoAugment/PBA) on ImageNet and dramatically reducing the search space, making AutoML for Data Aug much easier! piped.video/Zzt9i3gDueE #100DaysOfMLCode
1
5
23
Enjoyed The Pile dataset paper -- very thorough! Data is often overlooked and given the amount of money/time that goes into training these language models, this aspect should be taken seriously. arxiv.org/abs/2101.00027
2
1
21
Switch Transformers introduce sparsity by sending different tokens to different weights We simplify MoE models by routing to the top expert only, which saves computation + communication costs We also introduce training techniques for training huge models in lower precision!
1
2
18
Nice paper showing the power of simple scaling and training methods for video recognition! Follows the line of "RS" research I have done with some of these collaborators for Image Classification (arxiv.org/abs/2103.07579) and Object Detection (arxiv.org/abs/2107.00057).
Wondering how simple 3D-ResNets perform on video recognition given all the recent architecture craze? In Revisiting 3D ResNets for Video Recognition, we study the impact of improved training and scaling methods on 3D ResNets. arxiv.org/abs/2109.01696
1
3
17
In prior work, we showed generating labels from a teacher model can be more flexible than pre-training. arxiv.org/abs/2006.06882 MuST is a natural extension where now we generate labels from multiple different teachers on various tasks to learn a general pre-trained model.
1
17
Really fun chatting! Thanks for having us on.
New interview with Barret Zoph (@barret_zoph) and William Fedus (@LiamFedus) of Google Brain on Sparse Expert Models. We talk about Switch Transformers, GLAM, information routing, distributed systems, and how to scale to TRILLIONS of parameters. Watch now: piped.video/ccBMRryxGog
1
1
17
To find these interest prompts, should we be looking at the pre-training data? Is "step by step" mentioned the most frequently in documents when an explanation comes next? Automatic prompt discovery from inspecting the pre-training data feels promising
Big language models can generate their own chain of thought, even without few-shot exemplars. Just add "Let's think step by step". Look me in the eye and tell me you don't like big language models. arxiv.org/abs/2205.11916
2
1
14
Wow that is a very strong imagenet result! Cool to see further progress being made in semi-supervised methods for computer vision!
Some nice improvement on ImageNet: 90% top-1 accuracy has been achieved :-) This result is possible by using Meta Pseudo Labels, a semi-supervised learning method, to train EfficientNet-L2. More details here: arxiv.org/abs/2003.10580
1
15
Switch Transformers are also found to be strong multi-task learners On multilingual language modeling (mT5) we outperform T5 models across 101 languages w/ a 5x speedup
13
Thanks for the nice article on our recent work!
As promised, here is my new blogpost explaining the latest research from Google Research and Brain team. I liked this paper a lot because instead of building models with billions of params, it focuses on fundamental aspects. medium.com/@nainaakash012/re…
1
14
We find we can distill some of the performance improvements from our sparse Switch Transformers into dense variants (w/ the same FLOPs per token)
2
13
I would be surprised if a modeling improvement could yield a 10x smaller model for a fixed quality For data this is not the case and often the opposite feeling --- surprising if you couldn't reduce model size by 10x
1
14
Excited to be giving it! Thanks for the invite.
📢 Next Wed at 5 pm, we’ll have (@barret_zoph ) from Gooogle Brain who will talk about the use of sparsity for large Transformer models: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" zoom info: ai-info@ku.edu.tr or just DM!
2
12
Very useful LaTeX trick!
Nice and beautiful examples of how to produce annotated equations using LaTeX. 🤯 github.com/synercys/annotate…
3
12
Thanks @jeremiecharris for having me on your podcast! Super fun chatting about mixture-of-expert models and how they fit into the current large language model landscape. Podcast: bit.ly/3vpsCr2
1
3
12
Sparse expert models are becoming increasingly relevant as they are now being used across many domains (NLP, speech, vision, multi-modality) w/ very strong results Right now sparse expert models hold SOTA on various benchmarks (e.g. ST-MoE on SuperGlue, ANLI, ARC, etc…)
1
10
How do Switch Transformers scale? Keeping the floating point operations per token fixed, increasing the number of sparse parameters by adding more experts significantly improves performance
1
1
11
Yes this is a very important principle to keep in mind --- even when doing a single research project. It's often hard to find the right experimentation scale such that the "smaller" scale ideas have a higher probability of working at a "larger scale".
Just making sure everyone read “The Bitter Lesson”, as it is one of the best compact pieces of insight into nature of progress in AI. Good habit to keep checking ideas on whether they pass the bitter lesson gut check incompleteideas.net/IncIdeas…
12
Fantastic video on some our recent work! Really great job @CShorten30 .
"Rethinking Pre-training and Self-Training" from researchers @GoogleAI shows we get better results from self-training than either supervised or self-supervised pre-training. Demonstrated on Object Detection and Semantic Segmentation! piped.video/QSjMLGA7e2o #100DaysOfMLCode
12
We highlight the importance of disentangling the training methods and architectural components when making comparisons across architectures
2
10
The modern training techniques (data augmentation, label smoothing, etc…) lead to strong representations that rival sota self-supervised learning methods (e.g. SimCLR) on a bunch of vision tasks
2
1
11
Copy-Paste greatly improves data efficiency (even on top of a strong augmentation baseline of aggressive scale jittering!) Data efficiency is critical for instance segmentation as its much more expensive compared to object detection and image classification
1
11
We study scaling strategies for vision models and observe the best scaling strategies heavily depends on the training setup When overfitting can occur (e.g. 350 epochs on ImageNet) scaling depth is best. In settings with larger datasets/fewer epochs width scaling is preferred.
1
10
Replying to @giffmana
The T5 paper did something very similar right? Do the normal warmup, decay by 1/sqrt(step), then linearly decay by last 10% of training.
1
10
2,468
Happy to see our work on ResNet-RS made it to NeurIPS!
To appear #NeurIPS2021 as a spotlight - congrats team
10
LVIS dataset was created to make progress on long-tail visual recognition. We outperform the ECCV 2020 challenge winner on LVIS by +3.6 mask AP on rare objects (and our baseline by +6.1 AP)
1
10
Example of MuST: Step 1: Train three models: NYU Depth, COCO Detection, Pascal Segmentation Step 2: Generate pseudo labels for depth estimation, detection and segmentation on all labeled / unlabeled images Step 3: Train new model on the combined human + pseudo labeled images
1
9
Surprising to see how performance scales smoothly when the model goes from generating 1 solution all the way up to 1M solutions
1
9
Exciting to see more encoder-decoder models (e.g. T5, T0, Switch Transformer, ST-MoE) Liked the dual loss pre-training strategy: use MLM on encoder and simple autoregressive LM on decoder
1
9
Super excited to see the co-evolution of game design with these types of models. Open world games that could automatically generate new environments based on what the player has enjoyed so far would be so cool --- I often felt games got stale due to a lack of new environments.
DALL-E 2 applied to generating assets for game development:
9
Awesome startup w/ awesome founders! Excited to see future space of AI x Legal. (Disclosure: I invested)
1
8
Looking forward to giving this talk!
great talks lining up in September @KuisAICenter including @DeqingSun @jponttuset @barret_zoph, looking forward to all of them!
9
Surprised the 41B model only was better than the 9B model once it could generate 1k+ samples Wonder how results for different model sizes change as a function of the pre-training and fine-tuning dataset size
1
1
8
Impressive results w/ the continued scale of large LMs On certain tasks there were large discontinuous performance improvements not predicted by scaling curves Great leadership / coordination on this project to make it happen --- nice work team!
Introducing the 540 billion parameter Pathways Language Model. Trained on two Cloud #TPU v4 pods, it achieves state-of-the-art performance on benchmarks and shows exciting capabilities like mathematical reasoning, code writing, and even explaining jokes. goo.gle/3j6eMnK
9
When using only ImageNet images, MuST significantly outperforms both supervised and self-supervised representations across many tasks.
1
9
Hope these revamped ResNets can serve as baselines for future architectural and training method comparisons!
8
Nice summary of a lot of the great work done by Google Research in the past year.
As in past years, I've spent part of the holiday break summarizing much of the work we've done in @GoogleResearch over the last year. On behalf of @Google's research community, I'm delighted to share this writeup (this year grouped into five themes). ai.googleblog.com/2022/01/go…
8
We observed adding more pseudo labels to each image to lead to better representations! So don’t just use classification and depth estimation labels, include segmentation and others too.
1
8
Exciting research ahead to not require generating huge amounts of samples -- seems this should be possible Many applications of LLMs require generating lots of samples and even using discriminator models to further filter generated outputs (e.g. Lamda, OpenAI Verifiers)
1
1
7
What if I already trained my checkpoint? No problem! You can simply continue training your checkpoint with MuST for a few iterations and observe improvements! Results combining MuST with an ALIGN checkpoint.
2
8
This really hit homes --- the amount of hand holding for experiments and models can be quite frustrating. You would think that this area would have more progress given these are the issues people training the models are having :)
The AGI I want is one that realizes I made a dumb mistake with batch size which makes it OOM on a supercomputer and tries a smaller one for me - while I am sleeping so I don’t have to babysit the models and increases the throughput in experimentation!
1
7
Interesting how the validation loss isn't correlated with the solve rate Other tasks like dialogue (e.g. Lamda) seem to correlate much better to human evals Probably due to the one-to-many nature of coding tasks relative to dialogue as the authors point out
2
7
Wouldn't be surprised if some of the most impactful papers in the language modeling space in the next few years come from pure dataset research
4
7
Nice summary of our recent work!
My review of the paper "Revisiting ResNets: Improved Training and Scaling Strategies". It seems that we have a new SOTA for CV tasks. Looking forwards for PyTorch version! andlukyane.com/blog/paper-re…
7
In a large scale semi-supervised learning setup we obtain 5.5x speedups over Noisy Student EfficientNets.
1
6
Also seems the 41B models wasn't the "compute Pareto optimal" --- for a given TPU budget its almost always better to use the 9B model
2
1
7
Yea +1 also to the power of these GLU/GELU FFN variants (like in arxiv.org/abs/2002.05202). These work very well.
6
We design a Pareto curve of 11 different ResNet models named ResNet-RS by scaling the image size along with different network depths. We obtain 1.7-2.7x speedups over EfficientNets on ImageNet.
1
5
How do MuST representations compare to those trained with standard multi-task learning across datasets and tasks? MuST improves over multi-task training across all tasks!
2
6
We studied MuST on a suite of different tasks and datasets. Training Datasets: Specialized teacher models trained on these datasets, which are used to produce pseudo labels. Evaluation Datasets: Datasets models are fine-tuned on.
1
6
We dive into the tradeoffs of using sparse expert models versus standard dense models We hope this review can help to increase adoption for them as they are working quite well and lots of excellent research has been done for them!
4
We finally combine our improvements and train a sparse model with 269B parameters (FLOP matched to a 32B dense model). This model achieve SOTA on a wide range of NLP tasks: SuperGLUE, XSum, CNN-DM, ANLI R3, ARC-Easy/Challenge, CB WebQA, CB NatQA.
5
Great thread describing some of the approaches for getting models to perform well on tasks we care about!
📢 A 🧵on the future of NLP model inputs. What are the options and where are we going? 🔭 1. Task-specific finetuning (FT) 2. Zero-shot prompting 3. Few-shot prompting 4. Chain of thought (CoT) 5. Parameter-efficient finetuning (PEFT) 6. Dialog [1/]
1
5
We study the fine-tuning of sparse vs dense models The optimal batch sizes and learning rates for sparse vs dense models are very different In certain scenarios wrong values masked any of the pre-training performance improvements of sparse models over the dense models
1
5