the surreal feeling of someone posting a paper internally, a new hire asking should we try it? and having to answer “yeah we call it flabberblanung internally. we’ve been using it as our default for about 8 months. here’s the code.” over and over and over and over.
12
11
355
108,296
We are moving incredibly fast. Come light up GPUs with us.
Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're excited that in the next couple months we’ll be able to share our first product, which will include a significant open source component and be useful for researchers and startups developing custom models. Soon, we’ll also share our best science to help the research community better understand frontier AI systems. To accelerate our progress, we’re happy to confirm that we’ve raised $2B led by a16z with participation from NVIDIA, Accel, ServiceNow, CISCO, AMD, Jane Street and more who share our mission. We’re always looking for extraordinary talent that learns by doing, turning research into useful things. We believe AI should serve as an extension of individual agency and, in the spirit of freedom, be distributed as widely and equitably as possible.  We hope this vision resonates with those who share our commitment to advancing the field. If so, join us. thinkingmachines.paperform.c…
12
12
344
40,350
Super proud to have worked on this with @suchenzang, @NamanGoyal21 and many others.
Today Meta AI is sharing OPT-175B, the first 175-billion-parameter language model to be made available to the broader AI research community. OPT-175B can generate creative text on a vast range of topics. Learn more & request access: ai.facebook.com/blog/democra…
14
36
324
Happy to announce today is my first day @character_ai.
25
4
302
57,910
Some teams use sweeps, heuristics, or scaling laws to determine their training LR. At Character, we just have Noam Shazeer dial it to the right value.
16
27
302
169,927
Replying to @srush_nlp
I find people unfamiliar with scaling are shocked by this:
17
27
282
Huge release from 🤗Transformers, including all the OPT models up to 30B parameters! You can even run OPT models in Colab now!
Last week @MetaAI publicly released huge LMs, with up to ☄️30B parameters. Great win for Open-Source🎉 These checkpoints are now in 🤗transformers! But how to use such big checkpoints? Introducing Accelerate and ⚡️BIG MODEL INFERENCE⚡️ Load & USE the 30B model in colab (!)⬇️
2
32
251
Really excited to be sharing this with everyone today. Blog post below, paper here: arxiv.org/abs/2004.13637
Today we’re announcing that Facebook AI has built and open-sourced Blender, the largest-ever open-domain chatbot. It outperforms others in terms of engagement and also feels more human, according to human evaluators. ai.facebook.com/blog/state-o…
5
40
193
I'm looking for a PhD intern to join me summer 2021 at FAIR NY to work on Conversational AI. Interests include chit chat, task oriented dialogue, large-scale modeling, and evaluation. Apply at facebook.com/careers/jobs/19… and send me a heads up.
4
29
174
I have a PhD now. So now when people ask me "is it Stephen with a V or a PH?" I can reply, "It's Stephen with a PhD."
3
15
133
Terrible day. A lot of great colleagues gone.
8
4
143
I once trained hyperbolic (Poincaré) networks with Riemannian SGD and HogWild. Your optimization stack does not scare me.
6
4
142
12,996
One of my biggest lessons from OPT is that engineering risk and research risk are multiplicative, and research risk can be easier to reduce (simplify to a known baseline).
3
9
125
Replying to @andrew_n_carr
I have bad news
2
105
I have absolutely no idea how BPE-based LMs learn how to rhyme in English — a language with notoriously awful spelling, an absurd number of vowels, and constantly changing pronunciation; let alone with a tokenizer that actively fights morphology. Incredible.
11
4
109
26,266
The vibes are so good at Thinky.
3
1
107
15,793
Have any of you ever set up a calculator of the Chinchilla equations and plugged in their upper and lower error bars? It’s pretty interesting.
4
3
70
168,195
Why are so many LLM critics fixated on what can be done with a single generation? The really interesting stuff is going to come from allowing for O(n) or O(n log n) generations.
9
2
64
12,722
Character is a blast, and some of the most talented people I've ever worked with. We're hiring too!
Announcing our Series A and our new AI model, C1.2! blog.character.ai/character-…
3
1
66
36,727
If you’re planning on being an employee, focus on places where you’re “Member of the Technical Staff” and otherwise exist no distinctions. They have a healthier view of the constant flexibility of necessary skills.
1
59
8,059
One thing I really like about ARR as a reviewer is that I got to read two resubmissions of drafts I previously rejected. The updated manuscripts were significantly improved and clearly took my prior comments into account, and the result was significantly improved papers.
1
4
57
Since everyone is piling on Chinchilla again, here’s a simple experiment you can run at home. Train any sized model you want with a token/param ratio of 20, then a double sized model for half as many steps, and a half sized model for double steps. Observe loss curves.
6
4
53
13,407
Training a particularly precarious model right now. Making a sacrifice to the god of NaNs.
2
3
56
I took notes on yesterday's #naacl2016 deep learning panel and decided to post them online: cs.utexas.edu/~roller/naacl2…
31
49
Replying to @NRO
@NRO @powerlineUS how to lie with a y-axis, from people who believe others are lying with a y-axis
2
4
47
Germans this time of year:
22
36
I feel like multiple papers are rediscovering some of the premises of arxiv.org/abs/2012.14983
Interesting paper that indicates that LLMs do have information on truths even when their output indicates otherwise. They also propose a new method that improved LLAMA 7B’s truthfulness from 32% to 65%! arxiv.org/abs/2306.03341 1/🧵
1
46
11,454
there are two lessons here for those who are paying attention
1
45
10,404
One of the most wonderful things about being a senior researcher is pointing junior people to problems and knowing they’ll do *much* better than if you tried yourself.
39
Replying to @twiecki
90% of its benefit is as documentation, and the other 10% is discouraging people from input/return values that are Tuple[Str, Dict, Tuple[str, float]] madness.
34
ray on slurm on kubernetes on hypervisor on borg
5
33
2,227
My favorite thing about this work is how much it manages to replace prior specialized architectures with clever multitask objectives.
🚨 New work 🚨 SeeKeR: An open source search-augmented language model - uses a search engine to stay up-to-date - hallucinates less & is more topical than GPT2 or GPT3, with less parameters - applied to dialogue, superior to BlenderBot 2 Read more here: parl.ai/projects/seeker
4
31
Which part of RLHF is most important?
10% RLHF combined
7% RL
47% HF
37% F
356 votes • Final results
11
1
33
31,684
Replying to @typedfemale
i deleted the first version of this tweet where i actually leaked the name of our internal name of today’s paper. the name is too much better than the paper’s. it’s all a distraction anyway
1
31
7,678
BlenderBot is insanely fun to talk to. Talk to it at blenderbot.ai.
(1/4) Meet BlenderBot 3, the first publicly available 175B-parameter chatbot with model weights, code & datasets. It can chat about nearly any topic & is designed to learn & improve by conversing with people in the real world. Try the interactive demo: bit.ly/3Pf2s2t
1
3
28
to vote myself: i thought HF until Anthropic’s Constitutional AI paper. Now I lean towards just F.
4
1
29
2,297
Replying to @ezyang
Entire class of PR review comments are eliminated, saving time and energy. No one is happy with what the autoformatter does to their code, but everyone is happy with what it does to others’.
1
25
this is a story about how closed source becomes open source
2
25
4,921
Also all modern scaling strategies are highly synchronous. Which means one bad node can tank the entire system. I would love our future researchers to be thinking about this.
2
25
Replying to @cHHillee
Are we like two papers away from “all you need is gabor filters and an SVM”?
1
26
NeuCAIR Workshop @iclr_conf (May 7, 2021) is excited to have a day discussing topics broadly to Neural #ConversationalAI, with applications in task-oriented dialogue, chitchat, healthcare, and education. For more information and schedule: sites.google.com/view/neucai…
1
8
26
I'm also unreasonably obsessed with our logbook github.com/facebookresearch/…
1
27
Pro tip: make your training script's output always human readable first, and have it *also* dump raw logs as a separate, structured json file. If you find yourself writing regexs to parse a stdout file, you've missed the opportunity to just dump it in the first place.
2
1
23
Replying to @timgill924
Wtf is wrong with you
1
21
@character_ai is hiring in nyc too!
Move to NYC! We have bagels and culture and public transportation and @srush_nlp and bagels!
1
1
21
4,376
Replying to @Thom_Wolf
After that, repeat the exercise with a corpus of Korean, Hindi, or some other language with a relatively underrepresented writing system.
17
3,200
NeuCAIR Workshop @iclr_conf (May 7, 2021) solicits novel contributions that relate broadly to Neural #ConversationalAI, with applications in task-oriented dialogue, chitchat, healthcare, and education. New submission deadline: 11:59pm Mar 4 AOE. sites.google.com/view/neucai… 1/4
1
6
23
Saw some h100s irl today. The hot aisle has a different vibe with them.
20
3,917
Replying to @chrmanning
I kinda wish groups did a *secret* held out test set and told no one about it and then 6 months later released with a message “surprise! time to see who’s overfitting!”
1
19
4,816
You’re right, we should have used another name. We are changing it.
3
20
let’s just say it might change your view of the paper some.
1
21
6,484
Do you ever read the wikipedia entries on extremely basic concepts (for me today, “food”) just to see how such things are defined precisely?
5
1
21
3,565
Totality was worth the journey.
1
20
5,210
Second biggest confounder in my mind is fp16. I regret not having bfloat ready in time (falls into eng risk bucket).
1
20
At @character_ai we call them moé models.
1
2
17
3,395
Congrats to @StasBekman and the entire rest of the team. Y’all keep doing the amazing work.
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100%
1
19
Replying to @tomgoldsteincs
Perhaps another noteworthy number is that we’ve grown something like 6 orders of magnitude in training flops but cost only 4 orders of magnitude (mildly less really). I can hold until at least 2032 until I need all of earth’s electricity ;)
17
wrap it in a loop
Replying to @ylecun
5- they have limited working memory 6- they execute a fixed number of computational steps per generated token 7- hence they are very far from Turing complete 8- Auto-regressive generation is a exponentially-divergent diffusion process, hence not controllable. 3/
4
19
5,124
Nothing like a good re-org.
1
18
4,150
Replying to @XENOWHITEx
We own our full stack end to end
1
18
751
Character is now in your pocket.
The same Characters you love, now in the palm of your hand. Download the official #CharacterAI Mobile App for 𝗙𝗥𝗘𝗘 on iOS and Android. 𝗶𝗢𝗦: bit.ly/downloadcai_ios 𝗔𝗻𝗱𝗿𝗼𝗶𝗱: bit.ly/downloadcai_and
1
17
2,198
A really fascinating thing about deploying LLMs at scale is that you can mutually confuse training-time bugs and serving-time bugs: they appear indistinguishable to the users providing feedback.
1
1
16
3,125
Replying to @timgill924
No, you’re in real life and grad students are real people. Don’t tweet things that exacerbate their perpetual existential dread.
1
16
Replying to @O42n2 @NVIDIAAI
that’s a funny definition of open source. they distribute binary kernels
4
15
2,286
Replying to @main_horse
let’s say there is up to an order of magnitude delta in some of the predictions
1
16
3,801
Shout out to the guy who brought all the poll workers coffee and donuts while I was voting.
14
The undocumented XID errors just taste better. More fresh.
1
1
16
4,931
Submitted my dissertation to my committee. Defense is scheduled in 3 weeks. No idea what to do until then.
4
12
So are there a bunch of conservative Canadians now like, "that's it. I'm moving to America"?
1
7
13
I love that LMs trained on internet comments will produce a helpful sounding answer with a link to a supporting YouTube video, and it's just rickrolling you.
14
1,330
Replying to @arankomatsuzaki
I've done the comparison a few times on BlenderBot1 (2.7B params). I never got clear, conclusive results. I decided the additional engineering overhead of being able to incorporate those states wasn't worth it, so now @parlai_parley always resets.
1
14
I suppose I need to make it official... *Updates twitter profile to say "PhD Candidate"*
1
9
Replying to @srush_nlp
arthur szlam had a nice paper from gpt2 days that found they were distinguishable only by a strictly more powerful model
3
14
1,885
Replying to @typedfemale
ngl this is how i view that trick after having tried a dozen other things

ALT Ol Reliable Spongebob GIF

14
2,703
Replying to @tallinzen
Das Leben der Anderen
1
10
Replying to @srush_nlp
I imagine the RNN knows roughly what needs to be placed but can’t do it with fidelity In phonebook lookup, RNN might output random 10 digits, all with relatively low entropy; good ppl bc it eliminated most of vocab but terrible acc like we forgot luong et al (2015)…
1
1
12
1,610
hole: why would the clear leader correlate pricing with marginal costs? either overcharge for gpt4 (unique capabilities) or more likely overcharge for gpt3.5-turbo (mass usage)
13
822
I really appreciate @yoavgo noting that pretraining on code + SFT + RLHF (probably) makes the Octopus assumptions invalid. I feel we have been having the same stale, unprovable arguments for 2+ years and this is something new.
1
13
14,411
I had a model of size X without embeddings. I plugged the number in their equation and rounded to be reasonable human number, trained and got better ppl than the smaller learning rate I was using for 6 months. Actually forced me to go redo my baseline.
1
13
Replying to @mike64_t
i bought clear protractors for my entire team
1
10
1,998
Replying to @yoavgo
FP16 is about 3x faster if you have the right hardware. Adaptive batching is about 3x faster. HF tokenizers is critical.
3
3
12
Tbh building a flops calculator is a pretty good homework assignment…
1
11
I procrastinated on my dissertation by making a death clock which counts down until the University dissertation deadline ☠️💣
1
11
Replying to @kroscoo
I think so. Count based remains easy to understand and can give an intuition base for when you start modeling it as more complex functions
1
11
Consider that a huge number of the poems these models are trained on were written before/during the Great Vowel Shift (much to the dismay of any student who has studied Shakespeare!)
12
1,451
Replying to @sjmielke
We haven’t had any new ideas in ML since the 80s. It’s the re in research
1
11
I took some notes on the Q&A session from the Senate Subcomittee Dawn of #AI subcommittee hearing. cs.utexas.edu/~roller/201611…
10
12
Replying to @typedfemale
hey, we worked really hard on those learning rates!
1
12
2,174
Replying to @typedfemale
i’m pretty sure you called it something else 6-18 months ago.
12
2,962
Never forgiving the Bay for the time the movie theater told me I needed to leave my stuff in my car and then didn’t let me see the movie when I didn’t have a car.
2
11
3,850
Perhaps the most noteworthy thing of the Apple Vision Pro announcement is how many of its devs are tweeting about it. That’s pretty unusual for Apple.
10
1,404
A culture of code review in research has been one of @parlai_parley’s greatest assets.
My first pull request was approved at DeepMind today 🥳 but seriously code review is so amazing for learning. Why does no one in academia do it?
1
10
That said, I’m fairly certain initialization was the biggest confounder. For example, I’ve been able to partially ablate gelu vs rely since and that seems to follow small scale.
1
11
After that, consider the confidence intervals and reflect on how accurately we may or may not be predicting this 2 orders of magnitude out.
Have any of you ever set up a calculator of the Chinchilla equations and plugged in their upper and lower error bars? It’s pretty interesting.
10
5,809
I had never previously experienced THAT level of engineering risk
11