Stephen Roller · Apr 29, 2024 · 3:51 AM UTC

Stephen Roller

Stephen Roller

@stephenroller

29 Apr 2024

the surreal feeling of someone posting a paper internally, a new hire asking should we try it? and having to answer “yeah we call it flabberblanung internally. we’ve been using it as our default for about 8 months. here’s the code.” over and over and over and over.

355

108,296

Stephen Roller · Jul 15, 2025 · 5:10 PM UTC

Stephen Roller

@stephenroller

15 Jul 2025

We are moving incredibly fast. Come light up GPUs with us.

Mira Murati

@miramurati

15 Jul 2025

Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're excited that in the next couple months we’ll be able to share our first product, which will include a significant open source component and be useful for researchers and startups developing custom models. Soon, we’ll also share our best science to help the research community better understand frontier AI systems. To accelerate our progress, we’re happy to confirm that we’ve raised $2B led by a16z with participation from NVIDIA, Accel, ServiceNow, CISCO, AMD, Jane Street and more who share our mission. We’re always looking for extraordinary talent that learns by doing, turning research into useful things. We believe AI should serve as an extension of individual agency and, in the spirit of freedom, be distributed as widely and equitably as possible. We hope this vision resonates with those who share our commitment to advancing the field. If so, join us. thinkingmachines.paperform.c…

344

40,350

Stephen Roller · May 3, 2022 · 2:02 PM UTC

Stephen Roller

@stephenroller

3 May 2022

Super proud to have worked on this with @suchenzang, @NamanGoyal21 and many others.

AI at Meta

@AIatMeta

3 May 2022

Today Meta AI is sharing OPT-175B, the first 175-billion-parameter language model to be made available to the broader AI research community. OPT-175B can generate creative text on a vast range of topics. Learn more & request access: ai.facebook.com/blog/democra…

324

Stephen Roller · Jan 30, 2023 · 9:36 PM UTC

Stephen Roller

@stephenroller

30 Jan 2023

Happy to announce today is my first day @character_ai.

302

57,910

Stephen Roller · Jun 14, 2024 · 2:09 AM UTC

Stephen Roller

@stephenroller

14 Jun 2024

Some teams use sweeps, heuristics, or scaling laws to determine their training LR. At Character, we just have Noam Shazeer dial it to the right value.

302

169,927

Stephen Roller · Oct 12, 2022 · 12:31 AM UTC

Stephen Roller

@stephenroller

12 Oct 2022

Replying to @srush_nlp

I find people unfamiliar with scaling are shocked by this:

282

Stephen Roller · May 12, 2022 · 4:14 PM UTC

Stephen Roller

@stephenroller

12 May 2022

Huge release from 🤗Transformers, including all the OPT models up to 30B parameters! You can even run OPT models in Colab now!

Hugging Face

@huggingface

12 May 2022

Last week @MetaAI publicly released huge LMs, with up to ☄️30B parameters. Great win for Open-Source🎉 These checkpoints are now in 🤗transformers! But how to use such big checkpoints? Introducing Accelerate and ⚡️BIG MODEL INFERENCE⚡️ Load & USE the 30B model in colab (!)⬇️

We load the checkpoint that is saved on disk and we dispatch it to the devices. At no point is the checkpoint fully loaded in RAM; only parts of it to be dispatched to each device.

We load it as float16 so that we may load more layers at a time on each device for a faster execution time.

ALT We load the checkpoint that is saved on disk and we dispatch it to the devices. At no point is the checkpoint fully loaded in RAM; only parts of it to be dispatched to each device. We load it as float16 so that we may load more layers at a time on each device for a faster execution time.

251

Stephen Roller · Apr 29, 2020 · 3:06 PM UTC

Stephen Roller

@stephenroller

29 Apr 2020

Really excited to be sharing this with everyone today. Blog post below, paper here: arxiv.org/abs/2004.13637

Recipes for building an open-domain chatbot

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they...

arxiv.org

AI at Meta

@AIatMeta

29 Apr 2020

Today we’re announcing that Facebook AI has built and open-sourced Blender, the largest-ever open-domain chatbot. It outperforms others in terms of engagement and also feels more human, according to human evaluators. ai.facebook.com/blog/state-o…

193

Stephen Roller · Nov 13, 2020 · 9:57 PM UTC

Stephen Roller

@stephenroller

13 Nov 2020

I'm looking for a PhD intern to join me summer 2021 at FAIR NY to work on Conversational AI. Interests include chit chat, task oriented dialogue, large-scale modeling, and evaluation. Apply at facebook.com/careers/jobs/19… and send me a heads up.

174

Stephen Roller · Jan 10, 2020 · 8:20 PM UTC

Stephen Roller

@stephenroller

10 Jan 2020

Mixout (arxiv.org/abs/1909.11299) is a cool way to regularize your large neural network. I quickly wrote an implementation in pytorch that works with (most) arbitrary nn.Modules: gist.github.com/stephenrolle…

Mixout: Effective Regularization to Finetune Large-scale...

In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus....

arxiv.org

165

Stephen Roller · Apr 28, 2017 · 8:55 PM UTC

Stephen Roller

@stephenroller

28 Apr 2017

I have a PhD now. So now when people ask me "is it Stephen with a V or a PH?" I can reply, "It's Stephen with a PhD."

133

Stephen Roller · Nov 9, 2022 · 9:55 PM UTC

Stephen Roller

@stephenroller

9 Nov 2022

Terrible day. A lot of great colleagues gone.

143

Stephen Roller · May 13, 2025 · 12:49 PM UTC

Stephen Roller

@stephenroller

13 May 2025

I once trained hyperbolic (Poincaré) networks with Riemannian SGD and HogWild. Your optimization stack does not scare me.

142

12,996

Stephen Roller · May 4, 2022 · 4:53 AM UTC

Stephen Roller

@stephenroller

4 May 2022

One of my biggest lessons from OPT is that engineering risk and research risk are multiplicative, and research risk can be easier to reduce (simplify to a known baseline).

125

Stephen Roller · Dec 23, 2020 · 11:23 PM UTC

Stephen Roller

@stephenroller

23 Dec 2020

Replying to @andrew_n_carr

I have bad news

105

Stephen Roller · Mar 22, 2023 · 3:44 AM UTC

Stephen Roller

@stephenroller

22 Mar 2023

I have absolutely no idea how BPE-based LMs learn how to rhyme in English — a language with notoriously awful spelling, an absurd number of vowels, and constantly changing pronunciation; let alone with a tokenizer that actively fights morphology. Incredible.

109

26,266

Stephen Roller · Mar 8, 2025 · 5:20 PM UTC

Stephen Roller

@stephenroller

8 Mar 2025

The vibes are so good at Thinky.

107

15,793

Stephen Roller · Jan 21, 2023 · 1:16 AM UTC

Stephen Roller

@stephenroller

21 Jan 2023

Have any of you ever set up a calculator of the Chinchilla equations and plugged in their upper and lower error bars? It’s pretty interesting.

168,195

Stephen Roller · Mar 26, 2023 · 5:35 PM UTC

Stephen Roller

@stephenroller

26 Mar 2023

Why are so many LLM critics fixated on what can be done with a single generation? The really interesting stuff is going to come from allowing for O(n) or O(n log n) generations.

12,722

Stephen Roller · Mar 24, 2023 · 4:23 PM UTC

Stephen Roller

@stephenroller

24 Mar 2023

Character is a blast, and some of the most talented people I've ever worked with. We're hiring too!

Character.AI

@character_ai

24 Mar 2023

Announcing our Series A and our new AI model, C1.2! blog.character.ai/character-…

36,727

Stephen Roller · Feb 22, 2023 · 6:14 AM UTC

Stephen Roller

@stephenroller

22 Feb 2023

If you’re planning on being an employee, focus on places where you’re “Member of the Technical Staff” and otherwise exist no distinctions. They have a healthier view of the constant flexibility of necessary skills.

8,059

Stephen Roller · Dec 28, 2021 · 12:50 AM UTC

Stephen Roller

@stephenroller

28 Dec 2021

One thing I really like about ARR as a reviewer is that I got to read two resubmissions of drafts I previously rejected. The updated manuscripts were significantly improved and clearly took my prior comments into account, and the result was significantly improved papers.

Stephen Roller · Apr 21, 2024 · 6:24 AM UTC

Stephen Roller

@stephenroller

21 Apr 2024

Since everyone is piling on Chinchilla again, here’s a simple experiment you can run at home. Train any sized model you want with a token/param ratio of 20, then a double sized model for half as many steps, and a half sized model for double steps. Observe loss curves.

13,407

Stephen Roller · Jul 27, 2021 · 12:21 AM UTC

Stephen Roller

@stephenroller

27 Jul 2021

Training a particularly precarious model right now. Making a sacrifice to the god of NaNs.

Stephen Roller · Jun 15, 2016 · 4:25 PM UTC

Stephen Roller

@stephenroller

15 Jun 2016

I took notes on yesterday's #naacl2016 deep learning panel and decided to post them online: cs.utexas.edu/~roller/naacl2…

Stephen Roller · Dec 14, 2015 · 10:15 PM UTC

Stephen Roller

@stephenroller

14 Dec 2015

Replying to @NRO

@NRO @powerlineUS how to lie with a y-axis, from people who believe others are lying with a y-axis

Stephen Roller · May 22, 2015 · 8:47 PM UTC

Stephen Roller

@stephenroller

22 May 2015

Germans this time of year:

Stephen Roller · Jun 13, 2023 · 4:07 AM UTC

Stephen Roller

@stephenroller

13 Jun 2023

I feel like multiple papers are rediscovering some of the premises of arxiv.org/abs/2012.14983

Reducing conversational agents' overconfidence through...

While improving neural dialogue agents' factual accuracy is the object of much research, another important aspect of communication, less studied in the setting of neural dialogue, is transparency...

arxiv.org

ikka

@Shahules786

12 Jun 2023

Interesting paper that indicates that LLMs do have information on truths even when their output indicates otherwise. They also propose a new method that improved LLAMA 7B’s truthfulness from 32% to 65%! arxiv.org/abs/2306.03341 1/🧵

11,454

Stephen Roller · Apr 29, 2024 · 3:54 AM UTC

Stephen Roller

@stephenroller

29 Apr 2024

there are two lessons here for those who are paying attention

10,404

Stephen Roller · Oct 12, 2022 · 3:00 AM UTC

Stephen Roller

@stephenroller

12 Oct 2022

One of the most wonderful things about being a senior researcher is pointing junior people to problems and knowing they’ll do *much* better than if you tried yourself.

Stephen Roller · Nov 25, 2020 · 2:06 PM UTC

Stephen Roller

@stephenroller

25 Nov 2020

Replying to @twiecki

90% of its benefit is as documentation, and the other 10% is discouraging people from input/return values that are Tuple[Str, Dict, Tuple[str, float]] madness.

Stephen Roller · Mar 26, 2025 · 7:56 PM UTC

Stephen Roller

@stephenroller

26 Mar 2025

ray on slurm on kubernetes on hypervisor on borg

2,227

Stephen Roller · Mar 25, 2022 · 1:41 PM UTC

Stephen Roller

@stephenroller

25 Mar 2022

My favorite thing about this work is how much it manages to replace prior specialized architectures with clever multitask objectives.

Jason Weston

@jaseweston

25 Mar 2022

🚨 New work 🚨 SeeKeR: An open source search-augmented language model - uses a search engine to stay up-to-date - hallucinates less & is more topical than GPT2 or GPT3, with less parameters - applied to dialogue, superior to BlenderBot 2 Read more here: parl.ai/projects/seeker

Stephen Roller · Feb 25, 2023 · 4:22 AM UTC

Stephen Roller

@stephenroller

25 Feb 2023

Which part of RLHF is most important?

10% RLHF combined

7% RL

47% HF

37% F

356 votes • Final results

31,684

Stephen Roller · Apr 29, 2024 · 4:33 AM UTC

Stephen Roller

@stephenroller

29 Apr 2024

Replying to @typedfemale

i deleted the first version of this tweet where i actually leaked the name of our internal name of today’s paper. the name is too much better than the paper’s. it’s all a distraction anyway

7,678

Stephen Roller · Aug 5, 2022 · 3:44 PM UTC

Stephen Roller

@stephenroller

5 Aug 2022

BlenderBot is insanely fun to talk to. Talk to it at blenderbot.ai.

AI at Meta

@AIatMeta

5 Aug 2022

(1/4) Meet BlenderBot 3, the first publicly available 175B-parameter chatbot with model weights, code & datasets. It can chat about nearly any topic & is designed to learn & improve by conversing with people in the real world. Try the interactive demo: bit.ly/3Pf2s2t

Stephen Roller · Feb 25, 2023 · 4:37 AM UTC

Stephen Roller

@stephenroller

25 Feb 2023

to vote myself: i thought HF until Anthropic’s Constitutional AI paper. Now I lean towards just F.

2,297

Stephen Roller · May 13, 2021 · 6:44 PM UTC

Stephen Roller

@stephenroller

13 May 2021

Replying to @ezyang

Entire class of PR review comments are eliminated, saving time and energy. No one is happy with what the autoformatter does to their code, but everyone is happy with what it does to others’.

Stephen Roller · Apr 29, 2024 · 4:10 AM UTC

Stephen Roller

@stephenroller

29 Apr 2024

this is a story about how closed source becomes open source

4,921

Stephen Roller · Oct 12, 2022 · 1:45 AM UTC

Stephen Roller

@stephenroller

12 Oct 2022

Replying to @stephenroller @srush_nlp

Also all modern scaling strategies are highly synchronous. Which means one bad node can tank the entire system. I would love our future researchers to be thinking about this.

Stephen Roller · Oct 7, 2021 · 10:58 AM UTC

Stephen Roller

@stephenroller

7 Oct 2021

Replying to @cHHillee

Are we like two papers away from “all you need is gabor filters and an SVM”?

Stephen Roller · Apr 28, 2021 · 7:27 PM UTC

Stephen Roller

@stephenroller

28 Apr 2021

NeuCAIR Workshop @iclr_conf (May 7, 2021) is excited to have a day discussing topics broadly to Neural #ConversationalAI, with applications in task-oriented dialogue, chitchat, healthcare, and education. For more information and schedule: sites.google.com/view/neucai…

NeuCAIR @ ICLR-21

Joining workshop On the day of workshop, please enter sessions through this link: https://iclr.cc/virtual/2021/workshop/2133

sites.google.com

Stephen Roller · May 3, 2022 · 2:21 PM UTC

Stephen Roller

@stephenroller

3 May 2022

Replying to @stephenroller @suchenzang @NamanGoyal21

I'm also unreasonably obsessed with our logbook github.com/facebookresearch/…

Stephen Roller · Aug 27, 2024 · 3:31 PM UTC

Stephen Roller

@stephenroller

27 Aug 2024

Replying to @yoavgo @annargrs

6,510

Stephen Roller · Jul 29, 2021 · 1:32 PM UTC

Stephen Roller

@stephenroller

29 Jul 2021

Pro tip: make your training script's output always human readable first, and have it *also* dump raw logs as a separate, structured json file. If you find yourself writing regexs to parse a stdout file, you've missed the opportunity to just dump it in the first place.

Stephen Roller · Jul 25, 2022 · 2:23 AM UTC

Stephen Roller

@stephenroller

25 Jul 2022

Replying to @timgill924

Wtf is wrong with you

Stephen Roller · Mar 29, 2024 · 1:14 AM UTC

Stephen Roller

@stephenroller

29 Mar 2024

@character_ai is hiring in nyc too!

Jonathan Frankle

@jefrankle

28 Mar 2024

Move to NYC! We have bagels and culture and public transportation and @srush_nlp and bagels!

4,376

Stephen Roller · Sep 10, 2023 · 12:01 PM UTC

Stephen Roller

@stephenroller

10 Sep 2023

Replying to @Thom_Wolf

After that, repeat the exercise with a corpus of Korean, Hindi, or some other language with a relatively underrepresented writing system.

3,200

Stephen Roller · Feb 25, 2021 · 3:09 AM UTC

Stephen Roller

@stephenroller

25 Feb 2021

NeuCAIR Workshop @iclr_conf (May 7, 2021) solicits novel contributions that relate broadly to Neural #ConversationalAI, with applications in task-oriented dialogue, chitchat, healthcare, and education. New submission deadline: 11:59pm Mar 4 AOE. sites.google.com/view/neucai… 1/4

NeuCAIR @ ICLR-21

Joining workshop On the day of workshop, please enter sessions through this link: https://iclr.cc/virtual/2021/workshop/2133

sites.google.com

Stephen Roller · Sep 23, 2023 · 4:32 AM UTC

Stephen Roller

@stephenroller

23 Sep 2023

Saw some h100s irl today. The hot aisle has a different vibe with them.

3,917

Stephen Roller · Jul 11, 2024 · 11:39 PM UTC

Stephen Roller

@stephenroller

11 Jul 2024

Replying to @chrmanning

I kinda wish groups did a *secret* held out test set and told no one about it and then 6 months later released with a message “surprise! time to see who’s overfitting!”

4,816

Stephen Roller · May 3, 2022 · 1:37 AM UTC

Stephen Roller

@stephenroller

3 May 2022

Replying to @MasterScrat @arankomatsuzaki @MetaAI

First thing in AM

Stephen Roller · May 7, 2020 · 7:00 PM UTC

Stephen Roller

@stephenroller

7 May 2020

Replying to @hardmaru @blender_org

You’re right, we should have used another name. We are changing it.

Stephen Roller · Jan 21, 2023 · 2:57 AM UTC

Stephen Roller

@stephenroller

21 Jan 2023

let’s just say it might change your view of the paper some.

6,484

Stephen Roller · Apr 3, 2023 · 11:25 PM UTC

Stephen Roller

@stephenroller

3 Apr 2023

2,399

Stephen Roller · Apr 15, 2023 · 2:16 AM UTC

Stephen Roller

@stephenroller

15 Apr 2023

Do you ever read the wikipedia entries on extremely basic concepts (for me today, “food”) just to see how such things are defined precisely?

3,565

Stephen Roller · Apr 8, 2024 · 7:33 PM UTC

Stephen Roller

@stephenroller

8 Apr 2024

Totality was worth the journey.

5,210

Stephen Roller · May 4, 2022 · 5:23 AM UTC

Stephen Roller

@stephenroller

4 May 2022

Second biggest confounder in my mind is fp16. I regret not having bfloat ready in time (falls into eng risk bucket).

Stephen Roller · Jan 31, 2024 · 5:53 AM UTC

Stephen Roller

@stephenroller

31 Jan 2024

At @character_ai we call them moé models.

3,395

Stephen Roller · Jun 29, 2022 · 1:39 AM UTC

Stephen Roller

@stephenroller

29 Jun 2022

Congrats to @StasBekman and the entire rest of the team. Y’all keep doing the amazing work.

BigScience Large Model Training @BigScienceLLM

28 Jun 2022

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100%

Stephen Roller · Jul 5, 2022 · 11:01 PM UTC

Stephen Roller

@stephenroller

5 Jul 2022

Replying to @tomgoldsteincs

Perhaps another noteworthy number is that we’ve grown something like 6 orders of magnitude in training flops but cost only 4 orders of magnitude (mildly less really). I can hold until at least 2032 until I need all of earth’s electricity ;)

Stephen Roller · Mar 27, 2023 · 4:20 PM UTC

Stephen Roller

@stephenroller

27 Mar 2023

wrap it in a loop

Yann LeCun

@ylecun

26 Mar 2023

Replying to @ylecun

5- they have limited working memory 6- they execute a fixed number of computational steps per generated token 7- hence they are very far from Turing complete 8- Auto-regressive generation is a exponentially-divergent diffusion process, hence not controllable. 3/

5,124

Stephen Roller · Apr 21, 2023 · 3:58 AM UTC

Stephen Roller

@stephenroller

21 Apr 2023

Nothing like a good re-org.

4,150

Stephen Roller · Oct 14, 2021 · 12:38 PM UTC

Stephen Roller

@stephenroller

14 Oct 2021

Excited to announce our new E2E User Simulators for Task Oriented Dialogue with @moyapchen and @pacrook. Now part of @parlai_parley! arxiv.org/abs/2110.06905

Teaching Models new APIs: Domain-Agnostic Simulators for Task...

We demonstrate that large language models are able to simulate Task Oriented Dialogues in novel domains, provided only with an API implementation and a list of goals. We show these simulations can...

arxiv.org

Stephen Roller · Mar 24, 2023 · 6:28 PM UTC

Stephen Roller

@stephenroller

24 Mar 2023

Replying to @XENOWHITEx

We own our full stack end to end

751

Stephen Roller · May 24, 2023 · 5:50 PM UTC

Stephen Roller

@stephenroller

24 May 2023

Character is now in your pocket.

Character.AI

@character_ai

24 May 2023

The same Characters you love, now in the palm of your hand. Download the official #CharacterAI Mobile App for 𝗙𝗥𝗘𝗘 on iOS and Android. 𝗶𝗢𝗦: bit.ly/downloadcai_ios 𝗔𝗻𝗱𝗿𝗼𝗶𝗱: bit.ly/downloadcai_and

2,198

Stephen Roller · May 24, 2024 · 3:29 AM UTC

Stephen Roller

@stephenroller

24 May 2024

A really fascinating thing about deploying LLMs at scale is that you can mutually confuse training-time bugs and serving-time bugs: they appear indistinguishable to the users providing feedback.

3,125

Stephen Roller · Jul 25, 2022 · 2:51 AM UTC

Stephen Roller

@stephenroller

25 Jul 2022

Replying to @timgill924

No, you’re in real life and grad students are real people. Don’t tweet things that exacerbate their perpetual existential dread.

Stephen Roller · Mar 6, 2023 · 6:45 PM UTC

Stephen Roller

@stephenroller

6 Mar 2023

Replying to @O42n2 @NVIDIAAI

that’s a funny definition of open source. they distribute binary kernels

2,286

Stephen Roller · Jan 21, 2023 · 3:35 AM UTC

Stephen Roller

@stephenroller

21 Jan 2023

Replying to @main_horse

let’s say there is up to an order of magnitude delta in some of the predictions

3,801

Stephen Roller · Nov 3, 2020 · 4:43 PM UTC

Stephen Roller

@stephenroller

3 Nov 2020

Shout out to the guy who brought all the poll workers coffee and donuts while I was voting.

Stephen Roller · Mar 8, 2023 · 11:21 PM UTC

Stephen Roller

@stephenroller

8 Mar 2023

The undocumented XID errors just taste better. More fresh.

4,931

Stephen Roller · Apr 7, 2017 · 7:21 PM UTC

Stephen Roller

@stephenroller

7 Apr 2017

Submitted my dissertation to my committee. Defense is scheduled in 3 weeks. No idea what to do until then.

Stephen Roller · Oct 20, 2015 · 5:18 AM UTC

Stephen Roller

@stephenroller

20 Oct 2015

So are there a bunch of conservative Canadians now like, "that's it. I'm moving to America"?

Stephen Roller · Feb 18, 2023 · 6:37 PM UTC

Stephen Roller

@stephenroller

18 Feb 2023

I love that LMs trained on internet comments will produce a helpful sounding answer with a link to a supporting YouTube video, and it's just rickrolling you.

1,330

Stephen Roller · May 31, 2022 · 3:39 PM UTC

Stephen Roller

@stephenroller

31 May 2022

Replying to @arankomatsuzaki

I've done the comparison a few times on BlenderBot1 (2.7B params). I never got clear, conclusive results. I decided the additional engineering overhead of being able to incorporate those states wasn't worth it, so now @parlai_parley always resets.

Stephen Roller · Sep 7, 2016 · 9:07 PM UTC

Stephen Roller

@stephenroller

7 Sep 2016

I suppose I need to make it official... *Updates twitter profile to say "PhD Candidate"*

Stephen Roller · Jun 16, 2023 · 2:51 PM UTC

Stephen Roller

@stephenroller

16 Jun 2023

Replying to @srush_nlp

arthur szlam had a nice paper from gpt2 days that found they were distinguishable only by a strictly more powerful model

1,885

Stephen Roller · Jan 16, 2023 · 1:55 AM UTC

Stephen Roller

@stephenroller

16 Jan 2023

Replying to @typedfemale

ngl this is how i view that trick after having tried a dozen other things

ALT Ol Reliable Spongebob GIF

2,703

Stephen Roller · Jan 9, 2021 · 5:30 PM UTC

Stephen Roller

@stephenroller

9 Jan 2021

Replying to @tallinzen

Das Leben der Anderen

Stephen Roller · Mar 1, 2024 · 9:00 PM UTC

Stephen Roller

@stephenroller

1 Mar 2024

Replying to @srush_nlp

I imagine the RNN knows roughly what needs to be placed but can’t do it with fidelity In phonebook lookup, RNN might output random 10 digits, all with relatively low entropy; good ppl bc it eliminated most of vocab but terrible acc like we forgot luong et al (2015)…

1,610

Stephen Roller · Mar 15, 2023 · 2:51 AM UTC

Stephen Roller

@stephenroller

15 Mar 2023

hole: why would the clear leader correlate pricing with marginal costs? either overcharge for gpt4 (unique capabilities) or more likely overcharge for gpt3.5-turbo (mass usage)

822

Stephen Roller · Dec 28, 2022 · 1:04 AM UTC

Stephen Roller

@stephenroller

28 Dec 2022

I really appreciate @yoavgo noting that pretraining on code + SFT + RLHF (probably) makes the Octopus assumptions invalid. I feel we have been having the same stale, unprovable arguments for 2+ years and this is something new.

14,411

Stephen Roller · Nov 28, 2020 · 7:59 PM UTC

Stephen Roller

@stephenroller

28 Nov 2020

Replying to @morgymcg @Tim_Dettmers

I had a model of size X without embeddings. I plugged the number in their equation and rounded to be reasonable human number, trained and got better ppl than the smaller learning rate I was using for 6 months. Actually forced me to go redo my baseline.

Stephen Roller · May 5, 2024 · 12:11 AM UTC

Stephen Roller

@stephenroller

5 May 2024

Replying to @mike64_t

i bought clear protractors for my entire team

1,998

Stephen Roller · Jul 16, 2020 · 4:15 PM UTC

Stephen Roller

@stephenroller

16 Jul 2020

Replying to @yoavgo

FP16 is about 3x faster if you have the right hardware. Adaptive batching is about 3x faster. HF tokenizers is critical.

Stephen Roller · Oct 12, 2022 · 12:59 AM UTC

Stephen Roller

@stephenroller

12 Oct 2022

Replying to @stephenroller @srush_nlp

Tbh building a flops calculator is a pretty good homework assignment…

Stephen Roller · Feb 10, 2017 · 9:15 PM UTC

Stephen Roller

@stephenroller

10 Feb 2017

I procrastinated on my dissertation by making a death clock which counts down until the University dissertation deadline ☠️💣

Stephen Roller · Nov 2, 2022 · 12:51 PM UTC

Stephen Roller

@stephenroller

2 Nov 2022

Replying to @kroscoo

I think so. Count based remains easy to understand and can give an intuition base for when you start modeling it as more complex functions

Stephen Roller · Mar 22, 2023 · 3:54 AM UTC

Stephen Roller

@stephenroller

22 Mar 2023

Consider that a huge number of the poems these models are trained on were written before/during the Great Vowel Shift (much to the dismay of any student who has studied Shakespeare!)

1,451

Stephen Roller · Jun 25, 2021 · 8:52 PM UTC

Stephen Roller

@stephenroller

25 Jun 2021

Replying to @sjmielke

We haven’t had any new ideas in ML since the 80s. It’s the re in research

Stephen Roller · Nov 30, 2016 · 9:34 PM UTC

Stephen Roller

@stephenroller

30 Nov 2016

I took some notes on the Q&A session from the Senate Subcomittee Dawn of #AI subcommittee hearing. cs.utexas.edu/~roller/201611…

Stephen Roller · Jan 16, 2023 · 1:52 AM UTC

Stephen Roller

@stephenroller

16 Jan 2023

Replying to @typedfemale

hey, we worked really hard on those learning rates!

2,174

Stephen Roller · Apr 29, 2024 · 4:35 AM UTC

Stephen Roller

@stephenroller

29 Apr 2024

Replying to @typedfemale

i’m pretty sure you called it something else 6-18 months ago.

2,962

Stephen Roller · Jun 11, 2023 · 1:44 AM UTC

Stephen Roller

@stephenroller

11 Jun 2023

Never forgiving the Bay for the time the movie theater told me I needed to leave my stuff in my car and then didn’t let me see the movie when I didn’t have a car.

3,850

Stephen Roller · Jun 6, 2023 · 1:03 AM UTC

Stephen Roller

@stephenroller

6 Jun 2023

Perhaps the most noteworthy thing of the Apple Vision Pro announcement is how many of its devs are tweeting about it. That’s pretty unusual for Apple.

1,404

Stephen Roller · Jun 24, 2021 · 9:55 PM UTC

Stephen Roller

@stephenroller

24 Jun 2021

A culture of code review in research has been one of @parlai_parley’s greatest assets.

Hannah Sheahan @hannahsheahan

24 Jun 2021

My first pull request was approved at DeepMind today 🥳 but seriously code review is so amazing for learning. Why does no one in academia do it?

Stephen Roller · May 4, 2022 · 5:22 AM UTC

Stephen Roller

@stephenroller

4 May 2022

That said, I’m fairly certain initialization was the biggest confounder. For example, I’ve been able to partially ablate gelu vs rely since and that seems to follow small scale.

Stephen Roller · Apr 21, 2024 · 7:01 AM UTC

Stephen Roller

@stephenroller

21 Apr 2024

After that, consider the confidence intervals and reflect on how accurately we may or may not be predicting this 2 orders of magnitude out.

Stephen Roller

@stephenroller

21 Jan 2023

Have any of you ever set up a calculator of the Chinchilla equations and plugged in their upper and lower error bars? It’s pretty interesting.

5,809

Stephen Roller · May 4, 2022 · 5:09 AM UTC

Stephen Roller

@stephenroller

4 May 2022

I had never previously experienced THAT level of engineering risk