David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall

26 Jun 2025

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

ALT a very spiky training loss curve for marin 32b

Percy Liang

@percyliang

22 May 2025

Marin 32B training crossed 1.5 trillion tokens today...

103

1,031

307,298

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

Today, I’m excited to announce the release of Levanter 1.0, our new JAX-based framework for training foundation models, which we’ve been working on @StanfordCRFM. Levanter is designed to be legible, scalable and reproducible. crfm.stanford.edu/2023/06/16…

392

142,652

David Hall · Feb 7, 2022 · 5:00 PM UTC

David Hall @dlwh

7 Feb 2022

I'm excited to announce I'm joining Stanford CRFM (crfm.stanford.edu) to lead the engineering effort to improve the accessibility of foundation models. Really looking forward to working with @percyliang and everyone else there!

183

David Hall · Mar 13, 2024 · 5:31 PM UTC

David Hall @dlwh

13 Mar 2024

Last summer we announced the Sophia optimizer, a successor to Adam that can achieve up to 2x gains over Adam. We’ve now merged mainline support into Levanter! Check out @tengyuma’s original thread for how Sophia works: nitter.app/tengyuma/status/166141… @HongLiu9903 github.com/stanford-crfm/lev…

Tengyu Ma

@tengyuma

24 May 2023

Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️

146

42,567

David Hall · Mar 26, 2024 · 9:56 PM UTC

David Hall @dlwh

26 Mar 2024

I like to talk about Levanter’s performance, reproducibility, and scalability, but it’s also portable! So portable you can even switch from TPU to GPU in the middle of a run, and then switch back again! github.com/stanford-crfm/lev…

ALT Graph showing a training run being transferred from GPU to TPU and the curves staying consistent

139

51,770

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

(If your attention span has been irreparably damaged by chronic scrolling, the answer is QK Norm, but the story is the fun part! Or here's the report version of this thread: api.wandb.ai/links/marin-com…)

Marin 32B Work In Progress

links-cdn.wandb.ai

140

12,615

David Hall · May 8, 2023 · 5:55 PM UTC

David Hall @dlwh

8 May 2023

For fun, I've been working with @achewood to create an LLM-backed advice bot to celebrate the relaunch of my favorite web comic.

How do you feel about artificial intelligence bots writing advice columns? —David
Dear David,
I don’t. Any place on the Internet where you can type a question and a robot will answer it is kind of insulting to intelligence, I think. I do not consider myself a bot and I will report you to America Online, where I am a Platinum Power User.

ALT How do you feel about artificial intelligence bots writing advice columns? —David Dear David, I don’t. Any place on the Internet where you can type a question and a robot will answer it is kind of insulting to intelligence, I think. I do not consider myself a bot and I will report you to America Online, where I am a Platinum Power User.

100

12,299

David Hall · Jun 26, 2025 · 5:04 PM UTC

David Hall @dlwh

26 Jun 2025

And now it's looking great! So, more norms good. Also, and more importantly, you can totally just change something major mid-run and it'll be okay. Or, as the meme goes, you can just do things.

ALT old loss curve and new loss curve continuing the old one, showing the 32B being nice and stable with QK Norm

117

8,330

David Hall · Mar 27, 2024 · 6:32 PM UTC

David Hall @dlwh

27 Mar 2024

FP8 support has landed in Levanter (and Haliax)! On H100, you can now get a >40%(!) throughput improvement by flipping a flag. Just add `trainer.fp8: true` to your config and you’re good to go! H100s not included. github.com/stanford-crfm/lev…

ALT Plot showing throughput improvement for GPT2 6.7B and Llama 2 13B on 8xh100. 42% and 35% respectively.

23,750

David Hall · Apr 22, 2024 · 5:53 AM UTC

David Hall @dlwh

22 Apr 2024

I made this little project for turning jax's jitted functions back into python code, mostly so you can minimize bug reports etc. It's horrible and hacky but maybe someone will find it useful. github.com/dlwh/jax_sourcero…

GitHub - dlwh/jax_sourceror: Turn jitted jax functions back into python source code

Turn jitted jax functions back into python source code - dlwh/jax_sourceror

github.com

6,696

David Hall · Jun 26, 2025 · 5:04 PM UTC

David Hall @dlwh

26 Jun 2025

Anyway, here's the report! api.wandb.ai/links/marin-com…

Marin 32B Work In Progress

links-cdn.wandb.ai

6,804

David Hall · May 21, 2025 · 8:59 PM UTC

David Hall @dlwh

21 May 2025

Come read about all the mistakes I made along the way to beating Llama 3.1 8B on 14/19 benchmarks. We trained from scratch, made plenty of wrong turns, and learned a lot.

Percy Liang

@percyliang

21 May 2025

For a rare look into how LLMs are really built, check out @dlwh's retrospective on how we trained the Marin 8B model from scratch (and outperformed Llama 3.1 8B base). It’s an honest account with all the revelations and mistakes we made along our journey. Papers are forced to hide the mess, but the real science happens in the process. marin.readthedocs.io/en/late…

10,365

David Hall · May 19, 2025 · 6:17 PM UTC

David Hall @dlwh

19 May 2025

Super excited Marin is finally out! Come see what we've been building! Code/platform for training fully reproducible models end-to-end, from data to evals. Plus a new high quality 8B base model, fully documented from start to finish.

Percy Liang

@percyliang

19 May 2025

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

9,501

David Hall · Jun 6, 2014 · 5:48 AM UTC

David Hall @dlwh

6 Jun 2014

We've released our new GPU-based natural language parser, Puck. It can parse over half a million words per minute. github.com/dlwh/puck

David Hall · Jun 26, 2025 · 5:09 PM UTC

David Hall @dlwh

26 Jun 2025

Thanks in particular to @soldni who told me about the OLMo spike-skipping logic. I think things were too unstable for us to benefit sufficiently from it, but we left it on too.

6,778

David Hall · Jul 30, 2025 · 8:50 PM UTC

David Hall @dlwh

30 Jul 2025

So first congrats to Arcee on the release. On the other hand, look how Marin is doing on this leaderboard we didn't even know about??

Lucas Atkins

@latkins

29 Jul 2025

Replying to @latkins

Our preview model actually tied at #2 for a while on the @yupp_ai leaderboard, when filtered for 2-5 turns. It has since gone further down, but I do think this speaks to the charm that this model has, which we haven't quite figured out how to evaluate.

6,345

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

**BUT** we aren't going to start from scratch. That is not the patented Marin Tootsie Process™️ way. No flop left behind. Add QK Norm, warmstart, keep the optimizer states, just rewarmup the learning rate. (Worst case, it blows up and we eventually throw it out.)

7,009

David Hall · Mar 18, 2024 · 5:30 PM UTC

David Hall @dlwh

18 Mar 2024

@StanfordCRFM and the @NVIDIA JAX Team have worked together to integrate TransformerEngine into our foundation model training framework, Levanter. The result? Levanter is now significantly faster on GPUs, with up to 50% more tokens per second! github.com/stanford-crfm/lev… @itsvadams

Model Flops Utilization for Levanter with and without TransformerEngine's Fused Attention Kernel.

At 1.5B scale, we get a 50% improvement in throughput. The improvement falls off at larger scales, but is still substantial.

ALT Model Flops Utilization for Levanter with and without TransformerEngine's Fused Attention Kernel. At 1.5B scale, we get a 50% improvement in throughput. The improvement falls off at larger scales, but is still substantial.

11,126

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

But then it happened: a Bad Spike. A spike where the loss didn't recover to the same plateau. Everyone has told us this is Bad News. In absolute terms, the loss spike was nothing. But it just didn't settle back. I dunno why. (Same y axis as the previous "fine" spike.)

11,833

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

TL; DR: The warmstart QK Norm caught up real quick, (about 200 steps, 6.5B tokens), overshooting (due to warmup) before settling in at just a bit better.

ALT QKNorm+Warmstart catching up in about 200 steps

7,090

David Hall · Jun 26, 2025 · 5:38 PM UTC

David Hall @dlwh

26 Jun 2025

Oh and if you made it this far, you should come hang out in our discord: marin.community/ for the link

Marin

marin.community

7,228

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

We did notice that update spikes always preceded loss spikes, similar to what arxiv.org/abs/2304.13013 found. So we were pretty hopeful about updating clipping. (Updates can go to 0 b/c of our OLMo2 style update skipping.)

ALT Plot of training loss and update norms, showing that update norms would spike before loss spikes

9,013

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

Now, look, we knew QK Norm was a good idea. We just thought it wasn't a **necessary** idea, not for us. We were different. Anyway, let's fix it.

7,796

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

So, pretty good right? It's running way ahead. To be clear, the training data is different (mostly Nemotron-CC instead of mostly DCLM), and the batch size is much, much larger, at 32Mi tokens instead of 4-12Mi.

9,369

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

Aside: The Muon run was still warming up its Adam params here so the loss was lower. Then it decided to go to space. Again, I'm sure insufficient tuning. Also, the Muon run did its bad shift a little later? Might be worth investigating.

ALT Loss curve showing Muon looking great then deciding it would be gradient ascent.

8,084

David Hall · Mar 22, 2024 · 5:04 PM UTC

David Hall @dlwh

22 Mar 2024

Earlier, I shared some throughput numbers for Levanter with @NVIDIA’s TransformerEngine. Today, I wanted to share some scaling experiments we conducted with Radium Cloud. We ran 1.5 and 7B experiments on up to 64 A100s. TL;DR: Linear scaling!

ALT line plot showing linear scaling for 1.5B and 7B models

19,204

David Hall · Feb 17, 2022 · 6:24 AM UTC

David Hall @dlwh

17 Feb 2022

How have I not seen the PFSC "sick of graphs" comic even once in the last two years?

reporter interviewing person: "tell me what precautions your family has taken this epidemic" person: "i don't, i don't care". reporter: "the wave of indifference continues to proliferate in affected areas". anchor: "thank you colleen. miles what can you tell us about this widespread lack of interest". ,standing in front of a graph with x-axis "sick of graphs" and y-axis "who cares", miles says "i just, i don't" and drops the microphone.

ALT reporter interviewing person: "tell me what precautions your family has taken this epidemic" person: "i don't, i don't care". reporter: "the wave of indifference continues to proliferate in affected areas". anchor: "thank you colleen. miles what can you tell us about this widespread lack of interest". ,standing in front of a graph with x-axis "sick of graphs" and y-axis "who cares", miles says "i just, i don't" and drops the microphone.

David Hall · Jun 26, 2025 · 8:31 PM UTC

David Hall @dlwh

26 Jun 2025

another PS, since the tweet is Doing Numbers: The compute is very generously sponsored by @googlecloud TPU Research Cloud. JAX+TPU determinism have been critical for testing at this scale. Also >50MFU without breaking too much of a sweat.

4,484

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

It was time to do what everyone else has learned but we were too proud, too foolish to try. (After all, the 22b and 70b trials were buttery smooth! Eval losses were ahead of schedule!) It was time to add QK Norm.

7,217

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

We also thought we were okay because the model was massively outperforming our other test models on a flop for flop basis. Recall that the 8b run (the gray top lines) was ultimately on par with Llama 3.1. Orange is the 32B.

ALT various eval losses for several test runs we did. the 32B run is doing way better.

10,169

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

Some told us we were already doomed, some were a bit more hedgy. Some trusted people privately told us that actually things were probably fine.

11,604

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

Nevertheless, we tried some interventions: tightening the grad norm clip, loss and grad outlier skipping, update clipping, etc. Nothing seemed to make a huge difference.

9,103

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

Ideally there wouldn't be spikes of course. But many of the people we talked to (and our own experience) suggested that if the model recovered quickly and it didn't really change the trajectory, it was fine. And things did recover pretty quickly, in terms of step count.

ALT Loss spike that recovered pretty quickly

11,372

David Hall · Jun 26, 2025 · 5:03 PM UTC

David Hall @dlwh

26 Jun 2025

So we tried some stuff. We tried skipping the problematic step. We tried Muon (which looked great until it didn't... Need to spend more time with it at small scales.) We could have tried some other stuff, but it was time to take drastic action to end the spikes.

ALT A few attempts at saving the 32B: Muon, more clipping. Didn't work

7,945

David Hall · Jun 26, 2023 · 8:01 PM UTC

David Hall @dlwh

26 Jun 2023

We've released v1.0.1 of our scalable named tensor library Haliax, now available as its own package on PyPI, with minimal deps beyond Jax and Equinox! pypi.org/project/haliax/

haliax

Named Tensors for Legible Deep Learning in JAX

pypi.org

10,107

David Hall · Oct 24, 2023 · 7:36 PM UTC

David Hall @dlwh

24 Oct 2023

In addition to the Haliax release, we released 1.1 of Levanter, including support for Llama models, pure-JAX Flash Attention impl, and preliminary HF-PEFT-compatible LoRA support. As part of the release, we put together a tutorial on reproducing Alpaca. levanter.readthedocs.io/en/l…

9,923

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

With Levanter, we also introduce Haliax, a new named tensor module that makes deep learning code easier to read, understand, and compose. Named tensors are a more intuitive abstraction than the usual positional axes. You can learn more about Haliax here: colab.research.google.com/dr…

ALT Named Attention Snippet in Haliax (see blogpost for full snippet)

ALT positional attention in minGPT (see blogpost for full snippet)

5,284

David Hall · Jan 24, 2024 · 5:00 PM UTC

David Hall @dlwh

24 Jan 2024

Levanter isn’t just for pre-training! You can get the same benefits of scalability, legibility, and reproducibility when fine-tuning as well! We wrote a tutorial on how to replicate Alpaca. levanter.readthedocs.io/en/l…

28,730

David Hall · Sep 4, 2025 · 10:31 PM UTC

David Hall @dlwh

4 Sep 2025

Please participate in the Marin speedrun! If you can write up some code for an optimizer we can run it and create a cool scaling ladder plot comparing to baselines!

ALT scaling law plot of Adam vs other optimizers

6,323

David Hall · Apr 11, 2024 · 3:34 AM UTC

David Hall @dlwh

11 Apr 2024

Lots of great libraries here, not least of which is @PatrickKidger’s Equinox, which is one of the core libraries Levanter is built on. github.com/patrick-kidger/eq…

GitHub - patrick-kidger/equinox: Elegant easy-to-use neural networks + scientific computing in JAX....

Elegant easy-to-use neural networks + scientific computing in JAX. https://docs.kidger.site/equinox/ - patrick-kidger/equinox

github.com

Ivan Zhou

@ivanzhouyq

11 Apr 2024

Levanter from @StanfordCRFM has earned recognition at #GoogleCloudNext as a popular #Jax repository to build Foundation Models 🙌 There have been many great improvements to Levanter in the past few months led by @dlwh, particularly in achieving impressive MFU numbers on both TPU and GPU, and support fine-tuning on popular architectures 🚀

6,686

David Hall · Mar 12, 2024 · 4:25 PM UTC

David Hall @dlwh

12 Mar 2024

When we released our foundation model training framework Levanter, we got a lot of requests for LoRA. So we added support! Our implementation works with all Levanter models and produces checkpoints that work with Hugging Face’s PEFT library.

ALT terminal of me typing `python -m levanter.main.lora_lm --config configs/lora_llama2.yaml --data.id math-ai/AutoMathText` and then Levanter doing its thing (through loading weights). Sped up 4x

17,208

David Hall · Jan 26, 2024 · 7:45 PM UTC

David Hall @dlwh

26 Jan 2024

@_jasonw_sy recently added Grouped Query Attention to Levanter’s Llama implementation. GQA is used in the higher parameter count Llama 2 configurations, meaning Levanter now supports the full suite of Llama models! github.com/stanford-crfm/lev…

GitHub - marin-community/levanter: Legible, Scalable, Reproducible Foundation Models with Named...

Legible, Scalable, Reproducible Foundation Models with Named Tensors and Jax - marin-community/levanter

github.com

13,480

David Hall · Jul 9, 2023 · 6:13 AM UTC

David Hall @dlwh

9 Jul 2023

Saturday night tutorial! This time we're doing Tensor Parallelism in JAX/Haliax/Levanter in just 5 lines of code colab.research.google.com/dr…

$These lines of code: dp_axis_mapping = {"batch": "data"} fsdp_axis_mapping = {"embed": "data"} tp_mapping = {"mlp": "model", "head": "model"} compute_axis_mapping = {**dp_axis_mapping, **tp_mapping} param_axis_mapping = {**fsdp_axis_mapping, **tp_mapping}$

ALT These lines of code: dp_axis_mapping = {"batch": "data"} fsdp_axis_mapping = {"embed": "data"} tp_mapping = {"mlp": "model", "head": "model"} compute_axis_mapping = {**dp_axis_mapping, **tp_mapping} param_axis_mapping = {**fsdp_axis_mapping, **tp_mapping}

4,364

David Hall · May 2, 2023 · 6:58 PM UTC

David Hall @dlwh

2 May 2023

Come work with me and other really great people at @StanfordCRFM! Lots of cool projects involving training and evaluating foundation models, while still getting to do it all in the open!

Percy Liang

@percyliang

2 May 2023

Interested in building and benchmarking LLMs and other foundation models in a vibrant academic setting? @StanfordCRFM is hiring research engineers! careersearch.stanford.edu/jo… Here are some things that you could be a part of:

3,875

David Hall · Oct 11, 2021 · 6:20 PM UTC

David Hall @dlwh

11 Oct 2021

Replying to @mark_riedl

when i was at MS at a training, a long timer got up and said that at MS, if anything is worth doing there were at least 5 different teams working on it already, and you'd only ever be able to find 4 of them

David Hall · Dec 15, 2015 · 11:22 PM UTC

David Hall @dlwh

15 Dec 2015

#NLProc people: we're hiring to build the next generation of conversational AI interfaces semanticmachines.com/

Microsoft Research – Emerging Technology, Computer, & Software Research

Explore research at Microsoft, a site featuring the impact of research along with publications, products, downloads, and research careers.

microsoft.com

David Hall · Mar 22, 2024 · 3:21 PM UTC

David Hall @dlwh

22 Mar 2024

Replying to @srush_nlp

This is not the ELI5 answer (@gallabytes and others have nailed it) but if you're interested, there's a recent-ish performance guide for TPU that I think explains the perf properties quite well jax.readthedocs.io/en/latest…

2,827

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

Thanks to JAX, Levanter also offers perfect bitwise reproducibility, meaning the same run (same hardware) = same output, every time, even with preemption/restarts.

ALT wandb training curve showing 10 runs with the exact same loss curves

3,043

David Hall · Sep 30, 2023 · 12:11 AM UTC

David Hall @dlwh

30 Sep 2023

How to talk to your kid about LLM-induced x-risk

ALT Cover of the book “ llama destroys the world”

1,123

David Hall · Sep 27, 2011 · 10:15 PM UTC

David Hall @dlwh

27 Sep 2011

Yo dawg, I heard you like Bayes' rule, so I put a prior on your priors so you can be uncertain about your uncertainty.

David Hall · Mar 19, 2024 · 4:00 PM UTC

David Hall @dlwh

19 Mar 2024

Built with Levanter!

Vaibhav (VB) Srivastav

@reach_vb

19 Mar 2024

Anticipatory Music Transformer by @StanfordCRFM 🎶 > A foundation model for symbolic music. > Supports generating accompaniments (enrich music) and infill (fill in musical details). > 780 Million parameters, trained for 800 Thousand steps. > Trained on Lakh, MetaMIDI and Transcripts of Audio. > Apache 2.0 Licensed! 🔥 In the video - you see/ hear the accompaniment generated by the Anticipatory Music Transformer model for the Tonal input to Dua Lipa's - Levitating song. The music generation space is definitely popping! ⚡

7,634

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

Levanter has useful features like live data visualization during training, and cached distributed on-demand preprocessing with Ray. Just specify your HF Dataset or data URLs, and start training!

2,251

David Hall · May 19, 2025 · 6:35 PM UTC

David Hall @dlwh

19 May 2025

Oh right, this kinda got lost in all the platform launch and "we built a pretty good model" stuff. We have WSD-S cooled-down 8B checkpoints every 83B tokens or so for all(?) your scaling law and emergence needs.

Will Held @WilliamBarrHeld

19 May 2025

Replying to @WilliamBarrHeld

Last August, I chatted with @dlwh about the need for an open-source set of scaling law checkpoints! Since then, I was lucky to play a (small) role in building Marin-8B. Check out the model (including intermediate checkpoints) here: huggingface.co/marin-communi…

2,031

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

At @StanfordCRFM, we’ve used Levanter to help scale new techniques like: * Sophia: nitter.app/tengyuma/status/166141… * Backpacks: nitter.app/johnhewtt/status/16632… * Anticipatory Music Transformers: nitter.app/jwthickstun/status/166… (co-released today!)

John Thickstun @jwthickstun

16 Jun 2023

We’re releasing the Anticipatory Music Transformer: a controllable generative model for symbolic music (like MIDI). Read about the model on the CRFM blog: crfm.stanford.edu/2023/06/16… 🧵👇

2,827

David Hall · Jun 26, 2025 · 7:33 PM UTC

David Hall @dlwh

26 Jun 2025

Replying to @yassineyousfi_

this is the way

1,880

David Hall · Feb 23, 2018 · 6:48 PM UTC

David Hall @dlwh

23 Feb 2018

@propensive I just wanted to say thanks for Magnolia. We switched some stuff over from shapeless (PureConfig derivation) and it sped up compile times by like 90% (seriously)

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

Levanter is still evolving, but we hope it will be useful to the community for training foundation models with JAX and TPUs (and GPUs too!). Check it out on GitHub (github.com/stanford-crfm/lev…)

GitHub - marin-community/levanter: Legible, Scalable, Reproducible Foundation Models with Named...

Legible, Scalable, Reproducible Foundation Models with Named Tensors and Jax - marin-community/levanter

github.com

1,613

David Hall · Sep 4, 2025 · 8:46 PM UTC

David Hall @dlwh

4 Sep 2025

Great work from @wen_kaiyue! Tons of work on top of Marin.

Kaiyue Wen

@wen_kaiyue

4 Sep 2025

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

1,490

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

Levanter and Haliax’s named tensors enable more than just legibility. They also enable scalability: FSDP and Tensor Parallelism can be added to your training loop with about 10 lines of code without modifying any of the model code. colab.research.google.com/dr…

Scaling Transformers in Haliax

Colab notebook

colab.research.google.com

2,804

David Hall · Jun 26, 2025 · 7:01 PM UTC

David Hall @dlwh

26 Jun 2025

Replying to @AdamZweiger

You do want to put the best data at the end, I think that’s pretty settled now. It would be good to run an experiment with fully preregistered config probably for the 8B scale. The nemotron data does seem to be dramatically better in terms of loss.

2,372

David Hall · Jun 26, 2025 · 5:32 PM UTC

David Hall @dlwh

26 Jun 2025

Replying to @Adi_kmt

We will do what is typically called “mid training” these days, which is where you put your best pre-training data at the end during the cooldown.

1,915

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

Despite the non-invasive approach to parallelism, Levanter’s pretty fast too! We can achieve up to 54% Model Flop Utilization (MFU) on a TPU V3-256, which puts it in the ballpark of performance-focused libraries like Google’s MaxText, MosaicML, and Megatron.

2,342

David Hall · Oct 24, 2023 · 7:33 PM UTC

David Hall @dlwh

24 Oct 2023

We released version 1.2 of our named tensor library Haliax. In addition to improved docs (haliax.readthedocs.io) and convolutional layers, we added a new einops-style rearrange that works with names or positional axes that I particularly like!

ALT # split into patches hax.rearrange(x, "N C (ph H) (pw W) -> N C (P: ph pw) H W", ph=8, pw=8) # order agnostic hax.rearrange(x, "{(H: ph H) (W: pw W)} -> ... (P: ph pw) H W", ph=8, pw=8)

587

David Hall · Aug 12, 2019 · 3:31 AM UTC

David Hall @dlwh

12 Aug 2019

Way, way overdue, but I'm pleased to announce the release of Scala Breeze 1.0. github.com/scalanlp/breeze Available for 2.11, 2.12, and 2.13

GitHub - scalanlp/breeze: Breeze is/was a numerical processing library for Scala.

Breeze is/was a numerical processing library for Scala. - scalanlp/breeze

github.com

David Hall · Apr 10, 2024 · 1:04 AM UTC

David Hall @dlwh

10 Apr 2024

Replying to @yaroslavvb

Can’t wait for my DBRX platinum rewards card. 2x a100-hours on dining purchases

631

David Hall · Oct 5, 2021 · 6:21 AM UTC

David Hall @dlwh

5 Oct 2021

I just released Scala Breeze 2.0 with support for Scala 2.12, 2.13, and 3.0. A lot of stuff under the hood changed, but users should hopefully not notice much change, except that it should be somewhat faster. (Compile times w/ Scala 3 are pretty snappy!)

David Hall · May 8, 2023 · 6:58 PM UTC

David Hall @dlwh

8 May 2023

Ray,

I think my model's learning rate is too low, but if I increase it, it gets unstable. What should I do? —David

—

Dear David,

Models only learn through pain, David. Make it more painful. That’s probably just a metaphor for adding more algorithms, etc.

From RayBot.help

ALT Ray, I think my model's learning rate is too low, but if I increase it, it gets unstable. What should I do? —David — Dear David, Models only learn through pain, David. Make it more painful. That’s probably just a metaphor for adding more algorithms, etc. From RayBot.help

631

David Hall · Sep 4, 2025 · 8:45 PM UTC

David Hall @dlwh

4 Sep 2025

Replying to @BlancheMinerva @eliebakouch @wen_kaiyue @tengyuma @percyliang

Thanks Stella! Yeah for Marin I see papers as almost a byproduct of the work: milestones and offshoots of the day to day building and experimentation. More to come, to be sure!

2,979

David Hall · Jul 30, 2025 · 8:57 PM UTC

David Hall @dlwh

30 Jul 2025

Any leader board where we're tied with Claude Sonnet and beating Gemini/4o is a good one! (This is of course really a sign that the leader board should be taken with quite a lot of salt...)

574

David Hall · Mar 18, 2024 · 5:30 PM UTC

David Hall @dlwh

18 Mar 2024

What's more, the @nvidia JAX team now provides an official Docker container in their JAX Toolbox, which comes with Levanter, TE and an optimized JAX environment preinstalled. github.com/NVIDIA/JAX-Toolbo… @mjsMLP @santosh_bhavanii @tangmaxin

GitHub - NVIDIA/JAX-Toolbox: JAX-Toolbox

JAX-Toolbox. Contribute to NVIDIA/JAX-Toolbox development by creating an account on GitHub.

github.com

567

David Hall · May 21, 2025 · 9:03 PM UTC

David Hall @dlwh

21 May 2025

32B's in the works! So far a lot fewer mistakes. I do learn sometimes. wandb.ai/marin-community/mar… marin.community

ALT log/log plot of the loss curve of the 32B run 's eval loss on paloma, showing a nice looking curve.

1,612

David Hall · Jun 14, 2022 · 4:40 AM UTC

David Hall @dlwh

14 Jun 2022

I wish there were a statically typed autodiff language with named tensors, like Dex, but maybe a little less pure and with much less scary syntax. Jax with xmap makes it feel like it's so close, but it's so far away

David Hall · Apr 3, 2024 · 6:10 PM UTC

David Hall @dlwh

3 Apr 2024

You can train speech foundation models in Levanter thanks to @WilliamBarrHeld!

Will Held @WilliamBarrHeld

3 Apr 2024

Foundation Models are not just Large Language Models! This past month, I added support for Audio into Levanter - @StanfordCRFM's framework for Foundation Model training. Learn how to train Whisper on your own data from scratch here: levanter.readthedocs.io/en/l…

1,613

David Hall · Mar 11, 2025 · 10:41 PM UTC

David Hall @dlwh

11 Mar 2025

I think @eraznafre may be the most underrated JAX LLM person out there. EasyDeL is worth watching

Erfanzar

@eraznafre

11 Mar 2025

EasyDeL v0.1.0 released new inference engine efficiently processes 65,536–131,072 tokens (on v4-8), making extended context windows practical for real-world applications—with fast response times and easy scalability. Learn more: shorturl.at/iL9bx

1,332

David Hall · May 23, 2025 · 7:23 PM UTC

David Hall @dlwh

23 May 2025

Since some folks flagged loss spikes: at ~1.5T tokens Olmo 32B (NeoX tokenizer) saw c4_en loss ≈ 2.43; we’re at 2.28 (Llama 3 tokenizer) with BPB 0.702. A simple regression predicts Olmo’s BPB ≈ 0.727 ± 0.0005 . I think we're okay.

Percy Liang

@percyliang

22 May 2025

Marin 32B training crossed 1.5 trillion tokens today...

1,690

David Hall · Sep 12, 2024 · 12:01 AM UTC

David Hall @dlwh

12 Sep 2024

@ztellman's series (in particular this one) crystallized for me when we should expect current LLM dev tools (aider, cursor) to help: it's when the change has a short explanation, even if the diff is huge. If it's boilerplate or otherwise needs limited context, no big deal.

zach @ztellman

9 Sep 2024

Replying to @ztellman

For the next few weeks, we'll be exploring both sides of this coin. To start, we'll look at the simplicity of keeping things apart: explaining.software/archive/…

1,159

David Hall · Jul 26, 2024 · 5:40 PM UTC

David Hall @dlwh

26 Jul 2024

Really pleased to have @WilliamBarrHeld 's multimodla audio work in Levanter!

Will Held @WilliamBarrHeld

26 Jul 2024

Replying to @WilliamBarrHeld @EllaMinzhiLi @michaelryan207 @shi_weiyan @StevenyzZhang

Also, huge thanks to @dlwh for helping code review many of my audio additions to Levanter! This project would not have been possible without Levanter - especially since it enabled us to use the awesome @GoogleCloud TPU Research Cloud.

1,061

David Hall · Mar 26, 2024 · 9:56 PM UTC

David Hall @dlwh

26 Mar 2024

With mid-run portability, Levanter will pick up optimization exactly where it left off. It's as easy as copying the checkpoint and adding a flag to the resumed run. @itvadams made a tutorial here: levanter.readthedocs.io/en/l…

934

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

Huge thanks to all the help we (@ivanzhouyq @percyliang) got along the way, including @froystig, @sholtodouglas, Skye Wanderman-Milne, Yifan Mai, @siddkaramcheti, and to our earliest testers @jwthickstun, @johnhewtt, @sidilu_pluslab

1,417

David Hall · May 22, 2025 · 6:24 PM UTC

David Hall @dlwh

22 May 2025

This will go on the dust jacket.

Austin Huang

@austinvhuang

22 May 2025

Refreshingly unpretentious🙂

1,393

David Hall · Jan 19, 2023 · 6:26 AM UTC

David Hall @dlwh

19 Jan 2023

Glad this is seeing the light of day

Salar Rahmanian

@SalarRahmanian

18 Jan 2023

#Microsoft released #scala bindings for #PyTorch #python 🎉😎 github.com/microsoft/scala_t…

950

David Hall · Mar 26, 2024 · 9:56 PM UTC

David Hall @dlwh

26 Mar 2024

Huge thanks to JAX and TensorStore for making this look so easy!

1,671

David Hall · May 2, 2022 · 6:49 PM UTC

David Hall @dlwh

2 May 2022

Replying to @srush_nlp @BenevOrang

There seem to be broadly two groups of EA people. Those are who focused on treatments that have short-term measurable outcomes, and those who are really , really concerned with tail risk (and also buy things like Rocko's Basilisk...)

David Hall · Jun 26, 2025 · 8:49 PM UTC

David Hall @dlwh

26 Jun 2025

Replying to @capetorch

We did automatically skip updates on spike steps using a moving window based thing that @soldni pointed me to from OLMo. So... sort of? But the spikes would come anyway. I'm pretty sure it's something before the loss spike itself, and we have evidence it's not always just data

1,676

David Hall · Mar 27, 2024 · 1:37 AM UTC

David Hall @dlwh

27 Mar 2024

Replying to @srush_nlp @StanfordCRFM

Thanks Sasha, that’s very kind

634

David Hall · Jun 16, 2023 · 3:59 PM UTC

David Hall @dlwh

16 Jun 2023

And an extra thanks to the Google TPU Research Cloud for generously providing us access to the TPUs!

1,232

David Hall · Oct 11, 2021 · 6:26 PM UTC

David Hall @dlwh

11 Oct 2021

Replying to @mark_riedl

i meant the former, but probably both

David Hall · May 29, 2025 · 11:14 PM UTC

David Hall @dlwh

29 May 2025

Replying to @percyliang

When do we put this in Marin :-)

348

David Hall · Jun 19, 2023 · 4:28 PM UTC

David Hall @dlwh

19 Jun 2023

How cool would it be to be this much of a polymath? Cleaning up my Semantic Scholar page... lot of David Halls out there.

ALT bunch of non-NLP (my speciality) papers across a whole bunch of different areas attributed to me that aren't by me

958

David Hall · Dec 7, 2022 · 5:44 PM UTC

David Hall @dlwh

7 Dec 2022

Replying to @nsaphra

how big is big? Mistral has 125M and 345M checkpoints here huggingface.co/stanford-crfm There's also 1.5B and we have a 6.7B in progress

stanford-crfm (Stanford CRFM)

Org profile for Stanford CRFM on Hugging Face, the AI community building the future.

huggingface.co

David Hall · Apr 4, 2022 · 10:27 PM UTC

David Hall @dlwh

4 Apr 2022

Replying to @yoavgo

As others have said it's basically cogsci, but SymSys started with a thesis/viewpoint (~Hofstadter Godel Escher Bach). By the time I was there (04-08), that thesis had been largely forgotten, and it's more "CS with stuff other than engineering fundamentals"

David Hall · Mar 27, 2024 · 6:32 PM UTC

David Hall @dlwh

27 Mar 2024

I honestly didn’t really believe it, but it’s true. This is literally the first thing I tried (looking at what’s in Flax and TransformerEngine), so probably we can get more wins. I especially didn’t believe it would be stable, but it is!

ALT plot showing fp8 is stable training a gpt6.7B for 10k steps

395

David Hall · May 22, 2025 · 3:55 AM UTC

David Hall @dlwh

22 May 2025

Replying to @BrandoHablando

Probably that you can make aggressive changes and it’s fine.

David Hall · Oct 26, 2016 · 3:22 PM UTC

David Hall @dlwh

26 Oct 2016

And we're hiring!

Sasha Rush

@srush_nlp

25 Oct 2016

Semantic Machines has such an impressive team for a small company. Their about page has so many interesting researchers.

David Hall · Aug 9, 2022 · 12:22 AM UTC

David Hall @dlwh

9 Aug 2022

When programming in Python, every time I take ≥2 steps off the golden path in some library, everything breaks. Either I'm cursed or I don't understand how anyone does anything

David Hall · Mar 26, 2024 · 9:56 PM UTC

David Hall @dlwh

26 Mar 2024

Oh, and performance is portable too! Levanter can achieve model flop utilization (MFU) in the 50-55% range (sometimes a bit higher) on both TPU and GPU, so we have you covered on both platforms.

1,227

David Hall · Sep 4, 2025 · 10:31 PM UTC

David Hall @dlwh

4 Sep 2025

marin.community/speedrun/ is the dashboard where you can see the ladder plots and explore some other visualizations @williambarrheld and @nikilravi cooked up

732

David Hall · Sep 4, 2025 · 10:31 PM UTC

David Hall @dlwh

4 Sep 2025

Example code here (we have an example here github.com/marin-community/m…)

588

David Hall · Feb 16, 2018 · 1:22 AM UTC

David Hall @dlwh

16 Feb 2018

Replying to @praxilogical @jasonpjason

Seize the means of reproduction?

David Hall · Sep 26, 2013 · 9:57 PM UTC

David Hall @dlwh

26 Sep 2013

My cat left me some ascii art in my terminal