Member of Technical Staff @ Open Athena. Creator of Levanter and Marin. Previously Research Engineering @StanfordCRFM, co-founder at Semantic Machines ⟶ MSFT.

Berkeley, CA
So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )
Marin 32B training crossed 1.5 trillion tokens today...
23
103
1,031
307,298
Today, I’m excited to announce the release of Levanter 1.0, our new JAX-based framework for training foundation models, which we’ve been working on @StanfordCRFM. Levanter is designed to be legible, scalable and reproducible. crfm.stanford.edu/2023/06/16…
6
83
392
142,652
I'm excited to announce I'm joining Stanford CRFM (crfm.stanford.edu) to lead the engineering effort to improve the accessibility of foundation models. Really looking forward to working with @percyliang and everyone else there!
6
8
183
Last summer we announced the Sophia optimizer, a successor to Adam that can achieve up to 2x gains over Adam. We’ve now merged mainline support into Levanter! Check out @tengyuma’s original thread for how Sophia works: nitter.app/tengyuma/status/166141… @HongLiu9903 github.com/stanford-crfm/lev…
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️
5
22
146
42,567
I like to talk about Levanter’s performance, reproducibility, and scalability, but it’s also portable! So portable you can even switch from TPU to GPU in the middle of a run, and then switch back again! github.com/stanford-crfm/lev…
3
22
139
51,770
(If your attention span has been irreparably damaged by chronic scrolling, the answer is QK Norm, but the story is the fun part! Or here's the report version of this thread: api.wandb.ai/links/marin-com…)
2
1
140
12,615
For fun, I've been working with @achewood to create an LLM-backed advice bot to celebrate the relaunch of my favorite web comic.
5
17
100
12,299
And now it's looking great! So, more norms good. Also, and more importantly, you can totally just change something major mid-run and it'll be okay. Or, as the meme goes, you can just do things.
3
117
8,330
FP8 support has landed in Levanter (and Haliax)! On H100, you can now get a >40%(!) throughput improvement by flipping a flag. Just add `trainer.fp8: true` to your config and you’re good to go! H100s not included. github.com/stanford-crfm/lev…
3
8
84
23,750
I made this little project for turning jax's jitted functions back into python code, mostly so you can minimize bug reports etc. It's horrible and hacky but maybe someone will find it useful. github.com/dlwh/jax_sourcero…
1
9
73
6,696
Come read about all the mistakes I made along the way to beating Llama 3.1 8B on 14/19 benchmarks. We trained from scratch, made plenty of wrong turns, and learned a lot.
For a rare look into how LLMs are really built, check out @dlwh's retrospective on how we trained the Marin 8B model from scratch (and outperformed Llama 3.1 8B base). It’s an honest account with all the revelations and mistakes we made along our journey. Papers are forced to hide the mess, but the real science happens in the process. marin.readthedocs.io/en/late…
9
11
70
10,365
Super excited Marin is finally out! Come see what we've been building! Code/platform for training fully reproducible models end-to-end, from data to evals. Plus a new high quality 8B base model, fully documented from start to finish.
What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:
10
17
65
9,501
We've released our new GPU-based natural language parser, Puck. It can parse over half a million words per minute. github.com/dlwh/puck
2
54
64
Thanks in particular to @soldni who told me about the OLMo spike-skipping logic. I think things were too unstable for us to benefit sufficiently from it, but we left it on too.
3
66
6,778
So first congrats to Arcee on the release. On the other hand, look how Marin is doing on this leaderboard we didn't even know about??
Replying to @latkins
Our preview model actually tied at #2 for a while on the @yupp_ai leaderboard, when filtered for 2-5 turns. It has since gone further down, but I do think this speaks to the charm that this model has, which we haven't quite figured out how to evaluate.
2
5
58
6,345
**BUT** we aren't going to start from scratch. That is not the patented Marin Tootsie Process™️ way. No flop left behind. Add QK Norm, warmstart, keep the optimizer states, just rewarmup the learning rate. (Worst case, it blows up and we eventually throw it out.)
1
58
7,009
@StanfordCRFM and the @NVIDIA JAX Team have worked together to integrate TransformerEngine into our foundation model training framework, Levanter. The result? Levanter is now significantly faster on GPUs, with up to 50% more tokens per second! github.com/stanford-crfm/lev… @itsvadams
3
11
53
11,126
But then it happened: a Bad Spike. A spike where the loss didn't recover to the same plateau. Everyone has told us this is Bad News. In absolute terms, the loss spike was nothing. But it just didn't settle back. I dunno why. (Same y axis as the previous "fine" spike.)
2
55
11,833
TL; DR: The warmstart QK Norm caught up real quick, (about 200 steps, 6.5B tokens), overshooting (due to warmup) before settling in at just a bit better.
1
1
53
7,090
Oh and if you made it this far, you should come hang out in our discord: marin.community/ for the link
4
53
7,228
We did notice that update spikes always preceded loss spikes, similar to what arxiv.org/abs/2304.13013 found. So we were pretty hopeful about updating clipping. (Updates can go to 0 b/c of our OLMo2 style update skipping.)
2
54
9,013
Now, look, we knew QK Norm was a good idea. We just thought it wasn't a **necessary** idea, not for us. We were different. Anyway, let's fix it.
2
1
53
7,796
So, pretty good right? It's running way ahead. To be clear, the training data is different (mostly Nemotron-CC instead of mostly DCLM), and the batch size is much, much larger, at 32Mi tokens instead of 4-12Mi.
2
49
9,369
Aside: The Muon run was still warming up its Adam params here so the loss was lower. Then it decided to go to space. Again, I'm sure insufficient tuning. Also, the Muon run did its bad shift a little later? Might be worth investigating.
1
49
8,084
Earlier, I shared some throughput numbers for Levanter with @NVIDIA’s TransformerEngine. Today, I wanted to share some scaling experiments we conducted with Radium Cloud. We ran 1.5 and 7B experiments on up to 64 A100s. TL;DR: Linear scaling!
2
8
40
19,204
How have I not seen the PFSC "sick of graphs" comic even once in the last two years?
10
41
another PS, since the tweet is Doing Numbers: The compute is very generously sponsored by @googlecloud TPU Research Cloud. JAX+TPU determinism have been critical for testing at this scale. Also >50MFU without breaking too much of a sweat.
3
47
4,484
It was time to do what everyone else has learned but we were too proud, too foolish to try. (After all, the 22b and 70b trials were buttery smooth! Eval losses were ahead of schedule!) It was time to add QK Norm.
1
47
7,217
We also thought we were okay because the model was massively outperforming our other test models on a flop for flop basis. Recall that the 8b run (the gray top lines) was ultimately on par with Llama 3.1. Orange is the 32B.
1
43
10,169
Some told us we were already doomed, some were a bit more hedgy. Some trusted people privately told us that actually things were probably fine.
1
42
11,604
Nevertheless, we tried some interventions: tightening the grad norm clip, loss and grad outlier skipping, update clipping, etc. Nothing seemed to make a huge difference.
2
43
9,103
Ideally there wouldn't be spikes of course. But many of the people we talked to (and our own experience) suggested that if the model recovered quickly and it didn't really change the trajectory, it was fine. And things did recover pretty quickly, in terms of step count.
1
43
11,372
So we tried some stuff. We tried skipping the problematic step. We tried Muon (which looked great until it didn't... Need to spend more time with it at small scales.) We could have tried some other stuff, but it was time to take drastic action to end the spikes.
1
42
7,945
We've released v1.0.1 of our scalable named tensor library Haliax, now available as its own package on PyPI, with minimal deps beyond Jax and Equinox! pypi.org/project/haliax/
1
7
37
10,107
In addition to the Haliax release, we released 1.1 of Levanter, including support for Llama models, pure-JAX Flash Attention impl, and preliminary HF-PEFT-compatible LoRA support. As part of the release, we put together a tutorial on reproducing Alpaca. levanter.readthedocs.io/en/l…
2
7
34
9,923
With Levanter, we also introduce Haliax, a new named tensor module that makes deep learning code easier to read, understand, and compose. Named tensors are a more intuitive abstraction than the usual positional axes. You can learn more about Haliax here: colab.research.google.com/dr…
1
3
34
5,284
Levanter isn’t just for pre-training! You can get the same benefits of scalability, legibility, and reproducibility when fine-tuning as well! We wrote a tutorial on how to replicate Alpaca. levanter.readthedocs.io/en/l…
1
12
36
28,730
Please participate in the Marin speedrun! If you can write up some code for an optimizer we can run it and create a cool scaling ladder plot comparing to baselines!
3
7
40
6,323
Lots of great libraries here, not least of which is @PatrickKidger’s Equinox, which is one of the core libraries Levanter is built on. github.com/patrick-kidger/eq…
Levanter from @StanfordCRFM has earned recognition at #GoogleCloudNext as a popular #Jax repository to build Foundation Models 🙌 There have been many great improvements to Levanter in the past few months led by @dlwh, particularly in achieving impressive MFU numbers on both TPU and GPU, and support fine-tuning on popular architectures 🚀
1
2
30
6,686
When we released our foundation model training framework Levanter, we got a lot of requests for LoRA. So we added support! Our implementation works with all Levanter models and produces checkpoints that work with Hugging Face’s PEFT library.

ALT terminal of me typing `python -m levanter.main.lora_lm --config configs/lora_llama2.yaml --data.id math-ai/AutoMathText` and then Levanter doing its thing (through loading weights). Sped up 4x

2
7
33
17,208
@_jasonw_sy recently added Grouped Query Attention to Levanter’s Llama implementation. GQA is used in the higher parameter count Llama 2 configurations, meaning Levanter now supports the full suite of Llama models! github.com/stanford-crfm/lev…
2
4
26
13,480
Saturday night tutorial! This time we're doing Tensor Parallelism in JAX/Haliax/Levanter in just 5 lines of code colab.research.google.com/dr…
4
23
4,364
Come work with me and other really great people at @StanfordCRFM! Lots of cool projects involving training and evaluating foundation models, while still getting to do it all in the open!
Interested in building and benchmarking LLMs and other foundation models in a vibrant academic setting? @StanfordCRFM is hiring research engineers! careersearch.stanford.edu/jo… Here are some things that you could be a part of:
3
23
3,875
Replying to @mark_riedl
when i was at MS at a training, a long timer got up and said that at MS, if anything is worth doing there were at least 5 different teams working on it already, and you'd only ever be able to find 4 of them
1
24
Replying to @srush_nlp
This is not the ELI5 answer (@gallabytes and others have nailed it) but if you're interested, there's a recent-ish performance guide for TPU that I think explains the perf properties quite well jax.readthedocs.io/en/latest…
2
21
2,827
Thanks to JAX, Levanter also offers perfect bitwise reproducibility, meaning the same run (same hardware) = same output, every time, even with preemption/restarts.
3
20
3,043
How to talk to your kid about LLM-induced x-risk
18
1,123
Yo dawg, I heard you like Bayes' rule, so I put a prior on your priors so you can be uncertain about your uncertainty.
41
17
Built with Levanter!
Anticipatory Music Transformer by @StanfordCRFM 🎶 > A foundation model for symbolic music. > Supports generating accompaniments (enrich music) and infill (fill in musical details). > 780 Million parameters, trained for 800 Thousand steps. > Trained on Lakh, MetaMIDI and Transcripts of Audio. > Apache 2.0 Licensed! 🔥 In the video - you see/ hear the accompaniment generated by the Anticipatory Music Transformer model for the Tonal input to Dua Lipa's - Levitating song. The music generation space is definitely popping! ⚡
4
18
7,634
Levanter has useful features like live data visualization during training, and cached distributed on-demand preprocessing with Ray. Just specify your HF Dataset or data URLs, and start training!
1
1
19
2,251
Oh right, this kinda got lost in all the platform launch and "we built a pretty good model" stuff. We have WSD-S cooled-down 8B checkpoints every 83B tokens or so for all(?) your scaling law and emergence needs.
Replying to @WilliamBarrHeld
Last August, I chatted with @dlwh about the need for an open-source set of scaling law checkpoints! Since then, I was lucky to play a (small) role in building Marin-8B. Check out the model (including intermediate checkpoints) here: huggingface.co/marin-communi…
1
17
2,031
At @StanfordCRFM, we’ve used Levanter to help scale new techniques like: * Sophia: nitter.app/tengyuma/status/166141… * Backpacks: nitter.app/johnhewtt/status/16632… * Anticipatory Music Transformers: nitter.app/jwthickstun/status/166… (co-released today!)
We’re releasing the Anticipatory Music Transformer: a controllable generative model for symbolic music (like MIDI). Read about the model on the CRFM blog: crfm.stanford.edu/2023/06/16… 🧵👇
1
1
17
2,827
Replying to @yassineyousfi_
this is the way
16
1,880
@propensive I just wanted to say thanks for Magnolia. We switched some stuff over from shapeless (PureConfig derivation) and it sped up compile times by like 90% (seriously)
1
1
14
Levanter is still evolving, but we hope it will be useful to the community for training foundation models with JAX and TPUs (and GPUs too!). Check it out on GitHub (github.com/stanford-crfm/lev…)
1
1
15
1,613
Great work from @wen_kaiyue! Tons of work on top of Marin.
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
3
16
1,490
Levanter and Haliax’s named tensors enable more than just legibility. They also enable scalability: FSDP and Tensor Parallelism can be added to your training loop with about 10 lines of code without modifying any of the model code. colab.research.google.com/dr…
1
1
15
2,804
Replying to @AdamZweiger
You do want to put the best data at the end, I think that’s pretty settled now. It would be good to run an experiment with fully preregistered config probably for the 8B scale. The nemotron data does seem to be dramatically better in terms of loss.
1
15
2,372
Replying to @Adi_kmt
We will do what is typically called “mid training” these days, which is where you put your best pre-training data at the end during the cooldown.
1
16
1,915
Despite the non-invasive approach to parallelism, Levanter’s pretty fast too! We can achieve up to 54% Model Flop Utilization (MFU) on a TPU V3-256, which puts it in the ballpark of performance-focused libraries like Google’s MaxText, MosaicML, and Megatron.
1
14
2,342
We released version 1.2 of our named tensor library Haliax. In addition to improved docs (haliax.readthedocs.io) and convolutional layers, we added a new einops-style rearrange that works with names or positional axes that I particularly like!
1
13
587
Replying to @yaroslavvb
Can’t wait for my DBRX platinum rewards card. 2x a100-hours on dining purchases
1
11
631
I just released Scala Breeze 2.0 with support for Scala 2.12, 2.13, and 3.0. A lot of stuff under the hood changed, but users should hopefully not notice much change, except that it should be somewhat faster. (Compile times w/ Scala 3 are pretty snappy!)
1
4
10
1
9
631
Thanks Stella! Yeah for Marin I see papers as almost a byproduct of the work: milestones and offshoots of the day to day building and experimentation. More to come, to be sure!
1
1
12
2,979
Any leader board where we're tied with Claude Sonnet and beating Gemini/4o is a good one! (This is of course really a sign that the leader board should be taken with quite a lot of salt...)
11
574
What's more, the @nvidia JAX team now provides an official Docker container in their JAX Toolbox, which comes with Levanter, TE and an optimized JAX environment preinstalled. github.com/NVIDIA/JAX-Toolbo… @mjsMLP @santosh_bhavanii @tangmaxin
1
2
10
567
32B's in the works! So far a lot fewer mistakes. I do learn sometimes. wandb.ai/marin-community/mar… marin.community
2
11
1,612
I wish there were a statically typed autodiff language with named tensors, like Dex, but maybe a little less pure and with much less scary syntax. Jax with xmap makes it feel like it's so close, but it's so far away
2
7
You can train speech foundation models in Levanter thanks to @WilliamBarrHeld!
Foundation Models are not just Large Language Models! This past month, I added support for Audio into Levanter - @StanfordCRFM's framework for Foundation Model training. Learn how to train Whisper on your own data from scratch here: levanter.readthedocs.io/en/l…
2
9
1,613
I think @eraznafre may be the most underrated JAX LLM person out there. EasyDeL is worth watching
EasyDeL v0.1.0 released new inference engine efficiently processes 65,536–131,072 tokens (on v4-8), making extended context windows practical for real-world applications—with fast response times and easy scalability. Learn more: shorturl.at/iL9bx
9
1,332
Since some folks flagged loss spikes: at ~1.5T tokens Olmo 32B (NeoX tokenizer) saw c4_en loss ≈ 2.43; we’re at 2.28 (Llama 3 tokenizer) with BPB 0.702. A simple regression predicts Olmo’s BPB ≈ 0.727 ± 0.0005 . I think we're okay.
Marin 32B training crossed 1.5 trillion tokens today...
2
2
9
1,690
@ztellman's series (in particular this one) crystallized for me when we should expect current LLM dev tools (aider, cursor) to help: it's when the change has a short explanation, even if the diff is huge. If it's boilerplate or otherwise needs limited context, no big deal.
Replying to @ztellman
For the next few weeks, we'll be exploring both sides of this coin. To start, we'll look at the simplicity of keeping things apart: explaining.software/archive/…
2
1
7
1,159
Really pleased to have @WilliamBarrHeld 's multimodla audio work in Levanter!
Also, huge thanks to @dlwh for helping code review many of my audio additions to Levanter! This project would not have been possible without Levanter - especially since it enabled us to use the awesome @GoogleCloud TPU Research Cloud.
8
1,061
With mid-run portability, Levanter will pick up optimization exactly where it left off. It's as easy as copying the checkpoint and adding a flag to the resumed run. @itvadams made a tutorial here: levanter.readthedocs.io/en/l…
1
1
8
934
Huge thanks to all the help we (@ivanzhouyq @percyliang) got along the way, including @froystig, @sholtodouglas, Skye Wanderman-Milne, Yifan Mai, @siddkaramcheti, and to our earliest testers @jwthickstun, @johnhewtt, @sidilu_pluslab
1
8
1,417
This will go on the dust jacket.
Refreshingly unpretentious🙂
1
8
1,393
Huge thanks to JAX and TensorStore for making this look so easy!
1
7
1,671
There seem to be broadly two groups of EA people. Those are who focused on treatments that have short-term measurable outcomes, and those who are really , really concerned with tail risk (and also buy things like Rocko's Basilisk...)
1
7
Replying to @capetorch
We did automatically skip updates on spike steps using a moving window based thing that @soldni pointed me to from OLMo. So... sort of? But the spikes would come anyway. I'm pretty sure it's something before the loss spike itself, and we have evidence it's not always just data
1
8
1,676
Thanks Sasha, that’s very kind
5
634
And an extra thanks to the Google TPU Research Cloud for generously providing us access to the TPUs!
1
6
1,232
Replying to @mark_riedl
i meant the former, but probably both
6
Replying to @percyliang
When do we put this in Marin :-)
1
6
348
How cool would it be to be this much of a polymath? Cleaning up my Semantic Scholar page... lot of David Halls out there.
2
5
958
Replying to @nsaphra
how big is big? Mistral has 125M and 345M checkpoints here huggingface.co/stanford-crfm There's also 1.5B and we have a 6.7B in progress
3
6
Replying to @yoavgo
As others have said it's basically cogsci, but SymSys started with a thesis/viewpoint (~Hofstadter Godel Escher Bach). By the time I was there (04-08), that thesis had been largely forgotten, and it's more "CS with stuff other than engineering fundamentals"
6
I honestly didn’t really believe it, but it’s true. This is literally the first thing I tried (looking at what’s in Flax and TransformerEngine), so probably we can get more wins. I especially didn’t believe it would be stable, but it is!
1
5
395
Replying to @BrandoHablando
Probably that you can make aggressive changes and it’s fine.
5
82
And we're hiring!
Semantic Machines has such an impressive team for a small company. Their about page has so many interesting researchers.
1
1
4
When programming in Python, every time I take ≥2 steps off the golden path in some library, everything breaks. Either I'm cursed or I don't understand how anyone does anything
1
1
3
Oh, and performance is portable too! Levanter can achieve model flop utilization (MFU) in the 50-55% range (sometimes a bit higher) on both TPU and GPU, so we have you covered on both platforms.
1
2
5
1,227
marin.community/speedrun/ is the dashboard where you can see the ladder plots and explore some other visualizations @williambarrheld and @nikilravi cooked up
1
5
732
Example code here (we have an example here github.com/marin-community/m…)
1
1
6
588
Seize the means of reproduction?
4
My cat left me some ascii art in my terminal
1
5