professor of computer science @Stanford @stanfordnlp, co-founder of @togethercompute, creator of marin.community, co-founder of @simile_ai, pianist

Stanford, CA
What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:
66
224
1,228
209,444
Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:
46
569
4,921
679,345
While we celebrate @deepseek_ai 's release of open-weight models that we can all play with at home, just a friendly reminder that they are not *open-source*; there’s no training / data processing code, and hardly any information about the data.
238
426
4,634
776,375
You spend $1B training a model A. Someone on your team leaves and launches their own model API B. You're suspicious. Was B was derived (e.g., fine-tuned) from A? But you only have blackbox access to B... With our paper, you can still tell with strong statistical guarantees (p-values < 1e-8). Idea: test for independence of A's training data order with likelihoods under B. There are crazy amounts of metadata about training process baked into the model that can't be washed out, like a palimpsest...
🔎Did someone steal your language model? We can tell you, as long as you shuffled your training data🔀. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?🚨
54
209
2,433
381,013
We should call models like Llama 3, Mixtral, etc. “open-weight models”, not “open-source models”. For a model to be open-source, the code and training data need to be public (good examples: GPT-J, OLMo, RedPajama, StarCoder, K2, etc.). Weights are like an exe file, which would be ridiculous to call open-source.
42
288
1,963
260,974
📣 CRFM announces PubMedGPT, a new 2.7B language model that achieves a new SOTA on the US medical licensing exam. The recipe is simple: a standard Transformer trained from scratch on PubMed (from The Pile) using @mosaicml on the MosaicML Cloud, then fine-tuned for the QA task.
40
317
1,472
426,664
Writing on a whiteboard can make it easier for students to follow compared to slides (especially for math). During the pandemic, I added a feature to sfig (my Javascript slides library) to allow me to reveal parts of a slide using the mouse as if I were writing on a whiteboard:
10
67
1,134
Myth: open foundation models are antithetical to AI safety. Fact: open foundation models are critical for AI safety. Here are three reasons why:
26
250
1,106
425,613
I worry about language models being trained on test sets. Recently, we emailed support@openai.com to opt out of having our (test) data be used to improve models. This isn't enough though: others running evals could still inadvertently contribute those test sets to training.
37
100
937
291,860
RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans, this strategy seems bound to produce agents that merely appear to be aligned, but are bad/wrong in subtle, inconspicuous ways. Is anyone else worried about this?
73
81
925
I miss the days when we evaluated algorithms rather than models. Rather than "how well does model M do?", it should be "given data D and compute C, how well does running algorithm A on D with C do?" I don't think we can get scientific clarity unless we do the latter.
20
90
782
56,499
Language models are becoming the foundation of language technologies, but when do they work or don’t work? In a new CRFM paper, we propose Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of LMs. Holistic evaluation includes three elements:
13
193
755
This year, I have 4 exceptional students on the academic job market, and they couldn’t be more diffferent, with research spanning AI policy, robotics, NLP, and HCI. Here’s a brief summary of their research, along with one representative work each:
7
43
672
122,651
We did a very careful study of 10 optimizers with no horse in the race. Despite all the excitement about Muon, Mars, Kron, Soap, etc., at the end of the day, if you tune the hyperparameters rigorously and scale up, the speedup over AdamW diminishes to only 10% :-( Experiments are made possible by Marin (github.com/marin-community/m…); anyone developing new optimizers: please come try your method on this benchmark!
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
16
80
686
181,591
-2016 (classic era): focus on data efficiency 2017-2025 (pretraining era): focus on compute efficiency 2026-: focus on data efficiency (again) The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design decisions, which will be exciting!
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
14
68
613
103,907
Meta's release of OPT is an exciting step towards opening new opportunities for research. In general, we can think of stronger release as enabling researchers to tackle deeper questions. There are different levels of strength:
3
74
568
So why are DeepSeek models so good? To their credit, the papers have more detail than most frontier model papers, but it’s hard to tell which details matter. And data is the big missing piece, which we know to be the most important factor that determines model quality.
9
36
549
62,999
True open-source allows us to study and modify artifacts. We can study the DeepSeek papers (which are nicely written, but still omit details), and we can fine-tune their models. but one cannot understand or modify them at a deep level.
9
28
535
65,034
ChatGPT is reactive: user says X, ChatGPT responds with Y. Risks exist but are bounded. Soon it will be tempting to have proactive systems - an assistant that will answer emails for you, take actions on your behalf, etc. Risks will then be much higher.
22
69
543
116,977
Many "open" language models only come with released weights. In software, this is analogous to releasing a binary without code (you wouldn't call this open-source). To get the full benefits of transparency, you need the training data. GPT-J, GPT-NeoX, BLOOM, RedPajama do this.
11
79
523
88,221
Announcing Holistic Evaluation of Language Models (HELM) v0.2.0 with updated results on the new @OpenAI, @AI21Labs, and @CohereAI models. HELM now evaluates 34 prominent language models in a standardized way on 42 scenarios x 7 metrics.
4
88
547
154,209
⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:
20
88
595
127,413
For a rare look into how LLMs are really built, check out @dlwh's retrospective on how we trained the Marin 8B model from scratch (and outperformed Llama 3.1 8B base). It’s an honest account with all the revelations and mistakes we made along our journey. Papers are forced to hide the mess, but the real science happens in the process. marin.readthedocs.io/en/late…
2
70
498
56,362
I have 6 fantastic students and post-docs who are on the academic job market this year. Here is a short thread summarizing their work along with one representative paper:
11
56
492
150,138
There are legitimate and scientifically valuable reasons to train a language model on toxic text, but the deployment of GPT-4chan lacks them. AI researchers: please look at this statement and see what you think: forms.gle/ikiYE6ArLpWYz7aDA
68
125
454
When will the original GPT-3 model (davinci) be old enough that its weights can be safely released? It would be very useful for science and poses no additional risks (since open models will catch up anyway). In general, all models should expire and be released eventually.
17
35
485
94,465
My TEDAI talk from Oct 2023 is now live: go.ted.com/percyliang It was a hard talk to give: 1. I memorized it - felt more like giving a piano recital than an academic talk. 2. I wanted it to be timeless despite AI changing fast…still ok after 3 months. Here’s what I said:
18
79
477
100,194
No matter how good LMs get at writing, I will always want to write some things from scratch - for the same reason that I sometimes grow my own tomatoes, make my own granola, learn to play a Chopin etude...not because it's better, but because of the sheer joy of creation.
12
36
452
77,744
Vision took autoregressive Transformers from NLP. Now, NLP takes diffusion from vision. What will be the dominant paradigm in 5 years? Excited by the wide open space of possibilities that diffusion unlocks.
arxiv.org/abs/2205.14217 We propose Diffusion-LM, a non-autoregressive language model based on continuous diffusions. It enables complex controllable generation. We can steer the LM to generate text with desired syntax structure ( [S [NP...VP…]]) and semantic content (name=Coupa)
3
81
446
I have 4 incredible students/post-docs on the academic job market this year. As per tradition, I'll attempt to summarize their research + one representative paper:
3
29
437
160,918
Open AI means AI that is open.
64
27
436
79,054
Lack of transparency/full access to capable instruct models like GPT 3.5 has limited academic research in this important space. We make one small step with Alpaca (LLaMA 7B + self-instruct text-davinci-003), which is reasonably capable and dead simple:
Instruction-following models are now ubiquitous, but API-only access limits research. Today, we’re releasing info on Alpaca (solely for research use), a small but capable 7B model based on LLaMA that often behaves like OpenAI’s text-davinci-003. Demo: crfm.stanford.edu/alpaca/
11
80
428
147,171
For trying to understanding LMs deeply, @AiEleuther’s Pythia has been an invaluable resource: 16 LMs (70M to 12B parameters) trained on the same data (The Pile) in the same order, with intermediate checkpoints. It’s been two years and it’s time for a refresh.
8
47
446
59,136
So the only answer I can give is because they have a very strong team.
12
10
401
54,431
I am excited to be part of 7 NeurIPS papers on understanding and improving foundation models. We...
3
41
417
2nd-order optimization has been around for 300+ years...we got it to scale for LLMs (it's surprisingly simple: use the diagonal + clip). Results are promising (2x faster than Adam, which halves your $$$). A shining example of why students should still take optimization courses!
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️
19
59
409
91,778
model = learn(data) Synthetic data is great, but it’s not data. It’s an intermediate quantity created by learn(). Data is created by people and has privacy and copyright considerations. Synthetic “data” does not - it’s internal to learn().
28
47
400
63,541
Suppose someone uploads a new SOTA model on HF claiming they trained it from scratch (as opposed to just taking an existing model and fine-tuning it) - can we fact check this just given the weights? The answer is yes. Not only that, even if parts of existing models were taken, or if attempts to obfuscate by permuting or even retraining, the answer is still yes. You can even localize the correlation down to which hidden unit. It's interesting how much you can reverse engineer just from weights. Despite tons of training that completely overhaul the semantic behavior of the model, the imprints of initialization seem to linger and never vanish...
🧵 1/ The rise of open-weight LLMs and platforms like HuggingFace raises interesting questions about the relationships between such models. Given a pair of models (i.e. Llama 1 vs Vicuna or Llama 3 vs Llama 2) what can we say about whether they were trained independently?
14
47
408
74,635
Having a hard time keeping track of all the foundation models, upstream datasets, and downstream products that come out every day? We built ecosystem graphs to monitor these assets: crfm.stanford.edu/ecosystem-…
5
66
376
65,085
While instruction tuning is clearly necessary for producing usable interfaces like ChatGPT, the "magic" of language models comes from self-supervised learning on broad data, which enables emergent behavior like in-context learning and chain-of-thought.
9
49
366
165,019
1/🧵How do we know if AI is actually ready for healthcare? We built a benchmark, MedHELM, that tests LMs on real clinical tasks instead of just medical exams. #AIinHealthcare Blog, GitHub, and link to leaderboard in thread!
9
69
344
60,367
I just discovered this account I made 11 years ago. So how does one use these Twitters?
21
12
328
What is the analogue of next-token prediction for reinforcement learning? To get true generality, you want to be able to convert everything in the world to an environment+reward for training.
22
37
322
72,058
One thing I really like about language models is that they are stateless (they are functional programs of type text -> text). This allows us to share prompts (essentially currying the LM) and reproduce results.
10
69
304
104,570
Assignment 1 (get basic pipeline working): implement BPE tokenizer, Transformer architecture, Adam optimizer, train models on TinyStories and OpenWebText. Only PyTorch primitives are allowed (can’t just call torch.nn.Transformer or even torch.nn.Linear). github.com/stanford-cs336/as…
3
23
312
34,640
Marin 32B training crossed 1.5 trillion tokens today...
9
15
298
316,317
When people say GPT-3, do they mean the original GPT-3 or InstructGPT? And which version? It makes a huge difference, so it'd be nice to explicitly specify davinci, text-davinci-002, etc. when making a claim about GPT-3.
19
15
277
Now that we have a frontier model that's open-weight (not open-source), it's time to go back to all those ambitious use cases where open-weight models failed to deliver (agents) and try again, so we can have reproducible science and not worry about API models getting deprecated.
7
27
276
41,276
HELM v0.4.0 is out! 1) We have a new frontend (thanks to community contribution from Mike Lay). 2) We have added Mistral 7B, which really is punching above its weight (see crfm.stanford.edu/helm/v0.4.…), rivaling models an order of magnitude larger on the 16 core scenarios:
7
41
266
209,305
LM agents are consequential for cybersecurity, both for offense (cyberrisk) and defense (penetration testing). To measure these capabilities, we are excited to release Cybench, a new cybersecurity benchmark consisting of 40 professional Capture the Flag (CTF) tasks:
9
53
256
90,753
LM APIs are fickle, hurting reproducibility (I was really hoping that text-davinci-003 was going to stick around for a while, given the number of papers using it). Researchers should seriously use open models (especially as they are getting better now!)
GPT-4 API is now available to all paying OpenAI API customers. GPT-3.5 Turbo, DALL·E, and Whisper APIs are also now generally available, and we’re announcing a deprecation plan for some of our older models, which will retire beginning of 2024: openai.com/blog/gpt-4-api-ge…
7
42
255
68,878
1/ Benchmarks clearly have had a huge impact in AI, but I think everyone agrees that they ought to be better. How should we improve them? It depends on which of the two goals you're after:
9
38
259
I want to thank each of my 113 co-authors for their incredible work - I learned so much from all of you, @StanfordHAI for providing the rich interdisciplinary environment that made this possible, and everyone who took the time to read this and give valuable feedback!
NEW: This comprehensive report investigates foundation models (e.g. BERT, GPT-3), which are engendering a paradigm shift in AI. 100+ scholars across 10 departments at Stanford scrutinize their capabilities, applications, and societal consequences. bit.ly/3xZPFYK
3
29
252
Llama 2 was trained on 2.4T tokens. RedPajama-Data-v2 has 30T tokens. But of course the data is of varying quality, so we include 40+ quality signals. Open research problem: how do you automatically select data for pretraining LMs? Data-centric AI folks: have a field day!
We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training. together.ai/blog/redpajama-d…
1
39
247
60,590
The goal is simple: a robust, scalable, easy-to-use, and blazing fast endpoint for open models like LLama 2, Mistral, etc. The implementation is anything but. Super impressed with the team for making this happen! And we're not done yet...if you're interested, come talk to us.
Announcing the fastest inference available anywhere. We released FlashAttention-2, Flash-Decoding, and Medusa as open source. Our team combined these techniques with our own optimizations and we are excited to announce the Together Inference Engine. together.ai/blog/together-in…
5
35
247
95,447
As capabilities of foundation models are waxing, *transparency* is waning. How do we quantify transparency? We introduce the Foundation Models Transparency Index (FMTI), evaluating 10 foundation model developers on 100 indicators. crfm.stanford.edu/fmti/
11
68
238
54,823
I don’t get the “is scale all you need?” debate. Here’s how I see it: accuracy = resources * (accuracy / resources), where - resources is how much you’ve scaled up data|compute, and - accuracy / resources is the data|compute efficiency of your method.
10
23
233
41,743
Executable papers on CodaLab Worksheets are now linked from paperswithcode.com pages thanks to a collaboration with @paperswithcode! For example: paperswithcode.com/paper/noi…
1
41
220
Foundation models (e.g., GPT-3) demonstrate emergence, where small models perform as well as random guessing on some task (e.g., addition), but large models obtain non-trivial error rates. Is there a much simpler learning problem that also exhibits emergence?
12
17
217
Most leaderboards just give you scores, leaving one wondering: what does 76.8% mean? In HELM, we are committed to full transparency, meaning clicking on a score will reveal the full set of instances, and you can even inspect the exact prompt (which we know makes a big difference). Check it out at crfm.stanford.edu/helm!
6
30
223
34,421
Position: When a foundation model developer reports a test score, they should report the corresponding train-test overlap. Does this happen? Based on public documentation, only 9/30 language models have train-test overlap for the test sets they report on (or have open data).
9
43
213
31,003
Open or closed foundation models? This is one of the most important, contentious question in AI today. Important because it will determine structurally how AI will be developed, and contentious because we don’t have a shared framework. We offer guidance on this in a new paper:
5
40
207
34,422
HELM Lite v1.2.0 is out! Datasets: NarrativeQA, NaturalQA, OpenbookQA, MMLU, MATH, GSM8K, LegalBench, MedQA, WMT14 Results (we still need to add Claude 3, which requires more prompt finagling): crfm.stanford.edu/helm/lite/…
8
38
202
63,639
MMLU is the standard LM evaluation but model developers (i) use different prompting strategies and (ii) often do not release prompts. 3rd-party researchers often obtain lower scores 🤯 📢 HELM MMLU uses simple, standardized prompts, resulting in fair, reproducible comparisons of models:
11
28
202
40,009
As expected, lots of new models in the last few weeks. We're tracking them (along with datasets and applications) in the ecosystem graphs: crfm.stanford.edu/ecosystem-…
4
48
196
33,161
How close can LM agents simulate people? We interview person P for 2 hours and prompt an LM with the transcript, yielding an agent P'. We find that P and P' behave similarly on a number of surveys and experiments. Very excited about the applications; this also forces us to think about the ethics and what uniquely defines a human being.
Simulating human behavior with AI agents promises a testbed for policy and the social sciences. We interviewed 1,000 people for two hours each to create generative agents of them. These agents replicate their source individuals’ attitudes and behaviors. 🧵arxiv.org/abs/2411.10109
7
35
199
26,670
Lots of recent work on improving *absolute capabilities* with test-time compute (o1, r1, etc.). We are instead interested in *efficiency* (capabilities per budget). See what you can do on test-time scaling with just *1K* (carefully chosen) examples:
DeepSeek r1 is exciting but misses OpenAI’s test-time scaling plot and needs lots of data. We introduce s1 reproducing o1-preview scaling & performance with just 1K samples & a simple test-time intervention. 📜arxiv.org/abs/2501.19393
8
23
192
36,904
Assignment 2 (make GPUs go brrrr): implement Flash Attention 2 in Triton, distributed data parallel + optimizer sharding. github.com/stanford-cs336/as…
1
6
183
26,749
How are OpenAI, Scale, NVIDIA, Softbank, Disney, Google, AMD, Coreweave dependent on each other? Our new AI supply chains tracks and visualizes the relationships between companies as they evolve using SEC filings and news. Explore and see if you notice any patterns or surprises!
In the AI ecosystem, who supplies the data? the compute? the models? We just released a new tool on the AI Supply Chain. Our dataset reveals how AI models, data, compute, capital, and even talent change hands. Here’s why you should care 👇
17
29
187
47,490
Replying to @sama
This makes a lot of sense from a product perspective. From a research and developer perspective though, having everything encapsulated moves us away from being able to understand of how things work underneath the hood. We used to have an endpoint that corresponded to an autoregressive probabilistic model over tokens, and now we will have a magical box.
13
5
170
20,325
What if whenever an API model is deprecated (presumably because it's not relevant commercially), its model weights are released so that researchers can continue to do reproducible science?
8
13
163
32,401
The most two most surprising things to me was that the trained Transformer could exploit sparsity like LASSO and that it exhibits double descent. How on earth is the Transformer encoding these algorithmic properties, and how did it just acquire them through training?
LLMs can do in-context learning, but are they "learning" new tasks or just retrieving ones seen during training? w/ @shivamg_13, @percyliang, & Greg Valiant we study a simpler Q: Can we train Transformers to learn simple function classes in-context? 🧵 arxiv.org/abs/2208.01066
2
29
168
The secret reason for Marin doing open development is that that all the research state (usually in heads, private Slacks or docs) gets written down in the open, which means that LM agents can now concretely contribute to and advance the science and development of LMs.
4
9
172
20,162
...where I will attempt to compress all of my students' work on robust ML in the last 3 years into 40 minutes. We'll see how that goes.
1/ 📢 Registration now open for Percy Liang's (@percyliang) seminar this Thursday, Oct 29 from 12 pm to 1.30 pm Eastern Time! 👇🏾 Register here: us02web.zoom.us/webinar/regi… #TrustML #MachineLearning #ArtificialIntelligence #DeepLearning
1
17
163
gpt-oss-120b is the top open-weight model (with Kimi K2 right on its tail) for capabilities (HELM capabilities v1.11):
15
11
165
24,471
Holistic Evaluation of Language Models (HELM) v0.2.2 is updated with results from @CohereAI's command models and @Aleph__Alpha's Luminous models. Models are definitely getting better on average, but improvements are uneven. crfm.stanford.edu/helm/v0.2.…
6
37
161
34,281
When choosing a benchmark, you can have 2 of the 3 properties: 1. Realistic 2. Difficult 3. Quick evaluation Chatbot Arena is 1 + 3, HLE is 1 + 2. Our work (UQ) is 1 + 2, and for 3, evaluation happens over time with help from the community.
New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:
3
19
164
19,447
Assignment 4 (data): convert Common Crawl HTML to text, filter filter filter (quality, harmful content, PII), deduplication. This is the grunt work that doesn’t get enough appreciation. github.com/stanford-cs336/as…
2
6
158
22,198
We just updated *ecosystem graphs* with the latest datasets, models, and products: crfm.stanford.edu/ecosystem-…
6
36
149
45,606
Why this confusion? First, because our standards for openness in AI are so low. The status quo for frontier models is API access, so we cheer when we can get our hands on weights.
1
6
143
17,451
Until now, HELM has evaluated LMs with on short responses, where evaluation is simple. We now introduce HELM Instruct, which evaluates open-ended instruction following. We evaluate 4 models on 7 scenarios using 4 evaluators against 5 criteria:
4
36
148
39,008
In HELM, we evaluated language models. Now, we evaluate organizations that build language models. Just like model evaluations incentivize improvement in model quality, we hope that these evaluations will incentivize improvement in development and deployment practices.
6
37
140
28,244
First, open models enable a tremendous amount of (badly needed) safety research, which requires full access to model weights (ideally with training data). API access is insufficient.
2
8
139
12,122
My favorite detail about @nelsonfliu's evaluation of generative search engines is he takes queries from Reddit ELI5 as soon as they are posted and evaluates them in real time. This ensures the test set was not trained on (or retrieved from). nitter.app/nelsonfliu/status/1649…
3
15
146
40,436
Interested in building and benchmarking LLMs and other foundation models in a vibrant academic setting? @StanfordCRFM is hiring research engineers! careersearch.stanford.edu/jo… Here are some things that you could be a part of:
2
38
141
52,727
Announcing HELM lite v1.0.0, a revamp of the HELM classic benchmark, built on the same modular HELM framework. New scenarios: LegalBench (law), MedQA (medicine), WMT2014 (machine translation) New models: GPT-4, Claude, PaLM 2, Mixtral, Yi crfm.stanford.edu/2023/12/19…
5
27
150
31,438
A good LM + naive tree search => new kernels that outperform PyTorch implementations...so much more to do.
✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6) [🔗 link in final post]
2
13
144
20,422
This is the dream: having a system whose action space is universal (at least in the world of bits). And with foundation models, it is actually possible now to produce sane predictions in that huge action space. Some interesting challenges:
2
16
142
The term "foundation model" and its motivation unfortunately continues to be misunderstood. We wrote a blog post last year (see "Naming" section of crfm.stanford.edu/2021/10/18…) which aims to explain our thought process. Some selected quotes from the post:
3
18
137
But for open science, we really need open-source models. How do you interpret test accuracies without knowledge of the training data? How can we understand model capabilities without knowing what the sources are? While open-weight models are hugely enabling, we also risk building our entire field on unclear foundations.
2
11
137
19,385
Excited to see what kind of methods the community will come up with to address these realistic shifts in the wild! Also, if you are working on a real-world application and encounter distributional shifts, come talk to us!
We're excited to announce WILDS, a benchmark of in-the-wild distribution shifts with 7 datasets across diverse data modalities and real-world applications. Website: wilds.stanford.edu Paper: arxiv.org/abs/2012.07421 Github: github.com/p-lambda/wilds Thread below. (1/12)
2
8
139
We ran Llama 4 Maverick through some HELM benchmarks. It is 1st on HELM capabilities (MMLU-Pro, GPQA, IFEval, WildBench, Omni-MATH), but… crfm.stanford.edu/helm/capab…
5
15
138
29,455
2021: let's increase model size! 2023: let's increase FLOPs! 2025: let's increase ???! Shouldn't FLOPs be in the denominator rather than the numerator? Numerator should be some measure of capability+safety. We need better evals to capture this!
7
9
131
22,236
These powerful foundation models will be deployed to billions of people soon, which means there will be economic incentives for bad actors to start messing around. So we better figure out security for foundation models soon.
3
15
127
19,481
A better solution would to have all the LM providers agree on a common repository of examples that should be excluded from any training run.
5
3
121
21,943
AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we are excited to release BountyBench, the first framework to capture offensive & defensive cyber-capabilities in evolving real-world systems.
4
27
131
15,651
HELM MMLU v1.5.0 is out. crfm.stanford.edu/helm/mmlu/… Claude 3.5 Sonnet takes the top position.
1
24
126
14,838
What is the largest fully reproducible language model? That is, where I can get the data and code and run a sequence of commands that deterministically produces the exact model?
6
4
128