Percy Liang · May 19, 2025 · 6:08 PM UTC

Percy Liang

Pinned Tweet

Percy Liang

@percyliang

19 May 2025

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

224

1,228

209,444

Percy Liang · Jun 18, 2025 · 10:04 PM UTC

Percy Liang

@percyliang

18 Jun 2025

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

569

4,921

679,345

Percy Liang · Jan 26, 2025 · 7:22 AM UTC

Percy Liang

@percyliang

26 Jan 2025

While we celebrate @deepseek_ai 's release of open-weight models that we can all play with at home, just a friendly reminder that they are not *open-source*; there’s no training / data processing code, and hardly any information about the data.

238

426

4,634

776,375

Percy Liang · Oct 24, 2025 · 6:42 AM UTC

Percy Liang

@percyliang

24 Oct 2025

You spend $1B training a model A. Someone on your team leaves and launches their own model API B. You're suspicious. Was B was derived (e.g., fine-tuned) from A? But you only have blackbox access to B... With our paper, you can still tell with strong statistical guarantees (p-values < 1e-8). Idea: test for independence of A's training data order with likelihoods under B. There are crazy amounts of metadata about training process baked into the model that can't be washed out, like a palimpsest...

Sally Zhu

@SallyHZhu

23 Oct 2025

🔎Did someone steal your language model? We can tell you, as long as you shuffled your training data🔀. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?🚨

209

2,433

381,013

Percy Liang · Jun 11, 2024 · 10:05 PM UTC

Percy Liang

@percyliang

11 Jun 2024

We should call models like Llama 3, Mixtral, etc. “open-weight models”, not “open-source models”. For a model to be open-source, the code and training data need to be public (good examples: GPT-J, OLMo, RedPajama, StarCoder, K2, etc.). Weights are like an exe file, which would be ridiculous to call open-source.

288

1,963

260,974

Percy Liang · Dec 15, 2022 · 7:17 PM UTC

Percy Liang

@percyliang

15 Dec 2022

📣 CRFM announces PubMedGPT, a new 2.7B language model that achieves a new SOTA on the US medical licensing exam. The recipe is simple: a standard Transformer trained from scratch on PubMed (from The Pile) using @mosaicml on the MosaicML Cloud, then fine-tuned for the QA task.

317

1,472

426,664

Percy Liang · Oct 23, 2022 · 2:33 PM UTC

Percy Liang

@percyliang

23 Oct 2022

Writing on a whiteboard can make it easier for students to follow compared to slides (especially for math). During the pandemic, I added a feature to sfig (my Javascript slides library) to allow me to reveal parts of a slide using the mouse as if I were writing on a whiteboard:

1,134

Percy Liang · Nov 3, 2023 · 6:59 PM UTC

Percy Liang

@percyliang

3 Nov 2023

Myth: open foundation models are antithetical to AI safety. Fact: open foundation models are critical for AI safety. Here are three reasons why:

250

1,106

425,613

Percy Liang · Jan 29, 2023 · 7:12 AM UTC

Percy Liang

@percyliang

29 Jan 2023

I worry about language models being trained on test sets. Recently, we emailed support@openai.com to opt out of having our (test) data be used to improve models. This isn't enough though: others running evals could still inadvertently contribute those test sets to training.

100

937

291,860

Percy Liang · Dec 7, 2022 · 6:55 AM UTC

Percy Liang

@percyliang

7 Dec 2022

RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans, this strategy seems bound to produce agents that merely appear to be aligned, but are bad/wrong in subtle, inconspicuous ways. Is anyone else worried about this?

925

Percy Liang · Dec 6, 2024 · 7:23 PM UTC

Percy Liang

@percyliang

6 Dec 2024

I miss the days when we evaluated algorithms rather than models. Rather than "how well does model M do?", it should be "given data D and compute C, how well does running algorithm A on D with C do?" I don't think we can get scientific clarity unless we do the latter.

782

56,499

Percy Liang · Nov 17, 2022 · 4:04 PM UTC

Percy Liang

@percyliang

17 Nov 2022

Language models are becoming the foundation of language technologies, but when do they work or don’t work? In a new CRFM paper, we propose Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of LMs. Holistic evaluation includes three elements:

193

755

Percy Liang · Nov 25, 2024 · 7:55 PM UTC

Percy Liang

@percyliang

25 Nov 2024

This year, I have 4 exceptional students on the academic job market, and they couldn’t be more diffferent, with research spanning AI policy, robotics, NLP, and HCI. Here’s a brief summary of their research, along with one representative work each:

672

122,651

Percy Liang · Sep 4, 2025 · 4:59 PM UTC

Percy Liang

@percyliang

4 Sep 2025

We did a very careful study of 10 optimizers with no horse in the race. Despite all the excitement about Muon, Mars, Kron, Soap, etc., at the end of the day, if you tune the hyperparameters rigorously and scale up, the speedup over AdamW diminishes to only 10% :-( Experiments are made possible by Marin (github.com/marin-community/m…); anyone developing new optimizers: please come try your method on this benchmark!

Fantastic Pretraining Optimizers And Where to Find them · Issue #1290 · marin-community/marin

Description We evaluate a suite of optimizers on Transformer-style language models (130 M–1.2 B parameters) trained on up to 16× Chinchilla-optimal data. The goal is to quantify real speedups under...

github.com

Kaiyue Wen

@wen_kaiyue

4 Sep 2025

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

686

181,591

Percy Liang · Sep 19, 2025 · 6:09 PM UTC

Percy Liang

@percyliang

19 Sep 2025

-2016 (classic era): focus on data efficiency 2017-2025 (pretraining era): focus on compute efficiency 2026-: focus on data efficiency (again) The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design decisions, which will be exciting!

Suhas Kotha @kothasuhas

19 Sep 2025

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

613

103,907

Percy Liang · May 3, 2022 · 11:08 PM UTC

Percy Liang

@percyliang

3 May 2022

Meta's release of OPT is an exciting step towards opening new opportunities for research. In general, we can think of stronger release as enabling researchers to tackle deeper questions. There are different levels of strength:

568

Percy Liang · Jan 26, 2025 · 7:22 AM UTC

Percy Liang

@percyliang

26 Jan 2025

So why are DeepSeek models so good? To their credit, the papers have more detail than most frontier model papers, but it’s hard to tell which details matter. And data is the big missing piece, which we know to be the most important factor that determines model quality.

549

62,999

Percy Liang · Jan 26, 2025 · 7:22 AM UTC

Percy Liang

@percyliang

26 Jan 2025

True open-source allows us to study and modify artifacts. We can study the DeepSeek papers (which are nicely written, but still omit details), and we can fine-tune their models. but one cannot understand or modify them at a deep level.

535

65,034

Percy Liang · Feb 27, 2023 · 6:08 AM UTC

Percy Liang

@percyliang

27 Feb 2023

ChatGPT is reactive: user says X, ChatGPT responds with Y. Risks exist but are bounded. Soon it will be tempting to have proactive systems - an assistant that will answer emails for you, take actions on your behalf, etc. Risks will then be much higher.

543

116,977

Percy Liang · Oct 1, 2023 · 7:11 PM UTC

Percy Liang

@percyliang

1 Oct 2023

Many "open" language models only come with released weights. In software, this is analogous to releasing a binary without code (you wouldn't call this open-source). To get the full benefits of transparency, you need the training data. GPT-J, GPT-NeoX, BLOOM, RedPajama do this.

523

88,221

Percy Liang · Jan 3, 2023 · 8:31 PM UTC

Percy Liang

@percyliang

3 Jan 2023

Announcing Holistic Evaluation of Language Models (HELM) v0.2.0 with updated results on the new @OpenAI, @AI21Labs, and @CohereAI models. HELM now evaluates 34 prominent language models in a standardized way on 42 scenarios x 7 metrics.

547

154,209

Percy Liang · Oct 29, 2025 · 3:48 PM UTC

Percy Liang

@percyliang

29 Oct 2025

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

595

127,413

Percy Liang · May 21, 2025 · 4:07 PM UTC

Percy Liang

@percyliang

21 May 2025

For a rare look into how LLMs are really built, check out @dlwh's retrospective on how we trained the Marin 8B model from scratch (and outperformed Llama 3.1 8B base). It’s an honest account with all the revelations and mistakes we made along our journey. Papers are forced to hide the mess, but the real science happens in the process. marin.readthedocs.io/en/late…

498

56,362

Percy Liang · Jan 11, 2023 · 8:50 PM UTC

Percy Liang

@percyliang

11 Jan 2023

I have 6 fantastic students and post-docs who are on the academic job market this year. Here is a short thread summarizing their work along with one representative paper:

492

150,138

Percy Liang · Jun 21, 2022 · 5:49 PM UTC

Percy Liang

@percyliang

21 Jun 2022

There are legitimate and scientifically valuable reasons to train a language model on toxic text, but the deployment of GPT-4chan lacks them. AI researchers: please look at this statement and see what you think: forms.gle/ikiYE6ArLpWYz7aDA

Condemning the deployment of GPT-4chan

Large language models, and more generally foundation models, are powerful technologies that carry a potential risk of significant harm. Unfortunately, we, the AI community, currently lack community...

docs.google.com

125

454

Percy Liang · Feb 28, 2023 · 4:48 PM UTC

Percy Liang

@percyliang

28 Feb 2023

When will the original GPT-3 model (davinci) be old enough that its weights can be safely released? It would be very useful for science and poses no additional risks (since open models will catch up anyway). In general, all models should expire and be released eventually.

485

94,465

Percy Liang · Jan 20, 2024 · 6:56 PM UTC

Percy Liang

@percyliang

20 Jan 2024

My TEDAI talk from Oct 2023 is now live: go.ted.com/percyliang It was a hard talk to give: 1. I memorized it - felt more like giving a piano recital than an academic talk. 2. I wanted it to be timeless despite AI changing fast…still ok after 3 months. Here’s what I said:

A new way to build AI, openly

Today's AI is trained on the work of artists and writers without attribution, its core values decided by a privileged few. What if the future of AI was more open and democratic? Researcher Percy...

ted.com

477

100,194

Percy Liang · May 21, 2023 · 6:32 PM UTC

Percy Liang

@percyliang

21 May 2023

No matter how good LMs get at writing, I will always want to write some things from scratch - for the same reason that I sometimes grow my own tomatoes, make my own granola, learn to play a Chopin etude...not because it's better, but because of the sheer joy of creation.

452

77,744

Percy Liang · Jun 7, 2022 · 1:02 AM UTC

Percy Liang

@percyliang

7 Jun 2022

Vision took autoregressive Transformers from NLP. Now, NLP takes diffusion from vision. What will be the dominant paradigm in 5 years? Excited by the wide open space of possibilities that diffusion unlocks.

Xiang Lisa Li @XiangLisaLi2

7 Jun 2022

arxiv.org/abs/2205.14217 We propose Diffusion-LM, a non-autoregressive language model based on continuous diffusions. It enables complex controllable generation. We can steer the LM to generate text with desired syntax structure ( [S [NP...VP…]]) and semantic content (name=Coupa)

446

Percy Liang · Dec 12, 2023 · 7:40 AM UTC

Percy Liang

@percyliang

12 Dec 2023

I have 4 incredible students/post-docs on the academic job market this year. As per tradition, I'll attempt to summarize their research + one representative paper:

437

160,918

Percy Liang · Oct 29, 2025 · 1:57 AM UTC

Percy Liang

@percyliang

29 Oct 2025

Open AI means AI that is open.

436

79,054

Percy Liang · Mar 13, 2023 · 4:36 PM UTC

Percy Liang

@percyliang

13 Mar 2023

Lack of transparency/full access to capable instruct models like GPT 3.5 has limited academic research in this important space. We make one small step with Alpaca (LLaMA 7B + self-instruct text-davinci-003), which is reasonably capable and dead simple:

Tatsunori Hashimoto @tatsu_hashimoto

13 Mar 2023

Instruction-following models are now ubiquitous, but API-only access limits research. Today, we’re releasing info on Alpaca (solely for research use), a small but capable 7B model based on LLaMA that often behaves like OpenAI’s text-davinci-003. Demo: crfm.stanford.edu/alpaca/

428

147,171

Percy Liang · May 30, 2025 · 3:36 PM UTC

Percy Liang

@percyliang

30 May 2025

For trying to understanding LMs deeply, @AiEleuther’s Pythia has been an invaluable resource: 16 LMs (70M to 12B parameters) trained on the same data (The Pile) in the same order, with intermediate checkpoints. It’s been two years and it’s time for a refresh.

446

59,136

Percy Liang · Jan 26, 2025 · 7:22 AM UTC

Percy Liang

@percyliang

26 Jan 2025

So the only answer I can give is because they have a very strong team.

401

54,431

Percy Liang · Dec 1, 2022 · 6:31 AM UTC

Percy Liang

@percyliang

1 Dec 2022

I am excited to be part of 7 NeurIPS papers on understanding and improving foundation models. We...

417

Percy Liang · May 24, 2023 · 5:18 PM UTC

Percy Liang

@percyliang

24 May 2023

2nd-order optimization has been around for 300+ years...we got it to scale for LLMs (it's surprisingly simple: use the diagonal + clip). Results are promising (2x faster than Adam, which halves your $$$). A shining example of why students should still take optimization courses!

Tengyu Ma

@tengyuma

24 May 2023

Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️

409

91,778

Percy Liang · Apr 29, 2024 · 3:39 AM UTC

Percy Liang

@percyliang

29 Apr 2024

model = learn(data) Synthetic data is great, but it’s not data. It’s an intermediate quantity created by learn(). Data is created by people and has privacy and copyright considerations. Synthetic “data” does not - it’s internal to learn().

400

63,541

Percy Liang · Feb 20, 2025 · 9:27 PM UTC

Percy Liang

@percyliang

20 Feb 2025

Suppose someone uploads a new SOTA model on HF claiming they trained it from scratch (as opposed to just taking an existing model and fine-tuning it) - can we fact check this just given the weights? The answer is yes. Not only that, even if parts of existing models were taken, or if attempts to obfuscate by permuting or even retraining, the answer is still yes. You can even localize the correlation down to which hidden unit. It's interesting how much you can reverse engineer just from weights. Despite tons of training that completely overhaul the semantic behavior of the model, the imprints of initialization seem to linger and never vanish...

Ahmed Ahmed

@AhmedSQRD

20 Feb 2025

🧵 1/ The rise of open-weight LLMs and platforms like HuggingFace raises interesting questions about the relationships between such models. Given a pair of models (i.e. Llama 1 vs Vicuna or Llama 3 vs Llama 2) what can we say about whether they were trained independently?

408

74,635

Percy Liang · Oct 12, 2023 · 6:43 AM UTC

Percy Liang

@percyliang

12 Oct 2023

Having a hard time keeping track of all the foundation models, upstream datasets, and downstream products that come out every day? We built ecosystem graphs to monitor these assets: crfm.stanford.edu/ecosystem-…

376

65,085

Percy Liang · Feb 6, 2023 · 12:27 AM UTC

Percy Liang

@percyliang

6 Feb 2023

While instruction tuning is clearly necessary for producing usable interfaces like ChatGPT, the "magic" of language models comes from self-supervised learning on broad data, which enables emergent behavior like in-context learning and chain-of-thought.

366

165,019

Percy Liang · Jun 18, 2025 · 10:04 PM UTC

Percy Liang

@percyliang

18 Jun 2025

You can find all our lectures on YouTube (thanks to @StanfordOnline): piped.video/playlist?list=PL… and the assignments on the course website so you can do it yourself at home: stanford-cs336.github.io/spr…

Stanford CS336 Language Modeling from Scratch I 2025

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpo...

youtube.com

377

35,041

Percy Liang · Feb 28, 2025 · 8:58 PM UTC

Percy Liang

@percyliang

28 Feb 2025

1/🧵How do we know if AI is actually ready for healthcare? We built a benchmark, MedHELM, that tests LMs on real clinical tasks instead of just medical exams. #AIinHealthcare Blog, GitHub, and link to leaderboard in thread!

344

60,367

Percy Liang · Oct 24, 2020 · 4:23 AM UTC

Percy Liang

@percyliang

24 Oct 2020

I just discovered this account I made 11 years ago. So how does one use these Twitters?

328

Percy Liang · Feb 28, 2025 · 4:50 AM UTC

Percy Liang

@percyliang

28 Feb 2025

What is the analogue of next-token prediction for reinforcement learning? To get true generality, you want to be able to convert everything in the world to an environment+reward for training.

322

72,058

Percy Liang · Feb 12, 2023 · 5:41 PM UTC

Percy Liang

@percyliang

12 Feb 2023

One thing I really like about language models is that they are stateless (they are functional programs of type text -> text). This allows us to share prompts (essentially currying the LM) and reproduce results.

304

104,570

Percy Liang · Jun 18, 2025 · 10:04 PM UTC

Percy Liang

@percyliang

18 Jun 2025

Assignment 1 (get basic pipeline working): implement BPE tokenizer, Transformer architecture, Adam optimizer, train models on TinyStories and OpenWebText. Only PyTorch primitives are allowed (can’t just call torch.nn.Transformer or even torch.nn.Linear). github.com/stanford-cs336/as…

312

34,640

Percy Liang · May 22, 2025 · 11:30 PM UTC

Percy Liang

@percyliang

22 May 2025

Marin 32B training crossed 1.5 trillion tokens today...

298

316,317

Percy Liang · Oct 21, 2022 · 6:24 AM UTC

Percy Liang

@percyliang

21 Oct 2022

When people say GPT-3, do they mean the original GPT-3 or InstructGPT? And which version? It makes a huge difference, so it'd be nice to explicitly specify davinci, text-davinci-002, etc. when making a claim about GPT-3.

277

Percy Liang · Jul 25, 2024 · 4:01 AM UTC

Percy Liang

@percyliang

25 Jul 2024

Now that we have a frontier model that's open-weight (not open-source), it's time to go back to all those ambitious use cases where open-weight models failed to deliver (agents) and try again, so we can have reproducible science and not worry about API models getting deprecated.

276

41,276

Percy Liang · Nov 21, 2023 · 8:09 AM UTC

Percy Liang

@percyliang

21 Nov 2023

HELM v0.4.0 is out! 1) We have a new frontend (thanks to community contribution from Mike Lay). 2) We have added Mistral 7B, which really is punching above its weight (see crfm.stanford.edu/helm/v0.4.…), rivaling models an order of magnitude larger on the 16 core scenarios:

266

209,305

Percy Liang · Aug 20, 2024 · 1:53 PM UTC

Percy Liang

@percyliang

20 Aug 2024

LM agents are consequential for cybersecurity, both for offense (cyberrisk) and defense (penetration testing). To measure these capabilities, we are excited to release Cybench, a new cybersecurity benchmark consisting of 40 professional Capture the Flag (CTF) tasks:

256

90,753

Percy Liang · Jul 10, 2023 · 3:54 AM UTC

Percy Liang

@percyliang

10 Jul 2023

LM APIs are fickle, hurting reproducibility (I was really hoping that text-davinci-003 was going to stick around for a while, given the number of papers using it). Researchers should seriously use open models (especially as they are getting better now!)

OpenAI

@OpenAI

6 Jul 2023

GPT-4 API is now available to all paying OpenAI API customers. GPT-3.5 Turbo, DALL·E, and Whisper APIs are also now generally available, and we’re announcing a deprecation plan for some of our older models, which will retire beginning of 2024: openai.com/blog/gpt-4-api-ge…

255

68,878

Percy Liang · Apr 26, 2022 · 6:27 AM UTC

Percy Liang

@percyliang

26 Apr 2022

1/ Benchmarks clearly have had a huge impact in AI, but I think everyone agrees that they ought to be better. How should we improve them? It depends on which of the two goals you're after:

259

Percy Liang · Aug 18, 2021 · 5:37 PM UTC

Percy Liang

@percyliang

18 Aug 2021

I want to thank each of my 113 co-authors for their incredible work - I learned so much from all of you, @StanfordHAI for providing the rich interdisciplinary environment that made this possible, and everyone who took the time to read this and give valuable feedback!

Stanford HAI

@StanfordHAI

18 Aug 2021

NEW: This comprehensive report investigates foundation models (e.g. BERT, GPT-3), which are engendering a paradigm shift in AI. 100+ scholars across 10 departments at Stanford scrutinize their capabilities, applications, and societal consequences. bit.ly/3xZPFYK

252

Percy Liang · Oct 30, 2023 · 8:51 PM UTC

Percy Liang

@percyliang

30 Oct 2023

Llama 2 was trained on 2.4T tokens. RedPajama-Data-v2 has 30T tokens. But of course the data is of varying quality, so we include 40+ quality signals. Open research problem: how do you automatically select data for pretraining LMs? Data-centric AI folks: have a field day!

Together AI

@togethercompute

30 Oct 2023

We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training. together.ai/blog/redpajama-d…

247

60,590

Percy Liang · Nov 13, 2023 · 4:18 PM UTC

Percy Liang

@percyliang

13 Nov 2023

The goal is simple: a robust, scalable, easy-to-use, and blazing fast endpoint for open models like LLama 2, Mistral, etc. The implementation is anything but. Super impressed with the team for making this happen! And we're not done yet...if you're interested, come talk to us.

Together AI

@togethercompute

13 Nov 2023

Announcing the fastest inference available anywhere. We released FlashAttention-2, Flash-Decoding, and Medusa as open source. Our team combined these techniques with our own optimizations and we are excited to announce the Together Inference Engine. together.ai/blog/together-in…

247

95,447

Percy Liang · Oct 18, 2023 · 4:45 PM UTC

Percy Liang

@percyliang

18 Oct 2023

As capabilities of foundation models are waxing, *transparency* is waning. How do we quantify transparency? We introduce the Foundation Models Transparency Index (FMTI), evaluating 10 foundation model developers on 100 indicators. crfm.stanford.edu/fmti/

238

54,823

Percy Liang · Jan 24, 2025 · 7:01 AM UTC

Percy Liang

@percyliang

24 Jan 2025

I don’t get the “is scale all you need?” debate. Here’s how I see it: accuracy = resources * (accuracy / resources), where - resources is how much you’ve scaled up data|compute, and - accuracy / resources is the data|compute efficiency of your method.

233

41,743

Percy Liang · Jan 28, 2021 · 5:18 AM UTC

Percy Liang

@percyliang

28 Jan 2021

Executable papers on CodaLab Worksheets are now linked from paperswithcode.com pages thanks to a collaboration with @paperswithcode! For example: paperswithcode.com/paper/noi…

220

Percy Liang · May 22, 2022 · 9:23 PM UTC

Percy Liang

@percyliang

22 May 2022

Foundation models (e.g., GPT-3) demonstrate emergence, where small models perform as well as random guessing on some task (e.g., addition), but large models obtain non-trivial error rates. Is there a much simpler learning problem that also exhibits emergence?

217

Percy Liang · Apr 20, 2024 · 8:12 PM UTC

Percy Liang

@percyliang

20 Apr 2024

Most leaderboards just give you scores, leaving one wondering: what does 76.8% mean? In HELM, we are committed to full transparency, meaning clicking on a score will reveal the full set of instances, and you can even inspect the exact prompt (which we know makes a big difference). Check it out at crfm.stanford.edu/helm!

223

34,421

Percy Liang · Oct 14, 2024 · 4:53 AM UTC

Percy Liang

@percyliang

14 Oct 2024

Position: When a foundation model developer reports a test score, they should report the corresponding train-test overlap. Does this happen? Based on public documentation, only 9/30 language models have train-test overlap for the test sets they report on (or have open data).

213

31,003

Percy Liang · Feb 28, 2024 · 12:13 AM UTC

Percy Liang

@percyliang

28 Feb 2024

Open or closed foundation models? This is one of the most important, contentious question in AI today. Important because it will determine structurally how AI will be developed, and contentious because we don’t have a shared framework. We offer guidance on this in a new paper:

207

34,422

Percy Liang · Apr 26, 2024 · 5:05 AM UTC

Percy Liang

@percyliang

26 Apr 2024

HELM Lite v1.2.0 is out! Datasets: NarrativeQA, NaturalQA, OpenbookQA, MMLU, MATH, GSM8K, LegalBench, MedQA, WMT14 Results (we still need to add Claude 3, which requires more prompt finagling): crfm.stanford.edu/helm/lite/…

202

63,639

Percy Liang · May 2, 2024 · 3:44 AM UTC

Percy Liang

@percyliang

2 May 2024

MMLU is the standard LM evaluation but model developers (i) use different prompting strategies and (ii) often do not release prompts. 3rd-party researchers often obtain lower scores 🤯 📢 HELM MMLU uses simple, standardized prompts, resulting in fair, reproducible comparisons of models:

202

40,009

Percy Liang · Mar 29, 2024 · 4:59 PM UTC

Percy Liang

@percyliang

29 Mar 2024

As expected, lots of new models in the last few weeks. We're tracking them (along with datasets and applications) in the ecosystem graphs: crfm.stanford.edu/ecosystem-…

196

33,161

Percy Liang · Nov 18, 2024 · 5:04 PM UTC

Percy Liang

@percyliang

18 Nov 2024

How close can LM agents simulate people? We interview person P for 2 hours and prompt an LM with the transcript, yielding an agent P'. We find that P and P' behave similarly on a number of surveys and experiments. Very excited about the applications; this also forces us to think about the ethics and what uniquely defines a human being.

Joon Sung Park @joon_s_pk

18 Nov 2024

Simulating human behavior with AI agents promises a testbed for policy and the social sciences. We interviewed 1,000 people for two hours each to create generative agents of them. These agents replicate their source individuals’ attitudes and behaviors. 🧵arxiv.org/abs/2411.10109

199

26,670

Percy Liang · Feb 3, 2025 · 7:02 PM UTC

Percy Liang

@percyliang

3 Feb 2025

Lots of recent work on improving *absolute capabilities* with test-time compute (o1, r1, etc.). We are instead interested in *efficiency* (capabilities per budget). See what you can do on test-time scaling with just *1K* (carefully chosen) examples:

Niklas Muennighoff @Muennighoff

3 Feb 2025

DeepSeek r1 is exciting but misses OpenAI’s test-time scaling plot and needs lots of data. We introduce s1 reproducing o1-preview scaling & performance with just 1K samples & a simple test-time intervention. 📜arxiv.org/abs/2501.19393

192

36,904

Percy Liang · Jun 18, 2025 · 10:04 PM UTC

Percy Liang

@percyliang

18 Jun 2025

Assignment 2 (make GPUs go brrrr): implement Flash Attention 2 in Triton, distributed data parallel + optimizer sharding. github.com/stanford-cs336/as…

183

26,749

Percy Liang · Nov 11, 2025 · 4:17 PM UTC

Percy Liang

@percyliang

11 Nov 2025

How are OpenAI, Scale, NVIDIA, Softbank, Disney, Google, AMD, Coreweave dependent on each other? Our new AI supply chains tracks and visualizes the relationships between companies as they evolve using SEC filings and news. Explore and see if you notice any patterns or surprises!

Sarah Cen

@cen_sarah

10 Nov 2025

In the AI ecosystem, who supplies the data? the compute? the models? We just released a new tool on the AI Supply Chain. Our dataset reveals how AI models, data, compute, capital, and even talent change hands. Here’s why you should care 👇

187

47,490

Percy Liang · Feb 13, 2025 · 6:34 PM UTC

Percy Liang

@percyliang

13 Feb 2025

Replying to @sama

This makes a lot of sense from a product perspective. From a research and developer perspective though, having everything encapsulated moves us away from being able to understand of how things work underneath the hood. We used to have an endpoint that corresponded to an autoregressive probabilistic model over tokens, and now we will have a magical box.

170

20,325

Percy Liang · Dec 5, 2023 · 5:29 AM UTC

Percy Liang

@percyliang

5 Dec 2023

What if whenever an API model is deprecated (presumably because it's not relevant commercially), its model weights are released so that researchers can continue to do reproducible science?

163

32,401

Percy Liang · Aug 5, 2022 · 8:29 PM UTC

Percy Liang

@percyliang

5 Aug 2022

The most two most surprising things to me was that the trained Transformer could exploit sparsity like LASSO and that it exhibits double descent. How on earth is the Transformer encoding these algorithmic properties, and how did it just acquire them through training?

Dimitris Tsipras @tsiprasd

4 Aug 2022

LLMs can do in-context learning, but are they "learning" new tasks or just retrieving ones seen during training? w/ @shivamg_13, @percyliang, & Greg Valiant we study a simpler Q: Can we train Transformers to learn simple function classes in-context? 🧵 arxiv.org/abs/2208.01066

168

Percy Liang · May 23, 2025 · 3:48 PM UTC

Percy Liang

@percyliang

23 May 2025

The secret reason for Marin doing open development is that that all the research state (usually in heads, private Slacks or docs) gets written down in the open, which means that LM agents can now concretely contribute to and advance the science and development of LMs.

172

20,162

Percy Liang · Oct 27, 2020 · 1:20 AM UTC

Percy Liang

@percyliang

27 Oct 2020

...where I will attempt to compress all of my students' work on robust ML in the last 3 years into 40 minutes. We'll see how that goes.

Trustworthy ML Initiative (TrustML)@trustworthy_ml

26 Oct 2020

1/ 📢 Registration now open for Percy Liang's (@percyliang) seminar this Thursday, Oct 29 from 12 pm to 1.30 pm Eastern Time! 👇🏾 Register here: us02web.zoom.us/webinar/regi… #TrustML #MachineLearning #ArtificialIntelligence #DeepLearning

163

Percy Liang · Aug 6, 2025 · 5:35 PM UTC

Percy Liang

@percyliang

6 Aug 2025

gpt-oss-120b is the top open-weight model (with Kimi K2 right on its tail) for capabilities (HELM capabilities v1.11):

165

24,471

Percy Liang · Mar 21, 2023 · 5:51 PM UTC

Percy Liang

@percyliang

21 Mar 2023

Holistic Evaluation of Language Models (HELM) v0.2.2 is updated with results from @CohereAI's command models and @Aleph__Alpha's Luminous models. Models are definitely getting better on average, but improvements are uneven. crfm.stanford.edu/helm/v0.2.…

161

34,281

Percy Liang · Aug 26, 2025 · 6:52 PM UTC

Percy Liang

@percyliang

26 Aug 2025

When choosing a benchmark, you can have 2 of the 3 properties: 1. Realistic 2. Difficult 3. Quick evaluation Chatbot Arena is 1 + 3, HLE is 1 + 2. Our work (UQ) is 1 + 2, and for 3, evaluation happens over time with help from the community.

Ken Liu

@kenziyuliu

26 Aug 2025

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

164

19,447

Percy Liang · Jun 18, 2025 · 10:04 PM UTC

Percy Liang

@percyliang

18 Jun 2025

Assignment 4 (data): convert Common Crawl HTML to text, filter filter filter (quality, harmful content, PII), deduplication. This is the grunt work that doesn’t get enough appreciation. github.com/stanford-cs336/as…

158

22,198

Percy Liang · Feb 7, 2024 · 1:26 AM UTC

Percy Liang

@percyliang

7 Feb 2024

We just updated *ecosystem graphs* with the latest datasets, models, and products: crfm.stanford.edu/ecosystem-…

149

45,606

Percy Liang · Dec 15, 2022 · 7:17 PM UTC

Percy Liang

@percyliang

15 Dec 2022

Blog: crfm.stanford.edu/2022/12/15… GitHub: github.com/stanford-crfm/pub… Model: huggingface.co/stanford-crfm…

152

15,124

Percy Liang · Jun 11, 2024 · 10:05 PM UTC

Percy Liang

@percyliang

11 Jun 2024

Why this confusion? First, because our standards for openness in AI are so low. The status quo for frontier models is API access, so we cheer when we can get our hands on weights.

143

17,451

Percy Liang · Feb 19, 2024 · 5:36 PM UTC

Percy Liang

@percyliang

19 Feb 2024

Until now, HELM has evaluated LMs with on short responses, where evaluation is simple. We now introduce HELM Instruct, which evaluates open-ended instruction following. We evaluate 4 models on 7 scenarios using 4 evaluators against 5 criteria:

148

39,008

Percy Liang · Jun 16, 2023 · 7:55 AM UTC

Percy Liang

@percyliang

16 Jun 2023

In HELM, we evaluated language models. Now, we evaluate organizations that build language models. Just like model evaluations incentivize improvement in model quality, we hope that these evaluations will incentivize improvement in development and deployment practices.

140

28,244

Percy Liang · Nov 3, 2023 · 6:59 PM UTC

Percy Liang

@percyliang

3 Nov 2023

First, open models enable a tremendous amount of (badly needed) safety research, which requires full access to model weights (ideally with training data). API access is insufficient.

139

12,122

Percy Liang · Apr 22, 2023 · 8:23 PM UTC

Percy Liang

@percyliang

22 Apr 2023

My favorite detail about @nelsonfliu's evaluation of generative search engines is he takes queries from Reddit ELI5 as soon as they are posted and evaluates them in real time. This ensures the test set was not trained on (or retrieved from). nitter.app/nelsonfliu/status/1649…

146

40,436

Percy Liang · May 2, 2023 · 5:58 PM UTC

Percy Liang

@percyliang

2 May 2023

Interested in building and benchmarking LLMs and other foundation models in a vibrant academic setting? @StanfordCRFM is hiring research engineers! careersearch.stanford.edu/jo… Here are some things that you could be a part of:

141

52,727

Percy Liang · Dec 19, 2023 · 11:01 PM UTC

Percy Liang

@percyliang

19 Dec 2023

Announcing HELM lite v1.0.0, a revamp of the HELM classic benchmark, built on the same modular HELM framework. New scenarios: LegalBench (law), MedQA (medicine), WMT2014 (machine translation) New models: GPT-4, Claude, PaLM 2, Mixtral, Yi crfm.stanford.edu/2023/12/19…

150

31,438

Percy Liang · May 29, 2025 · 9:04 PM UTC

Percy Liang

@percyliang

29 May 2025

A good LM + naive tree search => new kernels that outperform PyTorch implementations...so much more to do.

Anne Ouyang

@anneouyang

29 May 2025

✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6) [🔗 link in final post]

144

20,422

Percy Liang · Sep 15, 2022 · 5:17 AM UTC

Percy Liang

@percyliang

15 Sep 2022

This is the dream: having a system whose action space is universal (at least in the world of bits). And with foundation models, it is actually possible now to produce sane predictions in that huge action space. Some interesting challenges:

This tweet is unavailable

142

Percy Liang · Jun 30, 2022 · 11:43 PM UTC

Percy Liang

@percyliang

30 Jun 2022

The term "foundation model" and its motivation unfortunately continues to be misunderstood. We wrote a blog post last year (see "Naming" section of crfm.stanford.edu/2021/10/18…) which aims to explain our thought process. Some selected quotes from the post:

137

Percy Liang · Jun 11, 2024 · 10:05 PM UTC

Percy Liang

@percyliang

11 Jun 2024

But for open science, we really need open-source models. How do you interpret test accuracies without knowledge of the training data? How can we understand model capabilities without knowing what the sources are? While open-weight models are hugely enabling, we also risk building our entire field on unclear foundations.

137

19,385

Percy Liang · Dec 15, 2020 · 8:42 PM UTC

Percy Liang

@percyliang

15 Dec 2020

Excited to see what kind of methods the community will come up with to address these realistic shifts in the wild! Also, if you are working on a real-world application and encounter distributional shifts, come talk to us!

Shiori Sagawa @shiorisagawa

15 Dec 2020

We're excited to announce WILDS, a benchmark of in-the-wild distribution shifts with 7 datasets across diverse data modalities and real-world applications. Website: wilds.stanford.edu Paper: arxiv.org/abs/2012.07421 Github: github.com/p-lambda/wilds Thread below. (1/12)

139

Percy Liang · Apr 8, 2025 · 5:11 AM UTC

Percy Liang

@percyliang

8 Apr 2025

We ran Llama 4 Maverick through some HELM benchmarks. It is 1st on HELM capabilities (MMLU-Pro, GPQA, IFEval, WildBench, Omni-MATH), but… crfm.stanford.edu/helm/capab…

138

29,455

Percy Liang · Nov 4, 2023 · 5:25 PM UTC

Percy Liang

@percyliang

4 Nov 2023

2021: let's increase model size! 2023: let's increase FLOPs! 2025: let's increase ???! Shouldn't FLOPs be in the denominator rather than the numerator? Numerator should be some measure of capability+safety. We need better evals to capture this!

131

22,236

Percy Liang · Feb 27, 2023 · 6:08 AM UTC

Percy Liang

@percyliang

27 Feb 2023

These powerful foundation models will be deployed to billions of people soon, which means there will be economic incentives for bad actors to start messing around. So we better figure out security for foundation models soon.

127

19,481

Percy Liang · Jan 29, 2023 · 7:12 AM UTC

Percy Liang

@percyliang

29 Jan 2023

A better solution would to have all the LM providers agree on a common repository of examples that should be excluded from any training run.

121

21,943

Percy Liang · May 22, 2025 · 2:02 PM UTC

Percy Liang

@percyliang

22 May 2025

AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we are excited to release BountyBench, the first framework to capture offensive & defensive cyber-capabilities in evolving real-world systems.

131

15,651

Percy Liang · Jun 26, 2024 · 2:35 PM UTC

Percy Liang

@percyliang

26 Jun 2024

HELM MMLU v1.5.0 is out. crfm.stanford.edu/helm/mmlu/… Claude 3.5 Sonnet takes the top position.

126

14,838

Percy Liang · Oct 27, 2022 · 3:43 AM UTC

Percy Liang

@percyliang

27 Oct 2022

What is the largest fully reproducible language model? That is, where I can get the data and code and run a sequence of commands that deterministically produces the exact model?

128