Adam Zweiger · Feb 19, 2026 · 4:02 PM UTC

Adam Zweiger

Pinned Tweet

Adam Zweiger

@AdamZweiger

Feb 19

We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.

148

943

131,116

Adam Zweiger · Aug 5, 2025 · 6:31 PM UTC

Adam Zweiger

@AdamZweiger

5 Aug 2025

Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low! - Alternate between 128-sliding window layers and dense layers. This is the same as GPT-3. - top-k MoE, with no shared expert and k=4. Standard load balancing loss. - Really nonstandard swiglu - Uses σ(αx) where α=1.702 which makes the silu approximate a gelu. Clips gate activations to (-inf, 7], and up activations to [-7 and 7]. Adds 1 to up activations so final output is (up + 1)*glu instead of up*glu, which probably helps with gradient flow.

739

64,811

Adam Zweiger · Jul 25, 2025 · 2:53 AM UTC

Adam Zweiger

@AdamZweiger

25 Jul 2025

Replying to @2prime_PKU

everyone always asks who/what is adam. never how is adam

510

22,136

Adam Zweiger · Jun 14, 2025 · 12:31 AM UTC

Adam Zweiger

@AdamZweiger

14 Jun 2025

Replying to @kimmonismus

Thanks for the post, but we don't want to mislead people. As we state in our tweet and in the paper, the ARC numbers are from a small curated subset of tasks that our human-crafted TTT configuration solved correctly. These results are somewhat toy, where we train the model to do tool-calls for existing data augmentation functions - see the other experiments about incorporating information from a passage as more of a showcase of our general method.

259

6,054

Adam Zweiger · Jun 13, 2025 · 2:31 AM UTC

Adam Zweiger

@AdamZweiger

13 Jun 2025

Excited to share our new work on Self-Adapting Language Models! This is my first first-author paper and I’m grateful to be able to work with such an amazing team of collaborators: @jyo_pari @HanGuo97 @akyurekekin @yoonrkim @pulkitology

Jyo Pari

@jyo_pari

13 Jun 2025

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

101

14,109

Adam Zweiger · Jul 25, 2025 · 2:45 AM UTC

Adam Zweiger

@AdamZweiger

25 Jul 2025

Replying to @2prime_PKU

It's pretty sad how so many people use Adam for optimizing their models yet don’t credit him as a coauthor :(

8,784

Adam Zweiger · Jun 13, 2025 · 3:19 PM UTC

Adam Zweiger

@AdamZweiger

13 Jun 2025

Replying to @jyo_pari

A few additional notes/limitations about SEAL after seeing some reactions: - This is **not** AGI / recursive self-improvement. It's more towards LLMs ingesting data in a more effective way. We will need more breakthroughs to overcome the core challenges of generalization, hallucination, and continual learning - We chose the relatively simply no-context SQuAD setup (short passage and questions) so our base model (Qwen2.5-7B) could fully "understand" the content when it was in-context and respond with a large amount of text compared to the original passage. It would be very cool to see how SEAL scales with model size and task complexity. - Many people are finding our idea of putting self-editing in an RL loop extremely compelling (and we agree!). As a bit of a warning though, RL is not a magic wand that pushes the reward to 1 in any environment. Weight updates from minimal data can be quite brittle and hard to work with, and it's possible self-edits of the form we study are upper bounded in ability to effectively update the model. - Thanks for all the excitement! We hope this inspires more interesting research!

2,273

Adam Zweiger · Jul 14, 2025 · 10:25 PM UTC

Adam Zweiger

@AdamZweiger

14 Jul 2025

Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles. I will be presenting with @jyo_pari at E-2702, Tuesday 11-1:30!

6,214

Adam Zweiger · Aug 5, 2025 · 6:44 PM UTC

Adam Zweiger

@AdamZweiger

5 Aug 2025

source: github.com/huggingface/trans…

transformers/src/transformers/models/gpt_oss/modeling_gpt_oss.py at main · huggingface/transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - huggingface/transformers

github.com

5,567

Adam Zweiger · Jun 14, 2025 · 4:10 PM UTC

Adam Zweiger

@AdamZweiger

14 Jun 2025

An underrated and potentially more practical aspect of our Self-Adapting LMs paper is the potential for general pre/post-training data curation. In the paper, we focus on using the same model for both generating and learning from self-edits. In practice, I imagine a "teacher" and a "student" (potentially initialized to the same gpt-n checkpoint), where the teacher learns through RL how to best augment some training data for the student. Finally, the aggregated set of text across the original data and synthetic data is used for mid/post training.

4,438

Adam Zweiger · Jun 20, 2025 · 5:56 PM UTC

Adam Zweiger

@AdamZweiger

20 Jun 2025

Replying to @Kimi_Moonshot

REINFORCE is all you need

3,291

Adam Zweiger · Jun 17, 2025 · 8:44 PM UTC

Adam Zweiger

@AdamZweiger

17 Jun 2025

Cool work! Maybe MesaNet can be seen as trying to optimally mitigate catastrophic forgetting during online learning. In the online test-time regression perspective, we can view these linear sequence models as continually learning from each new token. MesaNet improves on DeltaNet by keeping around the G_t and H_t state matrices to solve a cumulative least-squares loss, so new information is blended with rather than overwritten onto what was learned in earlier tokens Lots of parallels between linear sequence models and ideas in continual learning research!

Johannes Oswald @oswaldjoh

17 Jun 2025

Super happy and proud to share our novel scalable RNN model - the MesaNet! This work builds upon beautiful ideas of 𝗹𝗼𝗰𝗮𝗹𝗹𝘆 𝗼𝗽𝘁𝗶𝗺𝗮𝗹 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 (TTT), and combines ideas of in-context learning, test-time training and mesa-optimization.

2,710

Adam Zweiger · Jun 28, 2025 · 12:30 AM UTC

Adam Zweiger

@AdamZweiger

28 Jun 2025

Replying to @jxmnop

there’s also a distinction within [ii] between linear sequence models and transformer models. In linear models, TTT refers to online-learning from each token, and deals with architecural changes: fast-weight programmers, arxiv.org/abs/2501.12352, Yu Sun’s work as you mentioned, deltanet, mesanet, etc. Then there is TTT for continual adaptation in transformers. This includes works on external memory modules (e.g. Memoir), arxiv.org/abs/2411.07279, and SEAL

Test-time regression: a unifying framework for designing sequence...

Sequence models lie at the heart of modern deep learning. However, rapid advancements have produced a diversity of seemingly unrelated architectures, such as Transformers and recurrent...

arxiv.org

2,211

Adam Zweiger · Jul 22, 2025 · 6:16 PM UTC

Adam Zweiger

@AdamZweiger

22 Jul 2025

This is crazy! It makes more sense once you hear it requires both models to have the same initialization. If you can get a method like this to work without that, it would have big implications for data poisoning. I think it's not possible, but someone should look into it more.

Owain Evans

@OwainEvans_UK

22 Jul 2025

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

1,703

Adam Zweiger · Jul 1, 2025 · 12:18 AM UTC

Adam Zweiger

@AdamZweiger

1 Jul 2025

Cool concept, but LLM self-improvement through writing its own code is way overhyped. Coding good training pipelines is mostly figured out and strongly upper bounded. LLMs writing their own training data after agentic interaction will be significantly more powerful.

Minqi Jiang

@MinqiJiang

30 Jun 2025

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.

1,785

Adam Zweiger · Jul 19, 2025 · 6:12 PM UTC

Adam Zweiger

@AdamZweiger

19 Jul 2025

A mathematician can think about a single problem for a full decade (perhaps 100M+ tokens of reading/writing/thinking) before solving it. When will we reach that point with LLMs?

1,457

Adam Zweiger · Sep 3, 2025 · 6:10 AM UTC

Adam Zweiger

@AdamZweiger

3 Sep 2025

this is a really fun puzzle :D many more optimizations to come!

Alfred Wahlforss

@itsalfredw

2 Sep 2025

At Listen, we love hard problems and even harder puzzles. Decode the billboard, get the next challenge. Win an all-expenses Berlin trip + Berghain guest list (you'll see why).

2,744

Adam Zweiger · Jun 14, 2025 · 12:11 AM UTC

Adam Zweiger

@AdamZweiger

14 Jun 2025

Really interesting work on memory modules with sparse updates. Reminds me a lot of MoE! Both have a top-k style router to sparsely connect only the most relevant pieces of computation (though instead of tokens to experts, here it's indices of activations to columns of the memory).

Yiming Qin @qinym710

13 Jun 2025

How can we inject new knowledge into LLMs without full retraining, forgetting, or breaking past edits? We introduce MEMOIR 📖— a scalable framework for lifelong model editing that reliably rewrites thousands of facts sequentially using a residual memory module. 🔥 🧵1/7

1,257

Adam Zweiger · Jul 28, 2025 · 1:13 PM UTC

Adam Zweiger

@AdamZweiger

28 Jul 2025

Replying to @kalomaze

We have H100/200s! Lots of labs or groups of labs have their own compute in CSAIL.

1,342

Adam Zweiger · Aug 6, 2025 · 1:05 AM UTC

Adam Zweiger

@AdamZweiger

6 Aug 2025

Replying to @wenhaocha1

hi! Let y=(up+1)*glu. up projection is often close to 0, like at initialization. Without the +1, dy/dglu = 0. With the +1, dy/dglu=1

1,410

Adam Zweiger · Jul 2, 2025 · 4:14 PM UTC

Adam Zweiger

@AdamZweiger

2 Jul 2025

Replying to @ZeyuanAllenZhu

this seems like a lot? How many GPUs do you have access to?

4,926

Adam Zweiger · Jun 26, 2025 · 5:43 AM UTC

Adam Zweiger

@AdamZweiger

26 Jun 2025

Prediction: In 2030, >90% of training tokens for frontier models will be LLM-generated. Models are only going to get better at synthesizing new information with stuff in-context. With either SEAL or better heuristics, synth data quality will improve even further.

Thao Nguyen @thao_nguyen26

23 Jun 2025

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689

913

Adam Zweiger · Jun 18, 2025 · 10:56 PM UTC

Adam Zweiger

@AdamZweiger

18 Jun 2025

Probably not the best idea to title your paper frontier models can't do x. By the time it's published that changes lol looks like this one was first submitted only ~20 days after o1-preview dropped, so can't fault the authors too much

Dan Hendrycks

@hendrycks

18 Jun 2025

Apple recently published a paper showing that current AI systems lack the ability to solve puzzles that are easy for humans. Humans: 92.7% GPT-4o: 69.9% However, they didn't evaluate on any recent reasoning models. If they did, they'd find that o3 gets 96.5%, beating humans.

648

Adam Zweiger · Jun 20, 2025 · 6:13 AM UTC

Adam Zweiger

@AdamZweiger

20 Jun 2025

This class is insanely good! I’ve been looking forward to each lecture video dropping on yt like it’s a netflix show

Percy Liang

@percyliang

18 Jun 2025

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

805

Adam Zweiger · Aug 5, 2025 · 9:26 PM UTC

Adam Zweiger

@AdamZweiger

5 Aug 2025

Replying to @jxmnop

then it'll be called continual learning

368

Adam Zweiger · Jun 26, 2025 · 6:58 PM UTC

Adam Zweiger

@AdamZweiger

26 Jun 2025

Replying to @dlwh

How do you think training from scratch with the "best" configuration would compare with doing all these interventions along the way with optimization params, changing the data mix, and all the cooldowns/warmups?

2,698

Adam Zweiger · Jul 22, 2025 · 6:04 PM UTC

Adam Zweiger

@AdamZweiger

22 Jul 2025

Replying to @eshear

this only works if the models have the same initialization though! I suspect it's impossible without this

142

Adam Zweiger · Jul 8, 2025 · 7:57 PM UTC

Adam Zweiger

@AdamZweiger

8 Jul 2025

Very excited to see what they've cooked up now My out-there guess: Use SSMs to "tokenize" text into fewer but more semantic chunks, and do attention over that. State still grows linearly (and compute quadratically), but far fewer tokens and better expressivity for some domains

Albert Gu

@_albertgu

8 Jul 2025

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

729

Adam Zweiger · Jun 28, 2025 · 12:58 AM UTC

Adam Zweiger

@AdamZweiger

28 Jun 2025

If any undergrads follow me: apply for HackMIT!

HackMIT

@HackMIT

27 Jun 2025

ONE WEEK LEFT until priority applications for HackMIT 2025 close!! If you haven’t yet applied, let’s recap what HackMIT has to offer:🧵 #HackMIT #Hackathon #mit

591

Adam Zweiger · Sep 5, 2025 · 7:02 PM UTC

Adam Zweiger

@AdamZweiger

5 Sep 2025

Interesting work showing concretely why on-policy RL forgets less. It's not quite because of "sparse updates" -- only that RL maintains a smaller KL to the base model.

Jyo Pari

@jyo_pari

5 Sep 2025

For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇

1,241

Adam Zweiger · Jun 14, 2025 · 12:23 AM UTC

Adam Zweiger

@AdamZweiger

14 Jun 2025

Replying to @willccbb @inductionheads

My personal best guess of how the method could be applied in industry is more through curating training data. In the paper, we focus on having the same model generate self-edits and receive them. In practice, it's probably better to have a "teacher" model and a "student" model (potentially initialized to the same gpt-n checkpoint), where the teacher learns through RL how to best train the student, and this data is then aggregated for pre/post training Also re OP: only Ekin went to OpenAI, and his contribution was made prior to joining

257

Adam Zweiger · Jun 15, 2025 · 12:41 AM UTC

Adam Zweiger

@AdamZweiger

15 Jun 2025

Replying to @ghostITCITM @kimmonismus

yeah! check out our website and code: jyopari.github.io/posts/seal

276

Adam Zweiger · Jul 15, 2025 · 5:29 AM UTC

Adam Zweiger

@AdamZweiger

15 Jul 2025

This is one of the highest-quality evals I've seen and it's nice to see it expanding! I love how you can view each model-problem-run datapoint

Mislav Balunović @mbalunovic

14 Jul 2025

We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)

675

Adam Zweiger · Jun 22, 2025 · 3:16 AM UTC

Adam Zweiger

@AdamZweiger

22 Jun 2025

Replying to @kalomaze

model-based data filtering is already very common. Why not model-based data augmentation?

999

Adam Zweiger · Jul 20, 2025 · 2:30 AM UTC

Adam Zweiger

@AdamZweiger

20 Jul 2025

interesting

Dave

@dmvaldman

20 Jul 2025

A striking thing about OpenAI's IMO gold math model is how terse it is, it really tries to express itself in single tokens. Often breaking the rules of grammar and spelling to do so. They say compression is intelligence. We may be seeing a totally novel way to do compression here! Some examples: not divisible by3 (saves a token by not including a space "by 3") Let ω= circumcircle (saves a token, ω=circumcircle is 5 tokens where including a space on just one side makes it 4 tokens) Need show also all terms multiple of 3. (saves a token by not pluralizing "multiple") And it marks progress using single token words like: perfect, good, full, exactly)

674

Adam Zweiger · Jul 23, 2025 · 6:05 PM UTC

Adam Zweiger

@AdamZweiger

23 Jul 2025

Models that know what they know are way more useful. So, jointly reward correctness and calibration!

Mehul Damani

@MehulDamani2

23 Jul 2025

🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty -- improving both accuracy ✅ and calibration 🎯. [1/N]

1,116

Adam Zweiger · Jun 28, 2025 · 12:35 AM UTC

Adam Zweiger

@AdamZweiger

28 Jun 2025

Replying to @AdamZweiger @jxmnop

citations (hope I didn’t miss too many, these are just the ones I’ve read and found interesting!) arxiv.org/abs/2102.11174 arxiv.org/abs/2407.04620 arxiv.org/abs/2501.12352 arxiv.org/abs/2406.06484 arxiv.org/abs/2506.05233 arxiv.org/pdf/2506.07899v1 arxiv.org/abs/2411.07279 arxiv.org/abs/2506.10943

Linear Transformers Are Secretly Fast Weight Programmers

We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early '90s, where a ``slow" neural net learns by gradient descent to program the ``fast...

arxiv.org

135

Adam Zweiger · Jul 1, 2025 · 12:20 AM UTC

Adam Zweiger

@AdamZweiger

1 Jul 2025

The skill of coding its own training pipelines is also just a narrow subset of general coding/research ability, which should be targeted instead.

331

Adam Zweiger · Sep 3, 2025 · 3:41 PM UTC

Adam Zweiger

@AdamZweiger

3 Sep 2025

Replying to @NALC3D

it's all RNG at this point. for the first one especially there's not much to optimize after you reach a certain algorithm

165

Adam Zweiger · Jul 2, 2025 · 4:22 PM UTC

Adam Zweiger

@AdamZweiger

2 Jul 2025

Replying to @AdamZweiger @ZeyuanAllenZhu

is that like 77 days of 1024 H200s? or more like 1.7 years of 128 A6000s

580

Adam Zweiger · Jul 25, 2025 · 3:35 AM UTC

Adam Zweiger

@AdamZweiger

25 Jul 2025

Replying to @demishassabis

piped.video/watch?v=4e0n7vTL…

A Million Dollars Isn't Cool...

Too bad MySpace is only worth a few pennies...This was directed b...

youtube.com

282

Adam Zweiger · Jul 8, 2025 · 8:13 PM UTC

Adam Zweiger

@AdamZweiger

8 Jul 2025

Replying to @docmilanfar @_albertgu

not really. for example with x_t and y_t being scalars, C_t is a vector here. Makes more sense to have it be a column and then do the natural inner product notation aTb

198

Adam Zweiger · Jun 22, 2025 · 3:38 AM UTC

Adam Zweiger

@AdamZweiger

22 Jun 2025

Replying to @kalomaze

sure but if we are trying to just improve capabilities, this isn’t an issue. The main reason LLMs suck at X use case is because there wasn’t enough data on it on the internet. When you put this data in-context of a strong model, it can synthesize more data to fed back in

304