We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.
24
148
943
131,116
Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low! - Alternate between 128-sliding window layers and dense layers. This is the same as GPT-3. - top-k MoE, with no shared expert and k=4. Standard load balancing loss. - Really nonstandard swiglu - Uses σ(αx) where α=1.702 which makes the silu approximate a gelu. Clips gate activations to (-inf, 7], and up activations to [-7 and 7]. Adds 1 to up activations so final output is (up + 1)*glu instead of up*glu, which probably helps with gradient flow.
10
62
739
64,811
Replying to @2prime_PKU
everyone always asks who/what is adam. never how is adam
5
26
510
22,136
Replying to @kimmonismus
Thanks for the post, but we don't want to mislead people. As we state in our tweet and in the paper, the ARC numbers are from a small curated subset of tasks that our human-crafted TTT configuration solved correctly. These results are somewhat toy, where we train the model to do tool-calls for existing data augmentation functions - see the other experiments about incorporating information from a passage as more of a showcase of our general method.
9
7
259
6,054
Excited to share our new work on Self-Adapting Language Models! This is my first first-author paper and I’m grateful to be able to work with such an amazing team of collaborators: @jyo_pari @HanGuo97 @akyurekekin @yoonrkim @pulkitology
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
9
11
101
14,109
Replying to @2prime_PKU
It's pretty sad how so many people use Adam for optimizing their models yet don’t credit him as a coauthor :(
1
79
8,784
Replying to @jyo_pari
A few additional notes/limitations about SEAL after seeing some reactions: - This is **not** AGI / recursive self-improvement. It's more towards LLMs ingesting data in a more effective way. We will need more breakthroughs to overcome the core challenges of generalization, hallucination, and continual learning - We chose the relatively simply no-context SQuAD setup (short passage and questions) so our base model (Qwen2.5-7B) could fully "understand" the content when it was in-context and respond with a large amount of text compared to the original passage. It would be very cool to see how SEAL scales with model size and task complexity. - Many people are finding our idea of putting self-editing in an RL loop extremely compelling (and we agree!). As a bit of a warning though, RL is not a magic wand that pushes the reward to 1 in any environment. Weight updates from minimal data can be quite brittle and hard to work with, and it's possible self-edits of the form we study are upper bounded in ability to effectively update the model. - Thanks for all the excitement! We hope this inspires more interesting research!
1
2
40
2,273
Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles. I will be presenting with @jyo_pari at E-2702, Tuesday 11-1:30!
1
6
42
6,214
An underrated and potentially more practical aspect of our Self-Adapting LMs paper is the potential for general pre/post-training data curation. In the paper, we focus on using the same model for both generating and learning from self-edits. In practice, I imagine a "teacher" and a "student" (potentially initialized to the same gpt-n checkpoint), where the teacher learns through RL how to best augment some training data for the student. Finally, the aggregated set of text across the original data and synthetic data is used for mid/post training.
4
6
34
4,438
Replying to @Kimi_Moonshot
REINFORCE is all you need
2
2
28
3,291
Cool work! Maybe MesaNet can be seen as trying to optimally mitigate catastrophic forgetting during online learning. In the online test-time regression perspective, we can view these linear sequence models as continually learning from each new token. MesaNet improves on DeltaNet by keeping around the G_t and H_t state matrices to solve a cumulative least-squares loss, so new information is blended with rather than overwritten onto what was learned in earlier tokens Lots of parallels between linear sequence models and ideas in continual learning research!
Super happy and proud to share our novel scalable RNN model - the MesaNet! This work builds upon beautiful ideas of 𝗹𝗼𝗰𝗮𝗹𝗹𝘆 𝗼𝗽𝘁𝗶𝗺𝗮𝗹 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 (TTT), and combines ideas of in-context learning, test-time training and mesa-optimization.
1
1
22
2,710
Replying to @jxmnop
there’s also a distinction within [ii] between linear sequence models and transformer models. In linear models, TTT refers to online-learning from each token, and deals with architecural changes: fast-weight programmers, arxiv.org/abs/2501.12352, Yu Sun’s work as you mentioned, deltanet, mesanet, etc. Then there is TTT for continual adaptation in transformers. This includes works on external memory modules (e.g. Memoir), arxiv.org/abs/2411.07279, and SEAL
1
21
2,211
This is crazy! It makes more sense once you hear it requires both models to have the same initialization. If you can get a method like this to work without that, it would have big implications for data poisoning. I think it's not possible, but someone should look into it more.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
2
18
1,703
Cool concept, but LLM self-improvement through writing its own code is way overhyped. Coding good training pipelines is mostly figured out and strongly upper bounded. LLMs writing their own training data after agentic interaction will be significantly more powerful.
Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.
2
17
1,785
A mathematician can think about a single problem for a full decade (perhaps 100M+ tokens of reading/writing/thinking) before solving it. When will we reach that point with LLMs?
3
1
14
1,457
this is a really fun puzzle :D many more optimizations to come!
At Listen, we love hard problems and even harder puzzles. Decode the billboard, get the next challenge. Win an all-expenses Berlin trip + Berghain guest list (you'll see why).
3
15
2,744
Really interesting work on memory modules with sparse updates. Reminds me a lot of MoE! Both have a top-k style router to sparsely connect only the most relevant pieces of computation (though instead of tokens to experts, here it's indices of activations to columns of the memory).
How can we inject new knowledge into LLMs without full retraining, forgetting, or breaking past edits? We introduce MEMOIR 📖— a scalable framework for lifelong model editing that reliably rewrites thousands of facts sequentially using a residual memory module. 🔥 🧵1/7
13
1,257
Replying to @kalomaze
We have H100/200s! Lots of labs or groups of labs have their own compute in CSAIL.
1
11
1,342
Replying to @wenhaocha1
hi! Let y=(up+1)*glu. up projection is often close to 0, like at initialization. Without the +1, dy/dglu = 0. With the +1, dy/dglu=1
1
10
1,410
Replying to @ZeyuanAllenZhu
this seems like a lot? How many GPUs do you have access to?
2
11
4,926
Prediction: In 2030, >90% of training tokens for frontier models will be LLM-generated. Models are only going to get better at synthesizing new information with stuff in-context. With either SEAL or better heuristics, synth data quality will improve even further.
Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689
9
913
Probably not the best idea to title your paper frontier models can't do x. By the time it's published that changes lol looks like this one was first submitted only ~20 days after o1-preview dropped, so can't fault the authors too much
Apple recently published a paper showing that current AI systems lack the ability to solve puzzles that are easy for humans. Humans: 92.7% GPT-4o: 69.9% However, they didn't evaluate on any recent reasoning models. If they did, they'd find that o3 gets 96.5%, beating humans.
8
648
This class is insanely good! I’ve been looking forward to each lecture video dropping on yt like it’s a netflix show
Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:
7
805
Replying to @jxmnop
then it'll be called continual learning
7
368
Replying to @dlwh
How do you think training from scratch with the "best" configuration would compare with doing all these interventions along the way with optimization params, changing the data mix, and all the cooldowns/warmups?
1
1
7
2,698
Replying to @eshear
this only works if the models have the same initialization though! I suspect it's impossible without this
5
142
Very excited to see what they've cooked up now My out-there guess: Use SSMs to "tokenize" text into fewer but more semantic chunks, and do attention over that. State still grows linearly (and compute quadratically), but far fewer tokens and better expressivity for some domains
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
6
729
If any undergrads follow me: apply for HackMIT!
ONE WEEK LEFT until priority applications for HackMIT 2025 close!! If you haven’t yet applied, let’s recap what HackMIT has to offer:🧵 #HackMIT #Hackathon #mit
1
5
591
Interesting work showing concretely why on-policy RL forgets less. It's not quite because of "sparse updates" -- only that RL maintains a smaller KL to the base model.
For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇
6
1,241
My personal best guess of how the method could be applied in industry is more through curating training data. In the paper, we focus on having the same model generate self-edits and receive them. In practice, it's probably better to have a "teacher" model and a "student" model (potentially initialized to the same gpt-n checkpoint), where the teacher learns through RL how to best train the student, and this data is then aggregated for pre/post training Also re OP: only Ekin went to OpenAI, and his contribution was made prior to joining
5
257
yeah! check out our website and code: jyopari.github.io/posts/seal
1
4
276
This is one of the highest-quality evals I've seen and it's nice to see it expanding! I love how you can view each model-problem-run datapoint
We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)
4
675
Replying to @kalomaze
model-based data filtering is already very common. Why not model-based data augmentation?
1
4
999
interesting
A striking thing about OpenAI's IMO gold math model is how terse it is, it really tries to express itself in single tokens. Often breaking the rules of grammar and spelling to do so. They say compression is intelligence. We may be seeing a totally novel way to do compression here! Some examples: not divisible by3 (saves a token by not including a space "by 3") Let ω= circumcircle (saves a token, ω=circumcircle is 5 tokens where including a space on just one side makes it 4 tokens) Need show also all terms multiple of 3. (saves a token by not pluralizing "multiple") And it marks progress using single token words like: perfect, good, full, exactly)
3
674
Models that know what they know are way more useful. So, jointly reward correctness and calibration!
🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty -- improving both accuracy ✅ and calibration 🎯. [1/N]
4
1,116
The skill of coding its own training pipelines is also just a narrow subset of general coding/research ability, which should be targeted instead.
1
2
331
Replying to @NALC3D
it's all RNG at this point. for the first one especially there's not much to optimize after you reach a certain algorithm
1
165
is that like 77 days of 1024 H200s? or more like 1.7 years of 128 A6000s
2
580
not really. for example with x_t and y_t being scalars, C_t is a vector here. Makes more sense to have it be a column and then do the natural inner product notation aTb
1
1
198
Replying to @kalomaze
sure but if we are trying to just improve capabilities, this isn’t an issue. The main reason LLMs suck at X use case is because there wasn’t enough data on it on the internet. When you put this data in-context of a strong model, it can synthesize more data to fed back in
304