Sparse and efficient • Deus eXperiments • 🇮🇳

We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction". A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately. Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?" This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.
21
39
216
22,991
this is beyond mindblowing for me. somebody built a 5 million param language model inside minecraft, trained it, equipped it with basic conversational ability. probably the best thing i have seen entire month.
348
1,507
27,197
1,735,925
anything for the agi
221
600
12,114
570,443
tired of arguing with GPT5 pro, so i asked it to fight itself on my behalf. you guys do not understand automation as well as i do.
73
129
5,201
238,959
this playlist is gonna blow up cuz @karpathy sensei just recommended it from scratch
24
294
4,833
400,327
"I work at a secret project at xAI“ "The DoD one or the hentai one?“
Announcing Grok for Government - a suite of products that make our frontier models available to United States Government customers We are especially excited about two new partnerships for our US Government partners 1) a new contract from the US Department of Defense 2) our products being available to purchase via the General Services Administration (GSA) schedule. This allows every federal government department, agency, or office, to purchase xAI products We're hiring mission driven engineers who want to join the cause
35
151
4,640
243,636
perplexity seems to be on a path to be a case study. i will not elaborate.
New: @perplexity_ai just dropped its first celeb ad starring Squid Game's Lee Jung-jae. The AI search startup takes a jab at @Google “Poogle”—with a cheeky line: “Don’t use glue” when Jung-jae asks how to make cheese stick to pizza. 💰 Mid-seven-figure buy
60
74
3,451
380,327
me one year ago, before i decided to dive deeper into LLMs. just C and sometimes python.
29
79
2,486
108,772
Replying to @melqtx
a mathematical construct for solving multi-dimensional problems for entities having a magnitude and directional representation.
16
20
2,220
95,125
My estimate of the probability of Grok 5 achieving AGI is now at 10% and rising
12
27
1,851
104,814
i got to know this isn't the first time this has happened. sammyuri and others have been building incredible things pushing redstone x minecraft to its limit. a collection. 1) a complete CNN in minecraft, this was last year
this is beyond mindblowing for me. somebody built a 5 million param language model inside minecraft, trained it, equipped it with basic conversational ability. probably the best thing i have seen entire month.
12
78
1,802
253,753
a 17 yo genius disproves an age old mathematical conjecture and she is getting her applications rejected because she doesn't have a college degree. this can't be real. quantamagazine.org/at-17-han…
74
58
1,590
661,296
is it possible to pretrain a language model using pure reinforcement learning from scratch? random weights, no cross-entropy loss pretraining. you may have many questions in your head.
62
154
1,585
313,981
FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
10
67
1,477
157,303
never deleting this app
13
34
1,433
129,431
this is nice. something worth sharing for sure.
28
16
1,377
147,848
god bless claude code team to ship such amazing features that GLM users can enjoy.
Claude Code can now ask you interactive questions when it needs more information or when there are multiple paths forward.
28
28
1,201
100,813
it probably feels as a self-defeating thing to say for researchers but NEXT TOKEN PREDICTION IS THE MOST OPTIMAL COMPRESSION. - shannon, 1948 (a mathematical theory of communication)
Hot-take: Auto-regression sucks and is impressive as a parlor trick. Any spark of intelligence from an LLM reflects that it moved beyond, and built a factorized model with meaningful latents.
30
81
1,083
129,140
damn did karpathy pod just change the bubble burst timeline?
25
41
1,053
141,326
Announcing Bread Technologies. We’re building machines that learn like humans. We raised a $5 million seed round led by Menlo Ventures and have been building in stealth for 10 months. Today, we rise 🍞
7
8
914
283,284
how i bring the best out of claude code?
my workflow to get the best out of claude code is complete now.
9
51
854
127,679
Wanted to get better intuitions for how RL works on LLMs. So I wrote a simple script to teach Nanochat to add 5 digit numbers. I was surprised at how fast it learned. Until I looked at the model's generations and realized that it had just learned to always call the built-in Python interpreter 😂. The code I wrote is very remedial, minimal, and inefficient - I'm a professional podcaster, alright? But it might be helpful if you just want to see the basics of how REINFORCE or GRPO work. Link to gist below. Fundamentally, it's not that complicated: generate multiple trajectories per prompt. Update your model to make it more likely that it samples all the tokens in the successful trajectories.
6
9
771
139,039
we missed a banger paper in the grok4/k2 drop noise guys. these guys > look for optimal ways to select data mixes to get max improvement on a model given a target domain. > do multimodal validation > show good extrapolation accuracy (testing on 1.4B and predicting on 8B)
11
71
765
71,989
hypothetically, if one wanted to research at a frontier lab in 1.5 years (hypothetically winter 2027) and wanted to know how to develop the necessary skills and credentials to do so, what would you suggest to them (asking for a friend)
10
17
751
92,894
how i bring the best out of claude code - part 2
intermediate level workflow for claude code is complete taught claude to >create commands, >manage its own commands, >extract entire session and save it like os paging, >do multi agent deep research locally, >analyse functions in a specific style, >search the repo effectively, >work with user created scripts. post scheduled for 3.5 hours later. enjoy.
18
33
682
105,557
wow, this is exciting. both openai and google won gold medals at ICPC 2025. oai system solved 12/12 while gemini 2.5 deepthink solved 10/12. what is noteworthy is that no human team solved more than 11.
8
34
627
53,109
researchers when asked to switch from bf16 to fp16 and do loss scaling because it is way better for RL
15
19
629
41,622
kimi k2 is token-for-token the strongest model on earth. dirt cheap, such high quality.
15
23
588
31,000
this paper costed 4.2 mil USD to write holy... most labs haven't reached the point of releasing models that costed that much let alone a paper that covers all the details
15
43
541
102,768
Making a list of graveyard of ideas, the ultimate nerd snipes where efforts go and die DPO-*variant SSM-transformer hybrids SAEs MCTS Diffusion for large vision models Attention-less JEPA (lecun lovers) what else?
68
22
518
100,129
Indian Reflection grift model Shivaay > if we used openai approach, it'll not be possible. so we innovated something something and pretrained 4B SoTA unbelievable model on 4x A100 80G > unbelievable? no, skill issue. we invented everything but will not share tech report.
31
18
487
62,334
A really good paper from Meta Superintelligence Labs dropped. essentially a primer on how to bootstrap tasks and create efficient multi-turn self improving envs. and uses the same intuitions which i have been sharing on here. > guided exploration > prefills as hints
6
46
500
35,763
damn looks like i was not hallucinating, why is this paper not on the timeline?
muon is nice, ever since i got it working i can notice the way the output token quality is just better? probably varied from Adam optimizer tokens would be a better way to phrase it. but these models with different optimisers are different even with the same data.
12
30
494
56,886
if "fork found in the kitchen" was a paper in language modelling
20
22
486
34,530
Replying to @realmcore_
unbelievable to me when i saw they even trained it.
4
2
474
60,632
i do not suggest kimi k2 with claude code. it would ruin your experience of both claude code and k2. claude4 is RL-ed to make best use of the prompts and env it gets inside the scaffold. k2 has a higher error rate with tools/ops as context grows. use k2 with opencode/cline.
11
22
469
56,031
Replying to @andromeda74356
the guy is brilliant.
3
2
460
77,613
this is the paper. just found it out in the wild and thought of this post.
y'all ain't ready for this token order prediction (TOP) early preprint coming next week. it's promising enough for me to get a checkmark to boost the paper a bit, let's see
9
39
469
103,147
TIL annas archive is now a data broker for LLM training
15
24
443
32,101
Several of my team members + myself are impacted by this layoff today. Welcome to connect :)
8
16
405
38,848
Hello Thermo World.
21
5
371
26,350
anybody who wishes to understand current RL algos needs to do this minimum
REINFORCE is coming back in a big way
7
12
369
25,027
banger release but also attention bros this week
Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Ki… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡
5
14
371
34,234
what a bold direction by deepseek once again. they took "a picture is worth a thousand words" literally or the idea of "photographic memory" if i am to commit the crime of anthropomorphisation.
10
9
353
132,234
If the model can infer when it's incorrect, it should be allowed to backtrack with <backspace> tokens. A model trained from scratch with backspace/backtrack on the same dataset might be able to fix this on its own.
The quickest “oops” I’ve ever seen from an LLM:
22
20
332
45,510
He used TLDR correctly. Because he himself didn't read what the paper actually says.
TL;DR: LLMs don’t reason, and LLMs with RL still don’t.
4
5
341
39,494
the state of evals tells us how early we are in developing any solid understanding of these systems
Cursor vs. Cognition have opposite takes on agent search: - Cursor: Custom embeddings trained on agent traces improve accuracy by 12.5% - Cognition: Embeddings are counterproductive, so we trained models to use grep with 8x parallel tool calls. Both have benchmarks. Hmmm..
4
18
346
24,264
thinky really refused to acknowledge the existence of llama4 thinkingmachines.ai/tinker/
10
7
316
22,958
have had many questions in DMs about starting out -> training/RL your own models. there are gaps even with the presence of nanogpt and post training notebooks out in the open. would appreciate if you can let me know of more. would do a write up to bridge the understanding.
11
16
312
42,012
"torch compile: the missing manual" you probably can't pay to get such a good resource. it is amazing, very comprehensive and seems to be growing still.
Replying to @YouJiacheng
This ended up taking a lot of text to answer, so you've got a new section in the manual. Read "What you should expect to be compilable" docs.google.com/document/d/1…
27
310
46,535
just learnt Microsoft fired faster cpython team and cancelled support for it. this is after i learnt they fired ZaZ earlier (physics of LM) and WizardLM team. wtf microsoft? @satyanadella this is unfknbelievable.
17
14
304
27,907
why delete this, does he consider this some rare deeply hidden alpha?
33
6
274
30,608
ohh, so this is what to expect today. github models leak - archive.is/2025.08.07-035308…
9
13
280
38,046
Round 2 llama2.mojo vs llama2.c on M2 pro llama2.mojo -> 850 tok/s llama2.c -> 639 tok/s thanks @tairov for suggesting runfast for llama2.c and missing flag for mojo 🙏
llama2.mojo on M2 MBP seems to be much faster than llama2.c I didn't do any optimisations from my side. llama2.mojo, tinyllama 15M: ~434 tok/s llama2.c: ~120 tok/s Trying if I need to compile with optimisations for llama2.c for Apple sillicon.
17
32
268
94,088
> weekly rate limits that only apply to top 5% users selectively targeting power users. not happy at all.
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage.
23
2
256
26,235
i am free of my curse (momentarily). pure RL in pretrain from random weights works. now successfully scaling generally on natural language. writing everything, would share later.
19
14
256
16,318
the value of this is clearly under-communicated. what this translates to is - "anybody can learn training a model end-to-end for free" Google grants TPUs willy-nilly if you apply for research but jax ecosystem doesn't exist and...
.@StanfordCRFM's Marin project has released the first fully open model in JAX. It’s an 'open lab' sharing the entire research process - including code, data, and logs, to enable reproducibility and further innovation. developers.googleblog.com/en…
5
11
251
23,027
it is evident zuck bought into all the wrong reasons for meta to lose its oss lead, namely > deepseek used llama weights to get ahead > the llama4 failure was due to talent/data scarcity and not due to the executive level cluelessness wrong lessons and only paranoia and haphazardness follow.
New Zuck post, what a difference a few years makes: Today: "We'll need to be rigorous about mitigating these risks and careful about what we choose to open source." 2024: "Meta is committed to open source AI... and therefore a platform that will be around for the long term."
8
10
250
35,853
> On the recent IMO 2025, equipped with any of the three leading models -- Gemini 2.5 Pro, Grok-4, or GPT-5 - our pipeline correctly solved 5 out of the 6 problems the scaffold bros keep on winning, the zeroth order optimizer agent sauce is not weak at all.
Replicate IMO-Gold in less than 500 lines: gist.github.com/faabian/39d0… The prover-verifier workflow from Huang & Yang: Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline (arxiv.org/abs/2507.15855), original code at github.com/lyang36/IMO25/
2
11
245
30,583
google released diffusion LM and it has good bench scores, vibe eval thread to see if it is just a bench score gimmick or truly worth spending time with
4
6
239
42,645
Replying to @spikedoanz
beautiful but I'm afraid it's incorrect. you can't stack squares and visualise a cube. that's a very "flatland" way, it's limited by perception itself. in higher dimensions, your methods of the Cartesian plane wouldn't make sense.
3
1
227
10,621
Doomers were right. Look at how a schizo chimera model achieves AGI internally
7
11
223
16,909
deepseek r1 preview is about to be released and it is going to be so damn amazing
7
7
231
13,793
Have you said please and thank you to @deepseek_ai for launching infra in open source to this extent in just a week? the impact of which might lead to industry wide cost reduction in serving models?
4
12
226
9,052
behold 114 pages of ai generated slop that 2k people like and 4k bookmark. ultimate guide to fooling yourself probably.
Nice paper for a long read across 114 pages. "Ultimate Guide to Fine-Tuning LLMs" Some of the things they cover 📊 Fine-tuning Pipeline Outlines a seven-stage process for fine-tuning LLMs, from data preparation to deployment and maintenance. 🧠 Advanced Fine-tuning Methods Covers techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) for aligning LLMs with human preferences. 🛠️ Parameter-Efficient Fine-Tuning (PEFT) Techniques Discusses methods like LoRA, QLoRA, and adapters that enable efficient fine-tuning by updating only a subset of model parameters. 🔬 Evaluation metrics and benchmarks for assessing fine-tuned LLMs Includes perplexity, accuracy, and task-specific measures. Benchmarks like GLUE, SuperGLUE, TruthfulQA, and MMLU assess various aspects of LLM performance. Safety evaluations using frameworks like DecodingTrust are also crucial for ensuring responsible AI deployment. 💻 Explores various deployment approaches and optimization techniques to enhance LLM performance and efficiency in real-world applications. 🌐 Examines the extension of fine-tuning techniques to multimodal models and domain-specific applications in fields like medicine and finance.
8
8
221
26,347
GPT-5: lost 71% in a week. Qwen 3 Max: gained 70% in a week. How is Qwen 3 so good at trading??
4
7
230
22,401
finished nanochat d20 all the way to RL and now it is telling me google acquired microsoft for 52B USD in 2020.
pretraining nanochat is on going to quickly repro what already exists then then i am going to go full monkey on the repo
10
8
226
25,738
i can comfortably say the knowledge to build sota scaffolds and post training RL envs have gotten inseparable as of now. claude is bottom up agentic. kimi k2 paper is all about beautiful agentic envs. qwen3 training required over 20k envs. learning to build sota RL envs >> KLD derivations, autograd from scratch
Replying to @YagaoDirac
rl envs are the new synthetic data pipelines.
5
18
224
29,486
caffeine, L theanine and absolute disentanglement from anything that poisons your will.
What's a good nootropic stack for someone just getting into using their brain?
12
3
211
14,028
i know what i am doing tonight
6
7
219
12,553
Replying to @anpaure
a true g(r)eek tragedy fr
194
13,569
e/xperiments philosophy: * No muh-favourite-architecture * Stay GPU-poor, stay foolish (literally) * Forever behind SoTA, always learning * Everyone sleeps on smol models * Data curation/evaluation is the MOAT * Synthetic dataset creation is art
8
17
209
56,986
i know what im watching tonight
13
4
204
9,398
nobody starts as being a junior now. you all learn to be a lead right from the get go. start owning and caring really hard about everything small and don't fear anything. if you're passing out in 2026, you have to do it. it's a necessity now.
8
6
194
11,808
if you're really good and you do things for people. amazing things happen.
4
10
195
7,677
everyone should be dropping such questions on twt. albeit with precise details for help though. shrek comes in clutch for any B200 alpha as usual.
3
5
194
11,927
signatures to look for in ai writing - > "it isn't just x, it is y" > narrative-philosophical-poetic section headings "The XYZ - A Journey of ABC" > overuse of symbolism and lofty adjectives - "stands as a testament", "plays a vital role", "underscores its importance" > promotional language - "rich cultural heritage", "breathtaking" > emojis 🚀🤯🤩 > "delve" > emdashes > overuse of hedging language > mandatory need to conclude > verbosity, "unnecessary code comments", newlines at end of file add any you might know
once you see it you can't unsee it. and it's everywhere
19
9
195
21,621
interesting tidbits from openai IMO gold medal tech team podcast > alex wei's intuition was primary > more confident on my idea of optimal arguments seems to be getting confirmed - "if model think for 1500 hours, then you have to eval it for 1500 hours." > multi-agent approach, heavily leverages compute. P6 requires ability to think for much longer.
6
3
193
17,746
*clears his throat in ml* muon
I'm pregnant and looking for a baby boy name that ends with “on” Help me out before my husband suggests Dragon again 🙂
7
3
187
10,189
for those of you following pure rl shenanigans, here's the complete plan github.com/tokenbender/avata…
alright a plan that looks solid has emerged for next stage of avataRL repo (RL from zero pretrain). would be coding the first set, create a thinking md file for easy follow up and post in two hours. *fingers crossed*
18
11
187
30,333
llama2.mojo on M2 MBP seems to be much faster than llama2.c I didn't do any optimisations from my side. llama2.mojo, tinyllama 15M: ~434 tok/s llama2.c: ~120 tok/s Trying if I need to compile with optimisations for llama2.c for Apple sillicon.
7
26
180
87,058
Replying to @himanshustwts
Do you really believe this is a legit model getting #3 arc agi with 4B params? it's a grift model.
2
167
24,917
sakana is probably the most hacker coded research lab. they often have higher error rate but their directions are always intriguing.
Introducing Continuous Thought Machines New Blog: sakana.ai/ctm/ Modern AI is powerful, but it’s still distinct from human-like flexible intelligence. We believe neural timing is key. Our Continuous Thought Machine is built from the ground up to use neural dynamics as a powerful representation for intelligence. Thought takes time, and reasoning is a process. Biological brains inspire us with their complex neural activity, where neural timing is critical to intelligence. We’re exploring how to bring that power to AI. The Continuous Thought Machine (CTM) incorporates neuron-level temporal processing and neural synchronization, moving beyond current AI limitations. Our approach has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique parameters to process a history of incoming signals for fine-grained temporal dynamics, and (2) neural synchronization, used as a direct latent representation to modulate data and produce outputs, encoding information directly in the timing of neural activity. Learn more about our approach: Interactive Report: pub.sakana.ai/ctm/ Full Paper: arxiv.org/abs/2505.05522 GitHub : github.com/SakanaAI/continuo…
3
14
184
7,707
this is worse than reddit bruh. you guys fucking haven't seen a book ToC in your life? god save you cuz I don't think anything else is gonna help.
mathematics for computer science from scratch.
12
1
163
13,603
i would never want to post such vids unless they look like love letters to what we live and breathe for.
2
177
11,937
the best thing to drop this entire month. this information is useful in a myriad of ways.
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
9
5
178
25,037
please check out the samples at the end of the article as well. this is just 250M params btw.
is it possible to pretrain a language model using pure reinforcement learning from scratch? random weights, no cross-entropy loss pretraining. you may have many questions in your head.
5
7
178
25,559
pretraining nanochat is on going to quickly repro what already exists then then i am going to go full monkey on the repo
8
2
180
49,292
why do we accept approximating distribution from data/model with forward KL (left picture)? why not work towards algos that look like (right picture)?
14
6
175
27,686
i love thinky and tinky but listen me out the "everyone can post-train now" meme is good until your 8B LoRA fails on your promised niche eval against dirt-cheap gemini flash or grok fast. why do i think tinker is more of a siren than a savior? infra is a beast that awakens every once in a while and demands blood sacrifice. but the "creative part" as karpathy frames to make it sound friendly? the creative part is a more of a sadistic odious machine. i feel like i am diving in a dumpster for trinkets everytime i am starting off on a new task. do not get me started on reward and env design. handcrafting was put in the dictionary after watching RL-2025-engineers trying to make things work. though, given thinky's reputation, if this is not just fine-tune as a service, they cut down infra cost by an OOM and i ONLY have to worry about data/env. then this is something that changes vertical models industry altogether. but is it?
Tinker is cool. If you're a researcher/developer, tinker dramatically simplifies LLM post-training. You retain 90% of algorithmic creative control (usually related to data, loss function, the algorithm) while tinker handles the hard parts that you usually want to touch much less often (infra, forward/backward of the LLM itself, distributed training), meaning you can do these at well below <<10% of typical complexity involved. Compared to the more common and existing paradigm of "upload your data, we'll post-train your LLM", this is imo a more clever place to "slice up" the complexity of post-training, both delegating the heavy lifting, but also keeping majority of the data/algorithmic creative control. I think the community still has to discover how and when finetuning makes sense compared to the (often strong) baseline of prompting a giant model. The early indications I've seen is that finetuning isn't so much about "stylizing" an LLM, instead, it's a lot more about narrowing the scope, and especially when you have a lot of training examples. An extreme example of scope narrowing being that of categorical classifiers, e.g.spam filters, content filters, etc. but it should be broader than that. Instead of building a giant few-shot prompts for a big LLM, it might work a lot better (and faster!) to finetune a smaller LLM specifically for your narrow task. Increasingly, production applications of LLMs are larger pipelines where a bunch of LLMs collaborate in DAGs and flows. Some of these components might work well as prompts. But a lot of it will probably work a lot better as a finetune. Tinker makes the latter trivial and should allow for an easy experimentation of what works best at any stage.
6
4
175
33,781
how many times I've shit on Krutrim in comments? apparently not enough. someday (maybe already) they're going to do a breach of licence thinking nobody will catch them too. krutrim doesn't mean artificial, it means fake.
On the left we have a billion dollar company with restrictive license, On the right we have a open source non profit company with MIT license. They blatantly copied; they didn't even have the courtesy to clone the repo and make changes.
7
13
166
11,806
> I don’t think transformers success is as much to do with attention as it is to do with discretization (tokenization) lmao
It is wild to me how little Deep Learning researchers know about basic statistical theory. Everyone acts like all to all attention is a free lunch while basic stats has shown many better ways to capture long range dependencies instead of comparing every token to each other. Ironically, I don’t think transformers success is as much to do with attention as it is to do with discretization (tokenization) which HMM used to great effect and makes modeling long range dependencies much more tractable. Dense attention is compute hungry ESPECIALLY under autoregression. Just using basic intuition a model that first zooms out, then drills down mirroring natural dependency hierarchies feels mooore right... Maybe it’s time we actually think first before wasting a quadrillion flops.
9
2
171
13,800
> It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit) Let's go guys. The stage is set. Time to dance.
Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.
6
1
175
14,940
oh wow, this isn't going to be another namespace collision like k2 think at all.
SimpleVLA-RL (Tsinghua × Shanghai AI Lab under PRIME-RL): first clean port of R1-style RL into VLAs. Recipe: tokenized action head on OpenVLA-OFT plus veRL, outcome-only rewards, mixed-outcome dynamic sampling, Clip-Higher [0.8, 1.28], rollout T=1.6, GRPO without KL ref, parallel multi-environment rendering. Benchmarks - LIBERO avg 91.0 to 99.1; Long 86.5 to 98.5 (beats π₀ 85.2, UniVLA) - RoboTwin-1.0 39.8 to 70.4 - RoboTwin-2.0 (12 tasks, short to extra-long) 38.3 to 68.8 (π₀ 49.2, RDT 33.3) - Low-data: one-traj SFT 48.9 → 96.9; Long 17.3 → 91.7 (+74.4 pts, +430%) - Sim-to-Real (4 tasks) 17.5 → 38.5; above RDT 23.5 Findings - Generalization: RL strengthens spatial, object, and goal OOD while SFT overfits and often catastrophically forgets - Emergence: “Pushcut” where the policy learns push-based shortcuts vs grasp-move-place - Limits: needs non-zero base skill (0-traj SFT to RL fails); benefits scale with stronger priors (100 vs 1000 traj SFT) - Infra: 8×A800-80G, LR 5e-6, batch 64, 8 samples per prompt, action chunks (LIBERO 8, RT 25)
5
10
169
17,557
this is what sholto meant by "the current system is being held up by patches and duct tapes" in his recent podcast.
Me thinks I wasn't supposed to see this
4
167
17,796
Just woke up. Did I miss anything?
3
1
162
9,953
Pixtral 12B is probably going to repeat Mistral v0.1 history
magnet:?xt=urn:btih:7278e625de2b1da598b23954c13933047126238a&dn=pixtral-12b-240910&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%https://nitter.app/t.co/2UepcMHjvL%3A1337%2Fannounce&tr=http%3A%2F%https://nitter.app/t.co/NsTRgy7h8S%3A80%2Fannounce
3
15
156
18,592
it's worse guys, o3 mini knowledge cut off is oct '23 and they ask why devs prefer sonnet more?
o3 mini got released and nobody's saying anything about its cut off date am i still going to suffer with Dec '23 training cut off? no, i want to use uv, polars and all the other libraries that got released until aug '24 your RL iteration is now 3 months, update datasets.
14
12
154
23,587
Replying to @cursor_ai
woah this is literally claude code lmao
1
1
161
7,847