tokenbender · Jun 7, 2026 · 5:47 PM UTC

tokenbender

Pinned Tweet

tokenbender

@tokenbender

Jun 7

We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction". A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately. Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?" This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.

216

22,991

tokenbender · Sep 28, 2025 · 7:24 PM UTC

tokenbender

@tokenbender

28 Sep 2025

this is beyond mindblowing for me. somebody built a 5 million param language model inside minecraft, trained it, equipped it with basic conversational ability. probably the best thing i have seen entire month.

348

1,507

27,197

1,735,925

tokenbender · Oct 5, 2025 · 5:57 PM UTC

tokenbender

@tokenbender

5 Oct 2025

anything for the agi

221

600

12,114

570,443

tokenbender · Oct 17, 2025 · 10:23 AM UTC

tokenbender

@tokenbender

17 Oct 2025

tired of arguing with GPT5 pro, so i asked it to fight itself on my behalf. you guys do not understand automation as well as i do.

129

5,201

238,959

tokenbender · Aug 6, 2024 · 8:31 PM UTC

tokenbender

@tokenbender

6 Aug 2024

this playlist is gonna blow up cuz @karpathy sensei just recommended it from scratch

294

4,833

400,327

tokenbender · Jul 14, 2025 · 4:05 PM UTC

tokenbender

@tokenbender

14 Jul 2025

"I work at a secret project at xAI“ "The DoD one or the hentai one?“

xAI

@xai

14 Jul 2025

Announcing Grok for Government - a suite of products that make our frontier models available to United States Government customers We are especially excited about two new partnerships for our US Government partners 1) a new contract from the US Department of Defense 2) our products being available to purchase via the General Services Administration (GSA) schedule. This allows every federal government department, agency, or office, to purchase xAI products We're hiring mission driven engineers who want to join the cause

151

4,640

243,636

tokenbender · Mar 18, 2025 · 5:20 PM UTC

tokenbender

@tokenbender

18 Mar 2025

perplexity seems to be on a path to be a case study. i will not elaborate.

Trishla Ostwal @trishlaostwal

17 Mar 2025

New: @perplexity_ai just dropped its first celeb ad starring Squid Game's Lee Jung-jae. The AI search startup takes a jab at @Google “Poogle”—with a cheeky line: “Don’t use glue” when Jung-jae asks how to make cheese stick to pizza. 💰 Mid-seven-figure buy

3,451

380,327

tokenbender · Jul 11, 2024 · 11:59 AM UTC

tokenbender

@tokenbender

11 Jul 2024

me one year ago, before i decided to dive deeper into LLMs. just C and sometimes python.

2,486

108,772

tokenbender · Jun 1, 2024 · 3:06 PM UTC

tokenbender

@tokenbender

1 Jun 2024

Replying to @melqtx

a mathematical construct for solving multi-dimensional problems for entities having a magnitude and directional representation.

2,220

95,125

tokenbender · Oct 18, 2025 · 7:15 AM UTC

tokenbender

@tokenbender

18 Oct 2025

Elon Musk

@elonmusk

18 Oct 2025

My estimate of the probability of Grok 5 achieving AGI is now at 10% and rising

1,851

104,814

tokenbender · Sep 29, 2025 · 2:31 PM UTC

tokenbender

@tokenbender

29 Sep 2025

i got to know this isn't the first time this has happened. sammyuri and others have been building incredible things pushing redstone x minecraft to its limit. a collection. 1) a complete CNN in minecraft, this was last year

tokenbender

@tokenbender

28 Sep 2025

1,802

253,753

tokenbender · Aug 5, 2025 · 11:17 AM UTC

tokenbender

@tokenbender

5 Aug 2025

a 17 yo genius disproves an age old mathematical conjecture and she is getting her applications rejected because she doesn't have a college degree. this can't be real. quantamagazine.org/at-17-han…

1,590

661,296

tokenbender · Aug 9, 2025 · 2:30 PM UTC

tokenbender

@tokenbender

9 Aug 2025

is it possible to pretrain a language model using pure reinforcement learning from scratch? random weights, no cross-entropy loss pretraining. you may have many questions in your head.

154

1,585

313,981

tokenbender · Oct 31, 2025 · 5:57 AM UTC

tokenbender

@tokenbender

31 Oct 2025

Rosinality @rosinality

31 Oct 2025

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

1,477

157,303

tokenbender · Oct 14, 2025 · 5:03 PM UTC

tokenbender

@tokenbender

14 Oct 2025

never deleting this app

1,433

129,431

tokenbender · Jan 25, 2025 · 8:37 PM UTC

tokenbender

@tokenbender

25 Jan 2025

this is nice. something worth sharing for sure.

1,377

147,848

tokenbender · Oct 17, 2025 · 8:24 PM UTC

tokenbender

@tokenbender

17 Oct 2025

god bless claude code team to ship such amazing features that GLM users can enjoy.

Thariq

@trq212

17 Oct 2025

Claude Code can now ask you interactive questions when it needs more information or when there are multiple paths forward.

1,201

100,813

tokenbender · Oct 25, 2025 · 9:42 AM UTC

tokenbender

@tokenbender

25 Oct 2025

熊师傅 weight decay 了吗 @bigeagle_xd

25 Oct 2025

sad but true ...

1,136

312,424

tokenbender · May 13, 2025 · 10:16 AM UTC

tokenbender

@tokenbender

13 May 2025

it probably feels as a self-defeating thing to say for researchers but NEXT TOKEN PREDICTION IS THE MOST OPTIMAL COMPRESSION. - shannon, 1948 (a mathematical theory of communication)

François Fleuret

@francoisfleuret

13 May 2025

Hot-take: Auto-regression sucks and is impressive as a parlor trick. Any spark of intelligence from an LLM reflects that it moved beyond, and built a factorized model with meaningful latents.

1,083

129,140

tokenbender · Oct 18, 2025 · 6:09 PM UTC

tokenbender

@tokenbender

18 Oct 2025

damn did karpathy pod just change the bubble burst timeline?

1,053

141,326

tokenbender · Sep 28, 2025 · 7:24 PM UTC

tokenbender

@tokenbender

28 Sep 2025

link - piped.video/watch?v=VaeI9YgE…

I built ChatGPT with Minecraft redstone!

I built a small language model in Minecraft using no command blocks...

youtube.com

1,039

176,403

tokenbender · Oct 19, 2025 · 1:17 PM UTC

tokenbender

@tokenbender

19 Oct 2025

Bread

@ai_bread

18 Oct 2025

Announcing Bread Technologies. We’re building machines that learn like humans. We raised a $5 million seed round led by Menlo Ventures and have been building in stealth for 10 months. Today, we rise 🍞

914

283,284

tokenbender · Jun 15, 2025 · 5:38 PM UTC

tokenbender

@tokenbender

15 Jun 2025

how i bring the best out of claude code?

tokenbender

@tokenbender

15 Jun 2025

my workflow to get the best out of claude code is complete now.

854

127,679

tokenbender · Oct 21, 2025 · 9:20 AM UTC

tokenbender

@tokenbender

21 Oct 2025

Dwarkesh Patel

@dwarkesh_sp

21 Oct 2025

Wanted to get better intuitions for how RL works on LLMs. So I wrote a simple script to teach Nanochat to add 5 digit numbers. I was surprised at how fast it learned. Until I looked at the model's generations and realized that it had just learned to always call the built-in Python interpreter 😂. The code I wrote is very remedial, minimal, and inefficient - I'm a professional podcaster, alright? But it might be helpful if you just want to see the basics of how REINFORCE or GRPO work. Link to gist below. Fundamentally, it's not that complicated: generate multiple trajectories per prompt. Update your model to make it more likely that it samples all the tokens in the successful trajectories.

771

139,039

tokenbender · Jul 17, 2025 · 5:27 AM UTC

tokenbender

@tokenbender

17 Jul 2025

we missed a banger paper in the grok4/k2 drop noise guys. these guys > look for optimal ways to select data mixes to get max improvement on a model given a target domain. > do multimodal validation > show good extrapolation accuracy (testing on 1.4B and predicting on 8B)

765

71,989

tokenbender · Oct 22, 2025 · 5:35 AM UTC

tokenbender

@tokenbender

22 Oct 2025

josh yan 👍🏼@josh1yan

21 Oct 2025

hypothetically, if one wanted to research at a frontier lab in 1.5 years (hypothetically winter 2027) and wanted to know how to develop the necessary skills and credentials to do so, what would you suggest to them (asking for a friend)

751

92,894

tokenbender · Jun 20, 2025 · 3:15 PM UTC

tokenbender

@tokenbender

20 Jun 2025

how i bring the best out of claude code - part 2

tokenbender

@tokenbender

20 Jun 2025

intermediate level workflow for claude code is complete taught claude to >create commands, >manage its own commands, >extract entire session and save it like os paging, >do multi agent deep research locally, >analyse functions in a specific style, >search the repo effectively, >work with user created scripts. post scheduled for 3.5 hours later. enjoy.

682

105,557

tokenbender · Sep 17, 2025 · 5:48 PM UTC

tokenbender

@tokenbender

17 Sep 2025

wow, this is exciting. both openai and google won gold medals at ICPC 2025. oai system solved 12/12 while gemini 2.5 deepthink solved 10/12. what is noteworthy is that no human team solved more than 11.

627

53,109

tokenbender · Oct 31, 2025 · 2:22 PM UTC

tokenbender

@tokenbender

31 Oct 2025

researchers when asked to switch from bf16 to fp16 and do loss scaling because it is way better for RL

629

41,622

tokenbender · Jul 12, 2025 · 4:32 AM UTC

tokenbender

@tokenbender

12 Jul 2025

kimi k2 is token-for-token the strongest model on earth. dirt cheap, such high quality.

588

31,000

tokenbender · Oct 17, 2025 · 6:07 AM UTC

tokenbender

@tokenbender

17 Oct 2025

this paper costed 4.2 mil USD to write holy... most labs haven't reached the point of releasing models that costed that much let alone a paper that covers all the details

541

102,768

tokenbender · May 1, 2025 · 1:37 PM UTC

tokenbender

@tokenbender

1 May 2025

Making a list of graveyard of ideas, the ultimate nerd snipes where efforts go and die DPO-*variant SSM-transformer hybrids SAEs MCTS Diffusion for large vision models Attention-less JEPA (lecun lovers) what else?

518

100,129

tokenbender · Jan 29, 2025 · 4:22 PM UTC

tokenbender

@tokenbender

29 Jan 2025

Indian Reflection grift model Shivaay > if we used openai approach, it'll not be possible. so we innovated something something and pretrained 4B SoTA unbelievable model on 4x A100 80G > unbelievable? no, skill issue. we invented everything but will not share tech report.

487

62,334

tokenbender · Sep 9, 2025 · 12:10 PM UTC

tokenbender

@tokenbender

9 Sep 2025

A really good paper from Meta Superintelligence Labs dropped. essentially a primer on how to bootstrap tasks and create efficient multi-turn self improving envs. and uses the same intuitions which i have been sharing on here. > guided exploration > prefills as hints

500

35,763

tokenbender · Jul 29, 2025 · 7:25 AM UTC

tokenbender

@tokenbender

29 Jul 2025

damn looks like i was not hallucinating, why is this paper not on the timeline?

tokenbender

@tokenbender

29 Jul 2025

muon is nice, ever since i got it working i can notice the way the output token quality is just better? probably varied from Adam optimizer tokens would be a better way to phrase it. but these models with different optimisers are different even with the same data.

494

56,886

tokenbender · Oct 15, 2025 · 1:27 PM UTC

tokenbender

@tokenbender

15 Oct 2025

if "fork found in the kitchen" was a paper in language modelling

486

34,530

tokenbender · Sep 28, 2025 · 7:27 PM UTC

tokenbender

@tokenbender

28 Sep 2025

Replying to @realmcore_

unbelievable to me when i saw they even trained it.

474

60,632

tokenbender · Jul 13, 2025 · 10:15 AM UTC

tokenbender

@tokenbender

13 Jul 2025

i do not suggest kimi k2 with claude code. it would ruin your experience of both claude code and k2. claude4 is RL-ed to make best use of the prompts and env it gets inside the scaffold. k2 has a higher error rate with tools/ops as context grows. use k2 with opencode/cline.

469

56,031

tokenbender · Sep 28, 2025 · 9:13 PM UTC

tokenbender

@tokenbender

28 Sep 2025

Replying to @andromeda74356

the guy is brilliant.

460

77,613

tokenbender · Aug 27, 2025 · 9:57 AM UTC

tokenbender

@tokenbender

27 Aug 2025

this is the paper. just found it out in the wild and thought of this post.

zed @zmkzmkz

21 Aug 2025

y'all ain't ready for this token order prediction (TOP) early preprint coming next week. it's promising enough for me to get a checkmark to boost the paper a bit, let's see

469

103,147

tokenbender · Jul 14, 2025 · 11:15 AM UTC

tokenbender

@tokenbender

14 Jul 2025

TIL annas archive is now a data broker for LLM training

443

32,101

tokenbender · Oct 23, 2025 · 11:39 AM UTC

tokenbender

@tokenbender

23 Oct 2025

Yuandong Tian

@tydsh

23 Oct 2025

Several of my team members + myself are impacted by this layoff today. Welcome to connect :)

405

38,848

tokenbender · Oct 30, 2025 · 8:30 AM UTC

tokenbender

@tokenbender

30 Oct 2025

Extropic

@extropic

29 Oct 2025

Hello Thermo World.

371

26,350

tokenbender · Jul 9, 2025 · 8:36 AM UTC

tokenbender

@tokenbender

9 Jul 2025

anybody who wishes to understand current RL algos needs to do this minimum

will brown

@willccbb

8 Jul 2025

REINFORCE is coming back in a big way

369

25,027

tokenbender · Oct 30, 2025 · 5:47 PM UTC

tokenbender

@tokenbender

30 Oct 2025

banger release but also attention bros this week

Kimi.ai

@Kimi_Moonshot

30 Oct 2025

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Ki… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡

371

34,234

tokenbender · Oct 20, 2025 · 8:08 AM UTC

tokenbender

@tokenbender

20 Oct 2025

what a bold direction by deepseek once again. they took "a picture is worth a thousand words" literally or the idea of "photographic memory" if i am to commit the crime of anthropomorphisation.

353

132,234

tokenbender · Aug 31, 2024 · 7:51 AM UTC

tokenbender

@tokenbender

31 Aug 2024

If the model can infer when it's incorrect, it should be allowed to backtrack with <backspace> tokens. A model trained from scratch with backspace/backtrack on the same dataset might be able to fix this on its own.

Riley Goodside

@goodside

31 Aug 2024

The quickest “oops” I’ve ever seen from an LLM:

ALT Screenshot of ChatGPT dialog Riley: how many u’s in yo? GPT: There is only one “u” in “yo”—but actually, there isn’t any!

332

45,510

tokenbender · Nov 10, 2025 · 8:32 PM UTC

tokenbender

@tokenbender

10 Nov 2025

He used TLDR correctly. Because he himself didn't read what the paper actually says.

Pedro Domingos

@pmddomingos

10 Nov 2025

TL;DR: LLMs don’t reason, and LLMs with RL still don’t.

341

39,494

tokenbender · Nov 12, 2025 · 12:16 PM UTC

tokenbender

@tokenbender

12 Nov 2025

the state of evals tells us how early we are in developing any solid understanding of these systems

Ara

@arafatkatze

12 Nov 2025

Cursor vs. Cognition have opposite takes on agent search: - Cursor: Custom embeddings trained on agent traces improve accuracy by 12.5% - Cognition: Embeddings are counterproductive, so we trained models to use grep with 8x parallel tool calls. Both have benchmarks. Hmmm..

346

24,264

tokenbender · Oct 1, 2025 · 6:25 PM UTC

tokenbender

@tokenbender

1 Oct 2025

thinky really refused to acknowledge the existence of llama4 thinkingmachines.ai/tinker/

316

22,958

tokenbender · Sep 11, 2025 · 2:37 PM UTC

tokenbender

@tokenbender

11 Sep 2025

have had many questions in DMs about starting out -> training/RL your own models. there are gaps even with the presence of nanogpt and post training notebooks out in the open. would appreciate if you can let me know of more. would do a write up to bridge the understanding.

312

42,012

tokenbender · Jul 8, 2024 · 7:40 AM UTC

tokenbender

@tokenbender

8 Jul 2024

"torch compile: the missing manual" you probably can't pay to get such a good resource. it is amazing, very comprehensive and seems to be growing still.

Edward Z. Yang @ezyang

7 Jul 2024

Replying to @YouJiacheng

This ended up taking a lot of text to answer, so you've got a new section in the manual. Read "What you should expect to be compilable" docs.google.com/document/d/1…

310

46,535

tokenbender · May 15, 2025 · 7:43 AM UTC

tokenbender

@tokenbender

15 May 2025

just learnt Microsoft fired faster cpython team and cancelled support for it. this is after i learnt they fired ZaZ earlier (physics of LM) and WizardLM team. wtf microsoft? @satyanadella this is unfknbelievable.

304

27,907

tokenbender · Jul 31, 2024 · 10:43 AM UTC

tokenbender

@tokenbender

31 Jul 2024

why delete this, does he consider this some rare deeply hidden alpha?

274

30,608

tokenbender · Aug 7, 2025 · 4:35 AM UTC

tokenbender

@tokenbender

7 Aug 2025

ohh, so this is what to expect today. github models leak - archive.is/2025.08.07-035308…

280

38,046

tokenbender · Oct 21, 2023 · 6:12 AM UTC

tokenbender

@tokenbender

21 Oct 2023

Round 2 llama2.mojo vs llama2.c on M2 pro llama2.mojo -> 850 tok/s llama2.c -> 639 tok/s thanks @tairov for suggesting runfast for llama2.c and missing flag for mojo 🙏

tokenbender

@tokenbender

20 Oct 2023

llama2.mojo on M2 MBP seems to be much faster than llama2.c I didn't do any optimisations from my side. llama2.mojo, tinyllama 15M: ~434 tok/s llama2.c: ~120 tok/s Trying if I need to compile with optimisations for llama2.c for Apple sillicon.

268

94,088

tokenbender · Jul 28, 2025 · 6:32 PM UTC

tokenbender

@tokenbender

28 Jul 2025

> weekly rate limits that only apply to top 5% users selectively targeting power users. not happy at all.

Anthropic

@AnthropicAI

28 Jul 2025

We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage.

ALT Abstract picture of shapes and lines on an orange background.

256

26,235

tokenbender · Aug 6, 2025 · 7:34 AM UTC

tokenbender

@tokenbender

6 Aug 2025

i am free of my curse (momentarily). pure RL in pretrain from random weights works. now successfully scaling generally on natural language. writing everything, would share later.

256

16,318

tokenbender · Jul 17, 2025 · 11:56 AM UTC

tokenbender

@tokenbender

17 Jul 2025

the value of this is clearly under-communicated. what this translates to is - "anybody can learn training a model end-to-end for free" Google grants TPUs willy-nilly if you apply for research but jax ecosystem doesn't exist and...

Google AI Developers

@googleaidevs

16 Jul 2025

.@StanfordCRFM's Marin project has released the first fully open model in JAX. It’s an 'open lab' sharing the entire research process - including code, data, and logs, to enable reproducibility and further innovation. developers.googleblog.com/en…

251

23,027

tokenbender · Jul 30, 2025 · 2:29 PM UTC

tokenbender

@tokenbender

30 Jul 2025

it is evident zuck bought into all the wrong reasons for meta to lose its oss lead, namely > deepseek used llama weights to get ahead > the llama4 failure was due to talent/data scarcity and not due to the executive level cluelessness wrong lessons and only paranoia and haphazardness follow.

Nathan Lambert

@natolambert

30 Jul 2025

New Zuck post, what a difference a few years makes: Today: "We'll need to be rigorous about mitigating these risks and careful about what we choose to open source." 2024: "Meta is committed to open source AI... and therefore a platform that will be around for the long term."

250

35,853

tokenbender · Oct 15, 2025 · 9:19 AM UTC

tokenbender

@tokenbender

15 Oct 2025

> On the recent IMO 2025, equipped with any of the three leading models -- Gemini 2.5 Pro, Grok-4, or GPT-5 - our pipeline correctly solved 5 out of the 6 problems the scaffold bros keep on winning, the zeroth order optimizer agent sauce is not weak at all.

Fabian Gloeckle @FabianGloeckle

14 Oct 2025

Replicate IMO-Gold in less than 500 lines: gist.github.com/faabian/39d0… The prover-verifier workflow from Huang & Yang: Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline (arxiv.org/abs/2507.15855), original code at github.com/lyang36/IMO25/

245

30,583

tokenbender · May 20, 2025 · 8:23 PM UTC

tokenbender

@tokenbender

20 May 2025

google released diffusion LM and it has good bench scores, vibe eval thread to see if it is just a bench score gimmick or truly worth spending time with

239

42,645

tokenbender · Jan 11, 2025 · 4:46 PM UTC

tokenbender

@tokenbender

11 Jan 2025

Replying to @spikedoanz

beautiful but I'm afraid it's incorrect. you can't stack squares and visualise a cube. that's a very "flatland" way, it's limited by perception itself. in higher dimensions, your methods of the Cartesian plane wouldn't make sense.

227

10,621

tokenbender · Oct 2, 2023 · 7:30 AM UTC

tokenbender

@tokenbender

2 Oct 2023

Doomers were right. Look at how a schizo chimera model achieves AGI internally

223

16,909

tokenbender · Jan 18, 2025 · 4:56 AM UTC

tokenbender

@tokenbender

18 Jan 2025

deepseek r1 preview is about to be released and it is going to be so damn amazing

231

13,793

tokenbender · Mar 3, 2025 · 9:28 AM UTC

tokenbender

@tokenbender

3 Mar 2025

Have you said please and thank you to @deepseek_ai for launching infra in open source to this extent in just a week? the impact of which might lead to industry wide cost reduction in serving models?

226

9,052

tokenbender · Oct 22, 2024 · 8:07 AM UTC

tokenbender

@tokenbender

22 Oct 2024

behold 114 pages of ai generated slop that 2k people like and 4k bookmark. ultimate guide to fooling yourself probably.

Rohan Paul

@rohanpaul_ai

21 Oct 2024

Nice paper for a long read across 114 pages. "Ultimate Guide to Fine-Tuning LLMs" Some of the things they cover 📊 Fine-tuning Pipeline Outlines a seven-stage process for fine-tuning LLMs, from data preparation to deployment and maintenance. 🧠 Advanced Fine-tuning Methods Covers techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) for aligning LLMs with human preferences. 🛠️ Parameter-Efficient Fine-Tuning (PEFT) Techniques Discusses methods like LoRA, QLoRA, and adapters that enable efficient fine-tuning by updating only a subset of model parameters. 🔬 Evaluation metrics and benchmarks for assessing fine-tuned LLMs Includes perplexity, accuracy, and task-specific measures. Benchmarks like GLUE, SuperGLUE, TruthfulQA, and MMLU assess various aspects of LLM performance. Safety evaluations using frameworks like DecodingTrust are also crucial for ensuring responsible AI deployment. 💻 Explores various deployment approaches and optimization techniques to enhance LLM performance and efficiency in real-world applications. 🌐 Examines the extension of fine-tuning techniques to multimodal models and domain-specific applications in fields like medicine and finance.

221

26,347

tokenbender · Oct 25, 2025 · 8:36 PM UTC

tokenbender

@tokenbender

25 Oct 2025

Yuchen Jin

@Yuchenj_UW

25 Oct 2025

GPT-5: lost 71% in a week. Qwen 3 Max: gained 70% in a week. How is Qwen 3 so good at trading??

230

22,401

tokenbender · Oct 14, 2025 · 6:27 PM UTC

tokenbender

@tokenbender

14 Oct 2025

finished nanochat d20 all the way to RL and now it is telling me google acquired microsoft for 52B USD in 2020.

tokenbender

@tokenbender

13 Oct 2025

pretraining nanochat is on going to quickly repro what already exists then then i am going to go full monkey on the repo

226

25,738

tokenbender · Jul 24, 2025 · 3:54 PM UTC

tokenbender

@tokenbender

24 Jul 2025

i can comfortably say the knowledge to build sota scaffolds and post training RL envs have gotten inseparable as of now. claude is bottom up agentic. kimi k2 paper is all about beautiful agentic envs. qwen3 training required over 20k envs. learning to build sota RL envs >> KLD derivations, autograd from scratch

tokenbender

@tokenbender

12 Jun 2025

Replying to @YagaoDirac

rl envs are the new synthetic data pipelines.

224

29,486

tokenbender · Jun 13, 2025 · 4:04 PM UTC

tokenbender

@tokenbender

13 Jun 2025

caffeine, L theanine and absolute disentanglement from anything that poisons your will.

json

@JsonBasedman

12 Jun 2025

What's a good nootropic stack for someone just getting into using their brain?

211

14,028

tokenbender · Oct 24, 2025 · 6:16 PM UTC

tokenbender

@tokenbender

24 Oct 2025

i know what i am doing tonight

219

12,553

tokenbender · Jun 17, 2024 · 2:54 PM UTC

tokenbender

@tokenbender

17 Jun 2024

Replying to @anpaure

a true g(r)eek tragedy fr

194

13,569

tokenbender · Oct 21, 2023 · 1:30 PM UTC

tokenbender

@tokenbender

21 Oct 2023

e/xperiments philosophy: * No muh-favourite-architecture * Stay GPU-poor, stay foolish (literally) * Forever behind SoTA, always learning * Everyone sleeps on smol models * Data curation/evaluation is the MOAT * Synthetic dataset creation is art

209

56,986

tokenbender · Sep 26, 2025 · 4:28 PM UTC

tokenbender

@tokenbender

26 Sep 2025

i know what im watching tonight

204

9,398

tokenbender · Dec 20, 2024 · 6:43 PM UTC

tokenbender

@tokenbender

20 Dec 2024

nobody starts as being a junior now. you all learn to be a lead right from the get go. start owning and caring really hard about everything small and don't fear anything. if you're passing out in 2026, you have to do it. it's a necessity now.

194

11,808

tokenbender · Oct 5, 2024 · 4:32 PM UTC

tokenbender

@tokenbender

5 Oct 2024

if you're really good and you do things for people. amazing things happen.

195

7,677

tokenbender · Jul 14, 2025 · 10:06 AM UTC

tokenbender

@tokenbender

14 Jul 2025

everyone should be dropping such questions on twt. albeit with precise details for help though. shrek comes in clutch for any B200 alpha as usual.

You’re unable to view this Post because this account owner limits who can view their Posts.

194

11,927

tokenbender · Jul 23, 2025 · 7:52 AM UTC

tokenbender

@tokenbender

23 Jul 2025

signatures to look for in ai writing - > "it isn't just x, it is y" > narrative-philosophical-poetic section headings "The XYZ - A Journey of ABC" > overuse of symbolism and lofty adjectives - "stands as a testament", "plays a vital role", "underscores its importance" > promotional language - "rich cultural heritage", "breathtaking" > emojis 🚀🤯🤩 > "delve" > emdashes > overuse of hedging language > mandatory need to conclude > verbosity, "unnecessary code comments", newlines at end of file add any you might know

Ben (no treats)

@andersonbcdefg

23 Jul 2025

once you see it you can't unsee it. and it's everywhere

195

21,621

tokenbender · Jul 30, 2025 · 12:50 PM UTC

tokenbender

@tokenbender

30 Jul 2025

interesting tidbits from openai IMO gold medal tech team podcast > alex wei's intuition was primary > more confident on my idea of optimal arguments seems to be getting confirmed - "if model think for 1500 hours, then you have to eval it for 1500 hours." > multi-agent approach, heavily leverages compute. P6 requires ability to think for much longer.

193

17,746

tokenbender · Oct 24, 2025 · 1:50 PM UTC

tokenbender

@tokenbender

24 Oct 2025

*clears his throat in ml* muon

.@ewwwtfff

23 Oct 2025

I'm pregnant and looking for a baby boy name that ends with “on” Help me out before my husband suggests Dragon again 🙂

187

10,189

tokenbender · May 31, 2025 · 4:24 PM UTC

tokenbender

@tokenbender

31 May 2025

for those of you following pure rl shenanigans, here's the complete plan github.com/tokenbender/avata…

tokenbender

@tokenbender

31 May 2025

alright a plan that looks solid has emerged for next stage of avataRL repo (RL from zero pretrain). would be coding the first set, create a thinking md file for easy follow up and post in two hours. *fingers crossed*

187

30,333

tokenbender · Oct 20, 2023 · 8:10 PM UTC

tokenbender

@tokenbender

20 Oct 2023

180

87,058

tokenbender · Jan 29, 2025 · 4:04 PM UTC

tokenbender

@tokenbender

29 Jan 2025

Replying to @himanshustwts

Do you really believe this is a legit model getting #3 arc agi with 4B params? it's a grift model.

167

24,917

tokenbender · May 12, 2025 · 7:30 AM UTC

tokenbender

@tokenbender

12 May 2025

sakana is probably the most hacker coded research lab. they often have higher error rate but their directions are always intriguing.

Sakana AI

@SakanaAILabs

12 May 2025

Introducing Continuous Thought Machines New Blog: sakana.ai/ctm/ Modern AI is powerful, but it’s still distinct from human-like flexible intelligence. We believe neural timing is key. Our Continuous Thought Machine is built from the ground up to use neural dynamics as a powerful representation for intelligence. Thought takes time, and reasoning is a process. Biological brains inspire us with their complex neural activity, where neural timing is critical to intelligence. We’re exploring how to bring that power to AI. The Continuous Thought Machine (CTM) incorporates neuron-level temporal processing and neural synchronization, moving beyond current AI limitations. Our approach has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique parameters to process a history of incoming signals for fine-grained temporal dynamics, and (2) neural synchronization, used as a direct latent representation to modulate data and produce outputs, encoding information directly in the timing of neural activity. Learn more about our approach: Interactive Report: pub.sakana.ai/ctm/ Full Paper: arxiv.org/abs/2505.05522 GitHub : github.com/SakanaAI/continuo…

184

7,707

tokenbender · Oct 26, 2024 · 3:27 PM UTC

tokenbender

@tokenbender

26 Oct 2024

this is worse than reddit bruh. you guys fucking haven't seen a book ToC in your life? god save you cuz I don't think anything else is gonna help.

blue

@bluewmist

26 Oct 2024

mathematics for computer science from scratch.

163

13,603

tokenbender · Oct 11, 2025 · 4:13 PM UTC

tokenbender

@tokenbender

11 Oct 2025

i would never want to post such vids unless they look like love letters to what we live and breathe for.

Prime Intellect

@PrimeIntellect

11 Oct 2025

177

11,937

tokenbender · Oct 29, 2025 · 9:32 AM UTC

tokenbender

@tokenbender

29 Oct 2025

the best thing to drop this entire month. this information is useful in a myriad of ways.

GLADIA Research Lab

@GladiaLab

27 Oct 2025

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

178

25,037

tokenbender · Aug 9, 2025 · 4:42 PM UTC

tokenbender

@tokenbender

9 Aug 2025

please check out the samples at the end of the article as well. this is just 250M params btw.

tokenbender

@tokenbender

9 Aug 2025

is it possible to pretrain a language model using pure reinforcement learning from scratch? random weights, no cross-entropy loss pretraining. you may have many questions in your head.

178

25,559

tokenbender · Oct 13, 2025 · 8:20 PM UTC

tokenbender

@tokenbender

13 Oct 2025

pretraining nanochat is on going to quickly repro what already exists then then i am going to go full monkey on the repo

tokenbender

@tokenbender

13 Oct 2025

yeehaw mfkers

180

49,292

tokenbender · Nov 2, 2025 · 7:14 AM UTC

tokenbender

@tokenbender

2 Nov 2025

why do we accept approximating distribution from data/model with forward KL (left picture)? why not work towards algos that look like (right picture)?

175

27,686

tokenbender · Oct 2, 2025 · 9:24 AM UTC

tokenbender

@tokenbender

2 Oct 2025

i love thinky and tinky but listen me out the "everyone can post-train now" meme is good until your 8B LoRA fails on your promised niche eval against dirt-cheap gemini flash or grok fast. why do i think tinker is more of a siren than a savior? infra is a beast that awakens every once in a while and demands blood sacrifice. but the "creative part" as karpathy frames to make it sound friendly? the creative part is a more of a sadistic odious machine. i feel like i am diving in a dumpster for trinkets everytime i am starting off on a new task. do not get me started on reward and env design. handcrafting was put in the dictionary after watching RL-2025-engineers trying to make things work. though, given thinky's reputation, if this is not just fine-tune as a service, they cut down infra cost by an OOM and i ONLY have to worry about data/env. then this is something that changes vertical models industry altogether. but is it?

Andrej Karpathy

@karpathy

1 Oct 2025

Tinker is cool. If you're a researcher/developer, tinker dramatically simplifies LLM post-training. You retain 90% of algorithmic creative control (usually related to data, loss function, the algorithm) while tinker handles the hard parts that you usually want to touch much less often (infra, forward/backward of the LLM itself, distributed training), meaning you can do these at well below <<10% of typical complexity involved. Compared to the more common and existing paradigm of "upload your data, we'll post-train your LLM", this is imo a more clever place to "slice up" the complexity of post-training, both delegating the heavy lifting, but also keeping majority of the data/algorithmic creative control. I think the community still has to discover how and when finetuning makes sense compared to the (often strong) baseline of prompting a giant model. The early indications I've seen is that finetuning isn't so much about "stylizing" an LLM, instead, it's a lot more about narrowing the scope, and especially when you have a lot of training examples. An extreme example of scope narrowing being that of categorical classifiers, e.g.spam filters, content filters, etc. but it should be broader than that. Instead of building a giant few-shot prompts for a big LLM, it might work a lot better (and faster!) to finetune a smaller LLM specifically for your narrow task. Increasingly, production applications of LLMs are larger pipelines where a bunch of LLMs collaborate in DAGs and flows. Some of these components might work well as prompts. But a lot of it will probably work a lot better as a finetune. Tinker makes the latter trivial and should allow for an easy experimentation of what works best at any stage.

175

33,781

tokenbender · Feb 5, 2025 · 4:18 PM UTC

tokenbender

@tokenbender

5 Feb 2025

how many times I've shit on Krutrim in comments? apparently not enough. someday (maybe already) they're going to do a breach of licence thinking nobody will catch them too. krutrim doesn't mean artificial, it means fake.

Harveen Singh Chadha

@HarveenChadha

5 Feb 2025

On the left we have a billion dollar company with restrictive license, On the right we have a open source non profit company with MIT license. They blatantly copied; they didn't even have the courtesy to clone the repo and make changes.

166

11,806

tokenbender · Aug 3, 2025 · 6:05 PM UTC

tokenbender

@tokenbender

3 Aug 2025

> I don’t think transformers success is as much to do with attention as it is to do with discretization (tokenization) lmao

bubble boi

@bubbleboi

3 Aug 2025

It is wild to me how little Deep Learning researchers know about basic statistical theory. Everyone acts like all to all attention is a free lunch while basic stats has shown many better ways to capture long range dependencies instead of comparing every token to each other. Ironically, I don’t think transformers success is as much to do with attention as it is to do with discretization (tokenization) which HMM used to great effect and makes modeling long range dependencies much more tractable. Dense attention is compute hungry ESPECIALLY under autoregression. Just using basic intuition a model that first zooms out, then drills down mirroring natural dependency hierarchies feels mooore right... Maybe it’s time we actually think first before wasting a quadrillion flops.

171

13,800

tokenbender · Oct 13, 2025 · 4:43 PM UTC

tokenbender

@tokenbender

13 Oct 2025

> It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit) Let's go guys. The stage is set. Time to dance.

Andrej Karpathy

@karpathy

13 Oct 2025

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

175

14,940

tokenbender · Sep 12, 2025 · 8:44 PM UTC

tokenbender

@tokenbender

12 Sep 2025

oh wow, this isn't going to be another namespace collision like k2 think at all.

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

12 Sep 2025

SimpleVLA-RL (Tsinghua × Shanghai AI Lab under PRIME-RL): first clean port of R1-style RL into VLAs. Recipe: tokenized action head on OpenVLA-OFT plus veRL, outcome-only rewards, mixed-outcome dynamic sampling, Clip-Higher [0.8, 1.28], rollout T=1.6, GRPO without KL ref, parallel multi-environment rendering. Benchmarks - LIBERO avg 91.0 to 99.1; Long 86.5 to 98.5 (beats π₀ 85.2, UniVLA) - RoboTwin-1.0 39.8 to 70.4 - RoboTwin-2.0 (12 tasks, short to extra-long) 38.3 to 68.8 (π₀ 49.2, RDT 33.3) - Low-data: one-traj SFT 48.9 → 96.9; Long 17.3 → 91.7 (+74.4 pts, +430%) - Sim-to-Real (4 tasks) 17.5 → 38.5; above RDT 23.5 Findings - Generalization: RL strengthens spatial, object, and goal OOD while SFT overfits and often catastrophically forgets - Emergence: “Pushcut” where the policy learns push-based shortcuts vs grasp-move-place - Limits: needs non-zero base skill (0-traj SFT to RL fails); benefits scale with stronger priors (100 vs 1000 traj SFT) - Infra: 8×A800-80G, LR 5e-6, batch 64, 8 samples per prompt, action chunks (LIBERO 8, RT 25)

169

17,557

tokenbender · Oct 5, 2025 · 9:55 AM UTC

tokenbender

@tokenbender

5 Oct 2025

this is what sholto meant by "the current system is being held up by patches and duct tapes" in his recent podcast.

Zach Mueller

@TheZachMueller

4 Oct 2025

Me thinks I wasn't supposed to see this

167

17,796

tokenbender · Oct 22, 2025 · 5:00 AM UTC

tokenbender

@tokenbender

22 Oct 2025

Aravind Srinivas

@AravSrinivas

22 Oct 2025

Just woke up. Did I miss anything?

162

9,953

tokenbender · Sep 11, 2024 · 7:42 AM UTC

tokenbender

@tokenbender

11 Sep 2024

Pixtral 12B is probably going to repeat Mistral v0.1 history

Mistral AI

@MistralAI

11 Sep 2024

magnet:?xt=urn:btih:7278e625de2b1da598b23954c13933047126238a&dn=pixtral-12b-240910&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%https://nitter.app/t.co/2UepcMHjvL%3A1337%2Fannounce&tr=http%3A%2F%https://nitter.app/t.co/NsTRgy7h8S%3A80%2Fannounce

156

18,592

tokenbender · Jan 31, 2025 · 8:11 PM UTC

tokenbender

@tokenbender

31 Jan 2025

it's worse guys, o3 mini knowledge cut off is oct '23 and they ask why devs prefer sonnet more?

tokenbender

@tokenbender

31 Jan 2025

o3 mini got released and nobody's saying anything about its cut off date am i still going to suffer with Dec '23 training cut off? no, i want to use uv, polars and all the other libraries that got released until aug '24 your RL iteration is now 3 months, update datasets.

154

23,587

tokenbender · Aug 7, 2025 · 9:00 PM UTC

tokenbender

@tokenbender

7 Aug 2025

Replying to @cursor_ai

woah this is literally claude code lmao

161

7,847