samsja · Nov 27, 2025 · 4:32 AM UTC

samsja

Pinned Tweet

samsja

@samsja19

27 Nov 2025

INTELLECT-3 is our first model I can use daily It's build using our open source stack by scaling RL of MoE over 512 H200 and pushing the sota at its size Incredible proud of leading such a team of talented, dedicated and hard working individual collaborating together to push the open frontier Open source super intelligence is coming, the pretrain are also cooking

Prime Intellect

@PrimeIntellect

27 Nov 2025

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

438

51,334

samsja · Feb 22, 2024 · 7:57 PM UTC

samsja

@samsja19

22 Feb 2024

I just finished building my deep learning rig with three 3090s. Almost all is second hand, reasonably cheap to build. Finally will be able to finetune and infer large models, maybe develop some MoE stuff.

108

1,570

194,060

samsja · Jan 26, 2025 · 11:45 PM UTC

samsja

@samsja19

26 Jan 2025

sorry to all the debuggers fan boy but I like putting 1/0 in code to know if a function was called

1,520

107,190

samsja · Feb 27, 2025 · 7:25 PM UTC

samsja

@samsja19

27 Feb 2025

deepseek r1 release: open source o1 grok 3 release: beats every benchmark gpt 4.5 release: Can hold my hand when I am scared

1,040

125,595

samsja · Oct 31, 2025 · 4:58 AM UTC

samsja

@samsja19

31 Oct 2025

Rosinality @rosinality

31 Oct 2025

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

816

89,751

samsja · Aug 18, 2025 · 9:43 PM UTC

samsja

@samsja19

18 Aug 2025

I am hiring research engineer at @PrimeIntellect We are building an open source agi labs and are looking for raw talent. We don't care about your previous job title. Everybody in the research team is full stack, we build infra and also look at data. If you have a sweet spot for system, reinforcement learning, data or scaling law you will be served with ton of challenge to solve

790

86,315

samsja · Oct 20, 2025 · 12:36 PM UTC

samsja

@samsja19

20 Oct 2025

karpahty confirming his status of goat with the most balance and realistic point of view Not sure why everybody is calling the peak of the bubble while he literally said that we probably didn't over build and that claude code / codex where not even here 1 year ago, He is just overreacting to all of you calling agi too early and saying software engineering is dead Also when he said RL sucks he is just saying that we will have better algo in one year which is obviously true I wish they had done a 6h interview

577

153,657

samsja · Oct 5, 2025 · 5:00 AM UTC

samsja

@samsja19

5 Oct 2025

The "aha" moment when sonnet realized it can abuse try catch and pass all the units test during RL

finbarr

@finbarrtimbers

4 Oct 2025

my goal in life is to join Anthropic, delete all try/except clauses from Claude’s training data, and then quit.

503

35,715

samsja · Jun 28, 2025 · 4:57 AM UTC

samsja

@samsja19

28 Jun 2025

full stack research engineer: can do pretraining, inference and RL

483

66,933

samsja · Oct 16, 2025 · 5:03 PM UTC

samsja

@samsja19

16 Oct 2025

(stolen from the torchax blog post github.com/google/torchax/bl…)

vLLM

@vllm_project

16 Oct 2025

Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on TPUs with no code changes, now with native JAX support. - Up to 5x Performance: Achieve nearly 2x-5x higher throughput compared to the first TPU prototype. - Ragged Paged Attention v3: A more flexible and performant attention kernel for TPUs. - SPMD Native: We've shifted to Single Program, Multi-Data (SPMD) as the default, a compiler-centric model native to TPUs for optimal execution. Dive deep into the new architecture and see the performance benchmarks in our latest blog post! blog.vllm.ai/2025/10/16/vllm… #vLLM #TPU #JAX #PyTorch #AI #OpenSource

489

44,143

samsja · Jan 6, 2025 · 5:22 PM UTC

samsja

@samsja19

6 Jan 2025

Turns out GPUs and transformers can do other stuff than scaling LLM. We are releasing an open-source foundation 7b model for pandemic prevention trained on novel meta genomic data I am personally very excited about open source AI boosting scientific 😀

Prime Intellect

@PrimeIntellect

6 Jan 2025

Releasing METAGENE-1: In collaboration with researchers from USC, we're open-sourcing a state-of-the-art 7B parameter Metagenomic Foundation Model. Enabling planetary-scale pathogen detection and reducing the risk of pandemics in the age of exponential biology.

472

40,962

samsja · Sep 12, 2025 · 7:27 PM UTC

samsja

@samsja19

12 Sep 2025

Anybody that actually trained a model at large scale would tell you how painful and stressful it is to be 24/7 on the watch for infra crash, loss spike, expert routing collapse. Not convinced of the analogy haha

Anjney Midha

@AnjneyMidha

12 Sep 2025

Yes sex is great but have you ever had a training run on 10K+ GB200s converge successfully If so could you pls dm me thx

464

45,101

samsja · Oct 24, 2025 · 4:11 PM UTC

samsja

@samsja19

24 Oct 2025

I wish we could just teach model stuff instead of interacting with static entity

Andrej Karpathy

@karpathy

24 Oct 2025

Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here: github.com/karpathy/nanochat… This is done via a new synthetic task `SpellingBee` that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview: - You have to ensure diversity in user prompts/queries - For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc. - I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with. If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)

417

89,148

samsja · Oct 22, 2025 · 10:43 PM UTC

samsja

@samsja19

22 Oct 2025

Working on llm RL is one of the most intellectually satisfying things I ever done, both from a system and ml perspective

397

61,571

samsja · Nov 29, 2024 · 9:20 PM UTC

samsja

@samsja19

29 Nov 2024

Intellect 1 is out. It's a 10B model trained across 3 continents using 100+ H100s, with 30 individual compute contributors. The evals are good (for 1T tokens), and the model is live. I can't stress how important this release is for open-source AI. Decentralized training is the only path toward sovereign open-source foundation models. This release proves that it's not just a fairy tale - it's working, and it's just the beginning.

Prime Intellect

@PrimeIntellect

29 Nov 2024

Releasing INTELLECT-1: We’re open-sourcing the first decentralized trained 10B model: - INTELLECT-1 base model & intermediate checkpoints - Pre-training dataset - Post-trained instruct models by @arcee_ai - PRIME training framework - Technical paper with all details

380

63,288

samsja · Aug 27, 2025 · 7:57 PM UTC

samsja

@samsja19

27 Aug 2025

Next generation of 10B+ valuation product startup will be built by scaling training on in house RL environment We live in an abundance of capabilities and yet we only have two major AI products, chatgpt and coding agent, and it deeply frustrates me The current supply chain of artificial intelligence is structurally broken with one hardware vendor, couple of giant AI player owning all the intelligence refinery, developer are left with api access with barely no control As if the the internet era was build with Intel selling CPU to 4 giant cloud owning the whole infrastructure and software stack with secret and nda all over the place and giving wordpress as an development kit to startup Why didn't perplexity invent deep research ? They just couldn't because it is train with RL My prediction for the next years: RL will become the most powerful toolkit for startup building AI product. We will see hundred of success story like cursor and lovable. End user will benefit the most from it Big labs will evolve towards product company, oai will focus on the consumer market , Antropic on coding agent, deepmind will integrate AI into all google businesses, meta and XAI fight over social media We will see the emergence of an ecosystem of AI infrastructure startup, selling compute, training foundation model, curating data , building RL environment, offering cheap inference and training, powered by open science and open source software. @PrimeIntellect is pioneering this ecosystem and the vision of open source agi, RL environment hub is one of the first key piece

Prime Intellect

@PrimeIntellect

27 Aug 2025

Introducing the Environments Hub RL environments are the key bottleneck to the next wave of AI progress, but big labs are locking them down We built a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGI

354

46,948

samsja · Oct 29, 2025 · 11:27 PM UTC

samsja

@samsja19

29 Oct 2025

Replying to @qtnx_

And barely change performance, I am personally running all my 3090 at 285w

335

47,249

samsja · Jun 6, 2025 · 9:04 PM UTC

samsja

@samsja19

6 Jun 2025

at prime @jackminong invented an algorithm that can detect if a model is serve quantized or with a slightly different system prompt / tuned model. its the backbone of our trustless RL training run, but it can also be used to detect if inference provider are cheating

xjdr

@_xjdr

6 Jun 2025

you should legally be required to disclose what quantization level you are serving your current model at like it was a nutrition label. you should also be banned from dynamically adjusting quantization based on demand without notification. (you know who you are ...)

334

31,933

samsja · Oct 11, 2024 · 7:08 PM UTC

samsja

@samsja19

11 Oct 2024

We started our 10b training Decentralized. We only need one minute of communication every hour of training. Locked in for 3 weeks, long night of no sleep but it's finally live

Prime Intellect

@PrimeIntellect

11 Oct 2024

Announcing INTELLECT-1: the first-ever decentralized training of a 10B model Scaling decentralized training 10x beyond prior efforts. Anyone can join us to build open-source AGI 🦋

311

37,087

samsja · Jul 11, 2024 · 5:21 PM UTC

samsja

@samsja19

11 Jul 2024

Very excited to present our work on Open Diloco. We trained a 1b model over 3 countries with a bandwidth of less than 100mb/s (10_000 slower that infiniband) with 90%-95 compute utilization with a hybrid code using torch FSDP and hivemind. We want to break away from closed source model being trained on giant cluster to open source co-train model between multiple smaller datacenter. OpenDiloco is out first step towards this goal, we still have a lot of work to do, from efficiently engineering smarter parallelization technique to research new algorithm that require less communication. if this is something that excite you we are hiring for founding researcher shoutout to my colleague @johannes_hage and @jackminong , to @Ar_Douillard for the original diloco work and @m_ryabinin for his help and work on hivemind

Prime Intellect

@PrimeIntellect

11 Jul 2024

Introducing OpenDiLoCo, an open-source implementation and scaling of DeepMind’s Distributed Low-Communication (DiLoCo) method, enabling globally distributed AI model training. primeintellect.ai/blog/opend…

313

40,383

samsja · Oct 2, 2025 · 5:42 AM UTC

samsja

@samsja19

2 Oct 2025

Prime-rl has now extensive support for MoE both for RL and SFT, we have been training 100B+ model with it We have support for: * Qwen3 a3-30b * GLM series and Moonlight * adding gpt oss series as we speak we end up rewriting most of the modelling code to make it works with torch compile while still being compatible with the hugging face ecosystem

275

23,985

samsja · Aug 24, 2024 · 4:53 PM UTC

samsja

@samsja19

24 Aug 2024

OpenDiloco update: I think that we hit the sweet spot with our last experiments. We managed to match the loss of the baseline with 200x less communication. The key was to trade the amount of inner step with more quantization on the pseudo gradient. We are preparing a distributed 7b run all over the world. Goal is to prove that we can co-train the next generation of open source model. * What should we train on ? * Who wants to join the training ?

Prime Intellect

@PrimeIntellect

11 Jul 2024

254

56,020

samsja · Feb 28, 2025 · 7:29 PM UTC

samsja

@samsja19

28 Feb 2025

So proud of all we've shipped in less than a year: - 4 research papers covering decentralized training, trustless compute, and foundation models for pandemic detection - 2 large-scale collaborative distributed runs (Intellect-1 and Synthetic-1) - 2 reasoning datasets - Dozens of open-source codebases (we keep no secrets—all our work is out there) - Infrastructure for our peer-to-peer network of compute - A great platform that people love to use with the market's cheapest GPU prices We just secured solid investment from @foundersfund and amazing angels like @karpathy , @tri_dao @ClementDelangue (and other) that will enable us to pursue our vision of open-source AGI. We're set to release 10x more open research this year and accelerate like never before.

Prime Intellect

@PrimeIntellect

28 Feb 2025

Announcing our $15m raise — led by @foundersfund. To build our peer to peer compute and intelligence protocol. With participation from @MenloVentures and angels like @karpathy @ClementDelangue @tri_dao @dylan522p @balajis @EMostaque and many others.

253

26,693

samsja · Oct 25, 2025 · 5:04 AM UTC

samsja

@samsja19

25 Oct 2025

246

13,316

samsja · Mar 26, 2025 · 9:04 PM UTC

samsja

@samsja19

26 Mar 2025

Absolute cultural victory of Open ai vs Google today. Good reminder that product >>> benchmark

241

9,759

samsja · Nov 3, 2025 · 5:35 AM UTC

samsja

@samsja19

3 Nov 2025

It's time to teach models to maintain codebases and not just to write code we need mode advance RL environments

236

14,821

samsja · Jun 28, 2025 · 11:51 PM UTC

samsja

@samsja19

28 Jun 2025

We have more people without degree than people with a phd in the research team at prime

Noam Brown

@polynoamial

28 Jun 2025

You don’t need a PhD to be a great AI researcher. Even @OpenAI’s Chief Research Officer doesn’t have a PhD.

233

31,003

samsja · Jun 4, 2025 · 4:07 PM UTC

samsja

@samsja19

4 Jun 2025

async RL is faster that synchronous counterpart. this might be the first time in ML history where an algorithm is naturally async at scale. we realized two things at prime 6 months ago: * RL will be as much compute intensive as pretraining, pushing frontier capability * for the first time in ml history, decentralized RL has structural advantages over centralized counterpart. This was never true for pretraining

Tiezhen WANG

@Xianbao_QIAN

4 Jun 2025

RL training is too slow? AReaL by @AntResearch_ introduces an asynchronous approach that decouples generation from training to eliminate blocking. Combined with system-level optimizations, this method achieves up to 2.57× speedup. Code open-sourced @jxwuyi

220

26,805

samsja · Aug 18, 2025 · 9:39 PM UTC

samsja

@samsja19

18 Aug 2025

Took some time to reflect on this, big impostor syndrome for me to even be on this list lol, but I guess that couple of year in doing open source research is enough to do a name for yourself Anyway, If you are an ambitious researcher or engineer thinking about working for a big labs, you should rather consider joining @PrimeIntellect to take ownership in open source AGI and accelerate your career

TBPN

@tbpn

13 Aug 2025

Replying to @tbpn

To generate the list, we spoke with researchers, scraped Google Scholar, and studied Dwarkesh to identify the visionaries who dedicate their lives to bringing us our cherished tokens. The ranking is determined by votes generously provided by anonymous participants during their time between training runs. We also provide additional context, such as: - Associated institutions - Current/previous company - General interests - Notable papers You can view the updated list at MetisList dot com

210

31,227

samsja · Jul 19, 2025 · 3:25 PM UTC

samsja

@samsja19

19 Jul 2025

I realized at our Berlin event that there are a lot of talented and ambitious young ppl in Europe. Just (almost) no inspiring company to build the future nor VC that have the balls to give them a chance. No wonder why everybody wants to come to sf|

Lazarz

@Laz4rz

18 Jul 2025

Berlin I’m in you

206

48,371

samsja · Feb 23, 2025 · 3:45 AM UTC

samsja

@samsja19

23 Feb 2025

I went over the new muon paper. it's great. I want to see more of this type of work :) A couple of comments/observations: 1. Base muon won't work at a larger scale, you need to use weight decay (it has been recently added to github.com/KellerJordan/Muon…)

Muon/muon.py at master · KellerJordan/Muon

Muon is an optimizer for hidden layers in neural networks - KellerJordan/Muon

github.com

Kimi.ai

@Kimi_Moonshot

22 Feb 2025

🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW • Seamless transition from AdamW to Muon without hyper-parameter tuning • Memory & communication efficient implementation of distributed Muon optimizer. 🎯 Based on these improvements, we introduce Moonlight: Our 3B/16B MoE model trained with Muon on 5.7T tokens, advancing the Pareto frontier with better performance at fewer FLOPs! 🎁 Open-sourcing everything: 📚 Code & implementation: github.com/MoonshotAI/Moonli… 🤗 Full model series (pretrained, instruction-tuned & intermediate checkpoints): huggingface.co/moonshotai 📜 Paper: github.com/MoonshotAI/Moonli… #AI #LLM #OpenSource #MoonshotAI

204

30,943

samsja · Jul 16, 2025 · 4:06 PM UTC

samsja

@samsja19

16 Jul 2025

We should be more worried about thinky hiring all the pytorch ppl than zuck poaching from open ai

200

14,090

samsja · Apr 16, 2025 · 12:25 AM UTC

samsja

@samsja19

16 Apr 2025

Also, we release a new grpo RL training codebase github.com/PrimeIntellect-ai… It's only one of the components that made intellect-2 decentralized, but it implements async grpo with fsdp2 and vllm, worth checking it out :)

GitHub - PrimeIntellect-ai/prime-rl: Agentic RL Training at Scale

Agentic RL Training at Scale. Contribute to PrimeIntellect-ai/prime-rl development by creating an account on GitHub.

github.com

samsja

@samsja19

15 Apr 2025

I cannot wait to release the Intellect-2 technical report and share all we learned about scaling test compute in a distributed / decentralized fashion. Extremely proud of the intense teamwork that went into preparing this fully permissionless training run with async reinforcement learning. Open source for the win

200

18,566

samsja · Nov 22, 2024 · 4:24 PM UTC

samsja

@samsja19

22 Nov 2024

We finished the training of intellect 1. We trained the 10b model over 1T tokens across 3 continents over 100 GPUs. We are preparing a full release for next Friday including the final pre-trained model, a fine-tuned version, the intermediate checkpoints, datasets, and a technical report

Prime Intellect

@PrimeIntellect

22 Nov 2024

We did it — the first decentralized training of a 10B model is complete! Trained across the US, Europe, and Asia 🌐 Post-training with @arcee_ai is underway, and a full open-source release is coming in ~1 week, including: base model, checkpoints, post-trained model and data.

195

17,449

samsja · Jun 30, 2025 · 2:00 AM UTC

samsja

@samsja19

30 Jun 2025

why is the dataloader and checkpointing always the harder part to write

198

49,795

samsja · Jan 29, 2025 · 8:14 PM UTC

samsja

@samsja19

29 Jan 2025

Intellect 1 was the first large-scale crowdsourced training run of LLM (10B+)

Yann LeCun

@ylecun

29 Jan 2025

I've been arguing for something like this for over a year: crowdsourced distributed fine-tuning.

184

24,639

samsja · Nov 6, 2025 · 12:39 AM UTC

samsja

@samsja19

6 Nov 2025

we are training our next model with continuous batching from the pipeline-rl paper, big throughput difference especially at large seq len here on prime-rl github.com/PrimeIntellect-ai…

Continuous batching by mikasenghaas · Pull Request #1229 · PrimeIntellect-ai/prime-rl

This PR adds AReal-type continuous batching to PRIME-RL. We define the Scheduler abstraction which is responsible for scheduling rollout and weight update requests. Asynchronously schedules group ...

github.com

Rishabh Agarwal

@agarwl_

6 Nov 2025

Don't sleep on PipelineRL -- this is one of the biggest jumps in compute efficiency of RL setups that we found in the ScaleRL paper (also validated by Magistral & others before)! What's the problem PipelineRL solves? In RL for LLMs, we need to send weight updates from trainer to generator (to generate data from our latest policy being trained). (Conventional PPO-off-policy) A naive approach would be to "start generators on a batch, wait for all sequences to complete, update the model weights for both trainers and generators, and repeat. Unfortunately, this approach leads to idle generators and low pipeline efficiency due to heterogeneous completion times. (Pipeline-RL) Instead, we simply let the generators continue generating tokens without discarding or finishing ongoing generations in-flight whenever we need to do a weight update -- doing an "in-flight" weight update. As such our KV caches for these generations would be stale, as they would come from LLM with earlier copy(ies) of the weights) but this is ok (see below).

192

19,013

samsja · Jul 7, 2025 · 8:07 PM UTC

samsja

@samsja19

7 Jul 2025

My 3 favorites distributed/decantralized training paper. DiLoCo (obviously): Local sgd for the llm era, large batch size training arxiv.org/abs/2311.08105 (from @Ar_Douillard ) DedLoc, system focus and first decentralized run ever of a an llm (in 2021) (arxiv.org/abs/2106.10207) from @m_ryabinin PowerSGD: arxiv.org/abs/1905.13727 low rank compression of gradient without compromising convergence. Efficient implementation in pytorch, was used to train the first dalle 1 at open ai. This 3 paper are covering 80% of the sota imo. bonus point for swarm parallelism

DiLoCo: Distributed Low-Communication Training of Language Models

Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected...

arxiv.org

Zach Mueller

@TheZachMueller

7 Jul 2025

What are the must read papers on distributed training so far from 2025 (and 2024)? Want to brush up on my paper reading skills. I imagine OpenDiLoCo would be one

191

21,221

samsja · Nov 29, 2024 · 9:43 PM UTC

samsja

@samsja19

29 Nov 2024

as promised we open source all of our work

Vincent Weisser

@vincentweisser

29 Nov 2024

We are open-sourcing INTELLECT-1, the first decentralized trained 10B model. Including base model, checkpoints, post-trained model, data, technical report and our decentralized training framework.

177

19,648

samsja · Jul 11, 2025 · 2:58 PM UTC

samsja

@samsja19

11 Jul 2025

absolute @kellerjordan0 victory

clem 🤗

@ClementDelangue

11 Jul 2025

1T parameters, open-weights, just released on @huggingface!

187

10,303

samsja · Nov 1, 2025 · 6:06 AM UTC

samsja

@samsja19

1 Nov 2025

Everybody in SF yap about iq and eq but god instinct and balls is so much more important

@agihippo

1 Nov 2025

ablations are for the weak. just yolo your runs. (ok, do some small amount of ablations, but don't over do it). instinct is everything in ML and AI.

182

21,775

samsja · Aug 3, 2025 · 5:20 PM UTC

samsja

@samsja19

3 Aug 2025

What is the best way to work on a branch and let an agent work on another branch at the same time ?

180

41,287

samsja · Oct 28, 2025 · 4:15 AM UTC

samsja

@samsja19

28 Oct 2025

Hired to debug GPU, end up doing CPU profiling all day long, thanks python

Matej Sirovatka

@m_sirovatka

28 Oct 2025

Types in Python are a good idea until you open your profile trace and see this

183

14,662

samsja · Feb 24, 2025 · 5:19 AM UTC

samsja

@samsja19

24 Feb 2025

There are some hidden gems in the pytorch forum. An entire blog post on FSDP/ Cuda Caching allocator / cuda stream / debugging memory spike

175

13,512

samsja · Apr 20, 2025 · 4:01 PM UTC

samsja

@samsja19

20 Apr 2025

After more than 6 months in SF I finally shipped 2 out 4 of my GPUs , and it feels so good to have them back

175

7,777

samsja · Oct 10, 2025 · 7:16 AM UTC

samsja

@samsja19

10 Oct 2025

This could be the gpt 1 moment for agentic pretraining and latent reasoning. @mike64_t works has the potential to challenge the status quo on what agi will look like

Prime Intellect

@PrimeIntellect

9 Oct 2025

Introducing @mike64_t's work on "Recurrence-Complete Frame-based Action Models" A paper on why long-horizon perception requires rethinking recurrence.

161

23,916

samsja · Mar 10, 2025 · 8:41 PM UTC

samsja

@samsja19

10 Mar 2025

I wish more people understood this. I am all for startup building on top of a foundation model and focus on a good product rather than research. But the paradigm of API + prompt is just not powerful enough for building a truly innovative product. Nobody could have built deep research except Open AI. Startup needs access to the lower level stack, base model, compute, software to infer and tune / rl the models to perfectly fit the product needs. It's not just about doing open source because it's cool, or ethical. It is just the best way to create the next trillion dollar companies

kalomaze

@kalomaze

10 Mar 2025

Replying to @calebfahlgren @Dorialexander

sonnet + prompts is not enough to build a company unless you're the company that built sonnet

155

47,577

samsja · Jan 22, 2025 · 2:35 AM UTC

samsja

@samsja19

22 Jan 2025

The implication of the shift in paradigm to RL is enormous for decentralized training, and few realize it. normal training = 1 prefill-like forward 1 backward RL training = 10k autoregressive forward 1 backward you need to communicate an order of magnitude less

Prime Intellect

@PrimeIntellect

22 Jan 2025

Replying to @PrimeIntellect

Decentralized Training in the Inference-Time Compute Paradigm Reinforcement learning has fundamentally different communication infrastructure requirements than pre-training, making globally distributed training the path forward. Read more in our blog: primeintellect.ai/blog/intel…

160

19,902

samsja · Sep 6, 2025 · 12:48 AM UTC

samsja

@samsja19

6 Sep 2025

the real grind start when you cannot remember which day of the week it is

158

13,148

samsja · Feb 13, 2025 · 3:28 AM UTC

samsja

@samsja19

13 Feb 2025

Synthetic-1 is only the tip of the iceberg of what is coming. We are making progress on the three essential pillars of open-source decentralized AGI: * Pretraining (intellect-1), * synthetic data gen (synthetic-1), * RL (soon) Beyond the research, we are also building **infrastructure** (soon to be open source) to coordinate all globally distributed computing. Be ready for a lot of releases this year from @PrimeIntellect

Prime Intellect

@PrimeIntellect

12 Feb 2025

We did it — collaboratively generating the largest synthetic dataset of verified reasoning traces for math, coding and science using DeepSeek-R1. SFT fine-tune is underway and a full open-source release including dataset and post-trained models is coming in the next few days.

152

12,673

samsja · Jul 31, 2025 · 8:06 AM UTC

samsja

@samsja19

31 Jul 2025

torchtitan has built it HSDP + diloco support, it's probably the best place right now to start doing decentralized learning research. It also come with support for many arch (llama3,llama4, deepseekv3...) as well as all possible parallelism (6d?). Pytorch team cooked here

154

18,907

samsja · Jun 17, 2025 · 7:52 PM UTC

samsja

@samsja19

17 Jun 2025

normal day at prime

151

13,183

samsja · Feb 22, 2024 · 7:58 PM UTC

samsja

@samsja19

22 Feb 2024

I wrote a blog post about it for the more curious samsja.github.io/blogs/rig/p… one challenge was to find a cheap mobo + cpu that would support many gpus

138

8,944

samsja · Aug 13, 2025 · 10:36 PM UTC

samsja

@samsja19

13 Aug 2025

impressed by the execution of the @astral_sh team, taking over the whole python ecosystem in couple of months and already pushing great product for entreprise it's all just about execution

Charlie Marsh

@charliermarsh

13 Aug 2025

Today, we're announcing our first hosted infrastructure product: pyx, a Python-native package registry. We think of pyx as an optimized backend for uv: it’s a package registry, but it also solves problems that go beyond the scope of a traditional "package registry".

151

11,887

samsja · Aug 5, 2025 · 5:29 PM UTC

samsja

@samsja19

5 Aug 2025

respect to open ai for dropping strong open source model ! It's also the only competitive open source western model out there, at least for now😉

OpenAI

@OpenAI

5 Aug 2025

We released two open-weight reasoning models—gpt-oss-120b and gpt-oss-20b—under an Apache 2.0 license. Developed with open-source community feedback, these models deliver meaningful advancements in both reasoning capabilities & safety. openai.com/index/introducing…

141

8,224

samsja · Nov 15, 2025 · 7:42 PM UTC

samsja

@samsja19

15 Nov 2025

146

14,942

samsja · Feb 28, 2025 · 7:49 PM UTC

samsja

@samsja19

28 Feb 2025

The only reason why we ship fast is because our team is extremely dedicated and cracked. We don't have ex-big labs/big tech people—instead, we have amazing young talent (for some, it's their first job).

Johannes Hagemann

@johannes_hage

28 Feb 2025

Replying to @johannes_hage

shout out to our entire team @samsja19 @jackminong @jannik_stra @burnpiro @leonardofed @manveerxyz @MatternJustus @matthewdif @jzheng1994 @mikasenghaas @apaz_cli @felix_red_panda @mike64_t @afurgs @Grad62304977 @madisenxtaylor @_mario_neo_

140

14,915

samsja · Mar 11, 2025 · 12:54 AM UTC

samsja

@samsja19

11 Mar 2025

just an example of why accessing AI via API sucks for product. I am using claude with some rust codebase and it sucks hard. I can solve it by doing some RL on my favorite Rust libraries. If we live in a world where open source models are a commodity, infra accessible, and training software accessible, I can train my sota rust model in a couple of days with on demand InfiniBandand cluster and serve it using serveless infra / cloud. In a week, I have my product, if it is useful lot of people will adopt it and give feedback, will integrate with the cursor and other ... Today I simply cannot do anything about it. it's just a missed product opportunity for founders. It's not research it's just product

samsja

@samsja19

10 Mar 2025

137

33,997

samsja · Mar 20, 2025 · 5:34 PM UTC

samsja

@samsja19

20 Mar 2025

Everybody telling me "Uhh prompt + icl + api is enough to build a good product look at the cursor, they are just using API and are successful, deep learning is useless for a startup" Now cursor is hiring a world-class researcher. Not saying products should not use API, but relying purely on them is not enough

Sasha Rush

@srush_nlp

20 Mar 2025

Some personal news: I recently joined Cursor. Cursor is a small, ambitious team, and they’ve created my favorite AI systems. We’re now building frontier RL models at scale in real-world coding environments. Excited for how good coding is going to be.

143

25,124

samsja · Apr 15, 2025 · 10:50 PM UTC

samsja

@samsja19

15 Apr 2025

Prime Intellect

@PrimeIntellect

15 Apr 2025

Today we’re launching INTELLECT-2: The first decentralized 32B-parameter RL training run open to join for anyone with compute — fully permissionless. Scaling towards frontier reasoning across coding, math and science.

132

25,779

samsja · Oct 31, 2025 · 4:46 PM UTC

samsja

@samsja19

31 Oct 2025

fp16 really lower the mismatch, lets see if there is an impact on the convergence later on during training. the kl max is absolulty wild this mean no clipping at all

Zichen Liu

@zzlccc

31 Oct 2025

BF16 -> FP16 is such a simple (one configuration change in Oat) yet fundamental fix for inference-training mismatch. With FP16, the most basic importance sampling PG outperforms all algorithmic fixes in BF16. Let's rethink RL stability from the precision perspective.🔎

130

13,322

samsja · Sep 8, 2025 · 1:28 AM UTC

samsja

@samsja19

8 Sep 2025

I have been arguing for a long time now that RL training will be key part of building product I cannot emphasize how important it is for the broader startup ecosystem to open up the artificial intelligence supply chain

Guillermo Rauch

@rauchg

8 Sep 2025

We shipped an OSS 'vibe coding platform' (like @v0) built with @vercel AI SDK, Gateway and Sandbox. We worked with @openai to tune the GPT-5 agent loop. It can write/read files, run commands, install packages, autofix errors… Demo oneshotting a multiplayer Pong in Go ↓

131

23,210

samsja · Jan 26, 2025 · 11:44 PM UTC

samsja

@samsja19

26 Jan 2025

Replying to @giffmana

heard they train their mixture of expert on 2k 4090 with 48GB vram interconnected via Bluetooth 6

122

5,500

samsja · Aug 15, 2025 · 5:15 PM UTC

samsja

@samsja19

15 Aug 2025

python type hints are just there for my IDE to link me to the correct function / object definition the rest doesn’t work

126

27,870

samsja · Jul 19, 2025 · 3:13 PM UTC

samsja

@samsja19

19 Jul 2025

Open ai will be remembered as one of the most inspiring companies of all time

Noam Brown

@polynoamial

19 Jul 2025

Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵

121

8,292

samsja · Jan 22, 2025 · 2:29 AM UTC

samsja

@samsja19

22 Jan 2025

Our first work on reasoning is out, we generated reasoning traces and validated them, significantly increasing downstream performance. This work was based on QwQ, R1 reasoning trace will be ever more powerful. Who is ready for decentralized code-r1-CoT-100M and math-r1-CoT-100M datasets?

Prime Intellect

@PrimeIntellect

22 Jan 2025

Today, we are releasing: - INTELLECT-MATH, a frontier 7B parameter model for math reasoning - The largest synthetic math dataset to date of 5M verified reasoning traces - An outlook on decentralized training in the inference-compute paradigm primeintellect.ai/blog/intel…

122

11,924

samsja · Oct 11, 2024 · 8:06 PM UTC

samsja

@samsja19

11 Oct 2024

How are we training a model across datacenter ? We have more than 4 data center connected together. Yet we only communicate 1 minutes every 45 minutes for a 10b model.

Prime Intellect

@PrimeIntellect

11 Oct 2024

Announcing INTELLECT-1: the first-ever decentralized training of a 10B model Scaling decentralized training 10x beyond prior efforts. Anyone can join us to build open-source AGI 🦋

115

9,692

samsja · Oct 31, 2025 · 5:10 AM UTC

samsja

@samsja19

31 Oct 2025

repeat after me, open science and open source is all you need

Grad

@Grad62304977

31 Oct 2025

115

9,607

samsja · Oct 27, 2025 · 11:59 PM UTC

samsja

@samsja19

27 Oct 2025

I wish this existed when I started my ml journey. Open source has an incentive problem and we are hoping to create a broader ecosystem to fix it

Prime Intellect

@PrimeIntellect

27 Oct 2025

We're scaling our Open-Source Environments Program As part of this, we're committing hundreds of thousands of $ in bounties and looking for partners who want to join our mission to accelerate open superintelligence Join us in building the global hub for environments and evals

118

11,122

samsja · May 12, 2025 · 3:32 AM UTC

samsja

@samsja19

12 May 2025

Very proud of this release, a lot of hard work went into it, full details in the tech report

Prime Intellect

@PrimeIntellect

12 May 2025

Releasing INTELLECT-2: We’re open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: • Detailed Technical Report • INTELLECT-2 model checkpoint primeintellect.ai/blog/intel…

117

8,706

samsja · Oct 29, 2025 · 8:22 PM UTC

samsja

@samsja19

29 Oct 2025

Extremely excited about this release, custom models are entering the product layer A vibrant llm open source ecosystem means more steerability for startup founders to express their most crazy idea

Cursor

@cursor_ai

29 Oct 2025

Introducing Cursor 2.0. Our first coding model and the best way to code with agents.

119

15,322

samsja · Mar 15, 2025 · 1:56 AM UTC

samsja

@samsja19

15 Mar 2025

Hugging face @Thom_Wolf talking about diloco and intellect 1 at @PrimeIntellect event ! Announcing as well the boom project, training a large models across multiple datacenters in collaboration with @PrimeIntellect !

118

25,424

samsja · Jun 18, 2025 · 8:41 PM UTC

samsja

@samsja19

18 Jun 2025

You are all missing so much in not following @tenderizzation, his pytorch meme game is on another level

tender

@tenderizzation

18 Jun 2025

the FP8 values in your model after 50 layers of quantize/dequantize operations

114

5,973

samsja · Aug 25, 2025 · 11:26 PM UTC

samsja

@samsja19

25 Aug 2025

Based on my estimation we should reach 1 million GitHub stars within the next 3 months

will brown

@willccbb

25 Aug 2025

uh oh

117

10,137

samsja · Jan 25, 2025 · 3:39 AM UTC

samsja

@samsja19

25 Jan 2025

Replying to @deedydas

This cluster was A100 pcie without nvswitch and without infiniband. It's very difficult (read impossible) to train a moe like v3 on it

114

18,938

samsja · Jun 29, 2025 · 2:49 AM UTC

samsja

@samsja19

29 Jun 2025

But also what is as much important for researcher * look at the data * interpret eval results * experiment hygiene

samsja

@samsja19

28 Jun 2025

full stack research engineer: can do pretraining, inference and RL

112

9,985

samsja · Jun 15, 2025 · 12:29 AM UTC

samsja

@samsja19

15 Jun 2025

Replying to @jxmnop

The AI research crowd is giving too much credits to the "science part" of ai. Yes attention is all you need is a great paper but so is all the hardware work and system engineering around it

107

4,841

samsja · Apr 23, 2025 · 11:23 PM UTC

samsja

@samsja19

23 Apr 2025

In contrast with a lot of people yapping about AI research on X @kalomaze is also shipping paper. There are very few ppl that can claim having published and get an oral at iclr at only 19 years old

kalomaze

@kalomaze

23 Apr 2025

i will not be there at ICLR either (that much back to back flying wouldve been too much for me lol) but do pay close attention and listen to what @menhguin has to say at the event for our paper :)

110

9,307

samsja · Mar 14, 2025 · 8:42 PM UTC

samsja

@samsja19

14 Mar 2025

Great paper! Scaling laws for diloco were very much needed, really happy that deepmind released them. One of the important results from this paper is that DiLoCO can work with a larger batch size than Adam. This means that it does not only reduce communication but it allows to scale DDP to more compute! It's a result we have reproduced internally at Prime and used during intellect - 1 ( we scaled to 14M token batch size). Why does it work tho? My intuition is that. Diloco should be seen as a model merging while training technique rather than a replacement for DDP (averaging gradient). Let me explain: 1/N

Zachary Charles

@MatharyCharles

14 Mar 2025

We just put out a key step for making distributed training work at larger and larger models: Scaling Laws for DiLoCo TL;DR: We can do LLM training across datacenters in a way that scales incredibly well to larger and larger models!

104

17,634

samsja · Dec 11, 2024 · 5:57 AM UTC

samsja

@samsja19

11 Dec 2024

Replying to @isidentical

this is next level psychopathic behavior wtf

106

8,315

samsja · Oct 31, 2025 · 8:01 PM UTC

samsja

@samsja19

31 Oct 2025

please follow their work, they don't get the attention they deserve their lab at @SeaAIL has been pushing a lot of really good RL paper

Penghui Qi

@QPHutu

31 Oct 2025

🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precision…

109

8,760

samsja · Nov 9, 2025 · 4:20 AM UTC

samsja

@samsja19

9 Nov 2025

The bear is back

kalomaze

@kalomaze

9 Nov 2025

RL LEARNING WITH LORA: A DIVERSE DEEP DIVE

105

10,391

samsja · Jun 15, 2025 · 4:48 AM UTC

samsja

@samsja19

15 Jun 2025

what would us big labs do if we discover that reasoning model are better when reasoning in chinese ?

107

9,669

samsja · Oct 25, 2025 · 9:30 PM UTC

samsja

@samsja19

25 Oct 2025

this is just wild, have we ever see a market where the main supplier are in direct competition with their biggest client ? Isn't it clear that there is an alternative ecosystem where models become a commodity that would benefit the application / product layer ? Time for founder to take the open source super intelligence stack seriously

104

34,608

samsja · Jul 28, 2025 · 11:33 AM UTC

samsja

@samsja19

28 Jul 2025

yes and no, dataclass are not good enough. Pydantic model / dataclass are way more expressive and validate the input There is a lib maintain by the @pydantic team that allow for overriding cli and load config via toml, that's all you need github.com/pydantic/pydantic…

GitHub - pydantic/pydantic-settings: Settings management using pydantic

Settings management using pydantic. Contribute to pydantic/pydantic-settings development by creating an account on GitHub.

github.com

Arthur Douillard

@Ar_Douillard

27 Jul 2025

The obvious and unique possible answer is a single big dataclass and cmd line override of its values Right?

102

14,111

samsja · Apr 18, 2025 · 8:55 PM UTC

samsja

@samsja19

18 Apr 2025

Damn Google start to really care about local llm, are they going soon to win both on the oss side and in the api side ?

Sundar Pichai

@sundarpichai

18 Apr 2025

Just announced new versions of Gemma 3 – the most capable model to run just one H100 GPU – can now run on just one *desktop* GPU! Our Quantization-Aware Training (QAT) method drastically brings down memory use while maintaining high quality. Excited to make Gemma 3 even more accessible for more developers.

101

12,899

samsja · Oct 2, 2025 · 1:09 AM UTC

samsja

@samsja19

2 Oct 2025

Big respect to them for contributing back to torchtitan, a lot of labs talk about open source but few actually contribute

Rohan Pandey

@khoomeik

1 Oct 2025

periodic ❤️ open-source! for example, we’ve been collaborating with the @PyTorch team to build the highest MFU gpt-oss training implementation (includes thinky sinky flexattn) here’s a few SFT runs of gpt-oss-20b & 120b, where i get ~24% MFU for 20b and ~8% for 120b

101

14,497

samsja · Oct 30, 2025 · 3:51 PM UTC

samsja

@samsja19

30 Oct 2025

Our autonomous ai scientist @Grad62304977 just needs the right amount of prompting to write the clearest answers to some research problems on x comment

Grad

@Grad62304977

30 Oct 2025

Replying to @nrehiew_

Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store a database of knowledge in ur state ur gonna do this by storing key value pairs. For example on a high level, a key could be a song name, and the value associated with it is the song lyrics when u want to extract those lyrics, u want to be able to query ur database full of key value pairs with the song name and get back the song lyrics so ur gonna query the database using that key im going to be assuming all keys stored in the database are orthogonal as in a 768 dimension space that assumption holds (nearly orthogonal). With orthogonal meaning that if i query with a key, i get back that keys associated value and nothing else now why linear attention fails is because for example in a 768 dimension space, really all u can have is 768 orthogonal keys so when ur sequence becomes longer, keys (and those key value pairs) start to interfere so now when u query with the song name, u wont get back the song lyrics, but a linear combination of other values for other keys as well which could be unrelated this causes the retrieval to fail as now ur getting the song lyrics and a bunch of noise with it what retnet does is basically take ur state which is all these key value pairs added together, and scale them all down by a scalar fixed factor so now when u query with the song name, u will get the song lyrics and noise, but if ur song lyrics were added in recently, they will have a stronger signal than the noise if that was farther back so it prioritises recent key value pairs added in the state the obvious issue is that if u want to query with the song name but that was stored a long time ago, its signal will be low mamba-2 and g-retnet basically make this scalar value dependent on the sequence so the model can learn how much to reduce the signal of all previous key value pairs. So if ur now storing an important piece of information, ur model can choose to lower the signal (decay) of the state (all previous key value pairs) so that ur new info is stored with a strong signal then rwkv6, gla, mamba turn this scalar into a vector So u can imagine now the model can be more expressive with its decay as it can lower the signal of aspects of the state for example here u have to pay attention to that the vector doesnt mean each previous key value pair gets a different decay scalar, it means that ur lowering the signal of every key of those key value pairs with the same decay, but now that decay is a vector so u can decay parts of the key less and some more (talking abt parts of each key, all keys will get the same decay) delta rule takes a whole different approach it basically states that the ideal state update rule should selectively choose key value pairs to discard meaning instead of the decay which acts the same on all key value pairs we want to be able to target specific key value pairs to remove or lower the signal of more quick clarification: linear attention is "s = s + k^T @ v" so ur state is just a sum of all key value pairs the others are "s = w * s + k^T @ v" where w is a vector or scalar acting on the rows of s (key dimension) the idea behind the delta rule is that if i have my state after 768 tokens, then now i can assume all keys in the state are orthogonal which is the ideal situation. Now if i want to store a new key (and its value) but its the same as one of the keys already in there, i ideally dont want to store both but instead take one out now pay attention that just because the keys are the same or very similar, doesnt mean their values are the same the new key being the same can be attributed to it needing to be the same bcs of the limitation of its space like u can imagine when storing ur key value pairs, each key can choose out of the 768 choices of keys (orthogonal), then a new key comes in thats not related to any before, it needs to choose one of the 768 choices but theyre already all taken so it just chooses one so now when u query for the song name, u get back the sum of 2 values that can be very different and thats not what u want bcs u want the value exactly ideally so every step u have a new key value pair, and an old key value pair where those keys are the same (not in meaning but bcs of the limitations of the space). What the delta rule does is it queries the state with the new key, which means it extracts that old value. Then instead of adding in its new key value pair, it deletes that old value and adds the new one (adds to the state "k^T @ (v_new - v_old)" which equals "k^T @ v_new - k^T @ v_old" and now bcs ur state already has somewhere in there "k^T @ v_old" (bcs the keys are the same), then that old key value pair will be deleted the important thing here is that all other key value pairs in that state are unaffected while in the rnns like GLA, the decay is the same for all previous key value pairs at each step, of course is different at each step but the same for all previous keys and values so now when u query the state, u exactly get back that value instead of the sum of 2 different ones in practice though u dont want to completely delete one, so what u do is interpolate meaning instead of deleting v_old and storing v_new, u delete v_old and store "beta * v_new + (1 - beta) * v_old" instead where beta is a scalar thats dependent on the sequence (current value) so the model based on the actual new keys and values content, can choose how much to store of each also can be viewed as making ur state update dependent on the content of that state, while using decay only like GLA makes that state update not dependent on that state its like delta rule looks into the state, picks out a key value pair, and decides to forget it, with decay u just rip a piece off of each previous key value paper of course, some issues firstly the state update is too slow, as ur only touching one key value at a time, u ideally want to be able to get rid of like 5 at once for example which can hurt length extrapolation also not an issue in general, but in practice the model wont query the state with an exact key and retrieve an exact value, it would query with a superposition of many keys and retrieve a superposition of many values and work with that bcs its smart so it will retrieve some info abt the song lyrics, some info abt the songs history, some info abt other stuff and use it all however it wants to make its prediction but the model still needs to be able to retrieve exactly what it wants which with only linear attn it cant after a certain point gated deltanet then solves the issue of the slow update by using mamba 2 style scalar data dependent decay that means ur doing the delta rule on the state, then on tope of that decaying all key value pairs in that state at once with the same value this lets the model do stuff like a full delete of the state if it wants to Then again KDA and rwkv7 make the decay a vector (only worth it if u can make it fast) also, note that the gated deltanet/KDA decay also helps with the fact that ur still storing interpolations of values which is better than storing both but worse than storing one of them if u want exact retrieval meaning, imagine the model was doing a math question, then u asked it abt space, its not fast enough to make a large delete in its state so has to keep info abt the math even if it doesnt want to, so the decay helps it make a quick sweep its still a fixed state size so of course it has to lose info but linear attention cant lose info which is bad decay on its own can lose info but has to lose the same amount of info from all key value pairs, delta rule can lose info but target specific key value pairs it wants to lose info from while keeping all others perfectly stored and yes u lose some info with the interpolation but not as much as u think as u have many layers and its all a soup of info so it could store parts of info in one key value pair, then another part in another then retrive both and mix or whatever but the main idea is u specifically target key value pairs to lose info from

100

12,727

samsja · Oct 21, 2025 · 6:46 PM UTC

samsja

@samsja19

21 Oct 2025

cannot unsee the parallel between scaling people and scaling gpu, both get communication bottleneck pretty fast

11,721

samsja · Jun 23, 2025 · 10:24 PM UTC

samsja

@samsja19

23 Jun 2025

it's live. Anybody can join even with consumer device

Prime Intellect

@PrimeIntellect

23 Jun 2025

Launching SYNTHETIC-2: our next-gen open reasoning dataset and planetary-scale synthetic data generation run. Powered by our P2P inference stack and DeepSeek-R1-0528, it verifies traces for the hardest RL tasks. Contribute towards AGI via open, permissionless compute.

9,864

samsja · Aug 17, 2025 · 11:45 PM UTC

samsja

@samsja19

17 Aug 2025

attention is all you need but agent still need to grep for code that is already in the context

8,824

samsja · Jun 11, 2025 · 4:39 PM UTC

samsja

@samsja19

11 Jun 2025

One need to admire the simplicity of diloco for decentralized training. Just average outer gradient every 100 steps: * One big communication every hour is so much simple to handle than very small communication every couple of seconds * Still use AdamW as inner optimizer, so you don't have to worry to much for scaling to larger model (novel optimizer are hard to make work at scale). Also can be swap with whatever optimizer work best (muon, soap, shampoo) * Allow to scale to way bigger batch size, scaling the data parallelism axis to way more compute than compression You can have a quite performant diloco implementation with couple of line of code pytorch code and it scale well. Simplicity is very powerful in deep learning

8,039

samsja · Apr 18, 2023 · 7:41 AM UTC

samsja

@samsja19

18 Apr 2023

🚨🚨 Excited to release DocArray v2 today! It is a complete rewrite of DocArray. We built it to be the most Pythonic experience to deal with multi-modal data, at the edge of @pydantic and @PyTorch. repo github.com/docarray/docarray

9,788

samsja · Jul 4, 2025 · 7:21 PM UTC

samsja

@samsja19

4 Jul 2025

Replying to @finbarrtimbers

You can fix it with doing the softmax in fp32 arxiv.org/abs/2506.13585

9,091

samsja · Oct 29, 2025 · 6:05 AM UTC

samsja

@samsja19

29 Oct 2025

Its so frustrating to be absolulty convinced what are the next step to agi but being bottleneck by how fragile scaling is, especially in the open, but the research team at has been accelerating hard and we now have a strong momentum, really looking forward for our next models releases

6,090

samsja · Aug 26, 2025 · 11:28 PM UTC

samsja

@samsja19

26 Aug 2025

Claude is good at brainstorming about distributed training setup but absolutely crap at implementing simple code change anyway back to coding myself I guess

6,596

samsja · Jul 23, 2025 · 1:44 PM UTC

samsja

@samsja19

23 Jul 2025

running some workload on torchtitan with default config, 23% mfu, change a bit the config, enable compile and flex, increase batch size a bit --> 58% mfu. I wish we had more performant default in the torch ecosystem

14,191

samsja · Aug 1, 2025 · 1:35 PM UTC

samsja

@samsja19

1 Aug 2025

today's mood

4,246

samsja · Aug 6, 2025 · 8:21 AM UTC

samsja

@samsja19

6 Aug 2025

Replying to @cloneofsimo

It's extremely impressive indeed, Google is shipping hard

4,269

samsja · Jun 6, 2025 · 6:19 PM UTC

samsja

@samsja19

6 Jun 2025

6,445

samsja · Oct 30, 2025 · 3:39 AM UTC

samsja

@samsja19

30 Oct 2025

Replying to @nickbaumann_

I think you underestimate the complexity of scaling RL in large moe, what you call "fine-tuning" is still extremely expensive need thousand of GPU to be done. It's actually a smart move for them to focus on scaling RL rather than pretraining

13,117