INTELLECT-3 is our first model I can use daily It's build using our open source stack by scaling RL of MoE over 512 H200 and pushing the sota at its size Incredible proud of leading such a team of talented, dedicated and hard working individual collaborating together to push the open frontier Open source super intelligence is coming, the pretrain are also cooking
Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more
26
20
438
51,334
I just finished building my deep learning rig with three 3090s. Almost all is second hand, reasonably cheap to build. Finally will be able to finetune and infer large models, maybe develop some MoE stuff.
108
78
1,570
194,060
sorry to all the debuggers fan boy but I like putting 1/0 in code to know if a function was called
66
35
1,520
107,190
deepseek r1 release: open source o1 grok 3 release: beats every benchmark gpt 4.5 release: Can hold my hand when I am scared
19
45
1,040
125,595
FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
6
38
816
89,751
I am hiring research engineer at @PrimeIntellect We are building an open source agi labs and are looking for raw talent. We don't care about your previous job title. Everybody in the research team is full stack, we build infra and also look at data. If you have a sweet spot for system, reinforcement learning, data or scaling law you will be served with ton of challenge to solve
29
29
790
86,315
karpahty confirming his status of goat with the most balance and realistic point of view Not sure why everybody is calling the peak of the bubble while he literally said that we probably didn't over build and that claude code / codex where not even here 1 year ago, He is just overreacting to all of you calling agi too early and saying software engineering is dead Also when he said RL sucks he is just saying that we will have better algo in one year which is obviously true I wish they had done a 6h interview
22
12
577
153,657
The "aha" moment when sonnet realized it can abuse try catch and pass all the units test during RL
my goal in life is to join Anthropic, delete all try/except clauses from Claude’s training data, and then quit.
3
5
503
35,715
full stack research engineer: can do pretraining, inference and RL
21
14
483
66,933
(stolen from the torchax blog post github.com/google/torchax/bl…)
Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on TPUs with no code changes, now with native JAX support. - Up to 5x Performance: Achieve nearly 2x-5x higher throughput compared to the first TPU prototype. - Ragged Paged Attention v3: A more flexible and performant attention kernel for TPUs. - SPMD Native: We've shifted to Single Program, Multi-Data (SPMD) as the default, a compiler-centric model native to TPUs for optimal execution. Dive deep into the new architecture and see the performance benchmarks in our latest blog post! blog.vllm.ai/2025/10/16/vllm… #vLLM #TPU #JAX #PyTorch #AI #OpenSource
10
15
489
44,143
Turns out GPUs and transformers can do other stuff than scaling LLM. We are releasing an open-source foundation 7b model for pandemic prevention trained on novel meta genomic data I am personally very excited about open source AI boosting scientific 😀
Releasing METAGENE-1: In collaboration with researchers from USC, we're open-sourcing a state-of-the-art 7B parameter Metagenomic Foundation Model. Enabling planetary-scale pathogen detection and reducing the risk of pandemics in the age of exponential biology.
17
50
472
40,962
Anybody that actually trained a model at large scale would tell you how painful and stressful it is to be 24/7 on the watch for infra crash, loss spike, expert routing collapse. Not convinced of the analogy haha
Yes sex is great but have you ever had a training run on 10K+ GB200s converge successfully If so could you pls dm me thx
9
14
464
45,101
I wish we could just teach model stuff instead of interacting with static entity
Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here: github.com/karpathy/nanochat… This is done via a new synthetic task `SpellingBee` that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview: - You have to ensure diversity in user prompts/queries - For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc. - I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with. If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)
9
5
417
89,148
Working on llm RL is one of the most intellectually satisfying things I ever done, both from a system and ml perspective
15
13
397
61,571
Intellect 1 is out. It's a 10B model trained across 3 continents using 100+ H100s, with 30 individual compute contributors. The evals are good (for 1T tokens), and the model is live. I can't stress how important this release is for open-source AI. Decentralized training is the only path toward sovereign open-source foundation models. This release proves that it's not just a fairy tale - it's working, and it's just the beginning.
Releasing INTELLECT-1: We’re open-sourcing the first decentralized trained 10B model: - INTELLECT-1 base model & intermediate checkpoints - Pre-training dataset - Post-trained instruct models by @arcee_ai - PRIME training framework - Technical paper with all details
11
56
380
63,288
Next generation of 10B+ valuation product startup will be built by scaling training on in house RL environment We live in an abundance of capabilities and yet we only have two major AI products, chatgpt and coding agent, and it deeply frustrates me The current supply chain of artificial intelligence is structurally broken with one hardware vendor, couple of giant AI player owning all the intelligence refinery, developer are left with api access with barely no control As if the the internet era was build with Intel selling CPU to 4 giant cloud owning the whole infrastructure and software stack with secret and nda all over the place and giving wordpress as an development kit to startup Why didn't perplexity invent deep research ? They just couldn't because it is train with RL My prediction for the next years: RL will become the most powerful toolkit for startup building AI product. We will see hundred of success story like cursor and lovable. End user will benefit the most from it Big labs will evolve towards product company, oai will focus on the consumer market , Antropic on coding agent, deepmind will integrate AI into all google businesses, meta and XAI fight over social media We will see the emergence of an ecosystem of AI infrastructure startup, selling compute, training foundation model, curating data , building RL environment, offering cheap inference and training, powered by open science and open source software. @PrimeIntellect is pioneering this ecosystem and the vision of open source agi, RL environment hub is one of the first key piece
Introducing the Environments Hub RL environments are the key bottleneck to the next wave of AI progress, but big labs are locking them down We built a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGI
11
23
354
46,948
Replying to @qtnx_
And barely change performance, I am personally running all my 3090 at 285w
8
335
47,249
at prime @jackminong invented an algorithm that can detect if a model is serve quantized or with a slightly different system prompt / tuned model. its the backbone of our trustless RL training run, but it can also be used to detect if inference provider are cheating
you should legally be required to disclose what quantization level you are serving your current model at like it was a nutrition label. you should also be banned from dynamically adjusting quantization based on demand without notification. (you know who you are ...)
10
11
334
31,933
We started our 10b training Decentralized. We only need one minute of communication every hour of training. Locked in for 3 weeks, long night of no sleep but it's finally live
Announcing INTELLECT-1: the first-ever decentralized training of a 10B model Scaling decentralized training 10x beyond prior efforts. Anyone can join us to build open-source AGI 🦋
16
19
311
37,087
Very excited to present our work on Open Diloco. We trained a 1b model over 3 countries with a bandwidth of less than 100mb/s (10_000 slower that infiniband) with 90%-95 compute utilization with a hybrid code using torch FSDP and hivemind. We want to break away from closed source model being trained on giant cluster to open source co-train model between multiple smaller datacenter. OpenDiloco is out first step towards this goal, we still have a lot of work to do, from efficiently engineering smarter parallelization technique to research new algorithm that require less communication. if this is something that excite you we are hiring for founding researcher shoutout to my colleague @johannes_hage and @jackminong , to @Ar_Douillard for the original diloco work and @m_ryabinin for his help and work on hivemind
Introducing OpenDiLoCo, an open-source implementation and scaling of DeepMind’s Distributed Low-Communication (DiLoCo) method, enabling globally distributed AI model training. primeintellect.ai/blog/opend…
12
45
313
40,383
Prime-rl has now extensive support for MoE both for RL and SFT, we have been training 100B+ model with it We have support for: * Qwen3 a3-30b * GLM series and Moonlight * adding gpt oss series as we speak we end up rewriting most of the modelling code to make it works with torch compile while still being compatible with the hugging face ecosystem
8
15
275
23,985
OpenDiloco update: I think that we hit the sweet spot with our last experiments. We managed to match the loss of the baseline with 200x less communication. The key was to trade the amount of inner step with more quantization on the pseudo gradient. We are preparing a distributed 7b run all over the world. Goal is to prove that we can co-train the next generation of open source model. * What should we train on ? * Who wants to join the training ?
Introducing OpenDiLoCo, an open-source implementation and scaling of DeepMind’s Distributed Low-Communication (DiLoCo) method, enabling globally distributed AI model training. primeintellect.ai/blog/opend…
19
37
254
56,020
So proud of all we've shipped in less than a year: - 4 research papers covering decentralized training, trustless compute, and foundation models for pandemic detection - 2 large-scale collaborative distributed runs (Intellect-1 and Synthetic-1) - 2 reasoning datasets - Dozens of open-source codebases (we keep no secrets—all our work is out there) - Infrastructure for our peer-to-peer network of compute - A great platform that people love to use with the market's cheapest GPU prices We just secured solid investment from @foundersfund and amazing angels like @karpathy , @tri_dao @ClementDelangue (and other) that will enable us to pursue our vision of open-source AGI. We're set to release 10x more open research this year and accelerate like never before.
Announcing our $15m raise — led by @foundersfund. To build our peer to peer compute and intelligence protocol. With participation from @MenloVentures and angels like @karpathy @ClementDelangue @tri_dao @dylan522p @balajis @EMostaque and many others.
26
14
253
26,693
5
5
246
13,316
Absolute cultural victory of Open ai vs Google today. Good reminder that product >>> benchmark
9
5
241
9,759
It's time to teach models to maintain codebases and not just to write code we need mode advance RL environments
22
10
236
14,821
We have more people without degree than people with a phd in the research team at prime
You don’t need a PhD to be a great AI researcher. Even @OpenAI’s Chief Research Officer doesn’t have a PhD.
9
7
233
31,003
async RL is faster that synchronous counterpart. this might be the first time in ML history where an algorithm is naturally async at scale. we realized two things at prime 6 months ago: * RL will be as much compute intensive as pretraining, pushing frontier capability * for the first time in ml history, decentralized RL has structural advantages over centralized counterpart. This was never true for pretraining
RL training is too slow? AReaL by @AntResearch_ introduces an asynchronous approach that decouples generation from training to eliminate blocking. Combined with system-level optimizations, this method achieves up to 2.57× speedup. Code open-sourced @jxwuyi
12
18
220
26,805
Took some time to reflect on this, big impostor syndrome for me to even be on this list lol, but I guess that couple of year in doing open source research is enough to do a name for yourself Anyway, If you are an ambitious researcher or engineer thinking about working for a big labs, you should rather consider joining @PrimeIntellect to take ownership in open source AGI and accelerate your career
Replying to @tbpn
To generate the list, we spoke with researchers, scraped Google Scholar, and studied Dwarkesh to identify the visionaries who dedicate their lives to bringing us our cherished tokens. The ranking is determined by votes generously provided by anonymous participants during their time between training runs. We also provide additional context, such as: - Associated institutions - Current/previous company - General interests - Notable papers You can view the updated list at MetisList dot com
20
7
210
31,227
I realized at our Berlin event that there are a lot of talented and ambitious young ppl in Europe. Just (almost) no inspiring company to build the future nor VC that have the balls to give them a chance. No wonder why everybody wants to come to sf|
Berlin I’m in you
16
7
206
48,371
I went over the new muon paper. it's great. I want to see more of this type of work :) A couple of comments/observations: 1. Base muon won't work at a larger scale, you need to use weight decay (it has been recently added to github.com/KellerJordan/Muon…)
🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW • Seamless transition from AdamW to Muon without hyper-parameter tuning • Memory & communication efficient implementation of distributed Muon optimizer. 🎯 Based on these improvements, we introduce Moonlight: Our 3B/16B MoE model trained with Muon on 5.7T tokens, advancing the Pareto frontier with better performance at fewer FLOPs! 🎁 Open-sourcing everything: 📚 Code & implementation: github.com/MoonshotAI/Moonli… 🤗 Full model series (pretrained, instruction-tuned & intermediate checkpoints): huggingface.co/moonshotai 📜 Paper: github.com/MoonshotAI/Moonli… #AI #LLM #OpenSource #MoonshotAI
3
10
204
30,943
We should be more worried about thinky hiring all the pytorch ppl than zuck poaching from open ai
11
2
200
14,090
Also, we release a new grpo RL training codebase github.com/PrimeIntellect-ai… It's only one of the components that made intellect-2 decentralized, but it implements async grpo with fsdp2 and vllm, worth checking it out :)
I cannot wait to release the Intellect-2 technical report and share all we learned about scaling test compute in a distributed / decentralized fashion. Extremely proud of the intense teamwork that went into preparing this fully permissionless training run with async reinforcement learning. Open source for the win
8
16
200
18,566
We finished the training of intellect 1. We trained the 10b model over 1T tokens across 3 continents over 100 GPUs. We are preparing a full release for next Friday including the final pre-trained model, a fine-tuned version, the intermediate checkpoints, datasets, and a technical report
We did it — the first decentralized training of a 10B model is complete! Trained across the US, Europe, and Asia 🌐 Post-training with @arcee_ai is underway, and a full open-source release is coming in ~1 week, including: base model, checkpoints, post-trained model and data.
6
24
195
17,449
why is the dataloader and checkpointing always the harder part to write
15
1
198
49,795
Intellect 1 was the first large-scale crowdsourced training run of LLM (10B+)
I've been arguing for something like this for over a year: crowdsourced distributed fine-tuning.
5
10
184
24,639
we are training our next model with continuous batching from the pipeline-rl paper, big throughput difference especially at large seq len here on prime-rl github.com/PrimeIntellect-ai…
Don't sleep on PipelineRL -- this is one of the biggest jumps in compute efficiency of RL setups that we found in the ScaleRL paper (also validated by Magistral & others before)! What's the problem PipelineRL solves? In RL for LLMs, we need to send weight updates from trainer to generator (to generate data from our latest policy being trained). (Conventional PPO-off-policy) A naive approach would be to "start generators on a batch, wait for all sequences to complete, update the model weights for both trainers and generators, and repeat. Unfortunately, this approach leads to idle generators and low pipeline efficiency due to heterogeneous completion times. (Pipeline-RL) Instead, we simply let the generators continue generating tokens without discarding or finishing ongoing generations in-flight whenever we need to do a weight update -- doing an "in-flight" weight update. As such our KV caches for these generations would be stale, as they would come from LLM with earlier copy(ies) of the weights) but this is ok (see below).
5
10
192
19,013
My 3 favorites distributed/decantralized training paper. DiLoCo (obviously): Local sgd for the llm era, large batch size training arxiv.org/abs/2311.08105 (from @Ar_Douillard ) DedLoc, system focus and first decentralized run ever of a an llm (in 2021) (arxiv.org/abs/2106.10207) from @m_ryabinin PowerSGD: arxiv.org/abs/1905.13727 low rank compression of gradient without compromising convergence. Efficient implementation in pytorch, was used to train the first dalle 1 at open ai. This 3 paper are covering 80% of the sota imo. bonus point for swarm parallelism
What are the must read papers on distributed training so far from 2025 (and 2024)? Want to brush up on my paper reading skills. I imagine OpenDiLoCo would be one
6
19
191
21,221
as promised we open source all of our work
We are open-sourcing INTELLECT-1, the first decentralized trained 10B model. Including base model, checkpoints, post-trained model, data, technical report and our decentralized training framework.
7
8
177
19,648
absolute @kellerjordan0 victory
1T parameters, open-weights, just released on @huggingface!
2
3
187
10,303
Everybody in SF yap about iq and eq but god instinct and balls is so much more important
ablations are for the weak. just yolo your runs. (ok, do some small amount of ablations, but don't over do it). instinct is everything in ML and AI.
7
5
182
21,775
What is the best way to work on a branch and let an agent work on another branch at the same time ?
40
3
180
41,287
Hired to debug GPU, end up doing CPU profiling all day long, thanks python
Types in Python are a good idea until you open your profile trace and see this
6
6
183
14,662
There are some hidden gems in the pytorch forum. An entire blog post on FSDP/ Cuda Caching allocator / cuda stream / debugging memory spike
4
10
175
13,512
After more than 6 months in SF I finally shipped 2 out 4 of my GPUs , and it feels so good to have them back
7
1
175
7,777
This could be the gpt 1 moment for agentic pretraining and latent reasoning. @mike64_t works has the potential to challenge the status quo on what agi will look like
Introducing @mike64_t's work on "Recurrence-Complete Frame-based Action Models" A paper on why long-horizon perception requires rethinking recurrence.
3
9
161
23,916
I wish more people understood this. I am all for startup building on top of a foundation model and focus on a good product rather than research. But the paradigm of API + prompt is just not powerful enough for building a truly innovative product. Nobody could have built deep research except Open AI. Startup needs access to the lower level stack, base model, compute, software to infer and tune / rl the models to perfectly fit the product needs. It's not just about doing open source because it's cool, or ethical. It is just the best way to create the next trillion dollar companies
sonnet + prompts is not enough to build a company unless you're the company that built sonnet
6
12
155
47,577
The implication of the shift in paradigm to RL is enormous for decentralized training, and few realize it. normal training = 1 prefill-like forward 1 backward RL training = 10k autoregressive forward 1 backward you need to communicate an order of magnitude less
Replying to @PrimeIntellect
Decentralized Training in the Inference-Time Compute Paradigm Reinforcement learning has fundamentally different communication infrastructure requirements than pre-training, making globally distributed training the path forward. Read more in our blog: primeintellect.ai/blog/intel…
5
16
160
19,902
the real grind start when you cannot remember which day of the week it is
19
6
158
13,148
Synthetic-1 is only the tip of the iceberg of what is coming. We are making progress on the three essential pillars of open-source decentralized AGI: * Pretraining (intellect-1), * synthetic data gen (synthetic-1), * RL (soon) Beyond the research, we are also building **infrastructure** (soon to be open source) to coordinate all globally distributed computing. Be ready for a lot of releases this year from @PrimeIntellect
We did it — collaboratively generating the largest synthetic dataset of verified reasoning traces for math, coding and science using DeepSeek-R1. SFT fine-tune is underway and a full open-source release including dataset and post-trained models is coming in the next few days.
8
19
152
12,673
torchtitan has built it HSDP + diloco support, it's probably the best place right now to start doing decentralized learning research. It also come with support for many arch (llama3,llama4, deepseekv3...) as well as all possible parallelism (6d?). Pytorch team cooked here
4
13
154
18,907
normal day at prime
13
8
151
13,183
I wrote a blog post about it for the more curious samsja.github.io/blogs/rig/p… one challenge was to find a cheap mobo + cpu that would support many gpus
8
6
138
8,944
impressed by the execution of the @astral_sh team, taking over the whole python ecosystem in couple of months and already pushing great product for entreprise it's all just about execution
Today, we're announcing our first hosted infrastructure product: pyx, a Python-native package registry. We think of pyx as an optimized backend for uv: it’s a package registry, but it also solves problems that go beyond the scope of a traditional "package registry".
3
7
151
11,887
respect to open ai for dropping strong open source model ! It's also the only competitive open source western model out there, at least for now😉
We released two open-weight reasoning models—gpt-oss-120b and gpt-oss-20b—under an Apache 2.0 license. Developed with open-source community feedback, these models deliver meaningful advancements in both reasoning capabilities & safety. openai.com/index/introducing…
6
3
141
8,224
5
6
146
14,942
The only reason why we ship fast is because our team is extremely dedicated and cracked. We don't have ex-big labs/big tech people—instead, we have amazing young talent (for some, it's their first job).
7
3
140
14,915
just an example of why accessing AI via API sucks for product. I am using claude with some rust codebase and it sucks hard. I can solve it by doing some RL on my favorite Rust libraries. If we live in a world where open source models are a commodity, infra accessible, and training software accessible, I can train my sota rust model in a couple of days with on demand InfiniBandand cluster and serve it using serveless infra / cloud. In a week, I have my product, if it is useful lot of people will adopt it and give feedback, will integrate with the cursor and other ... Today I simply cannot do anything about it. it's just a missed product opportunity for founders. It's not research it's just product
I wish more people understood this. I am all for startup building on top of a foundation model and focus on a good product rather than research. But the paradigm of API + prompt is just not powerful enough for building a truly innovative product. Nobody could have built deep research except Open AI. Startup needs access to the lower level stack, base model, compute, software to infer and tune / rl the models to perfectly fit the product needs. It's not just about doing open source because it's cool, or ethical. It is just the best way to create the next trillion dollar companies
15
10
137
33,997
Everybody telling me "Uhh prompt + icl + api is enough to build a good product look at the cursor, they are just using API and are successful, deep learning is useless for a startup" Now cursor is hiring a world-class researcher. Not saying products should not use API, but relying purely on them is not enough
Some personal news: I recently joined Cursor. Cursor is a small, ambitious team, and they’ve created my favorite AI systems. We’re now building frontier RL models at scale in real-world coding environments. Excited for how good coding is going to be.
9
2
143
25,124
I cannot wait to release the Intellect-2 technical report and share all we learned about scaling test compute in a distributed / decentralized fashion. Extremely proud of the intense teamwork that went into preparing this fully permissionless training run with async reinforcement learning. Open source for the win
Today we’re launching INTELLECT-2: The first decentralized 32B-parameter RL training run open to join for anyone with compute — fully permissionless. Scaling towards frontier reasoning across coding, math and science.
7
9
132
25,779
fp16 really lower the mismatch, lets see if there is an impact on the convergence later on during training. the kl max is absolulty wild this mean no clipping at all
BF16 -> FP16 is such a simple (one configuration change in Oat) yet fundamental fix for inference-training mismatch. With FP16, the most basic importance sampling PG outperforms all algorithmic fixes in BF16. Let's rethink RL stability from the precision perspective.🔎
4
5
130
13,322
I have been arguing for a long time now that RL training will be key part of building product I cannot emphasize how important it is for the broader startup ecosystem to open up the artificial intelligence supply chain
We shipped an OSS 'vibe coding platform' (like @v0) built with @vercel AI SDK, Gateway and Sandbox. We worked with @openai to tune the GPT-5 agent loop. It can write/read files, run commands, install packages, autofix errors… Demo oneshotting a multiplayer Pong in Go ↓
5
15
131
23,210
Replying to @giffmana
heard they train their mixture of expert on 2k 4090 with 48GB vram interconnected via Bluetooth 6
3
2
122
5,500
python type hints are just there for my IDE to link me to the correct function / object definition the rest doesn’t work
13
2
126
27,870
Open ai will be remembered as one of the most inspiring companies of all time
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵
3
121
8,292
Our first work on reasoning is out, we generated reasoning traces and validated them, significantly increasing downstream performance. This work was based on QwQ, R1 reasoning trace will be ever more powerful. Who is ready for decentralized code-r1-CoT-100M and math-r1-CoT-100M datasets?
Today, we are releasing: - INTELLECT-MATH, a frontier 7B parameter model for math reasoning - The largest synthetic math dataset to date of 5M verified reasoning traces - An outlook on decentralized training in the inference-compute paradigm primeintellect.ai/blog/intel…
11
12
122
11,924
How are we training a model across datacenter ? We have more than 4 data center connected together. Yet we only communicate 1 minutes every 45 minutes for a 10b model.
Announcing INTELLECT-1: the first-ever decentralized training of a 10B model Scaling decentralized training 10x beyond prior efforts. Anyone can join us to build open-source AGI 🦋
6
13
115
9,692
repeat after me, open science and open source is all you need
3
2
115
9,607
I wish this existed when I started my ml journey. Open source has an incentive problem and we are hoping to create a broader ecosystem to fix it
We're scaling our Open-Source Environments Program As part of this, we're committing hundreds of thousands of $ in bounties and looking for partners who want to join our mission to accelerate open superintelligence Join us in building the global hub for environments and evals
6
5
118
11,122
Very proud of this release, a lot of hard work went into it, full details in the tech report
Releasing INTELLECT-2: We’re open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: • Detailed Technical Report • INTELLECT-2 model checkpoint primeintellect.ai/blog/intel…
6
3
117
8,706
Extremely excited about this release, custom models are entering the product layer A vibrant llm open source ecosystem means more steerability for startup founders to express their most crazy idea
Introducing Cursor 2.0. Our first coding model and the best way to code with agents.
4
5
119
15,322
Hugging face @Thom_Wolf talking about diloco and intellect 1 at @PrimeIntellect event ! Announcing as well the boom project, training a large models across multiple datacenters in collaboration with @PrimeIntellect !
3
13
118
25,424
You are all missing so much in not following @tenderizzation, his pytorch meme game is on another level
the FP8 values in your model after 50 layers of quantize/dequantize operations
5
2
114
5,973
Based on my estimation we should reach 1 million GitHub stars within the next 3 months
6
1
117
10,137
Replying to @deedydas
This cluster was A100 pcie without nvswitch and without infiniband. It's very difficult (read impossible) to train a moe like v3 on it
4
2
114
18,938
But also what is as much important for researcher * look at the data * interpret eval results * experiment hygiene
full stack research engineer: can do pretraining, inference and RL
7
3
112
9,985
Replying to @jxmnop
The AI research crowd is giving too much credits to the "science part" of ai. Yes attention is all you need is a great paper but so is all the hardware work and system engineering around it
107
4,841
In contrast with a lot of people yapping about AI research on X @kalomaze is also shipping paper. There are very few ppl that can claim having published and get an oral at iclr at only 19 years old
i will not be there at ICLR either (that much back to back flying wouldve been too much for me lol) but do pay close attention and listen to what @menhguin has to say at the event for our paper :)
5
4
110
9,307
Great paper! Scaling laws for diloco were very much needed, really happy that deepmind released them. One of the important results from this paper is that DiLoCO can work with a larger batch size than Adam. This means that it does not only reduce communication but it allows to scale DDP to more compute! It's a result we have reproduced internally at Prime and used during intellect - 1 ( we scaled to 14M token batch size). Why does it work tho? My intuition is that. Diloco should be seen as a model merging while training technique rather than a replacement for DDP (averaging gradient). Let me explain: 1/N
We just put out a key step for making distributed training work at larger and larger models: Scaling Laws for DiLoCo TL;DR: We can do LLM training across datacenters in a way that scales incredibly well to larger and larger models!
4
14
104
17,634
Replying to @isidentical
this is next level psychopathic behavior wtf
5
4
106
8,315
please follow their work, they don't get the attention they deserve their lab at @SeaAIL has been pushing a lot of really good RL paper
🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precision…
3
5
109
8,760
The bear is back
RL LEARNING WITH LORA: A DIVERSE DEEP DIVE
105
10,391
what would us big labs do if we discover that reasoning model are better when reasoning in chinese ?
13
2
107
9,669
this is just wild, have we ever see a market where the main supplier are in direct competition with their biggest client ? Isn't it clear that there is an alternative ecosystem where models become a commodity that would benefit the application / product layer ? Time for founder to take the open source super intelligence stack seriously
6
4
104
34,608
yes and no, dataclass are not good enough. Pydantic model / dataclass are way more expressive and validate the input There is a lib maintain by the @pydantic team that allow for overriding cli and load config via toml, that's all you need github.com/pydantic/pydantic…
The obvious and unique possible answer is a single big dataclass and cmd line override of its values Right?
7
6
102
14,111
Damn Google start to really care about local llm, are they going soon to win both on the oss side and in the api side ?
Just announced new versions of Gemma 3 – the most capable model to run just one H100 GPU – can now run on just one *desktop* GPU! Our Quantization-Aware Training (QAT) method drastically brings down memory use while maintaining high quality. Excited to make Gemma 3 even more accessible for more developers.
4
2
101
12,899
Big respect to them for contributing back to torchtitan, a lot of labs talk about open source but few actually contribute
periodic ❤️ open-source! for example, we’ve been collaborating with the @PyTorch team to build the highest MFU gpt-oss training implementation (includes thinky sinky flexattn) here’s a few SFT runs of gpt-oss-20b & 120b, where i get ~24% MFU for 20b and ~8% for 120b
1
5
101
14,497
Our autonomous ai scientist @Grad62304977 just needs the right amount of prompting to write the clearest answers to some research problems on x comment
Replying to @nrehiew_
Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store a database of knowledge in ur state ur gonna do this by storing key value pairs. For example on a high level, a key could be a song name, and the value associated with it is the song lyrics when u want to extract those lyrics, u want to be able to query ur database full of key value pairs with the song name and get back the song lyrics so ur gonna query the database using that key im going to be assuming all keys stored in the database are orthogonal as in a 768 dimension space that assumption holds (nearly orthogonal). With orthogonal meaning that if i query with a key, i get back that keys associated value and nothing else now why linear attention fails is because for example in a 768 dimension space, really all u can have is 768 orthogonal keys so when ur sequence becomes longer, keys (and those key value pairs) start to interfere so now when u query with the song name, u wont get back the song lyrics, but a linear combination of other values for other keys as well which could be unrelated this causes the retrieval to fail as now ur getting the song lyrics and a bunch of noise with it what retnet does is basically take ur state which is all these key value pairs added together, and scale them all down by a scalar fixed factor so now when u query with the song name, u will get the song lyrics and noise, but if ur song lyrics were added in recently, they will have a stronger signal than the noise if that was farther back so it prioritises recent key value pairs added in the state the obvious issue is that if u want to query with the song name but that was stored a long time ago, its signal will be low mamba-2 and g-retnet basically make this scalar value dependent on the sequence so the model can learn how much to reduce the signal of all previous key value pairs. So if ur now storing an important piece of information, ur model can choose to lower the signal (decay) of the state (all previous key value pairs) so that ur new info is stored with a strong signal then rwkv6, gla, mamba turn this scalar into a vector So u can imagine now the model can be more expressive with its decay as it can lower the signal of aspects of the state for example here u have to pay attention to that the vector doesnt mean each previous key value pair gets a different decay scalar, it means that ur lowering the signal of every key of those key value pairs with the same decay, but now that decay is a vector so u can decay parts of the key less and some more (talking abt parts of each key, all keys will get the same decay) delta rule takes a whole different approach it basically states that the ideal state update rule should selectively choose key value pairs to discard meaning instead of the decay which acts the same on all key value pairs we want to be able to target specific key value pairs to remove or lower the signal of more quick clarification: linear attention is "s = s + k^T @ v" so ur state is just a sum of all key value pairs the others are "s = w * s + k^T @ v" where w is a vector or scalar acting on the rows of s (key dimension) the idea behind the delta rule is that if i have my state after 768 tokens, then now i can assume all keys in the state are orthogonal which is the ideal situation. Now if i want to store a new key (and its value) but its the same as one of the keys already in there, i ideally dont want to store both but instead take one out now pay attention that just because the keys are the same or very similar, doesnt mean their values are the same the new key being the same can be attributed to it needing to be the same bcs of the limitation of its space like u can imagine when storing ur key value pairs, each key can choose out of the 768 choices of keys (orthogonal), then a new key comes in thats not related to any before, it needs to choose one of the 768 choices but theyre already all taken so it just chooses one so now when u query for the song name, u get back the sum of 2 values that can be very different and thats not what u want bcs u want the value exactly ideally so every step u have a new key value pair, and an old key value pair where those keys are the same (not in meaning but bcs of the limitations of the space). What the delta rule does is it queries the state with the new key, which means it extracts that old value. Then instead of adding in its new key value pair, it deletes that old value and adds the new one (adds to the state "k^T @ (v_new - v_old)" which equals "k^T @ v_new - k^T @ v_old" and now bcs ur state already has somewhere in there "k^T @ v_old" (bcs the keys are the same), then that old key value pair will be deleted the important thing here is that all other key value pairs in that state are unaffected while in the rnns like GLA, the decay is the same for all previous key value pairs at each step, of course is different at each step but the same for all previous keys and values so now when u query the state, u exactly get back that value instead of the sum of 2 different ones in practice though u dont want to completely delete one, so what u do is interpolate meaning instead of deleting v_old and storing v_new, u delete v_old and store "beta * v_new + (1 - beta) * v_old" instead where beta is a scalar thats dependent on the sequence (current value) so the model based on the actual new keys and values content, can choose how much to store of each also can be viewed as making ur state update dependent on the content of that state, while using decay only like GLA makes that state update not dependent on that state its like delta rule looks into the state, picks out a key value pair, and decides to forget it, with decay u just rip a piece off of each previous key value paper of course, some issues firstly the state update is too slow, as ur only touching one key value at a time, u ideally want to be able to get rid of like 5 at once for example which can hurt length extrapolation also not an issue in general, but in practice the model wont query the state with an exact key and retrieve an exact value, it would query with a superposition of many keys and retrieve a superposition of many values and work with that bcs its smart so it will retrieve some info abt the song lyrics, some info abt the songs history, some info abt other stuff and use it all however it wants to make its prediction but the model still needs to be able to retrieve exactly what it wants which with only linear attn it cant after a certain point gated deltanet then solves the issue of the slow update by using mamba 2 style scalar data dependent decay that means ur doing the delta rule on the state, then on tope of that decaying all key value pairs in that state at once with the same value this lets the model do stuff like a full delete of the state if it wants to Then again KDA and rwkv7 make the decay a vector (only worth it if u can make it fast) also, note that the gated deltanet/KDA decay also helps with the fact that ur still storing interpolations of values which is better than storing both but worse than storing one of them if u want exact retrieval meaning, imagine the model was doing a math question, then u asked it abt space, its not fast enough to make a large delete in its state so has to keep info abt the math even if it doesnt want to, so the decay helps it make a quick sweep its still a fixed state size so of course it has to lose info but linear attention cant lose info which is bad decay on its own can lose info but has to lose the same amount of info from all key value pairs, delta rule can lose info but target specific key value pairs it wants to lose info from while keeping all others perfectly stored and yes u lose some info with the interpolation but not as much as u think as u have many layers and its all a soup of info so it could store parts of info in one key value pair, then another part in another then retrive both and mix or whatever but the main idea is u specifically target key value pairs to lose info from
6
2
100
12,727
cannot unsee the parallel between scaling people and scaling gpu, both get communication bottleneck pretty fast
7
8
96
11,721
it's live. Anybody can join even with consumer device
Launching SYNTHETIC-2: our next-gen open reasoning dataset and planetary-scale synthetic data generation run. Powered by our P2P inference stack and DeepSeek-R1-0528, it verifies traces for the hardest RL tasks. Contribute towards AGI via open, permissionless compute.
3
3
97
9,864
attention is all you need but agent still need to grep for code that is already in the context
4
96
8,824
One need to admire the simplicity of diloco for decentralized training. Just average outer gradient every 100 steps: * One big communication every hour is so much simple to handle than very small communication every couple of seconds * Still use AdamW as inner optimizer, so you don't have to worry to much for scaling to larger model (novel optimizer are hard to make work at scale). Also can be swap with whatever optimizer work best (muon, soap, shampoo) * Allow to scale to way bigger batch size, scaling the data parallelism axis to way more compute than compression You can have a quite performant diloco implementation with couple of line of code pytorch code and it scale well. Simplicity is very powerful in deep learning
6
2
97
8,039
🚨🚨 Excited to release DocArray v2 today! It is a complete rewrite of DocArray. We built it to be the most Pythonic experience to deal with multi-modal data, at the edge of @pydantic and @PyTorch. repo github.com/docarray/docarray
1
17
96
9,788
Replying to @finbarrtimbers
You can fix it with doing the softmax in fp32 arxiv.org/abs/2506.13585
3
4
95
9,091
Its so frustrating to be absolulty convinced what are the next step to agi but being bottleneck by how fragile scaling is, especially in the open, but the research team at has been accelerating hard and we now have a strong momentum, really looking forward for our next models releases
4
91
6,090
Claude is good at brainstorming about distributed training setup but absolutely crap at implementing simple code change anyway back to coding myself I guess
6
1
93
6,596
running some workload on torchtitan with default config, 23% mfu, change a bit the config, enable compile and flex, increase batch size a bit --> 58% mfu. I wish we had more performant default in the torch ecosystem
6
2
89
14,191
today's mood
5
91
4,246
Replying to @cloneofsimo
It's extremely impressive indeed, Google is shipping hard
89
4,269
8
3
88
6,445
Replying to @nickbaumann_
I think you underestimate the complexity of scaling RL in large moe, what you call "fine-tuning" is still extremely expensive need thousand of GPU to be done. It's actually a smart move for them to focus on scaling RL rather than pretraining
3
1
88
13,117