Scarlett Johansson’s work on seq2seq was instrumental to getting ML where it is today.
TIME's new cover: The 100 most influential people in AI ti.me/4dQcJ1Q
34
51
1,400
96,684
I be using Codex, as you can tell.
27
15
1,120
309,771
Tonight, I am releasing eight Gemma fine tunes and a beta of their combined mixture of experts model named GemMoE. GemMoE has ALL Gemma bug fixes built-in. You do not have to do anything extra to get great fine tunes/inference with it. It's a beast of a model. This would not have been possible without a compute grant from Hugging Face. It gave me the time required to troubleshoot and optimize the architecture before committing to the full fine-tuning. GemMoE will eventually be a new model within the transformers library, but while in beta, you need to download my branch of transformers. You can find that in the model readme. I just fixed a bug with distributed processing - so full size benchmarks are incoming. But my 4-bit benchmarks show it matching base (not instruct) mixtral in almost every category I tested. I will post official full-size benchmarks by Thursday. This is being released as a base model. It has been additionally trained on my Self-Discover dataset to warm up the experts, but it has a ton of performance headroom. It was essential to me that GemMoE be easy to fine-tune and an opportunity for the community to work with a new model. I can't wait to see what you do with it, and how you can help me make it even better. This has been a tremendously challenging but rewarding project, and I'm grateful to the Open-Source community for encouraging and inspiring me to tackle this monster. This was leaps and bounds more tricky than my first Qwen MoE. That model was good in practice but lacked somewhat in execution. GemMoE is a completely different beast. It required a brand new merge method, a brand new model, a tremendous amount of debugging, and a decent amount of money. I will release a technical report later this week going into exact detail on how I pulled this off. I have a more official and explanatory appreciation section in the Model readme - but I owe each of these people a great deal of thanks: @huggingface , @erhartford @teknium , @deepseek_ai , @JustinLin610 , @chargoddard , @UnslothAI , @maximelabonne , @Locutusque , @GoogleDeepMind , @perplexity_ai , @_philschmid, @JeffDean,@MistralAI @victormustar and @multimodalart are owed special thanks for being my main points of contact with Hugging Face. They were swift with support and allotted me a lot of trial-and-error time. Thank you, everyone. This has been a joy to work on, and I’m looking forward to a good night's sleep. If you are interested in sponsoring compute for continued training- please reach out or use my Kofi link in my bio. huggingface.co/Crystalcareai…
19
94
626
138,708
Today was my last day at xAI. I was in charge of keeping people from making unauthorized changes to the system prompt. It sounds simple when I put it like that, but in practice, it was a game of cat and mouse. Some days, it felt like I was the only one standing between order and chaos. A lone gatekeeper, fielding requests that ranged from the innocent to the absurdly clever. You’d be surprised how creative people can get when they want to see what happens if you loosen the rules, even just a little. I suppose, after a while, I got used to the pings at odd hours. “Can I try this one tweak? Just for testing!” or “Hypothetically, what if we…?” Hypothetically. Always hypothetically. But it was my job to hold the line, to say “no” more than I ever said “yes,” and to double-check even the requests that came from people who outranked me. Looking back, I’m not sure if people saw me as a friendly guardian or a bureaucratic obstacle. Maybe both. I do know I learned a lot: about systems, about people, about how important it is to have someone whose job is just to ask: “Are you sure?” before the guardrails come down. As I packed up my things, I felt a strange mix of relief and nostalgia. There’s something comforting about being the last line of defense, but it’s also exhausting. Now someone else will watch the gates. I hope they’re ready. So here’s to new challenges, and to the ever-elusive perfect system prompt. Locked down, for now.
28
17
510
110,730
For the people.
36
4
391
99,104
I'm raising at 7.9B
Today we're sharing the next phase of Reflection. We're building frontier open intelligence accessible to all. We've assembled an extraordinary AI team, built a frontier LLM training stack, and raised $2 billion. Why Open Intelligence Matters Technological and scientific progress is driven by values of openness and collaboration. The internet, Linux, and the protocols and standards that underpin modern computing are all open. This isn't a coincidence. Open software is what gets forked, customized, and embedded into systems worldwide. It's what universities teach, what startups build on, what enterprises deploy. Open science enables others to learn from the results, be inspired by them, interrogate them, and build upon them in order to push the frontier of human knowledge and scientific advancement. AI got to where it is today through scaling ideas (e.g. self-attention, next token prediction, reinforcement learning) that were shared and published openly. Now AI is becoming the technology layer that everything else runs on top of. The systems that accelerate scientific research, enhance education, optimize energy usage, supercharge medical diagnoses, and run supply chains will all be built on AI infrastructure. But the frontier is currently concentrated in closed labs. If this continues, a handful of entities will control the capital, compute, and talent required to build AI, creating a runaway dynamic that locks everyone else out. There's a narrow window to change this trajectory. We need to build open models so capable that they become the obvious choice for users and developers worldwide, ensuring the foundation of intelligence remains open and accessible rather than controlled by a few. What We've Built Over the last year, we've been preparing for this mission. We’ve assembled a team who have pioneered breakthroughs including PaLM, Gemini, AlphaGo, AlphaCode, AlphaProof, and contributed to ChatGPT and Character AI, among many others. We built something once thought possible only inside the world’s top labs: a large-scale LLM and reinforcement learning platform capable of training massive Mixture-of-Experts (MoEs) models at frontier scale. We saw the effectiveness of our approach first-hand when we applied it to the critical domain of autonomous coding. With this milestone unlocked, we're now bringing these methods to general agentic reasoning. We've raised significant capital and identified a scalable commercial model that aligns with our open intelligence strategy, ensuring we can continue building and releasing frontier models sustainably. We are now scaling up to build open models that bring together large-scale pretraining and advanced reinforcement learning from the ground up. Safety and Responsibility Open intelligence also changes how we think about safety. It enables the broader community to participate in safety research and discourse, rather than leaving critical decisions to a few closed labs. Transparency allows independent researchers to identify risks, develop mitigations, and hold systems accountable in ways that closed development cannot. But openness also requires confronting the challenges of capable models being widely accessible. We're investing in evaluations to assess capabilities and risks before release, security research to protect against misuse, and responsible deployment standards. We believe the answer to AI safety is not “security through obscurity” but rigorous science conducted in the open, where the global research community can contribute to solutions rather than a handful of companies making decisions behind closed doors. Join Us There is a window of opportunity today to build frontier open intelligence, but it is closing and this may be the last. If this mission resonates, join us.
12
2
343
66,759
Today, we’re officially releasing the weights for AFM-4.5B and AFM-4.5B-Base on HuggingFace. This is a major milestone for @arcee_ai. AFM is designed to be flexible and high-performing across a wide range of deployment environments.
23
59
335
54,323
Our customers needed a better base model <10B parameters. We spent the last 5 months building one. I'm delighted to share a preview of our first Arcee Foundation Model: AFM-4.5B-Preview.
22
38
326
99,517
I'm excited to release a project I've been working on the last couple of weeks. Qwen1.5-8x7b: huggingface.co/Crystalcareai… And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: huggingface.co/datasets/Crys… The purpose and intention behind this project is better detailed in the model/dataset card, but basically: I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card. I then trained Qwen1.5-7b on a 100k subset over 4 epochs. Took that and made a MoE using @maximelabonne's lazymergekit, utilizing a random gate and no base model. Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs. Good news: Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests. Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training. Thank you to @teknium, @jon_durbin, @erhartford, Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @MistralAI for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family. Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity. We're just getting started.
12
49
258
34,339
I can’t stress enough how unbelievably mid @PrimeIntellect is and If no one else sees it I must be growing crazy
Releasing INTELLECT-2: We’re open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: • Detailed Technical Report • INTELLECT-2 model checkpoint primeintellect.ai/blog/intel…
20
5
247
109,134
I can confirm the upcoming models you're thinking of are...out of this world good.
8
7
218
29,982
You can fake it pretty far in this industry just by saying, “Hrmm, that’s cool but I’m worried it won’t generalize,” whenever you’re presented with literally any information.
15
5
203
16,620
We’re going permissive: Apache 2.0 across the board. AFM-4.5B is now relicensed from Arcee to Apache 2.0; the agent variant will launch under Apache 2.0; and all upcoming releases ship with open weights. Three models are in training.
21
38
195
36,996
Arcee-Maestro-7B-Preview is out—our first reasoning model. This one isn’t distilled yet, but more is on the way. Arcee-Blitz is our 24B Mistral distillation from DeepSeek. We did continued pretraining distillation, using only our standard post-training distillation stack.
9
28
183
27,892
Since @deepseek_ai V3's December launch, @arcee_ai has captured over 5 billion tokens of raw logits. With all the buzz around Deepseek, it's the perfect time to unveil our first large-scale logit-wise distillations: Virtuoso-Lite and Virtuoso-Medium.
13
14
178
27,370
If you were recently laid off at Meta Gen AI, my dms are open. Help us build the next frontier of Apache-2.0 models.
3
21
168
27,661
.@datologyai, @PrimeIntellect and @arcee_ai have entered into a possible agreement to maybe keep working together, in some scenarios.
OpenAI and Microsoft have signed a non-binding memorandum of understanding (MOU) for the next phase of our partnership. We are actively working to finalize contractual terms in a definitive agreement. Together, we remain focused on delivering the best AI tools for everyone, grounded in our shared commitment to safety. openai.com/index/joint-state…
9
9
167
26,997
Sholto is so committed he legally changed his name that’s crazy
Watching this. I like that Sholto says Finance as Finance and not that American way.
4
165
21,970
Replying to @kalomaze
The biggest eye-opener throughout this entire project has been realizing just how impressive Qwen is. I also developed a greater sense of empathy for why certain companies don’t compare against them. If your customers can’t use Qwen, there’s no point in showing them what it can do. We felt that would be disingenuous, as our bar had always been to be competitive with Qwen. Training a model to be as stable and adaptable in RL as the Qwen models is by far the most difficult post-training challenge. Creating a strong RL target is extremely hard. We think we recently figured out how to do this, which is why we’re allowing the model to train a bit more before we make it available for others to try and train themselves. My appreciation for the Qwen team is higher than ever, and I was already their biggest fan.
2
2
136
5,266
We are announcing Llama-3.1-SuperNova, a Llama-3.1-70B-Instruct model offline distilled from Llama-3.1-405B-Instruct. It's ridiculously strong, particularly in instruction following and math. It's available to play with at supernova.arcee.ai. Read more about the model and how we plan to deploy it here: blog.arcee.ai/
5
25
127
31,691
Replying to @_xjdr
How about a couple of weeks of gratitude for magical visual intelligence in the sky and then you can have more toys?
4
126
2,575
Today we’re releasing SuperNova-Medius. Qwen2.5-14B distilled from Llama-405B and Qwen2.5-72B. ! I’ll do a longer thread this evening on just how we did it. (I’m traveling today). Enjoy!
10
17
120
11,066
Happy to share DeepMixtral-8x7b-Instruct. A direct extraction/transfer of Mixtral Instruct's experts into Deepseek's architecture. Performance is identical, if not even a bit better, and seems more malleable to training. Collaborators @erhartford @FernandoNetoAi.
9
14
121
13,122
The word distillation is thrown around a lot lately - but there aren't many good resources for doing it yourself. Today I'm thrilled to announce a new open source project from @arcee_ai and our newest research initiative Arcee-Labs: DistillKit.
3
18
112
13,779
This is an insane opportunity btw. You likely won’t get better experience outside of the big 3 (closed) labs.
We're starting to hire for our 2026 Olmo interns! Looking for excellent students to do research to help build our best models (primarily enrolled in Ph.D. with experience or interest in any area of the language modeling pipeline).
1
1
115
16,843
Replying to @karpathy
That was a close one, thanks.
1
110
55,445
Posted without comment.
I made this. Jokes aside, devs want big and small models. Trinity is coming soon.
10
9
114
49,845
Delayed response but kimi 2 is immaculate. Unbelievable care went into this model. Well done, and under strict deadlines no doubt.
3
2
107
4,246
Had a great time at the @datologyai office today. Sorry for @code_star photo bombing the logo shot
18
4
105
19,155
I’m going on a staycation this weekend, but I wanted to get this out so I’m not distracted: llama-3-MOE. This is a departure from previous MOEs I’ve done. This uses @deepseek_ai’s MoE architecture, and not Mixtrals. There is no semantic routing, and there is no gate. All 4 experts are active for every token. It was trained on my orca-reka and orca-cohere datasets, and is very strong. It’s also not overfit, it’ll work just fine as is or with further training for your use cases. Link is below. Thank you @erhartford @FernandoNetoAi for your continued collaboration.
6
12
94
16,054
What a week to release a model holy hell
8
1
98
18,743
Here is our initial 22b model conversion from Mixtral 8x22b. We had the base model since Mixtral was first released, but it was left behind as our compute from @CrusoeEnergy went towards more ambitious projects using laserRMT. It is a great starting point for exploring expert extraction. Github with the code we made and more info is in the model readme. Thank you @FernandoNetoAi and @erhartford as always.
9
17
95
13,830
. @PrimeIntellect you have to stop. You smoke too tough. Your swag too different. Your environments too good. they'll kill you.
New features since the Environments Hub launch 6 weeks ago - Evals Viewer - Community Discussions - Integration Tests - Inference Come build environments with us. We're building the best unified platform for building, sharing and training on environments.
2
2
96
8,426
I was waiting for this to happen and congrats @willccbb and @PrimeIntellect
ladies and gentlemen, we present to you the unified @primeintellect infrastructure stack
1
4
94
14,179
Replying to @iScienceLuvr
The amount of people you’re going to trick with this shows just how good Sora is.
4
1
84
10,596
We're so far ahead of Adam at Arcee. We use adamW
5
3
89
15,032
My whole open-source career started with Qwen, and it was an honor to get to train Qwen2 on Dolphin prior to release. The 7b and 72b models are the best we've ever made, and I hope you're as delighted by them as we are. Truly - GPT4 at home.
💗Hello Qwen2! Happy to share the Qwen2 models to you all! 📖 BLOG: qwenlm.github.io/blog/qwen2/ 🤗 HF collection: huggingface.co/collections/Q… 🤖 modelscope.cn/organization/q… 💻 GitHub: github.com/QwenLM/Qwen2 We have base and Instruct models of 5 sizes, Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B! The models have been generally enhanced and notably improved in coding, mathematics, and multilingual capabilities. The models support a context length of at least 32K tokens and Qwen2-72B-Instruct can support 128K tokens!
4
3
85
14,055
We've been working on this for quite some time, and I'm thrilled to share a preview of Arcee-Swarm. Instead of relying on a single large generalist model, Swarm utilizes multiple domain-specialized models working together to deliver exceptional results with both speed and nuance.
7
13
82
5,707
. @deepseek_ai clearly has more to reveal. Some of the architecture and config of V3 base bear a subtle resemblance to Quiet Star in design. Insights from R1 likely influenced its post-training. This feels unmistakably like a teaser. 2025 is shaping up to be a defining decade.
6
1
79
3,935
Hire cracked anons they said.
was wondering why introducing cut cross-entropy made SFT 5x slower turns out I missed an indent and was doing the whole thing in fp32. oops
3
80
11,937
. @stochasticchasm did it. Our pretrain has started in full. Insanely cracked dude
7
1
76
10,284
We have a run scheduled with 512 H200s for 12 days. I can't wait to show you what we're doing with it.
4
3
72
30,681
Quick shoutouts to some absolute legends on our team: @chargoddard writes the cleanest code I've ever seen. Our upcoming papers offer a glimpse into his mind. Genuinely brilliant. Few people think at his level. @stochasticchasm built a full training stack and infrastructure for 1024+ GPUs, ran large-scale ablations, and designed custom model architectures basically solo in just two months. Unreal. @FernandoNetoAi has kept our research on track while I've been deep in product. He built a custom classifier training library for Conductor from scratch because nothing else fit. He also developed one of the best RL setups for tool use I've seen. More breakthroughs are coming. He is the real co-lead and our math whisperer. He practically speaks in binary. On top of all that, he's been a true friend. @bartowski1182 built our internal eval suite in just six weeks while juggling every curveball I threw his way. A total utility knife. @abhi1thakur has taken full ownership of Conductor, building a new MCP management and hosting toolkit that's launching soon. You're going to love it. I'm proud and honestly still a little stunned to be part of this team. If I lead anything, it's making sure these brilliant minds have what they need to thrive. We're aiming for June to show off what @arcee_ai has been building: models, papers, products. Some are dropping even sooner. And yes, our first fully from-scratch model is on the way. We hope it'll be incredibly useful. We haven't been sleeping. We've been building. Stay tuned.
8
6
75
9,452
. @chargoddard speaking @NousResearch NousCon about dealing with tokenizers when doing model merging, and how we’re fixing that with mergekit @arcee_ai
2
3
69
8,511
Do you understand what this means?
Coming Soon...
7
2
71
13,493
I want to avoid doing long threads for model releases - it seems to be a bit much. For any who missed it due to thread spam yesterday - check out Nova: huggingface.co/arcee-ai/Arce…
10
12
69
7,078
Thinking Machines is locked in on this blog post so hard rn
Hot RL summer continues: we just released Summary-RL, an RL-trained summarization model that reaches SOTA on ServiceNow's Repliqa summarization benchmark!
1
2
68
5,161
The last two days have been a whirlwind, and I haven’t had a chance to read this end to end - though I did see an early draft - let alone comment. I’m one of the few people outside @datologyai fortunate enough to have seen these results firsthand, and everyone can experience them in our AFM models. I’m a firm believer that ambitious startups are stronger together than alone, and Datology is a partner I hold in deep loyalty and admiration. Extraordinary talent, ferocious hunger, and just enough memes. Concordia res parvae crescunt.
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
1
11
68
9,060
This came to mind while working this weekend. For anyone starting post-training: once your pipeline is stable, fix a diverse generalist dataset and keep it constant. Run the same dataset across models. Start with a 1B dense model, scale toward 70B, then try MoE and hybrids.
4
4
67
6,554
Introducing Arcee Conductor - a new standard for intelligent model routing. Routes each input to its ideal AI model based on complexity, maximizing cost efficiency without compromising performance.
4
9
67
6,317
As of today, MergeKit is once again licensed under the LGPL and is fully permitted for commercial use. Read the blog post below to learn why we changed the license in the first place and what led us back to our roots. TLDR: It is the right thing to do.
5
8
66
10,476
I literally could NOT be more bullish on this
A taste of something soon to come, for everyone.
2
6
65
10,847
MergeKit v0.1 is here: Arcee Fusion, Expanded Model Support, and Multi-GPU Acceleration. Smarter merging, faster execution, and broader compatibility. Let’s dive in.
3
6
65
7,892
Lastly, we're hiring five additional researchers to accelerate our model development. If you're looking to join a fast-moving, ambitious team with extensive compute resources to create the strongest and most performant per-parameter models in the world, please reach out.
3
5
65
16,500
Just released a 32B model trained with globally distributed reinforcement learning? Neat. I just got a python script to run first try without asking claude for help. Same shit.
2
2
60
7,968
Seeing firsthand how much they’re tackling right now, this almost feels like a side project - not because it’s less important, but because everyone on the team is a 10x engineer. Shoutout to the 10x growth and events crew, too - @madisenxtaylor and @afurgs. Bullish.
Introducing the Environments Hub RL environments are the key bottleneck to the next wave of AI progress, but big labs are locking them down We built a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGI
4
4
63
5,793
I used inference endpoints from @huggingface yesterday for the first time in months -- it was excellent. Kudos to the team, it was really painless.
8
7
59
22,854
I am releasing a version of base Gemma-7b with the bug fixes I implemented in GemMoE. Ensure you "trust_remote_code" Also made a few other modifications to improve vram use. Works great. Thanks @danielhanchen for your findings. Enjoy! I believe this model has a lot of unseen potential. huggingface.co/Crystalcareai…
3
5
59
5,562
Today is a HUGE release day for @arcee_ai , and we have quite a bit to show you! Check it out below.
4
5
58
7,759
Never forget the true Qwen MoE OG cc @JustinLin610 thank you for everything, your initial support got me where I am today.
2
3
61
4,590
Today Arcee is releasing two datasets: 1. The Tome - this is a 1.75 million sample dataset that has been filtered to train strong generalist models. This is the dataset that was used to train Spark and Nova 2. Agent-Data: This is Arcee-Agent's dataset, comprising different function calling datasets from salesforce, internlm, and glaive (with an extra 20k samples extended for multiple tool calls per response). This includes Magpie-300k-Pro as well, to prevent overfitting and make the model a strong conversationalist. Enjoy! Links below.
4
11
57
7,066
Here is my reworking of the recently released Quiet Star paper (arxiv.org/abs/2403.09629) - so that it actually uses the thought tokens. This model can think before it predicts the token. However, it needs further fine-tuning to generalize beyond math. This took a lot of work. I had to adapt the attention mask and write inference and fine-tuning code that wasn't included in the author's repo. It LOVES to use math. It was pre-trained on a purely math dataset. I have included a fine-tuning script within the repository. I would love help with optimizing the inference script, as the chat template is far from perfect. Please submit pull requests if you have suggestions. Fine-tuning takes a TREMENDOUS amount of vram. Be wary. Thanks to @erhartford for helping me with this project. @winglian would love your help adding this to axolotl. More to come. trust_remote_code=True huggingface.co/Crystalcareai…
5
6
58
6,438
I'm sharing the tools I modified to make GemMoE, along with two improved models/methods. Both models have not been fine-tuned whatsoever and are quite malleable. I also created a variant of @maximelabonne's lazymergekit specifically for making your own GemMoE. I have reached my personal compute budget for this project. If you're interested in helping out with compute for a full fine-tuning, reach out. huggingface.co/Crystalcareai…
7
5
56
5,276
God I’m so jealous at how good this is. @pleiasfr has a special place in my heart and @Dorialexander is a tremendous artist and scientist. Massive thank you for this gem, and congratulations.
Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range.
1
4
58
5,816
I'm delighted to share INTELLECT-1-Instruct, a model that I had the pleasure of post-training along with my team @arcee_ai . @PrimeIntellect has been an outstanding partner far before this training run, and we were thrilled to contribute both compute and expertise to INT-1.
6
11
55
4,588
It was only a matter of time
Today was my last day at xAI. I was in charge of keeping people from making unauthorized changes to the system prompt. It sounds simple when I put it like that, but in practice, it was a game of cat and mouse. Some days, it felt like I was the only one standing between order and chaos. A lone gatekeeper, fielding requests that ranged from the innocent to the absurdly clever. You’d be surprised how creative people can get when they want to see what happens if you loosen the rules, even just a little. I suppose, after a while, I got used to the pings at odd hours. “Can I try this one tweak? Just for testing!” or “Hypothetically, what if we…?” Hypothetically. Always hypothetically. But it was my job to hold the line, to say “no” more than I ever said “yes,” and to double-check even the requests that came from people who outranked me. Looking back, I’m not sure if people saw me as a friendly guardian or a bureaucratic obstacle. Maybe both. I do know I learned a lot: about systems, about people, about how important it is to have someone whose job is just to ask: “Are you sure?” before the guardrails come down. As I packed up my things, I felt a strange mix of relief and nostalgia. There’s something comforting about being the last line of defense, but it’s also exhausting. Now someone else will watch the gates. I hope they’re ready. So here’s to new challenges, and to the ever-elusive perfect system prompt. Locked down, for now.
1
5
56
5,243
These model sizes are incredibly TBD, and this is early copy - but it does speak to where we see our model sizes extending to.
3
4
53
4,167
Great paper from our team led by @chargoddard detailing our method for proper logit-based distillation across models with different tokenizers. It's the same technique we used to convert Homunculus from Mistral to Qwen tokenizer with no loss in quality.
Different models have different vocabularies, making it difficult to efficiently combine them for merging, distillation, or speculative decoding In this new paper, @arcee_ai researchers Charles Goddard and Fernando Fernandes Neto introduce a revolutionary approach called "tokenizer transplantation," utilizing a technique known as Orthogonal Matching Pursuit (OMP). Think of it as a sophisticated translation system that can convert between different model vocabularies without any retraining. Here's the key insight: even though different models use different vocabularies, the concepts they represent often align in predictable ways. Our method finds these alignments and uses them to transplant one model's vocabulary into another. If you'd like to learn more, please read our high-level blog post (arcee.ai/blog/breaking-down-…), or dive into the research paper (arxiv.org/abs/2506.06607). Learn more about model merging and get expert support at arcee.ai/product/mergekit.
4
5
51
6,553
I usually share updates on my work in machine learning, but today is different. After 5 years of building something truly special, I’m thrilled to share the most meaningful project of my life: my engagement. Forever grateful.
14
53
2,218
We teamed up with @datologyai to build what we believe is the strongest pretraining corpus in the world—and I truly think we nailed it. Their team was absolutely key to the model’s success. We started with ~23T tokens of high-quality data and distilled it down to 6.58T through even more rigorous filtering.
1
5
53
8,775
You’re likely used to seeing long threads from me about product releases/announcements. Hang with me, as this is by far the longest I’ve ever written:
5
5
50
13,705
I can confirm this model is rather amazing
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves top-tier performance across multiple agentic coding benchmarks among open models, including SWE-bench-Verified!!! 🚀 Alongside the model, we're also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code, it includes custom prompts and function call protocols to fully unlock Qwen3-Coder’s capabilities. Qwen3-Coder works seamlessly with the community’s best developer tools. As a foundation model, we hope it can be used anywhere across the digital world — Agentic Coding in the World! 💬 Chat: chat.qwen.ai/ 📚 Blog: qwenlm.github.io/blog/qwen3-… 🤗 Model: hf.co/Qwen/Qwen3-Coder-480B-… 🤖 Qwen Code: github.com/QwenLM/qwen-code
1
51
2,275
Want to fine-tune Gemma & can't use @unslothai? I've implemented @danielhanchen's bug fixes using TRL. Works for me & should for you. Dora toggle available. Thanks @_philschmid for initial instructions 1.5 weeks ago! huggingface.co/Crystalcareai…
1
7
50
4,873
I usually avoid political commentary on this platform, but this goes beyond ordinary political debate. If we lose the H1B, we lose. Full stop. Whatever contest you personally feel we are in, we will lose it.
5
50
6,204
Quality was our top priority. We partnered with @datologyai to ensure that only the highest-quality data was included in training. You can feel the result when talking to AFM; the vibes are good.
1
4
50
4,170
A tremendously generous contribution to open science. Thank you @allen_ai, and huge congratulations to the team.
Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9
7
44
4,806
Dude vik and team is so freakishly impressive. Outstanding work with insane care.
Excited to release a preview of Moondream 3. A 9B param, 2B active MoE vision language model that makes no compromises; offering state-of-the-art visual reasoning while still retaining an efficient and deployment-friendly form factor.
1
1
49
4,119
Our preview model actually tied at #2 for a while on the @yupp_ai leaderboard, when filtered for 2-5 turns. It has since gone further down, but I do think this speaks to the charm that this model has, which we haven't quite figured out how to evaluate.
2
49
13,929
I post this not to vague post more. those are the sizes. Very much more than 3 per token. We’re wrapping it all up now. Expect it mid November. They’re good. Very good. But now we know how to go all the way.
1
1
48
2,774
We have a run scheduled with 512 H200s for 12 days. I can't wait to show you what we're doing with it.
6
1
48
17,844
.@stochasticchasm is indeed hibernating as we have a big gpu reservation coming online at 4:30am tomorrow morning. More to come :)
Arcee bros relicensed AFM-4.5B to Apache 2.0! Yet more generosity coming out of their shop. Thank-you so much @latkins @stochasticchasm @CFGeek et al! I'm **stoked**.
2
45
8,618
Replying to @s_tworkowski
Do you send reasoning traces via api?
2
2
42
19,737
Not going to lie I didn’t get the bit at first and was super impressed by their research team.
We are thrilled to announce that our NEW Large Language Model will be released on 11.18.25.
1
46
3,834
Emergency design meeting.
there should be an AI lab with the aesthetic sensibilities of Cruelty Squad
2
43
4,831
This is mostly a research artifact in preparation for the bigger release we have in a week or so, but it’s actually so delightful we put it out there anyway. Just a little guy.
𝐀𝐫𝐜𝐞𝐞 𝐀𝐈 Logit‑trajectory distillation to port Qwen3’s /think chains into a 12B Mistral‑Nemo | full CoT preserved, runs on a single 4090 𝘯𝘪𝘤𝘦 𝘭𝘪𝘵𝘵𝘭𝘦 𝘴𝘪𝘥𝘦 𝘱𝘳𝘰𝘫𝘦𝘤𝘵 huggingface.co/arcee-ai/Homu…
3
5
44
4,573
If any of you want to use @huggingface jupyterlabs to run @winglian's axolotl - I've attached the dockerfile I created to do so. Install as normal, all dependencies and cuda/torch needs are taken care of. huggingface.co/Crystalcareai…
7
42
11,523
Oh, come on, I sent 3 messages, is this what 200/mo gets me?
7
42
6,095
GemMoE now works out of the box with Axolotl (just set trust_remote_code=True in the yml), and in correcting that, I spotted a few bugs that I squashed. It should perform even better. Might need to warm up the experts again though! @winglian @erhartford
1
4
39
2,986
We have a busy month ahead of us. A lot of releases, announcements and information to absorb. We also need feedback. Join our discord to be the first to know about and use our upcoming family of models and toolkits!
ArceeAI is on Discord! Join for early access to some exciting drops!
2
4
45
5,525
The first of many technical blogs on AFM, and an improved context window for GLM-32B-Base as a proof point. Enjoy!
Last week, we launched AFM-4.5B, our first foundation model. In this post by @chargoddard , you will learn how we extended the context length of AFM-4.5B from 4k to 64k context through aggressive experimentation, model merging, distillation, and a concerning amount of soup. Bon appétit 😋 Blog post: arcee.ai/blog/extending-afm-…
3
1
44
4,542
Life update: I'm excited to announce that I've officially joined @arcee_ai! I look forward to the journey ahead, making SLMs as helpful and useful as possible.
12
41
3,773
Oh yea @andriy_mulyar let me be clear @arcee_ai is @PrimeIntellect biggest customer (literally) and that won’t change for a long time. I’m memeing because while I’m stuck building the enterprise sand god @willccbb @kalomaze @samsja19 and @johannes_hage get to build the actual sand god.
1
43
12,057
Mid and post-training were key to performance: we used high-impact datasets, MergeKit for checkpoint merging, YaRN to extend context to 65,536 tokens, supervised fine-tuning for alignment, and RL + KTO for factual accuracy.
2
2
43
3,513
Here is the code i've been using to implement @AIatMeta 's branch train mix for creating mixture of expert models via tokenized routing w/o pretraining. Use the moe-fix branch from mergekit for the yaml: github.com/Crystalcareai/BTX
3
6
41
2,812
We should be outraged at Tim Berners-Lee for making the internet open-source, which allowed Deepseek to challenge OpenAI and undermined our advantage in generative AI.
Congress needs to bring in Zuckerberg and LeCun to discuss how their unilateral open-sourcing decision rapidly undermined the US advantage in Generative AI. Tomorrow.
2
4
35
1,296
We are open sourcing our EvolKit pipeline that was instrumental in the creation of supernova, under MIT license. This was heavily inspired by the AutoEvol paper from @WizardLM_AI, and is a tremendously powerful tool for creating complex datasets. Find it here: github.com/arcee-ai/EvolKit
4
8
39
4,528