Life-long learner, hacker, and thinker. Prev: PhD @Princeton, researcher @togethercompute @GoogleDeepMind @MSFTResearch @citsecurities.

Princeton, USA
Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. 🧵👇
28
205
1,159
482,017
LLMs can make their own tools just like humans🤖! Thrilled to share my intern work @Google. We introduced a closed-loop framework to let LLMs make and utilize reusable new tools🛠️ (implemented as programs). Paper: arxiv.org/abs/2305.17126 More details⬇️
22
170
927
212,007
How much value does your fine-tuning add? Believe it or not, just 1 bit 🤏 Thrilled to unveil BitDelta, a super simple yet effective method for compressing fine-tuning deltas into a single bit while barely touching performance. This approach slashes storage and GPU memory demands by over 10x for numerous fine-tunes on the same base model—a frequent practice, and enables efficient multi-tenant serving.
5
100
547
104,090
Excited to see Meta's new paper! Been pondering this exact idea for the past 6 months but never had the resources to verify it properly - and now it's confirmed! Here are some thoughts and ideas I've been exploring that weren't covered in the paper: The key insight, from my perspective, is that tokens are essentially byte groups that a small byte-level language model can predict with high confidence. When the model's confidence drops, we need to pack these byte groups into tokens for a larger model to comprehend and make predictions. This can be viewed as a form of speculative decoding without rejection, where the large model doesn't need to process the entire KV cache from the draft model - it just uses cross attention to summarize each KV group. Following this line of thinking, one fascinating direction I've been considering is a hierarchical version: tokens are confident byte groups, phrases are confident token groups, and sentences are confident phrase groups... And o1 is essentially a beam search over a certain level of this hierarchy 😄 Another observation: this approach should be particularly effective for video generation, where many continuous video segments should be predictable by a small model, with only key frames requiring a larger model's predictions. Lots of interesting directions to explore here - if you're curious to discuss further, feel free to DM me! ai.meta.com/research/publica…
9
57
535
38,665
Llama 3: Better data is all you need
14
100
503
71,774
Exciting times with the new Mixtral model from @MistralAI! It’s evident that they’ve fine-tuned the Mistral 7B model to an impressive 8x. The significant correlation between the weights of the two models is a testament to the successful reuse of models. This approach could empower the OSS community with its own robust MoE! P.S. Shoutout to the dedicated folks already making strides in this area, including @far__el. Here’s to hoping we see open-source quality akin to GPT-4 soon!
13
52
456
221,667
Just wrote a script to further investigate how the corpus used to train the gpt4o tokenizer is polluted by Internet scams. The results are quite interesting... 🤦‍♂️🤦‍♂️🤦‍♂️ gist.github.com/ctlllll/4451…
"why was the gpt-4o demo so horny?"
42
99
446
308,663
Transformer is a GOOD graph learner when incorporating suitable graph structural information! Thrilled to share our new work: Transformer + simple but effective encodings = GNN achieving new SOTA on OGB datasets, ZINC! Paper: arxiv.org/abs/2106.05234 Code: github.com/microsoft/Graphor…
6
65
373
Just finished reading the Gemini 1.5 report and I'm blown away by the depth of information shared in such a competitive environment! 🤯 Most surprising was the revelation about their optimizer - they didn't just use Adam! Optimization is still alive and kicking! Kudos to the team for their great work! 👏
6
41
348
73,727
Just grasped the true significance (not just bc it's submitted by Wenfeng) of this work after reading @SonglinYang4 's explanation. The breakthrough isn't hybrid attention (studied years ago), but the ingenious kernel that delivers real-world speedups for dynamic sparse attention. As someone who worked on efficient transformers in undergrad, I had the impression that combining "efficient attentions" (linear, sparse, conv, block-structured), which theoretically would be faster, had the potential to replace full attention but was practically slower. But Deepseek's solution is different: By having each query group of a token attend to the same KV block, they can really reduce the memory movement and achieve FlashAttention-like memory efficiency. This matters enormously for reasoning models that output long thinking processes (10k+ tokens). The efficient dynamic sparse kernel dramatically speeds up both training and inference for such models. What a brilliant example of algorithm-system co-design!
🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costs—without compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. 📖 For more details, check out our paper here: arxiv.org/abs/2502.11089
3
39
340
43,406
o1's chain of thought contains a lot of verbal expressions like 'Hmm', 'But how?', etc. Are they using lecture recordings to train this model...
19
11
312
34,643
While Gemini is the talk of the town, let’s not overlook Google’s simultaneous release: the TPU v5p. Interestingly, Google appears to prioritize the enhancement of HBM bandwidth over FLOPS, with a 2.3x increase for bandwidth and 1.7x for bf16 FLOPS compared to TPU v4. In contrast, the H100 boosts FLOPS by 3x and bandwidth by 2x compared to the A100. Given the lengthy cycle of hardware development and the fact that these architectural decisions were made years ago, could this be interpreted as Google’s strategic bet on the future? Or is it merely a case of my overfitting? When it comes to LLM inference, it seems bandwidth is the priority over FLOPS.
5
41
309
47,487
Exciting news for those who want to experiment with Mixture of Experts (MoE) models but find training and fine-tuning too expensive! With @myshell_ai, we are thrilled to introduce JetMoE, a Llama-2-level model trained for under 0.1 million $. With 8B total and 2.2B active parameters, JetMoE can be tuned using most academic GPUs while delivering competitive performance. More⬇️
9
50
307
46,110
Toy project⬇️ LLM evaluation is extremely tricky: - Human annotation is slow and costly 😩 - LLM judges partially solve the speed problem, but are still expensive 💸 - Classic benchmarks are usually not very informative 🤷‍♂️ Is there a way to get informative evaluation results without human or LLM judges? 💡 Motivated by this question, I tried to test the (Spearman) correlation between the benchmarks from Open LLM Leaderboard by @huggingface 🤗 and the Elo score from Chatbot Arena by @lmsysorg . Interestingly, many humanity benchmarks from the MMLU dataset by @hendrycks 📊 show a high correlation (>83%) with the Elo score. While benchmarks like college mathematics, truthfulqa, abstract algebra seem to be less informative 😕 (advanced math is useless for pleasing human judges lol 😆). One step further, I tried to use LASSO regression to select the most informative benchmarks for the Elo score. The result suggests that US foreign policy, sociology, high school US history, marketing (?!), high school psychology, and high school government and politics 🗳️ are the most informative benchmarks for the Elo score. Together, combining these benchmarks together can achieve a 94% Spearman correlation with the Elo score (on a hold-out test set)! Code is available: github.com/ctlllll/understan… Remark: This is merely a 2-hour toy project that I aimed to help myself understand the benchmarks, comments and more insights are more than welcome! Acknowledge: - I remember someone on tweet did some similar test and concluded that MMLU is the most informative benchmark. I couldn't find the post now but feel free to reply and I'll tag it :) - I'm super grateful for the efforts of building better evaluation in the open-source community, more attention should be paid to this area. - The project is supported by @myshell_ai open-source grant.
13
29
248
59,814
If training's got you in a stew, take a REST and speed right through! 😎 Thrilled to introduce Retrieval-Based Speculative Decoding (REST), a plug-and-play method for accelerating language model decoding. 👇
5
32
210
59,590
Life update: Following my recent graduation, I've joined the Bytedance Seed Edge team to pursue this research direction further. Although this post was written last year, my conviction in this approach has only strengthened (many ideas here echo compelling recent writings from the legendary Rich Sutton and Shunyu, such as the need for rewards to help models evolve beyond the classic finite-context learning paradigm). In the near term, my focus will be on making individual agents evolvable, with the next phase involving connecting and scaling these evolvable agents. I'm incredibly excited about the potential achievements in this direction and welcome connections, discussions, or collaboration. I'll also be attending ICLR next week; please send a DM if you'd like to chat. Welcome to the second half of AGI 😉
6
8
211
21,256
Everyone is talking about the incredible LLM inference speed that @GroqInc chips achieve, but few notice its cost, especially if you want to replace your GPU/TPU stack with it. Long story short, it requires hundreds of chips to serve a single LLM since each chip only has a ~200MB SRAM. Wondering how they manage to provide a very low price for public API... A few good refs: nitter.app/DZhang50/status/… nitter.app/felix_red_panda/… nitter.app/cHHillee/status/…
Before people sell all their GPUs to go buy Groq hardware, I'd recommend answering two questions: 1. What is the cost of the system you're purchasing? 2. How many users can you serve at 500 tok/s+? Hint: Very high, and not many
13
29
193
55,200
What Makes Convolutional Models Great on Long Sequence Modeling? Thrilled to share our new findings on what basic principles can make *global* convolutional models (like the incredible S4 model) super powerful in sequence modeling! arxiv.org/abs/2210.09298 🧵🧵🧵
7
30
176
Wow, a GPT4-level model cheaper than Claude Haiku! We do have a large room to improve in terms of inference efficiency 😝
Replying to @deepseek_ai
> Chat with DeepSeek-V2: chat.deepseek.com > Access pay-as-you-go DeepSeek-V2 APIs with unbeatable price: platform.deepseek.com > DeepSeek-V2 is fully open-source and free for commercial use: huggingface.co/deepseek-ai
3
22
178
36,663
Wow, Medusa can be used for pre-training and leads to a better and faster generation! 😍
Meta presents Better & Faster Large Language Models via Multi-token Prediction - training language models to predict multiple future tokens at once results in higher sample efficiency - up to 3x faster at inference arxiv.org/abs/2404.19737
4
13
178
32,247
Since the launch of Medusa, we’ve been thrilled to see its adoption in TensorRT, TGI, and numerous open-source projects and companies. Today, we’re unveiling a technical report with fresh features! This includes the Medusa-2 recipe for full-model tuning, self-distillation for integrating Medusa into any fine-tuned LLM, and more acceleration techniques. The latest results reveal a 2.2-3.6x speed boost over the original model across various LLMs 🚀 Report: arxiv.org/abs/2401.10774 Code: github.com/FasterDecoding/Me… Highlights of the new features: - Medusa-2 Training Recipe: While training the entire model with Medusa heads can potentially enhance their prediction ability and speed, naively adding Medusa heads can distort the pre-trained model’s learned features, impacting performance. To address this, we’ve designed a learning schedule inspired by @ananyaku’s LPFT paper. The resulting recipe supports training next-gen models with built-in inference acceleration without compromising performance. - Self-Distillation: Training Medusa heads typically requires access to the specific instruction-tuning dataset the model uses, but this data isn't always available. Furthermore, post-RLHF models often exhibit a significantly altered output distribution compared to the original training data. These factors underscore the need for a dataset-independent method in Medusa training. Our solution is a self-distillation technique, where the model itself generates the training dataset, effectively resolving this challenge. Shout out to my excellent collaborators @yli3521, @ZhengyangGeng, @Hongwu_Peng, @jasondeanlee Deming, and @tri_dao! Acknowledgement: Thanks @narsilou, @joao_gante for the TGI integration, and Kaiyu Xie, Eddie Wang, and Xiaowei Shi for the TensorRT LLM integration. Thanks @togethercompute, @myshell_ai, @chai_research for sponsoring compute, API credits, and open-source bounty.
2
30
156
38,490
As an RL newbie I came across a very similar idea recently and was shocked to see that this natural idea (using the loss reduction over the ground truth answer when adding CoT before it) was (to my limited RL knowledge) only covered by the following and another very recent paper (arxiv.org/abs/2503.19618) lol. Did I miss any literature on this, or does it simply not work well?
Underrated paper and idea on using RL losses on non-verifiable domains, in this case the perplexity of the next chapter of a book.
12
11
149
35,666
Medusa revisits an underrated gem from the "Blockwise Parallel Decoding" paper back to the invention of the Transformer model: rather than pulling in an entirely new draft model to predict subsequent tokens, why not simply extend the original model itself? This is where the "Medusa heads" come in. These additional decoding heads seamlessly integrate with the original model, producing blocks of tokens at each generative juncture.
4
23
147
32,156
This work couldn't have been done without amazing collaborators @yli3521 @ZhengyangGeng @Hongwu_Peng and @tri_dao Blog: sites.google.com/view/medusa… Code: github.com/FasterDecoding/Me… Special thanks @togethercompute for their generous support for this project! together.ai/blog/medusa This is just the beginning. We are looking forward to extending Medusa to broader settings and welcome to everyone's contributions!
11
18
137
20,649
Happy to see OpenAI finally open-sourcing 📢. I ran a quick probe on GPT-OSS-20B—160 high-temperature completions prompted by the nine most common English words—and two things stood out: 1. ≈ 85-90 % of output is programming-related. 2. Within that slice, the model leans hard on code explanations, tutorials, and technical docs. That pattern looks a lot like “Textbooks Are All You Need”-style curation—heavy instructional corpora, less narrative variety. My probing is preliminary (single-token prompts, 16 samples each) and loosely inspired by membership-inference attacks, but even this shallow dive already shows a strong dev-centric training bias. Full notebook, outputs, and some simple analysis here → github.com/ctlllll/gpt-oss-r…
4
8
142
18,708
How interesting...
🚀 Day 2 of #OpenSourceWeek: DeepEP Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference. ✅ Efficient and optimized all-to-all communication ✅ Both intranode and internode support with NVLink and RDMA ✅ High-throughput kernels for training and inference prefilling ✅ Low-latency kernels for inference decoding ✅ Native FP8 dispatch support ✅ Flexible GPU resource control for computation-communication overlapping 🔗 GitHub: github.com/deepseek-ai/DeepE…
1
6
116
25,793
Weekend reflection: The end-game (?) of LLM serving I've been thinking about the situation of LLM serving recently, and here is what I thought could be the state in the near future. Given a latency requirement. For large players with high traffic volumes, 1. Ensure batches are large enough to avoid memory-bound. 2. Use Tensor Parallelism (TP) to cut down latency. 3. Adopt micro-pipelining to further minimize communication overhead in TP Result: Decoding costs approach those of prefilling. For smaller providers where your query rate doesn't fully utilize GPU capabilities, 1. Still opt for large batch sizes to maximize efficiency. 2. Leverage speculative decoding to squeeze available FLOPs. 3. Consider tensor parallelism and micro-pipelining if not meeting the latency requirement. No guarantee that my understanding is legit, please correct me if I'm wrong🙏
3
12
111
38,837
Remember what we saw in the Gemini 1.5 report? 😏
@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI
2
11
111
17,283
Subpopulation shift is a ubiquitous component of natural distribution shift. We propose a general theoretical framework of learning under subpopulation shift based on label propagation. And our insights can help to improve domain adaptation algorithms. arxiv.org/abs/2102.11203
1
14
103
Want to accelerate your Graph NN training? Not satisfied with the acceleration by normalization scheme from other domains? Come and try our GraphNorm, normalization specially designed for GNNs with new insights on GNN optimization! arxiv.org/abs/2009.03294 github.com/lsj2408/GraphNorm
1
20
103
We've heard many complaints about the high GPU memory requirements for training Medusa heads on models with large vocabularies. This is no longer an issue, thanks to the amazing Liger kernel developed by the LinkedIn team (@hsu_byron and team) The Liger kernel cleverly fuses the kernels and avoids the need to materialize full logits, significantly reducing memory usage and improving speed. This breakthrough allows researchers and developers to work with larger models more efficiently. Check out this example of training Medusa heads with Liger: github.com/linkedin/Liger-Ke…. With num_head = 5, stage = 1, Liger Kernel can scale to 8K seq len, while HF can only do 1K, with +30% throughput.
2
16
100
12,198
Sparsity is a profound idea for enhancing the efficiency of large models (e.g., MoE). Our new research demonstrates that we can achieve high activation sparsity (finer granularity than MoE) without additional training, thereby significantly boosting inference efficiency.
Your LLM may be sparser than you thought! Excited to announce TEAL, a simple training-free method that achieves up to 40-50% model-wide activation sparsity on Llama-2/3 and Mistral models. Combined with a custom kernel, we achieve end-to-end speedups of up to 1.53x-1.8x!
11
97
11,994
Want to learn more about how to compress your fine-tuning deltas into merely 1 bit? -- Our blog is out! fasterdecoding.github.io/Bit… together.ai/blog/bitdelta pli.princeton.edu/blog/2024/… Thx @togethercompute @PrincetonPLI for cross-posting :) GitHub PS: the attached demo is included in our github.
How much value does your fine-tuning add? Believe it or not, just 1 bit 🤏 Thrilled to unveil BitDelta, a super simple yet effective method for compressing fine-tuning deltas into a single bit while barely touching performance. This approach slashes storage and GPU memory demands by over 10x for numerous fine-tunes on the same base model—a frequent practice, and enables efficient multi-tenant serving.
1
18
89
22,786
LLM inference is inefficient because of its memory-bound nature. Speculative decoding cleverly uses an additional lightweight draft model to predict a few tokens ahead, allowing the large model to process them together. This reduces the number of memory accesses and thus speeds up the inference. See nitter.app/karpathy/status/… for a nice explanation from legendary @karpathy
Speculative execution for LLMs is an excellent inference-time optimization. It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch (for larger K than you might think). This unintuitive fact is because sampling is heavily memory bound: most of the "work" is not doing compute, it is reading in the weights of the transformer from VRAM into on-chip cache for processing. So if you're going to do all that work of reading in all those weights, you might as well apply them to a whole batch of input vectors. I went into more detail in an earlier thread: nitter.app/karpathy/status/… The reason we can't naively use this fact to sample in chunks of K tokens at a time is that every N-th token depends on what token we sample at time at step N-1. There is a serial dependency, so the baseline implementation just goes one by one left to right. Now the clever idea is to use a small and cheap draft model to first generate a candidate sequence of K tokens - a "draft". Then we feed all of these together through the big model in a batch. This is almost as fast as feeding in just one token, per the above. Then we go from left to right over the logits predicted by the model and sample tokens. Any sample that agrees with the draft allows us to immediately skip forward to the next token. If there is a disagreement then we throw the draft away and eat the cost of doing some throwaway work (sampling the draft and the forward passing for all the later tokens). The reason this works in practice is that most of the time the draft tokens get accepted, because they are easy, so even a much smaller draft model gets them. As these easy tokens get accepted, we skip through those parts in leaps. The hard tokens where the big model disagrees "fall back" to original speed, but actually a bit slower because of all the extra work. So TLDR: this one weird trick works because LLMs are memory bound at inference time, in the "batch size 1" setting of sampling a single sequence of interest, that a large fraction of "local LLM" use cases fall into. And because most tokens are "easy". References arxiv.org/abs/2302.01318 arxiv.org/abs/1811.03115 arxiv.org/abs/2211.17192
2
7
95
11,596
🔥 Fired up after today's OpenAI release, I'm thrilled to share that I'll be joining the VisionPro team @Apple as a summer intern next year! Looking forward to diving into the future of XR+AI. (hopefully I can make VisionPro lighter, below is one of my proposals) 🌟
1
89
12,652
My first CVPR 🤗
2
1
90
16,967
The SGConv paper from my last memorable internship @MSFTResearch has been accepted by ICLR, and will start a new journey @GoogleAI as a part-time student researcher next week! Ready to welcome lunar new year🤗
What Makes Convolutional Models Great on Long Sequence Modeling? Thrilled to share our new findings on what basic principles can make *global* convolutional models (like the incredible S4 model) super powerful in sequence modeling! arxiv.org/abs/2210.09298 🧵🧵🧵
1
7
80
15,391
Just flew back home after a 10+ hour flight. Sharing some in-flight thoughts on AI ✈️ Q: What assumptions should we make when imagining future AI applications? TL;DR: 1. Model inference will become nearly free 2. AI will solve most objectively verifiable problems 3. Personal AI models will diverge on subjective issues We often take many things for granted because we're so accustomed to them. For instance, we assume phones can easily connect to the Internet, or that we can get real-time traffic updates while driving. But these were once groundbreaking innovations. If we can enter this state of familiarity early, or make the right assumptions, we might be able to gain an advantage in future developments. Here are three assumptions I believe will eventually prove true in AI progress. I hope they provide some inspiration and welcome discussion: 1. Model inference will become nearly free The cost reduction of AI models is following a deterministic trajectory. We've already witnessed more than a 10x reduction in inference costs over the past few months, and this trend is set to continue as we refine our techniques in both model training and inference. Think about how web search evolved. Initially, online searches were limited and sometimes charged per query. Now, we can perform countless Google searches daily without direct costs. This shift enabled new business models like search advertising. Similarly, as AI inference costs plummet, we'll see a proliferation of AI-powered applications and innovative business models. 2. AI will solve most objectively verifiable problems This might sound like a bold claim, but it's logically sound. The key logic is that "verifying a problem is simpler than solving it." As long as we have initial models with sufficient verification abilities, they should be able to bootstrap themselves to become more powerful (the AlphaGo series is a good example of this). Of course, realizing this simple logic is no easy task, but hopefully, it will become a reality in the near future. 3. Personal AI models will diverge on subjective issues For problems where there isn't (and shouldn't be) a consensus, AI models will likely diverge. As the effective context of models grows longer and has more long-term impact on outputs, models will eventually develop unique "opinions." Imagine having a personal AI assistant that, over time, develops a deep understanding of your unique worldview, tastes, and values. When asked about the best way to spend a weekend, one person's AI might suggest an action-packed outdoor adventure, while another's might recommend a quiet day of reading and reflection - each perfectly tailored to their user's preferences. Looking forward to more "assumptions"!🤔💬
2
5
75
10,687
Interesting to see the output price is 5x the input, seems suggesting the latency is a big issue for those larger models so only a smaller batch size can be used. If this is the case, speculative decoding should be pretty helpful?
8
8
72
13,129
Unlike the draft model, Medusa heads can be trained in conjunction with the original model, which remains frozen during training. This method allows for fine-tuning large models on a single GPU, taking advantage of the powerful base model's learned representations. On its own, Medusa heads don't quite hit the mark of doubling processing speeds. But here's the twist: When we pair this with a tree-based attention mechanism, we can verify several candidates generated by Medusa heads in parallel. These two techniques allow us to achieve a 2x speedup over the original model.
3
3
72
9,631
Free idea: This story suggests that we may need to use the SFT (chat) corpus to build the tokenizer rather than relying on the pretraining corpus alone🤣 @OpenAI
Just wrote a script to further investigate how the corpus used to train the gpt4o tokenizer is polluted by Internet scams. The results are quite interesting... 🤦‍♂️🤦‍♂️🤦‍♂️ gist.github.com/ctlllll/4451…
7
3
64
19,170
Replying to @inflectionAI
Wonder where the FLOPs number comes from...
6
58
18,129
Just delved into Google's report on Gemma2. Initial takeaway: many tricks focus on training stability, particularly suitable for low-precision scenarios, e.g, logit soft-capping and sandwich layer normalization. Does this hint at int8 training being crucial? 🤔 Link: goo.gle/gemma2report
1
5
59
9,620
I'll be at COLM next week. DM if you want to chat (about inference efficiency, new architecture ideas, reasoning, startups, or anything LLM) 😜
2
1
60
7,265
LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper🤩
PagedAttention's paper is out! Check it out to learn more!
1
6
61
8,738
Three papers got accepted by #NeurIPS2021, excited but also introspective: whether getting accepted is less important nowadays given 2600+ papers in the conference, only good works make the footprints. Still, glad to share our works and thanks for my amazing collaborators :)
1
57
DBRX sounds like last year... The field just evolves so fast🤪
🚀🌟🚀Excited to announce Samba-CoE v0.2, which outperforms DBRX by @DbrxMosaicAI and @databricks, Mixtral-8x7B from @MistralAI, and Grok-1 by @grok at a breakneck speed of 330 tokens/s. These breakthrough speeds were achieved without sacrificing precision and only on 8 sockets, showcasing the true capabilities of dataflow! Why would you buy 576 sockets and go to 8 bits when you can run using 16 bits and just 8 sockets. Try out the model and check out the speed here - coe-1.cloud.snova.ai/. We are also providing a sneak peak of our next model, Samba-CoE v0.3, available soon with our partners at @LeptonAI. Read more about this announcement at sambanova.ai/blog/accurate-m…
4
5
56
12,361
It's fascinating to see how OpenAI decomposes complicated tasks by examining the author list. 👀 As the field matures, most of the work is focused on building large systems and ensuring they work together seamlessly. openai.com/gpt-4o-contributi…
3
5
55
12,519
I'll be at NeruIPS next week (Dec 11-15). Please DM me if you would like to chat about anything related to large models, especially efficient inference😋
1
53
8,360
JetMoE: 4 folks w/ $0.1M ship a (near-?)SOTA 8B MoE model within 1 month (with the help of @myshell_ai) 🤪
Mistral: ~30 folks w/ ~$500M ship a near-frontier model in nine months. Reka: ~25 folks w/ $60M ship a frontier model in four months. Zyphra: 7 folks w/ $11M (none a frontier lab alum) ship a near-SOTA 7B within a couple months. The frontier is shaping up to be pretty crowded!
9
52
9,757
However, introducing the draft model also brings in a lot of pain. First, it's non-trivial to find a good (both fast and accurate) draft model (as @ggerganov mentioned in nitter.app/ggerganov/status…). Also, the draft model adds extra complexity to the pipeline, making it harder to implement and a disaster for distributed inference.
Replying to @ggerganov
Meta should have release a couple of (1B and 3B) drafter models with the Code Llama release. Is it too late for them to train them or we have to wait for v2 🤔
1
2
51
16,013
Heading to NeurIPS next Tuesday! 🎯 Been deep into evolvable agents, adaptive computation, and system-model codesign lately, but down to geek out about any cool AI stuff. Also open to startup convos (consulting/investing). Hit me up if you wanna connect! 🤝
1
2
50
4,818
Thx for sharing!
SnapKV LLM Knows What You are Looking for Before Generation Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV
3
8
45
9,815
Super excited to announce our Graphormer wins FIRST place in the OGB Large-Scale Challenge PCQM4M-LSC at KDD Cup 2021 @kdd_news !! Attention is all you need even in graph representation learning? Yes, and no. Transformer + graph structural encodings = powerful GNN!
Transformer is a GOOD graph learner when incorporating suitable graph structural information! Thrilled to share our new work: Transformer + simple but effective encodings = GNN achieving new SOTA on OGB datasets, ZINC! Paper: arxiv.org/abs/2106.05234 Code: github.com/microsoft/Graphor…
2
8
42
Thank you for your nice words, Gemini :)
2
1
43
7,284
Open-source 30T corpus🤩
We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training. together.ai/blog/redpajama-d…
1
4
39
9,076
Do existing pruning methods really exploit the info from data? Are the architectures of the pruned networks really matter for the performance? We propose sanity checks on pruning methods and find a great part of existing methods does not rely on these! arxiv.org/abs/2009.11094
2
7
39
Will be at ICML next week!! Quite a last-minute decision, but I really miss the in-person conference :) Looking forward to meeting old&new friends here!! Pls ping me if u want to meet up :p
1
39
Many have speculated about Illya's hidden alpha for the next scaling axis. Here's my proposal: the number of evolvable agents to more efficiently extend the frontier of AI-generated new knowledge. Could this align with some of Illya's ideas? 🤔 My article on this: nitter.app/tianle_cai/status/1848…
Yes, we'll be scaling compute use, not data or model. But how? "Agents?" I feel Ilya is hiding alpha, his specific asnwer to his own "scaling what?" puzzle. My guess: compute per hard token in training. Branching reasoning chains, then distilling into latent multi-hop reasoning.
2
32
8,573
When will @grok be open-sourced?🧐
3
35
8,791
Superunaligned😰
2
36
9,611
Separating the two phases of LLM generation helps understand their computation characteristics and introduces optimization opportunities. Glad to see an excellent explanation of an underestimated optimization on this. PS: SplitFuse from @MSFTDeepSpeed might also be of interest.
Most OSS inference frameworks combine pre-filling (prompt tokens) with decoding (generation tokens) per device/process. But separating the stages should give you better perf! (1/14)
3
37
8,916
Nice figure, but it seems a little misleading: Memory wall refers to the trend that memory bandwidth (instead of the required memory or model size) grows slower than computation, which causes the LLM inference to be memory-bound. Here is an in-depth explanation by Claude 3 Opus: "The image shows the compute and memory capacity trends for AI models over time, with the "Memory Wall" line representing the memory limitation compared to compute capacity. However, the concept of the "Memory Wall" is misrepresented here. The "Memory Wall" typically refers to the growing disparity between processor speeds and memory speeds, where memory access times have not kept pace with the rapid increases in processor performance. This can lead to a bottleneck where the processor is idle, waiting for data from memory. In the context of AI models, the chart seems to suggest that the "Memory Wall" is a hard limit on the memory capacity of these models. However, this is not an accurate representation of the memory wall problem. The memory wall issue is more about the speed and bandwidth of memory access relative to processing power, rather than a hard limit on total memory capacity. While there are certainly limitations to the amount of memory that can be physically packaged with a processor, the chart's depiction of the memory wall as a linear limit on total memory capacity is misleading. In reality, techniques such as caching, prefetching, and memory hierarchy design can help mitigate the impact of the memory wall to some extent, even as the disparity between processor and memory speeds continues to grow."
This is what NVIDIA's Blackwell GPUs solve for Transformer sizes: 240x/2 years GPU memory: 2x/2 years What the memory wall means:
3
3
35
8,957
Glad to see Medusa keep advancing cutting-edge models and go beyond language 😍
New ultra-fast ‘multi-head’ speech recognition model drop from @_aiOla, beats OpenAI Whisper. Officially dubbed Whisper-Medusa, the model builds on Whisper but uses a novel “multi-head attention” architecture that predicts far more tokens at a time So they seem to have added more attention heads on top of whisper. They claim the same accuracy but 50% faster. Their demo does one text in 1.9s while "baseline" whisper does it in 4s. Code and weights opensource under MIT. they have started with a 10-head model but will soon expand to a larger 20-head version capable of predicting 20 tokens at a time, leading to faster recognition and transcription without any loss of accuracy.
1
6
35
5,856
Our ICML paper (arxiv.org/abs/2102.11203) provided theoretical evidence on the effectiveness of consistency-based methods like AdaMatch, FixMatch, etc. with empirical results for the adaptation of FixMatch on domain adaptation tasks : p
This video explains AdaMatch! 🐳 • Interesting unification of Semi-Supervised Learning and Domain Adaptation • Extensions to FixMatch: BatchNorm use, Confidence threshold, Distribution Alignment • Results on DomainNet piped.video/ORufPOY8H14 #100DaysOfMLCode
1
15
33
Replying to @capetorch
The reason it may look strange is that people are focusing on decoding speed. But an even larger proportion of computation is spent on processing the input which is way faster than 300 tok/sec. I made the same mistake before and luckily it was corrected by @jiayq and friends @LeptonAI.
3
35
10,958
Thank you to the incredible team @LeptonAI for harnessing the potential of Medusa in practical LLM applications and offering a variety of finely-tuned Medusa heads to the community!
Medusa is probably one of the most elegant accelerated inference solution we have seen over the last year. It runs complementary to other numerical ones (like int8/fp8, compilation etc) and gives something around ~2x performance gain in practice. (1/x) huggingface.co/papers/2401.1…
2
34
7,011
Glad to see Medusa helps the💯B Command R+ model @cohere run in💯 tok/sec 😍
Text Generation Inference v2.0.0 is the fastest open-source implementation of Cohere Command R+! Command R+ is the best open-weights model. Leveraging the power of Medusa heads, TGI achieves unparalleled speeds with a latency as low as 9ms per token for a 104B model!
1
2
36
7,574
Thrilled to share our new work on certifying L_inf adversarial robustness! New architecture with inherent robustness and a tailored training pipeline. Achieving SOTA performance on several benchmarks! paper: arxiv.org/abs/2102.05363 code: github.com/zbh2047/L_inf-dis…
1
1
34
Low precision seems to be the way to keep Moore's law going?
6
31
7,886
Wait, I don't need to learn CUDA now? 😍
🚀Introducing Mirage, a superoptimizer that automatically discovers highly-optimized GPU implementations for LLMs (and beyond). For certain attention operators, the fastest programs found by Mirage is 2x faster than existing expert-designed implementations such as FlashAttention and FlashDecoding. Code+Demo: github.com/mirage-project/mi… Paper: cs.cmu.edu/~zhihaoj2/papers/…
3
28
7,887
I'll be there tomorrow presenting our work on accelerating diffusion model inference with quality-preserving 4-bit quantization. Come say hi! 😉
🚀 How to run 12B FLUX.1 on your local laptop with 2-3× speedup? Come check out our #SVDQuant (#ICLR2025 Spotlight) poster session! 🎉 🗓️ When: Friday, Apr 25, 10–12:30 (Singapore time) 📍 Where: Hall 3 + Hall 2B, Poster 169 📌 Poster: tinyurl.com/poster-svdquant 🎮 Demo: svdquant.mit.edu 💻 Code: github.com/mit-han-lab/nunch… 🌐 Project: hanlab.mit.edu/projects/svdq… @syn7xavier @ZhekaiZhang @tianle_cai @xiuyu_l @jerry_gjx @xieenze_jr @chenlin_meng @junyanz89 @songhan_mit
2
28
2,802
JetMoE utilizes a novel MoE architecture (a variant of arxiv.org/abs/2306.04640) that splits both the MLP and attention layers into multiple experts, resulting in a highly efficient and performant model. Despite having only 2.2B active parameters, JetMoE outperforms Llama-2 7B and Llama 13B.
1
9
31
3,584
don't have enough time to dig deeper, but in high-dimension the expected corr between two random vectors should be ~0, so the corr in mixtral is quite significant I think
3
30
4,824
Interesting @xai release when people are waiting for their Sora generations 👀 "Aurora is an autoregressive mixture-of-experts network trained to predict the next token from interleaved text and image data." So does this mean it's natively multimodal? Also interesting to see they make autoregressive image generation work that well. x.ai/blog/grok-image-generat…
1
29
4,662
Simple architecture modification to make LLM better in-context learner: - Remove sentence len constraint - Linear scaling - Invariant to demo permutation An old course project from @danqi_chen 's amazing NLP class (shorturl.at/eALU2) recently got accepted into @ESFoMo .🔥
1
2
30
6,589
When using GPT-4 as Tool Maker and GPT-3.5 Turbo as Tool User, our method can achieve performance on par with GPT-4 with GPT-3.5 Turbo's low cost and fast speed. We validated our method on several complex reasoning tasks from BigBench and real-world applications.
2
3
28
4,886
Thanks Nicolas and the HF team for boosting the open-source community! Glad to see Medusa helps to deliver blazingly fast inference! PR has been merged, you can try it yourself ;)
Mixtral at 142 tok/s (latency)... Medusa models can really speed up inference, today we're releasing models & upgrades to the training script so you can adapt on your own finetune huggingface.co/text-generati… Training recipe: github.com/FasterDecoding/Me… TGI:latest works out of the box
1
2
27
13,605
Some clarifications: - The correlation here is the cosine similarity of the attention *weights*, not the intermediate hidden representation. - The correlation is significant as in high dimension, the expected corr of two random vectors should be around zero (verified as in the figure which shows the corr of a pair of weight matrices between Mixtral and Llama-2 7B). Wrote on the flight to NOLA👀 Ping me if you want to chat more 👀
Exciting times with the new Mixtral model from @MistralAI! It’s evident that they’ve fine-tuned the Mistral 7B model to an impressive 8x. The significant correlation between the weights of the two models is a testament to the successful reuse of models. This approach could empower the OSS community with its own robust MoE! P.S. Shoutout to the dedicated folks already making strides in this area, including @far__el. Here’s to hoping we see open-source quality akin to GPT-4 soon!
2
2
26
6,765
Diffusion models are compute intensive and we use 4-bit quantization (over both weights and activations!) to unlock 4x more FLOPs with 4-bit TensorCore -> 4x memory reduction and 3x speedup 🤩 With B200's 9P 4-bit FLOPs (4.5x H100 fp16), the future will be even more exciting 😏
🚀 The 4-bit era has arrived! Meet #SVDQuant, our new W4A4 quantization paradigm for diffusion models. Now, 12B FLUX can run on a 16GB 4090 laptop without offloading—with 3x speedups over W4A16 models (like NF4) while maintaining top-tier image quality.  #AI #Quantization. 1/7
5
23
4,126
As expected, ~0 haha
25
2,512
Honored that Graphormer was highlighted by @satyanadella at Microsoft Ignite 2023🤩🤩
1
1
24
3,123
8x22b👀👀
magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%https://nitter.app/t.co/OdtBUsbeV5%3A1337%2Fannounce
1
22
3,384
The release of Llama2 is a Big moment for OOS! 🔥 After a skim through the tech report, found several secret sauces behind the success: - Better data. - More efficient (architecture for) inference. And more👇 ai.meta.com/research/publica…
1
2
23
5,798
Excited to have my second in-person #NeurIPS three years after the first one! Can't wait to meet old&new friends here🔥Feel free to ping me if you would like to chat on ML methodologies (Transformers, learning paradigms, efficient training etc.)
1
22
I have spoken 😏
I guess the crazy $200/month subscription is also for Sora access? Then it makes more sense, we will see 👀
1
22
2,120
Replying to @tianle_cai @Google
Tools can boost the productivity of LLMs, but what if there isn't a suitable tool?😰-- Let LLM build their own🤩 We introduce "LLMs As Tool Maker": one LLM serves as Tool Maker👩🏻‍🎓 to make new tools🔨, another LLM servers as Tool User👨🏻‍🔧 to solve new problems with the tool.
1
20
5,873
So, what exactly is dynamic caching (in today's Apple release)? How does it compare to NV GPUs? I got super confused by Apple's presentation here. Sounds like some remarkable technical improvement, but the description is just so vague...
3
1
20
4,452
Would like to see 1) smaller-scale dense models runnable on edge devices and 2) larger-scale MoE models trained with this 15T corpus. Those might be more useful than the current architecture choices...
Llama 3: Better data is all you need
1
20
3,214
Amazing progress on long-context LM!
Very proud to introduce two of our recent long-context works: HELMET (best long-context benchmark imo): shorturl.at/JnBHD ProLong (a cont’d training & SFT recipe + a SoTA 512K 8B model): shorturl.at/XQV7a Here is a story of how we arrived there
1
19
2,877
Congrats Greg!
2
858
I guess the crazy $200/month subscription is also for Sora access? Then it makes more sense, we will see 👀
1
17
3,780
Besides the similarity between Mixtral 8x7B and Mistral 7B, here are some other sanity checks I would like to try if I'm not on my way to NeurIPS✈️: - The performance when using a fixed expert. - Similarity between different experts. - Performance drop after quantization.
Exciting times with the new Mixtral model from @MistralAI! It’s evident that they’ve fine-tuned the Mistral 7B model to an impressive 8x. The significant correlation between the weights of the two models is a testament to the successful reuse of models. This approach could empower the OSS community with its own robust MoE! P.S. Shoutout to the dedicated folks already making strides in this area, including @far__el. Here’s to hoping we see open-source quality akin to GPT-4 soon!
1
1
18
3,481
Joint work with the amazing team at Google Deepmind: Xuezhi, @tengyuma , @xinyun_chen_ , @denny_zhou
1
18
3,554
Replying to @bozavlado
Seems like they're trying to keep it low-key, but the phrase "higher-order preconditioned method" immediately brought to my mind the incredible paper "Scalable Second Order Optimization for Deep Learning" by @_arohan_ et al. arxiv.org/abs/2002.09018
2
4
18
1,949