Tianle Cai · Sep 11, 2023 · 4:34 PM UTC

Tianle Cai

Pinned Tweet

Tianle Cai

@tianle_cai

11 Sep 2023

Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. 🧵👇

205

1,159

482,017

Tianle Cai · May 29, 2023 · 1:05 AM UTC

Tianle Cai

@tianle_cai

29 May 2023

LLMs can make their own tools just like humans🤖! Thrilled to share my intern work @Google. We introduced a closed-loop framework to let LLMs make and utilize reusable new tools🛠️ (implemented as programs). Paper: arxiv.org/abs/2305.17126 More details⬇️

170

927

212,007

Tianle Cai · Feb 16, 2024 · 3:15 AM UTC

Tianle Cai

@tianle_cai

16 Feb 2024

How much value does your fine-tuning add? Believe it or not, just 1 bit 🤏 Thrilled to unveil BitDelta, a super simple yet effective method for compressing fine-tuning deltas into a single bit while barely touching performance. This approach slashes storage and GPU memory demands by over 10x for numerous fine-tunes on the same base model—a frequent practice, and enables efficient multi-tenant serving.

100

547

104,090

Tianle Cai · Dec 13, 2024 · 6:10 PM UTC

Tianle Cai

@tianle_cai

13 Dec 2024

Excited to see Meta's new paper! Been pondering this exact idea for the past 6 months but never had the resources to verify it properly - and now it's confirmed! Here are some thoughts and ideas I've been exploring that weren't covered in the paper: The key insight, from my perspective, is that tokens are essentially byte groups that a small byte-level language model can predict with high confidence. When the model's confidence drops, we need to pack these byte groups into tokens for a larger model to comprehend and make predictions. This can be viewed as a form of speculative decoding without rejection, where the large model doesn't need to process the entire KV cache from the draft model - it just uses cross attention to summarize each KV group. Following this line of thinking, one fascinating direction I've been considering is a hierarchical version: tokens are confident byte groups, phrases are confident token groups, and sentences are confident phrase groups... And o1 is essentially a beam search over a certain level of this hierarchy 😄 Another observation: this approach should be particularly effective for video generation, where many continuous video segments should be predictable by a small model, with only key frames requiring a larger model's predictions. Lots of interesting directions to explore here - if you're curious to discuss further, feel free to DM me! ai.meta.com/research/publica…

535

38,665

Tianle Cai · Apr 18, 2024 · 4:16 PM UTC

Tianle Cai

@tianle_cai

18 Apr 2024

Llama 3: Better data is all you need

100

503

71,774

Tianle Cai · Dec 11, 2023 · 12:29 PM UTC

Tianle Cai

@tianle_cai

11 Dec 2023

Exciting times with the new Mixtral model from @MistralAI! It’s evident that they’ve fine-tuned the Mistral 7B model to an impressive 8x. The significant correlation between the weights of the two models is a testament to the successful reuse of models. This approach could empower the OSS community with its own robust MoE! P.S. Shoutout to the dedicated folks already making strides in this area, including @far__el. Here’s to hoping we see open-source quality akin to GPT-4 soon!

456

221,667

Tianle Cai · May 13, 2024 · 7:59 PM UTC

Tianle Cai

@tianle_cai

13 May 2024

Just wrote a script to further investigate how the corpus used to train the gpt4o tokenizer is polluted by Internet scams. The results are quite interesting... 🤦‍♂️🤦‍♂️🤦‍♂️ gist.github.com/ctlllll/4451…

main @main_horse

13 May 2024

"why was the gpt-4o demo so horny?"

446

308,663

Tianle Cai · Jun 12, 2021 · 4:28 PM UTC

Tianle Cai

@tianle_cai

12 Jun 2021

Transformer is a GOOD graph learner when incorporating suitable graph structural information! Thrilled to share our new work: Transformer + simple but effective encodings = GNN achieving new SOTA on OGB datasets, ZINC! Paper: arxiv.org/abs/2106.05234 Code: github.com/microsoft/Graphor…

373

Tianle Cai · May 20, 2024 · 2:53 AM UTC

Tianle Cai

@tianle_cai

20 May 2024

Just finished reading the Gemini 1.5 report and I'm blown away by the depth of information shared in such a competitive environment! 🤯 Most surprising was the revelation about their optimizer - they didn't just use Adam! Optimization is still alive and kicking! Kudos to the team for their great work! 👏

348

73,727

Tianle Cai · Feb 20, 2025 · 4:14 PM UTC

Tianle Cai

@tianle_cai

20 Feb 2025

Just grasped the true significance (not just bc it's submitted by Wenfeng) of this work after reading @SonglinYang4 's explanation. The breakthrough isn't hybrid attention (studied years ago), but the ingenious kernel that delivers real-world speedups for dynamic sparse attention. As someone who worked on efficient transformers in undergrad, I had the impression that combining "efficient attentions" (linear, sparse, conv, block-structured), which theoretically would be faster, had the potential to replace full attention but was practically slower. But Deepseek's solution is different: By having each query group of a token attend to the same KV block, they can really reduce the memory movement and achieve FlashAttention-like memory efficiency. This matters enormously for reasoning models that output long thinking processes (10k+ tokens). The efficient dynamic sparse kernel dramatically speeds up both training and inference for such models. What a brilliant example of algorithm-system co-design!

DeepSeek

@deepseek_ai

18 Feb 2025

🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costs—without compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. 📖 For more details, check out our paper here: arxiv.org/abs/2502.11089

340

43,406

Tianle Cai · Sep 12, 2024 · 5:32 PM UTC

Tianle Cai

@tianle_cai

12 Sep 2024

o1's chain of thought contains a lot of verbal expressions like 'Hmm', 'But how?', etc. Are they using lecture recordings to train this model...

312

34,643

Tianle Cai · Dec 6, 2023 · 5:19 PM UTC

Tianle Cai

@tianle_cai

6 Dec 2023

While Gemini is the talk of the town, let’s not overlook Google’s simultaneous release: the TPU v5p. Interestingly, Google appears to prioritize the enhancement of HBM bandwidth over FLOPS, with a 2.3x increase for bandwidth and 1.7x for bf16 FLOPS compared to TPU v4. In contrast, the H100 boosts FLOPS by 3x and bandwidth by 2x compared to the A100. Given the lengthy cycle of hardware development and the fact that these architectural decisions were made years ago, could this be interpreted as Google’s strategic bet on the future? Or is it merely a case of my overfitting? When it comes to LLM inference, it seems bandwidth is the priority over FLOPS.

309

47,487

Tianle Cai · Apr 4, 2024 · 4:25 PM UTC

Tianle Cai

@tianle_cai

4 Apr 2024

Exciting news for those who want to experiment with Mixture of Experts (MoE) models but find training and fine-tuning too expensive! With @myshell_ai, we are thrilled to introduce JetMoE, a Llama-2-level model trained for under 0.1 million $. With 8B total and 2.2B active parameters, JetMoE can be tuned using most academic GPUs while delivering competitive performance. More⬇️

307

46,110

Tianle Cai · Jan 11, 2024 · 1:47 AM UTC

Tianle Cai

@tianle_cai

11 Jan 2024

Toy project⬇️ LLM evaluation is extremely tricky: - Human annotation is slow and costly 😩 - LLM judges partially solve the speed problem, but are still expensive 💸 - Classic benchmarks are usually not very informative 🤷‍♂️ Is there a way to get informative evaluation results without human or LLM judges? 💡 Motivated by this question, I tried to test the (Spearman) correlation between the benchmarks from Open LLM Leaderboard by @huggingface 🤗 and the Elo score from Chatbot Arena by @lmsysorg . Interestingly, many humanity benchmarks from the MMLU dataset by @hendrycks 📊 show a high correlation (>83%) with the Elo score. While benchmarks like college mathematics, truthfulqa, abstract algebra seem to be less informative 😕 (advanced math is useless for pleasing human judges lol 😆). One step further, I tried to use LASSO regression to select the most informative benchmarks for the Elo score. The result suggests that US foreign policy, sociology, high school US history, marketing (?!), high school psychology, and high school government and politics 🗳️ are the most informative benchmarks for the Elo score. Together, combining these benchmarks together can achieve a 94% Spearman correlation with the Elo score (on a hold-out test set)! Code is available: github.com/ctlllll/understan… Remark: This is merely a 2-hour toy project that I aimed to help myself understand the benchmarks, comments and more insights are more than welcome! Acknowledge: - I remember someone on tweet did some similar test and concluded that MMLU is the most informative benchmark. I couldn't find the post now but feel free to reply and I'll tag it :) - I'm super grateful for the efforts of building better evaluation in the open-source community, more attention should be paid to this area. - The project is supported by @myshell_ai open-source grant.

248

59,814

Tianle Cai · Nov 15, 2023 · 3:04 AM UTC

Tianle Cai

@tianle_cai

15 Nov 2023

If training's got you in a stew, take a REST and speed right through! 😎 Thrilled to introduce Retrieval-Based Speculative Decoding (REST), a plug-and-play method for accelerating language model decoding. 👇

210

59,590

Tianle Cai · Apr 17, 2025 · 2:44 PM UTC

Tianle Cai

@tianle_cai

17 Apr 2025

Life update: Following my recent graduation, I've joined the Bytedance Seed Edge team to pursue this research direction further. Although this post was written last year, my conviction in this approach has only strengthened (many ideas here echo compelling recent writings from the legendary Rich Sutton and Shunyu, such as the need for rewards to help models evolve beyond the classic finite-context learning paradigm). In the near term, my focus will be on making individual agents evolvable, with the next phase involving connecting and scaling these evolvable agents. I'm incredibly excited about the potential achievements in this direction and welcome connections, discussions, or collaboration. I'll also be attending ICLR next week; please send a DM if you'd like to chat. Welcome to the second half of AGI 😉

Tianle Cai

@tianle_cai

20 Oct 2024

x.com/i/article/184807255972…

The Missing Part of AGI

This is a record of my (Tianle's) recent thoughts about the current state of AI development. TL;DR: One plausible path to achieving Artificial General Intelligence (AGI) is to allow AIs to evolve

211

21,256

Tianle Cai · Feb 20, 2024 · 3:21 AM UTC

Tianle Cai

@tianle_cai

20 Feb 2024

Everyone is talking about the incredible LLM inference speed that @GroqInc chips achieve, but few notice its cost, especially if you want to replace your GPU/TPU stack with it. Long story short, it requires hundreds of chips to serve a single LLM since each chip only has a ~200MB SRAM. Wondering how they manage to provide a very low price for public API... A few good refs: nitter.app/DZhang50/status/… nitter.app/felix_red_panda/… nitter.app/cHHillee/status/…

Horace He

@cHHillee

19 Feb 2024

Before people sell all their GPUs to go buy Groq hardware, I'd recommend answering two questions: 1. What is the cost of the system you're purchasing? 2. How many users can you serve at 500 tok/s+? Hint: Very high, and not many

193

55,200

Tianle Cai · Oct 18, 2022 · 2:07 AM UTC

Tianle Cai

@tianle_cai

18 Oct 2022

What Makes Convolutional Models Great on Long Sequence Modeling? Thrilled to share our new findings on what basic principles can make *global* convolutional models (like the incredible S4 model) super powerful in sequence modeling! arxiv.org/abs/2210.09298 🧵🧵🧵

176

Tianle Cai · May 6, 2024 · 5:00 PM UTC

Tianle Cai

@tianle_cai

6 May 2024

Wow, a GPT4-level model cheaper than Claude Haiku! We do have a large room to improve in terms of inference efficiency 😝

DeepSeek

@deepseek_ai

6 May 2024

Replying to @deepseek_ai

> Chat with DeepSeek-V2: chat.deepseek.com > Access pay-as-you-go DeepSeek-V2 APIs with unbeatable price: platform.deepseek.com > DeepSeek-V2 is fully open-source and free for commercial use: huggingface.co/deepseek-ai

178

36,663

Tianle Cai · May 1, 2024 · 2:40 AM UTC

Tianle Cai

@tianle_cai

1 May 2024

Wow, Medusa can be used for pre-training and leads to a better and faster generation! 😍

Aran Komatsuzaki

@arankomatsuzaki

1 May 2024

Meta presents Better & Faster Large Language Models via Multi-token Prediction - training language models to predict multiple future tokens at once results in higher sample efficiency - up to 3x faster at inference arxiv.org/abs/2404.19737

178

32,247

Tianle Cai · Jan 22, 2024 · 3:23 PM UTC

Tianle Cai

@tianle_cai

22 Jan 2024

Since the launch of Medusa, we’ve been thrilled to see its adoption in TensorRT, TGI, and numerous open-source projects and companies. Today, we’re unveiling a technical report with fresh features! This includes the Medusa-2 recipe for full-model tuning, self-distillation for integrating Medusa into any fine-tuned LLM, and more acceleration techniques. The latest results reveal a 2.2-3.6x speed boost over the original model across various LLMs 🚀 Report: arxiv.org/abs/2401.10774 Code: github.com/FasterDecoding/Me… Highlights of the new features: - Medusa-2 Training Recipe: While training the entire model with Medusa heads can potentially enhance their prediction ability and speed, naively adding Medusa heads can distort the pre-trained model’s learned features, impacting performance. To address this, we’ve designed a learning schedule inspired by @ananyaku’s LPFT paper. The resulting recipe supports training next-gen models with built-in inference acceleration without compromising performance. - Self-Distillation: Training Medusa heads typically requires access to the specific instruction-tuning dataset the model uses, but this data isn't always available. Furthermore, post-RLHF models often exhibit a significantly altered output distribution compared to the original training data. These factors underscore the need for a dataset-independent method in Medusa training. Our solution is a self-distillation technique, where the model itself generates the training dataset, effectively resolving this challenge. Shout out to my excellent collaborators @yli3521, @ZhengyangGeng, @Hongwu_Peng, @jasondeanlee Deming, and @tri_dao! Acknowledgement: Thanks @narsilou, @joao_gante for the TGI integration, and Kaiyu Xie, Eddie Wang, and Xiaowei Shi for the TensorRT LLM integration. Thanks @togethercompute, @myshell_ai, @chai_research for sponsoring compute, API credits, and open-source bounty.

156

38,490

Tianle Cai · Apr 8, 2025 · 3:55 PM UTC

Tianle Cai

@tianle_cai

8 Apr 2025

As an RL newbie I came across a very similar idea recently and was shocked to see that this natural idea (using the loss reduction over the ground truth answer when adding CoT before it) was (to my limited RL knowledge) only covered by the following and another very recent paper (arxiv.org/abs/2503.19618) lol. Did I miss any literature on this, or does it simply not work well?

Beyond Verifiable Rewards: Scaling Reinforcement Learning for...

We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable...

arxiv.org

Nathan Lambert

@natolambert

8 Apr 2025

Underrated paper and idea on using RL losses on non-verifiable domains, in this case the perplexity of the next chapter of a book.

149

35,666

Tianle Cai · Sep 11, 2023 · 4:37 PM UTC

Tianle Cai

@tianle_cai

11 Sep 2023

Replying to @tianle_cai @ggerganov

Medusa revisits an underrated gem from the "Blockwise Parallel Decoding" paper back to the invention of the Transformer model: rather than pulling in an entirely new draft model to predict subsequent tokens, why not simply extend the original model itself? This is where the "Medusa heads" come in. These additional decoding heads seamlessly integrate with the original model, producing blocks of tokens at each generative juncture.

147

32,156

Tianle Cai · Sep 11, 2023 · 4:41 PM UTC

Tianle Cai

@tianle_cai

11 Sep 2023

Replying to @tianle_cai @ggerganov

This work couldn't have been done without amazing collaborators @yli3521 @ZhengyangGeng @Hongwu_Peng and @tri_dao Blog: sites.google.com/view/medusa… Code: github.com/FasterDecoding/Me… Special thanks @togethercompute for their generous support for this project! together.ai/blog/medusa This is just the beginning. We are looking forward to extending Medusa to broader settings and welcome to everyone's contributions!

Homepage

Tianle Cai*, Yuhong Li*, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao (* Equal contribution)

sites.google.com

137

20,649

Tianle Cai · Aug 6, 2025 · 1:09 PM UTC

Tianle Cai

@tianle_cai

6 Aug 2025

Happy to see OpenAI finally open-sourcing 📢. I ran a quick probe on GPT-OSS-20B—160 high-temperature completions prompted by the nine most common English words—and two things stood out: 1. ≈ 85-90 % of output is programming-related. 2. Within that slice, the model leans hard on code explanations, tutorials, and technical docs. That pattern looks a lot like “Textbooks Are All You Need”-style curation—heavy instructional corpora, less narrative variety. My probing is preliminary (single-token prompts, 16 samples each) and loosely inspired by membership-inference attacks, but even this shallow dive already shows a strong dev-centric training bias. Full notebook, outputs, and some simple analysis here → github.com/ctlllll/gpt-oss-r…

GitHub - ctlllll/gpt-oss-reverse-engineering

Contribute to ctlllll/gpt-oss-reverse-engineering development by creating an account on GitHub.

github.com

142

18,708

Tianle Cai · Feb 25, 2025 · 2:31 AM UTC

Tianle Cai

@tianle_cai

25 Feb 2025

How interesting...

DeepSeek

@deepseek_ai

25 Feb 2025

🚀 Day 2 of #OpenSourceWeek: DeepEP Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference. ✅ Efficient and optimized all-to-all communication ✅ Both intranode and internode support with NVLink and RDMA ✅ High-throughput kernels for training and inference prefilling ✅ Low-latency kernels for inference decoding ✅ Native FP8 dispatch support ✅ Flexible GPU resource control for computation-communication overlapping 🔗 GitHub: github.com/deepseek-ai/DeepE…

116

25,793

Tianle Cai · Feb 4, 2024 · 4:25 AM UTC

Tianle Cai

@tianle_cai

4 Feb 2024

Weekend reflection: The end-game (?) of LLM serving I've been thinking about the situation of LLM serving recently, and here is what I thought could be the state in the near future. Given a latency requirement. For large players with high traffic volumes, 1. Ensure batches are large enough to avoid memory-bound. 2. Use Tensor Parallelism (TP) to cut down latency. 3. Adopt micro-pipelining to further minimize communication overhead in TP Result: Decoding costs approach those of prefilling. For smaller providers where your query rate doesn't fully utilize GPU capabilities, 1. Still opt for large batch sizes to maximize efficiency. 2. Leverage speculative decoding to squeeze available FLOPs. 3. Consider tensor parallelism and micro-pipelining if not meeting the latency requirement. No guarantee that my understanding is legit, please correct me if I'm wrong🙏

111

38,837

Tianle Cai · Aug 1, 2024 · 9:31 PM UTC

Tianle Cai

@tianle_cai

1 Aug 2024

Remember what we saw in the Gemini 1.5 report? 😏

MLCommons @MLCommons

1 Aug 2024

@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI

111

17,283

Tianle Cai · Feb 23, 2021 · 3:13 AM UTC

Tianle Cai

@tianle_cai

23 Feb 2021

Subpopulation shift is a ubiquitous component of natural distribution shift. We propose a general theoretical framework of learning under subpopulation shift based on label propagation. And our insights can help to improve domain adaptation algorithms. arxiv.org/abs/2102.11203

103

Tianle Cai · Sep 24, 2020 · 3:14 PM UTC

Tianle Cai

@tianle_cai

24 Sep 2020

Want to accelerate your Graph NN training? Not satisfied with the acceleration by normalization scheme from other domains? Come and try our GraphNorm, normalization specially designed for GNNs with new insights on GNN optimization! arxiv.org/abs/2009.03294 github.com/lsj2408/GraphNorm

103

Tianle Cai · May 11, 2021 · 2:21 AM UTC

Tianle Cai

@tianle_cai

11 May 2021

Late updates: Three papers will appear at #ICML2021! Theory of Label Propagation for Domain Adaptation arxiv.org/abs/2102.11203 GraphNorm for Accelerating GNN Training arxiv.org/abs/2009.03294 L_inf-dist Net for Certifying L_inf robustness arxiv.org/abs/2102.05363

A Theory of Label Propagation for Subpopulation Shift

One of the central problems in machine learning is domain adaptation. Unlike past theoretical work, we consider a new model for subpopulation shift in the input or representation space. In this...

arxiv.org

101

Tianle Cai · Aug 26, 2024 · 11:27 PM UTC

Tianle Cai

@tianle_cai

26 Aug 2024

We've heard many complaints about the high GPU memory requirements for training Medusa heads on models with large vocabularies. This is no longer an issue, thanks to the amazing Liger kernel developed by the LinkedIn team (@hsu_byron and team) The Liger kernel cleverly fuses the kernels and avoids the need to materialize full logits, significantly reducing memory usage and improving speed. This breakthrough allows researchers and developers to work with larger models more efficiently. Check out this example of training Medusa heads with Liger: github.com/linkedin/Liger-Ke…. With num_head = 5, stage = 1, Liger Kernel can scale to 8K seq len, while HF can only do 1K, with +30% throughput.

100

12,198

Tianle Cai · Aug 30, 2024 · 4:34 PM UTC

Tianle Cai

@tianle_cai

30 Aug 2024

Sparsity is a profound idea for enhancing the efficiency of large models (e.g., MoE). Our new research demonstrates that we can achieve high activation sparsity (finer granularity than MoE) without additional training, thereby significantly boosting inference efficiency.

James Liu @JamesLiuID

30 Aug 2024

Your LLM may be sparser than you thought! Excited to announce TEAL, a simple training-free method that achieves up to 40-50% model-wide activation sparsity on Llama-2/3 and Mistral models. Combined with a custom kernel, we achieve end-to-end speedups of up to 1.53x-1.8x!

11,994

Tianle Cai · Feb 21, 2024 · 4:55 PM UTC

Tianle Cai

@tianle_cai

21 Feb 2024

Want to learn more about how to compress your fine-tuning deltas into merely 1 bit? -- Our blog is out! fasterdecoding.github.io/Bit… together.ai/blog/bitdelta pli.princeton.edu/blog/2024/… Thx @togethercompute @PrincetonPLI for cross-posting :) GitHub PS: the attached demo is included in our github.

Tianle Cai

@tianle_cai

16 Feb 2024

22,786

Tianle Cai · Sep 11, 2023 · 4:35 PM UTC

Tianle Cai

@tianle_cai

11 Sep 2023

LLM inference is inefficient because of its memory-bound nature. Speculative decoding cleverly uses an additional lightweight draft model to predict a few tokens ahead, allowing the large model to process them together. This reduces the number of memory accesses and thus speeds up the inference. See nitter.app/karpathy/status/… for a nice explanation from legendary @karpathy

Andrej Karpathy

@karpathy

31 Aug 2023

Speculative execution for LLMs is an excellent inference-time optimization. It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch (for larger K than you might think). This unintuitive fact is because sampling is heavily memory bound: most of the "work" is not doing compute, it is reading in the weights of the transformer from VRAM into on-chip cache for processing. So if you're going to do all that work of reading in all those weights, you might as well apply them to a whole batch of input vectors. I went into more detail in an earlier thread: nitter.app/karpathy/status/… The reason we can't naively use this fact to sample in chunks of K tokens at a time is that every N-th token depends on what token we sample at time at step N-1. There is a serial dependency, so the baseline implementation just goes one by one left to right. Now the clever idea is to use a small and cheap draft model to first generate a candidate sequence of K tokens - a "draft". Then we feed all of these together through the big model in a batch. This is almost as fast as feeding in just one token, per the above. Then we go from left to right over the logits predicted by the model and sample tokens. Any sample that agrees with the draft allows us to immediately skip forward to the next token. If there is a disagreement then we throw the draft away and eat the cost of doing some throwaway work (sampling the draft and the forward passing for all the later tokens). The reason this works in practice is that most of the time the draft tokens get accepted, because they are easy, so even a much smaller draft model gets them. As these easy tokens get accepted, we skip through those parts in leaps. The hard tokens where the big model disagrees "fall back" to original speed, but actually a bit slower because of all the extra work. So TLDR: this one weird trick works because LLMs are memory bound at inference time, in the "batch size 1" setting of sampling a single sequence of interest, that a large fraction of "local LLM" use cases fall into. And because most tokens are "easy". References arxiv.org/abs/2302.01318 arxiv.org/abs/1811.03115 arxiv.org/abs/2211.17192

11,596

Tianle Cai · Nov 6, 2023 · 9:48 PM UTC

Tianle Cai

@tianle_cai

6 Nov 2023

🔥 Fired up after today's OpenAI release, I'm thrilled to share that I'll be joining the VisionPro team @Apple as a summer intern next year! Looking forward to diving into the future of XR+AI. (hopefully I can make VisionPro lighter, below is one of my proposals) 🌟

12,652

Tianle Cai · Jun 16, 2024 · 9:22 PM UTC

Tianle Cai

@tianle_cai

16 Jun 2024

My first CVPR 🤗

16,967

Tianle Cai · Jan 21, 2023 · 12:57 AM UTC

Tianle Cai

@tianle_cai

21 Jan 2023

The SGConv paper from my last memorable internship @MSFTResearch has been accepted by ICLR, and will start a new journey @GoogleAI as a part-time student researcher next week! Ready to welcome lunar new year🤗

Tianle Cai

@tianle_cai

18 Oct 2022

15,391

Tianle Cai · Aug 20, 2024 · 1:56 AM UTC

Tianle Cai

@tianle_cai

20 Aug 2024

Just flew back home after a 10+ hour flight. Sharing some in-flight thoughts on AI ✈️ Q: What assumptions should we make when imagining future AI applications? TL;DR: 1. Model inference will become nearly free 2. AI will solve most objectively verifiable problems 3. Personal AI models will diverge on subjective issues We often take many things for granted because we're so accustomed to them. For instance, we assume phones can easily connect to the Internet, or that we can get real-time traffic updates while driving. But these were once groundbreaking innovations. If we can enter this state of familiarity early, or make the right assumptions, we might be able to gain an advantage in future developments. Here are three assumptions I believe will eventually prove true in AI progress. I hope they provide some inspiration and welcome discussion: 1. Model inference will become nearly free The cost reduction of AI models is following a deterministic trajectory. We've already witnessed more than a 10x reduction in inference costs over the past few months, and this trend is set to continue as we refine our techniques in both model training and inference. Think about how web search evolved. Initially, online searches were limited and sometimes charged per query. Now, we can perform countless Google searches daily without direct costs. This shift enabled new business models like search advertising. Similarly, as AI inference costs plummet, we'll see a proliferation of AI-powered applications and innovative business models. 2. AI will solve most objectively verifiable problems This might sound like a bold claim, but it's logically sound. The key logic is that "verifying a problem is simpler than solving it." As long as we have initial models with sufficient verification abilities, they should be able to bootstrap themselves to become more powerful (the AlphaGo series is a good example of this). Of course, realizing this simple logic is no easy task, but hopefully, it will become a reality in the near future. 3. Personal AI models will diverge on subjective issues For problems where there isn't (and shouldn't be) a consensus, AI models will likely diverge. As the effective context of models grows longer and has more long-term impact on outputs, models will eventually develop unique "opinions." Imagine having a personal AI assistant that, over time, develops a deep understanding of your unique worldview, tastes, and values. When asked about the best way to spend a weekend, one person's AI might suggest an action-packed outdoor adventure, while another's might recommend a quiet day of reading and reflection - each perfectly tailored to their user's preferences. Looking forward to more "assumptions"!🤔💬

10,687

Tianle Cai · Mar 4, 2024 · 2:24 PM UTC

Tianle Cai

@tianle_cai

4 Mar 2024

Interesting to see the output price is 5x the input, seems suggesting the latency is a big issue for those larger models so only a smaller batch size can be used. If this is the case, speculative decoding should be pretty helpful?

13,129

Tianle Cai · Sep 11, 2023 · 4:38 PM UTC

Tianle Cai

@tianle_cai

11 Sep 2023

Replying to @tianle_cai @ggerganov

Unlike the draft model, Medusa heads can be trained in conjunction with the original model, which remains frozen during training. This method allows for fine-tuning large models on a single GPU, taking advantage of the powerful base model's learned representations. On its own, Medusa heads don't quite hit the mark of doubling processing speeds. But here's the twist: When we pair this with a tree-based attention mechanism, we can verify several candidates generated by Medusa heads in parallel. These two techniques allow us to achieve a 2x speedup over the original model.

9,631

Tianle Cai · May 13, 2024 · 8:31 PM UTC

Tianle Cai

@tianle_cai

13 May 2024

Free idea: This story suggests that we may need to use the SFT (chat) corpus to build the tokenizer rather than relying on the pretraining corpus alone🤣 @OpenAI

Tianle Cai

@tianle_cai

13 May 2024

19,170

Tianle Cai · Feb 16, 2024 · 3:23 AM UTC

Tianle Cai

@tianle_cai

16 Feb 2024

Paper: arxiv.org/abs/2402.10193 Code: github.com/FasterDecoding/Bi… Blog is on the way :) With amazing collaborators @JamesLiuID (super talented undergrad from MIT!), Kai, @jasondeanlee, @songhan_mit, and @tri_dao! Thanks @togethercompute for sponsoring GPUs and @myshell_ai for OpenAI credits! I'm grateful for these efforts to empower our open-source community.

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of...

arxiv.org

7,645

Tianle Cai · Mar 7, 2024 · 2:57 PM UTC

Tianle Cai

@tianle_cai

7 Mar 2024

Replying to @inflectionAI

Wonder where the FLOPs number comes from...

18,129

Tianle Cai · Jun 27, 2024 · 10:10 PM UTC

Tianle Cai

@tianle_cai

27 Jun 2024

Just delved into Google's report on Gemma2. Initial takeaway: many tricks focus on training stability, particularly suitable for low-precision scenarios, e.g, logit soft-capping and sandwich layer normalization. Does this hint at int8 training being crucial? 🤔 Link: goo.gle/gemma2report

9,620

Tianle Cai · Oct 3, 2024 · 6:44 PM UTC

Tianle Cai

@tianle_cai

3 Oct 2024

I'll be at COLM next week. DM if you want to chat (about inference efficiency, new architecture ideas, reasoning, startups, or anything LLM) 😜

7,265

Tianle Cai · Sep 14, 2023 · 1:50 AM UTC

Tianle Cai

@tianle_cai

14 Sep 2023

LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper🤩

Zhuohan Li

@zhuohan123

14 Sep 2023

PagedAttention's paper is out! Check it out to learn more!

8,738

Tianle Cai · Sep 29, 2021 · 5:36 PM UTC

Tianle Cai

@tianle_cai

29 Sep 2021

Three papers got accepted by #NeurIPS2021, excited but also introspective: whether getting accepted is less important nowadays given 2600+ papers in the conference, only good works make the footprints. Still, glad to share our works and thanks for my amazing collaborators :)

Tianle Cai · Mar 28, 2024 · 7:45 PM UTC

Tianle Cai

@tianle_cai

28 Mar 2024

DBRX sounds like last year... The field just evolves so fast🤪

SambaNova

@SambaNovaAI

28 Mar 2024

🚀🌟🚀Excited to announce Samba-CoE v0.2, which outperforms DBRX by @DbrxMosaicAI and @databricks, Mixtral-8x7B from @MistralAI, and Grok-1 by @grok at a breakneck speed of 330 tokens/s. These breakthrough speeds were achieved without sacrificing precision and only on 8 sockets, showcasing the true capabilities of dataflow! Why would you buy 576 sockets and go to 8 bits when you can run using 16 bits and just 8 sockets. Try out the model and check out the speed here - coe-1.cloud.snova.ai/. We are also providing a sneak peak of our next model, Samba-CoE v0.3, available soon with our partners at @LeptonAI. Read more about this announcement at sambanova.ai/blog/accurate-m…

12,361

Tianle Cai · May 13, 2024 · 6:56 PM UTC

Tianle Cai

@tianle_cai

13 May 2024

It's fascinating to see how OpenAI decomposes complicated tasks by examining the author list. 👀 As the field matures, most of the work is focused on building large systems and ensuring they work together seamlessly. openai.com/gpt-4o-contributi…

12,519

Tianle Cai · Dec 4, 2023 · 11:33 PM UTC

Tianle Cai

@tianle_cai

4 Dec 2023

I'll be at NeruIPS next week (Dec 11-15). Please DM me if you would like to chat about anything related to large models, especially efficient inference😋

8,360

Tianle Cai · Apr 18, 2024 · 12:23 AM UTC

Tianle Cai

@tianle_cai

18 Apr 2024

JetMoE: 4 folks w/ $0.1M ship a (near-?)SOTA 8B MoE model within 1 month (with the help of @myshell_ai) 🤪

George

@georgejrjrjr

17 Apr 2024

Mistral: ~30 folks w/ ~$500M ship a near-frontier model in nine months. Reka: ~25 folks w/ $60M ship a frontier model in four months. Zyphra: 7 folks w/ $11M (none a frontier lab alum) ship a near-SOTA 7B within a couple months. The frontier is shaping up to be pretty crowded!

9,757

Tianle Cai · Sep 11, 2023 · 4:36 PM UTC

Tianle Cai

@tianle_cai

11 Sep 2023

However, introducing the draft model also brings in a lot of pain. First, it's non-trivial to find a good (both fast and accurate) draft model (as @ggerganov mentioned in nitter.app/ggerganov/status…). Also, the draft model adds extra complexity to the pipeline, making it harder to implement and a disaster for distributed inference.

Georgi Gerganov

@ggerganov

31 Aug 2023

Replying to @ggerganov

Meta should have release a couple of (1B and 3B) drafter models with the Code Llama release. Is it too late for them to train them or we have to wait for v2 🤔

16,013

Tianle Cai · Dec 3, 2024 · 8:53 PM UTC

Tianle Cai

@tianle_cai

3 Dec 2024

Heading to NeurIPS next Tuesday! 🎯 Been deep into evolvable agents, adaptive computation, and system-model codesign lately, but down to geek out about any cool AI stuff. Also open to startup convos (consulting/investing). Hit me up if you wanna connect! 🤝

4,818

Tianle Cai · Apr 24, 2024 · 3:00 AM UTC

Tianle Cai

@tianle_cai

24 Apr 2024

Thx for sharing!

@_akhaliq

24 Apr 2024

SnapKV LLM Knows What You are Looking for Before Generation Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV

9,815

Tianle Cai · Jun 16, 2021 · 10:15 PM UTC

Tianle Cai

@tianle_cai

16 Jun 2021

Super excited to announce our Graphormer wins FIRST place in the OGB Large-Scale Challenge PCQM4M-LSC at KDD Cup 2021 @kdd_news !! Attention is all you need even in graph representation learning? Yes, and no. Transformer + graph structural encodings = powerful GNN!

Tianle Cai

@tianle_cai

12 Jun 2021

Tianle Cai · Feb 28, 2024 · 7:17 PM UTC

Tianle Cai

@tianle_cai

28 Feb 2024

Thank you for your nice words, Gemini :)

7,284

Tianle Cai · Oct 20, 2024 · 6:49 PM UTC

Tianle Cai

@tianle_cai

20 Oct 2024

x.com/i/article/184807255972…

The Missing Part of AGI

This is a record of my (Tianle's) recent thoughts about the current state of AI development. TL;DR: One plausible path to achieving Artificial General Intelligence (AGI) is to allow AIs to evolve

28,557

Tianle Cai · Oct 30, 2023 · 5:52 PM UTC

Tianle Cai

@tianle_cai

30 Oct 2023

Open-source 30T corpus🤩

Together AI

@togethercompute

30 Oct 2023

We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training. together.ai/blog/redpajama-d…

9,076

Tianle Cai · Sep 24, 2020 · 1:16 AM UTC

Tianle Cai

@tianle_cai

24 Sep 2020

Do existing pruning methods really exploit the info from data? Are the architectures of the pruned networks really matter for the performance? We propose sanity checks on pruning methods and find a great part of existing methods does not rely on these! arxiv.org/abs/2009.11094

Tianle Cai · Jul 17, 2022 · 6:24 PM UTC

Tianle Cai

@tianle_cai

17 Jul 2022

Will be at ICML next week!! Quite a last-minute decision, but I really miss the in-person conference :) Looking forward to meeting old&new friends here!! Pls ping me if u want to meet up :p

Tianle Cai · Dec 14, 2024 · 3:28 PM UTC

Tianle Cai

@tianle_cai

14 Dec 2024

Many have speculated about Illya's hidden alpha for the next scaling axis. Here's my proposal: the number of evolvable agents to more efficiently extend the frontier of AI-generated new knowledge. Could this align with some of Illya's ideas? 🤔 My article on this: nitter.app/tianle_cai/status/1848…

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

14 Dec 2024

Yes, we'll be scaling compute use, not data or model. But how? "Agents?" I feel Ilya is hiding alpha, his specific asnwer to his own "scaling what?" puzzle. My guess: compute per hard token in training. Branching reasoning chains, then distilling into latent multi-hop reasoning.

8,573

Tianle Cai · Mar 15, 2024 · 3:20 AM UTC

Tianle Cai

@tianle_cai

15 Mar 2024

When will @grok be open-sourced?🧐

8,791

Tianle Cai · May 15, 2024 · 5:11 AM UTC

Tianle Cai

@tianle_cai

15 May 2024

Superunaligned😰

Jan Leike

@janleike

15 May 2024

I resigned

9,611

Tianle Cai · Dec 18, 2023 · 1:39 AM UTC

Tianle Cai

@tianle_cai

18 Dec 2023

Separating the two phases of LLM generation helps understand their computation characteristics and introduces optimization opportunities. Glad to see an excellent explanation of an underestimated optimization on this. PS: SplitFuse from @MSFTDeepSpeed might also be of interest.

Aman Sanger

@amanrsanger

17 Dec 2023

Most OSS inference frameworks combine pre-filling (prompt tokens) with decoding (generation tokens) per device/process. But separating the stages should give you better perf! (1/14)

8,916

Tianle Cai · Mar 20, 2024 · 11:05 PM UTC

Tianle Cai

@tianle_cai

20 Mar 2024

Nice figure, but it seems a little misleading: Memory wall refers to the trend that memory bandwidth (instead of the required memory or model size) grows slower than computation, which causes the LLM inference to be memory-bound. Here is an in-depth explanation by Claude 3 Opus: "The image shows the compute and memory capacity trends for AI models over time, with the "Memory Wall" line representing the memory limitation compared to compute capacity. However, the concept of the "Memory Wall" is misrepresented here. The "Memory Wall" typically refers to the growing disparity between processor speeds and memory speeds, where memory access times have not kept pace with the rapid increases in processor performance. This can lead to a bottleneck where the processor is idle, waiting for data from memory. In the context of AI models, the chart seems to suggest that the "Memory Wall" is a hard limit on the memory capacity of these models. However, this is not an accurate representation of the memory wall problem. The memory wall issue is more about the speed and bandwidth of memory access relative to processing power, rather than a hard limit on total memory capacity. While there are certainly limitations to the amount of memory that can be physically packaged with a processor, the chart's depiction of the memory wall as a linear limit on total memory capacity is misleading. In reality, techniques such as caching, prefetching, and memory hierarchy design can help mitigate the impact of the memory wall to some extent, even as the disparity between processor and memory speeds continues to grow."

Chief AI Officer @chiefaioffice

20 Mar 2024

This is what NVIDIA's Blackwell GPUs solve for Transformer sizes: 240x/2 years GPU memory: 2x/2 years What the memory wall means:

8,957

Tianle Cai · Aug 2, 2024 · 9:11 PM UTC

Tianle Cai

@tianle_cai

2 Aug 2024

Glad to see Medusa keep advancing cutting-edge models and go beyond language 😍

Rohan Paul

@rohanpaul_ai

2 Aug 2024

New ultra-fast ‘multi-head’ speech recognition model drop from @_aiOla, beats OpenAI Whisper. Officially dubbed Whisper-Medusa, the model builds on Whisper but uses a novel “multi-head attention” architecture that predicts far more tokens at a time So they seem to have added more attention heads on top of whisper. They claim the same accuracy but 50% faster. Their demo does one text in 1.9s while "baseline" whisper does it in 4s. Code and weights opensource under MIT. they have started with a 10-head model but will soon expand to a larger 20-head version capable of predicting 20 tokens at a time, leading to faster recognition and transcription without any loss of accuracy.

5,856

Tianle Cai · Jun 18, 2021 · 3:47 AM UTC

Tianle Cai

@tianle_cai

18 Jun 2021

Our ICML paper (arxiv.org/abs/2102.11203) provided theoretical evidence on the effectiveness of consistency-based methods like AdaMatch, FixMatch, etc. with empirical results for the adaptation of FixMatch on domain adaptation tasks : p

Connor Shorten

@CShorten30

17 Jun 2021

This video explains AdaMatch! 🐳 • Interesting unification of Semi-Supervised Learning and Domain Adaptation • Extensions to FixMatch: BatchNorm use, Confidence threshold, Distribution Alignment • Results on DomainNet piped.video/ORufPOY8H14 #100DaysOfMLCode

Tianle Cai · Jul 26, 2024 · 9:01 PM UTC

Tianle Cai

@tianle_cai

26 Jul 2024

Replying to @capetorch

The reason it may look strange is that people are focusing on decoding speed. But an even larger proportion of computation is spent on processing the input which is way faster than 300 tok/sec. I made the same mistake before and luckily it was corrected by @jiayq and friends @LeptonAI.

10,958

Tianle Cai · Jan 22, 2024 · 4:35 PM UTC

Tianle Cai

@tianle_cai

22 Jan 2024

Thank you to the incredible team @LeptonAI for harnessing the potential of Medusa in practical LLM applications and offering a variety of finely-tuned Medusa heads to the community!

Yangqing Jia

@jiayq

22 Jan 2024

Medusa is probably one of the most elegant accelerated inference solution we have seen over the last year. It runs complementary to other numerical ones (like int8/fp8, compilation etc) and gives something around ~2x performance gain in practice. (1/x) huggingface.co/papers/2401.1…

7,011

Tianle Cai · Apr 12, 2024 · 7:53 PM UTC

Tianle Cai

@tianle_cai

12 Apr 2024

Glad to see Medusa helps the💯B Command R+ model @cohere run in💯 tok/sec 😍

OlivierD @OlivierDehaene

12 Apr 2024

Text Generation Inference v2.0.0 is the fastest open-source implementation of Cohere Command R+! Command R+ is the best open-weights model. Leveraging the power of Medusa heads, TGI achieves unparalleled speeds with a latency as low as 9ms per token for a 104B model!

7,574

Tianle Cai · Feb 12, 2021 · 2:04 AM UTC

Tianle Cai

@tianle_cai

12 Feb 2021

Thrilled to share our new work on certifying L_inf adversarial robustness! New architecture with inherent robustness and a tailored training pipeline. Achieving SOTA performance on several benchmarks! paper: arxiv.org/abs/2102.05363 code: github.com/zbh2047/L_inf-dis…

Tianle Cai · Mar 18, 2024 · 9:32 PM UTC

Tianle Cai

@tianle_cai

18 Mar 2024

Low precision seems to be the way to keep Moore's law going?

7,886

Tianle Cai · May 9, 2024 · 6:06 PM UTC

Tianle Cai

@tianle_cai

9 May 2024

Wait, I don't need to learn CUDA now? 😍

Zhihao Jia

@JiaZhihao

9 May 2024

🚀Introducing Mirage, a superoptimizer that automatically discovers highly-optimized GPU implementations for LLMs (and beyond). For certain attention operators, the fastest programs found by Mirage is 2x faster than existing expert-designed implementations such as FlashAttention and FlashDecoding. Code+Demo: github.com/mirage-project/mi… Paper: cs.cmu.edu/~zhihaoj2/papers/…

7,887

Tianle Cai · Apr 24, 2025 · 5:09 AM UTC

Tianle Cai

@tianle_cai

24 Apr 2025

I'll be there tomorrow presenting our work on accelerating diffusion model inference with quality-preserving 4-bit quantization. Come say hi! 😉

Muyang Li

@lmxyy1999

24 Apr 2025

🚀 How to run 12B FLUX.1 on your local laptop with 2-3× speedup? Come check out our #SVDQuant (#ICLR2025 Spotlight) poster session! 🎉 🗓️ When: Friday, Apr 25, 10–12:30 (Singapore time) 📍 Where: Hall 3 + Hall 2B, Poster 169 📌 Poster: tinyurl.com/poster-svdquant 🎮 Demo: svdquant.mit.edu 💻 Code: github.com/mit-han-lab/nunch… 🌐 Project: hanlab.mit.edu/projects/svdq… @syn7xavier @ZhekaiZhang @tianle_cai @xiuyu_l @jerry_gjx @xieenze_jr @chenlin_meng @junyanz89 @songhan_mit

2,802

Tianle Cai · Apr 4, 2024 · 4:27 PM UTC

Tianle Cai

@tianle_cai

4 Apr 2024

Replying to @tianle_cai @myshell_ai

JetMoE utilizes a novel MoE architecture (a variant of arxiv.org/abs/2306.04640) that splits both the MLP and attention layers into multiple experts, resulting in a highly efficient and performant model. Despite having only 2.2B active parameters, JetMoE outperforms Llama-2 7B and Llama 13B.

ModuleFormer: Modularity Emerges from Mixture-of-Experts

Large Language Models (LLMs) have achieved remarkable results. However, existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training...

arxiv.org

3,584

Tianle Cai · Dec 11, 2023 · 1:31 PM UTC

Tianle Cai

@tianle_cai

11 Dec 2023

Replying to @main_horse @MistralAI

don't have enough time to dig deeper, but in high-dimension the expected corr between two random vectors should be ~0, so the corr in mixtral is quite significant I think

4,824

Tianle Cai · Dec 9, 2024 · 10:23 PM UTC

Tianle Cai

@tianle_cai

9 Dec 2024

Interesting @xai release when people are waiting for their Sora generations 👀 "Aurora is an autoregressive mixture-of-experts network trained to predict the next token from interleaved text and image data." So does this mean it's natively multimodal? Also interesting to see they make autoregressive image generation work that well. x.ai/blog/grok-image-generat…

4,662

Tianle Cai · Jul 7, 2023 · 1:51 AM UTC

Tianle Cai

@tianle_cai

7 Jul 2023

Simple architecture modification to make LLM better in-context learner: - Remove sentence len constraint - Linear scaling - Invariant to demo permutation An old course project from @danqi_chen 's amazing NLP class (shorturl.at/eALU2) recently got accepted into @ESFoMo .🔥

6,589

Tianle Cai · May 29, 2023 · 1:18 AM UTC

Tianle Cai

@tianle_cai

29 May 2023

When using GPT-4 as Tool Maker and GPT-3.5 Turbo as Tool User, our method can achieve performance on par with GPT-4 with GPT-3.5 Turbo's low cost and fast speed. We validated our method on several complex reasoning tasks from BigBench and real-world applications.

4,886

Tianle Cai · Feb 27, 2024 · 3:07 PM UTC

Tianle Cai

@tianle_cai

27 Feb 2024

Thanks Nicolas and the HF team for boosting the open-source community! Glad to see Medusa helps to deliver blazingly fast inference! PR has been merged, you can try it yourself ;)

Nicolas Patry @narsilou

27 Feb 2024

Mixtral at 142 tok/s (latency)... Medusa models can really speed up inference, today we're releasing models & upgrades to the training script so you can adapt on your own finetune huggingface.co/text-generati… Training recipe: github.com/FasterDecoding/Me… TGI:latest works out of the box

13,605

Tianle Cai · Dec 11, 2023 · 7:15 PM UTC

Tianle Cai

@tianle_cai

11 Dec 2023

Some clarifications: - The correlation here is the cosine similarity of the attention *weights*, not the intermediate hidden representation. - The correlation is significant as in high dimension, the expected corr of two random vectors should be around zero (verified as in the figure which shows the corr of a pair of weight matrices between Mixtral and Llama-2 7B). Wrote on the flight to NOLA👀 Ping me if you want to chat more 👀

Tianle Cai

@tianle_cai

11 Dec 2023

6,765

Tianle Cai · Nov 8, 2024 · 7:06 PM UTC

Tianle Cai

@tianle_cai

8 Nov 2024

Diffusion models are compute intensive and we use 4-bit quantization (over both weights and activations!) to unlock 4x more FLOPs with 4-bit TensorCore -> 4x memory reduction and 3x speedup 🤩 With B200's 9P 4-bit FLOPs (4.5x H100 fp16), the future will be even more exciting 😏

Muyang Li

@lmxyy1999

8 Nov 2024

🚀 The 4-bit era has arrived! Meet #SVDQuant, our new W4A4 quantization paradigm for diffusion models. Now, 12B FLUX can run on a 16GB 4090 laptop without offloading—with 3x speedups over W4A16 models (like NF4) while maintaining top-tier image quality. #AI #Quantization. 1/7

4,126

Tianle Cai · Dec 11, 2023 · 4:12 PM UTC

Tianle Cai

@tianle_cai

11 Dec 2023

Replying to @mvpatel2000 @MistralAI

As expected, ~0 haha

2,512

Tianle Cai · Nov 17, 2023 · 3:18 PM UTC

Tianle Cai

@tianle_cai

17 Nov 2023

Honored that Graphormer was highlighted by @satyanadella at Microsoft Ignite 2023🤩🤩

3,123

Tianle Cai · Nov 15, 2023 · 3:04 AM UTC

Tianle Cai

@tianle_cai

15 Nov 2023

paper: arxiv.org/abs/2311.08252 blog: sites.google.com/view/rest-l… code: github.com/FasterDecoding/RE… Shout out to my amazing collaborators @zhenyuhe00 @ZexuanZhong @jasondeanlee and Di.

REST: Retrieval-Based Speculative Decoding

We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that...

arxiv.org

2,477

Tianle Cai · Apr 10, 2024 · 1:32 AM UTC

Tianle Cai

@tianle_cai

10 Apr 2024

8x22b👀👀

Mistral AI

@MistralAI

10 Apr 2024

magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%https://nitter.app/t.co/OdtBUsbeV5%3A1337%2Fannounce

3,384

Tianle Cai · Jul 18, 2023 · 4:34 PM UTC

Tianle Cai

@tianle_cai

18 Jul 2023

The release of Llama2 is a Big moment for OOS! 🔥 After a skim through the tech report, found several secret sauces behind the success: - Better data. - More efficient (architecture for) inference. And more👇 ai.meta.com/research/publica…

5,798

Tianle Cai · Nov 27, 2022 · 7:13 PM UTC

Tianle Cai

@tianle_cai

27 Nov 2022

Excited to have my second in-person #NeurIPS three years after the first one! Can't wait to meet old&new friends here🔥Feel free to ping me if you would like to chat on ML methodologies (Transformers, learning paradigms, efficient training etc.)

Tianle Cai · Dec 9, 2024 · 6:19 PM UTC

Tianle Cai

@tianle_cai

9 Dec 2024

I have spoken 😏

Tianle Cai

@tianle_cai

5 Dec 2024

I guess the crazy $200/month subscription is also for Sora access? Then it makes more sense, we will see 👀

2,120

Tianle Cai · Apr 4, 2024 · 4:29 PM UTC

Tianle Cai

@tianle_cai

4 Apr 2024

Learn more about JetMoE and access the model: Research: research.myshell.ai/jetmoe GitHub: github.com/myshell-ai/JetMoE Huggingface: huggingface.co/jetmoe/jetmoe… Demo: lepton.ai/playground/chat?mo… Joint work with @Yikang_Shen @Zhen4good @qinzytech @myshell_ai @LeptonAI

2,281

Tianle Cai · May 29, 2023 · 1:12 AM UTC

Tianle Cai

@tianle_cai

29 May 2023

Replying to @tianle_cai @Google

Tools can boost the productivity of LLMs, but what if there isn't a suitable tool?😰-- Let LLM build their own🤩 We introduce "LLMs As Tool Maker": one LLM serves as Tool Maker👩🏻‍🎓 to make new tools🔨, another LLM servers as Tool User👨🏻‍🔧 to solve new problems with the tool.

5,873

Tianle Cai · Oct 31, 2023 · 7:18 AM UTC

Tianle Cai

@tianle_cai

31 Oct 2023

So, what exactly is dynamic caching (in today's Apple release)? How does it compare to NV GPUs? I got super confused by Apple's presentation here. Sounds like some remarkable technical improvement, but the description is just so vague...

4,452

Tianle Cai · Apr 18, 2024 · 6:25 PM UTC

Tianle Cai

@tianle_cai

18 Apr 2024

Would like to see 1) smaller-scale dense models runnable on edge devices and 2) larger-scale MoE models trained with this 15T corpus. Those might be more useful than the current architecture choices...

Tianle Cai

@tianle_cai

18 Apr 2024

Llama 3: Better data is all you need

3,214

Tianle Cai · Oct 4, 2024 · 5:00 PM UTC

Tianle Cai

@tianle_cai

4 Oct 2024

Amazing progress on long-context LM!

Tianyu Gao @gaotianyu1350

4 Oct 2024

Very proud to introduce two of our recent long-context works: HELMET (best long-context benchmark imo): shorturl.at/JnBHD ProLong (a cont’d training & SFT recipe + a SoTA 512K 8B model): shorturl.at/XQV7a Here is a story of how we arrived there

2,877

Tianle Cai · Jul 13, 2023 · 12:35 AM UTC

Tianle Cai

@tianle_cai

13 Jul 2023

Replying to @TheGregYang @KaiyuYang4

Congrats Greg!

858

Tianle Cai · Dec 5, 2024 · 10:19 PM UTC

Tianle Cai

@tianle_cai

5 Dec 2024

I guess the crazy $200/month subscription is also for Sora access? Then it makes more sense, we will see 👀

3,780

Tianle Cai · Dec 11, 2023 · 1:21 PM UTC

Tianle Cai

@tianle_cai

11 Dec 2023

Besides the similarity between Mixtral 8x7B and Mistral 7B, here are some other sanity checks I would like to try if I'm not on my way to NeurIPS✈️: - The performance when using a fixed expert. - Similarity between different experts. - Performance drop after quantization.

Tianle Cai

@tianle_cai

11 Dec 2023

3,481

Tianle Cai · May 29, 2023 · 1:21 AM UTC

Tianle Cai

@tianle_cai

29 May 2023

Joint work with the amazing team at Google Deepmind: Xuezhi, @tengyuma , @xinyun_chen_ , @denny_zhou

3,554

Tianle Cai · May 20, 2024 · 6:51 PM UTC

Tianle Cai

@tianle_cai

20 May 2024

Replying to @bozavlado

Seems like they're trying to keep it low-key, but the phrase "higher-order preconditioned method" immediately brought to my mind the incredible paper "Scalable Second Order Optimization for Deep Learning" by @_arohan_ et al. arxiv.org/abs/2002.09018

Scalable Second Order Optimization for Deep Learning

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that...

arxiv.org

1,949