Woosuk Kwon · Jan 22, 2026 · 5:03 PM UTC

Woosuk Kwon

Pinned Tweet

Woosuk Kwon

@woosuk_k

Jan 22

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

181

129

1,152

484,163

Woosuk Kwon · Sep 5, 2024 · 5:26 PM UTC

Woosuk Kwon

@woosuk_k

5 Sep 2024

Developing @vllm_project taught me a tough lesson: to keep the GPU fully utilized, we need to pay close attention to everything happening on the CPU. Over the past month, the vLLM community conducted an in-depth study and made key optimizations, leading to significant performance improvements. Check out our findings and latest updates!

vLLM

@vllm_project

5 Sep 2024

A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. blog.vllm.ai/2024/09/05/perf…

249

37,377

Woosuk Kwon · Nov 14, 2023 · 11:14 PM UTC

Woosuk Kwon

@woosuk_k

14 Nov 2023

We’ve just released a new blog post comparing vLLM with DeepSpeed-FastGen. While we are happy to see the open-source technology advancements from the DeepSpeed team, we’ve got different results with more extensive performance benchmarks. vLLM is actually faster than DeepSpeed in many common scenarios. Details here: blog.vllm.ai/2023/11/14/note… (written with @zhuohan123, @simon_mo_, @eqhylxx)

203

45,192

Woosuk Kwon · Jan 27, 2025 · 8:21 PM UTC

Woosuk Kwon

@woosuk_k

27 Jan 2025

As one of the fastest-growing OSS projects, vLLM inevitably accumulated some technical debts. We noticed it, and re-architected vLLM's core with careful engineering. Enjoy simpler code & higher performance with vLLM V1!

vLLM

@vllm_project

27 Jan 2025

🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.

215

31,225

Woosuk Kwon · Sep 13, 2023 · 11:51 PM UTC

Woosuk Kwon

@woosuk_k

13 Sep 2023

Exciting news! 🎉Our PagedAttention paper is now up on arXiv! Dive in to learn why it's an indispensable technique for all major LLM serving frameworks. @zhuohan123 and I will present it at @sospconf next month. Blog post: vllm.ai Paper: arxiv.org/abs/2309.06180

183

25,907

Woosuk Kwon · Aug 19, 2024 · 8:35 PM UTC

Woosuk Kwon

@woosuk_k

19 Aug 2024

This Wednesday (8/21) I will be speaking about the diverse hardware support in vLLM, with a focus on AMD GPUs and Google TPUs. Sign up to learn more about vLLM! neuralmagic.com/community-of…

178

14,434

Woosuk Kwon · Mar 18, 2025 · 6:58 PM UTC

Woosuk Kwon

@woosuk_k

18 Mar 2025

vLLM ❤️ @nvidia Dynamo

vLLM

@vllm_project

18 Mar 2025

Replying to @vllm_project

We are grateful for the trust in vLLM ❤️

149

9,772

Woosuk Kwon · Feb 21, 2025 · 7:00 PM UTC

Woosuk Kwon

@woosuk_k

21 Feb 2025

Let's make B200 go brrr 🚀 Huge thanks @nvidia for supporting us!

vLLM

@vllm_project

21 Feb 2025

We're excited to receive our first #NVIDIADGX B200 system which we'll use for vLLM research and development! Thank you @nvidia!

145

9,536

Woosuk Kwon · Jun 27, 2024 · 9:01 PM UTC

Woosuk Kwon

@woosuk_k

27 Jun 2024

Gemma 2 is also available in vLLM! 🎉github.com/vllm-project/vllm… Check out the update in the main branch and stay tuned for the next release coming soon

[Model] Add Gemma 2 by WoosukKwon · Pull Request #5908 · vllm-project/vllm

This PR adds Gemma 2, a new family of open LLMs from Google. Two major issues to note: Attention logit soft-capping: Gemma 2 models soft-cap the attention logits. This requires changes to all the ...

github.com

Google DeepMind

@GoogleDeepMind

27 Jun 2024

We're excited to unveil Gemma 2. 🛠️ Available in both 9B and 27B parameters, it delivers the best performance for its size - unlocking more possibilities for developers to build and deploy with AI. → dpmd.ai/45Q6yba

139

15,209

Woosuk Kwon · Apr 11, 2025 · 6:27 AM UTC

Woosuk Kwon

@woosuk_k

11 Apr 2025

Replying to @jxmnop

He’s a legend. Never seen anyone so focused, productive, and kind🔥

150

25,844

Woosuk Kwon · Dec 18, 2023 · 6:39 PM UTC

Woosuk Kwon

@woosuk_k

18 Dec 2023

In vLLM v0.2.6, we've introduced CUDA/HIP graph for faster model execution, and added GPTQ support (finally!). More optimizations and feature are coming... so stay tuned! github.com/vllm-project/vllm…

Release v0.2.6 · vllm-project/vllm

Major changes Fast model execution with CUDA/HIP graph W4A16 GPTQ support (thanks to @chu-tianxiang) Fix memory profiling with tensor parallelism Fix *.bin weight loading for Mixtral models What&...

github.com

119

10,484

Woosuk Kwon · Dec 15, 2023 · 10:09 PM UTC

Woosuk Kwon

@woosuk_k

15 Dec 2023

vLLM + AMD MI300X = Blazingly-fast LLM serving! 🚀🚀🚀

AMD

@AMD

15 Dec 2023

Update: Let's look at some new inference performance data on AMD Instinct MI300X community.amd.com/t5/instinc…

12,831

Woosuk Kwon · Mar 19, 2025 · 11:40 PM UTC

Woosuk Kwon

@woosuk_k

19 Mar 2025

We are super excited to host an inference night with @ollama next Thursday! See you all there!!

ollama

@ollama

19 Mar 2025

.@vllm_project and Ollama are hosting an inference night at @ycombinator San Francisco. ❤️ Let's go open source! Come meet: vLLM project leads (@simon_mo_ and @woosuk_k) Ollama maintainers startup founders / engineers RSVP required 👇👇👇

ALT vLLM and Ollama driving fast to serve you.

10,138

Woosuk Kwon · Dec 11, 2023 · 7:45 PM UTC

Woosuk Kwon

@woosuk_k

11 Dec 2023

Check out the Mistral's official inference code at vLLM! github.com/vllm-project/vllm

Zhuohan Li

@zhuohan123

11 Dec 2023

Excited to have first-hand official support of the Mixtral MoE model in vLLM from @MistralAI! Getting started with Mixtral with the latest vLLM now: github.com/vllm-project/vllm. Be sure to check their announcing blog: mistral.ai/news/mixtral-of-e… Joint with @woosuk_k @PierreStock

8,590

Woosuk Kwon · Mar 1, 2024 · 11:07 PM UTC

Woosuk Kwon

@woosuk_k

1 Mar 2024

The new vLLM release includes some optimizations for Gemma and Mixtral, and finally supports 8-bit GPTQ. Please give it a try!

Simon Mo

@simon_mo_

1 Mar 2024

vLLM v0.3.3 is released with Starcoder2 @BigCodeProject and Inferentia @awscloud support. I'm also excited about the addition of guided decoding* (JSON, regex) in server leveraging @OutlinesOSS. *experimental, the schema take some time to compile but will be cached.

7,043

Woosuk Kwon · Apr 10, 2025 · 6:18 PM UTC

Woosuk Kwon

@woosuk_k

10 Apr 2025

Huge congrats to all the @googlecloud and @RedHat_AI team members who drove this effort!

vLLM

@vllm_project

10 Apr 2025

spotted @vllm_project at @googlecloud next keynote today!

4,129

Woosuk Kwon · Mar 11, 2025 · 3:50 AM UTC

Woosuk Kwon

@woosuk_k

11 Mar 2025

Check out this great work from @0xsling0! vLLM also greatly benefits from the kernels.

Shanli Xing @shanli_xing

11 Mar 2025

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

4,843

Woosuk Kwon · Mar 12, 2025 · 9:14 AM UTC

Woosuk Kwon

@woosuk_k

12 Mar 2025

Gemma 3 🚀🚀🚀 github.com/vllm-project/vllm…

[Model] Add support for Gemma 3 by WoosukKwon · Pull Request #14660 · vllm-project/vllm

This PR adds the support for Gemma 3, an open-source vision-language model from Google. NOTE: The PR doesn't implement the pan-and-scan pre-processing algorithm. It will be implemented by ...

github.com

3,006

Woosuk Kwon · Jul 23, 2024 · 3:23 PM UTC

Woosuk Kwon

@woosuk_k

23 Jul 2024

We've partnered with @AIatMeta to support the 405B model from Day 1. Enjoy!

vLLM

@vllm_project

23 Jul 2024

🚀 Exciting news! In partnership with @AIatMeta, vLLM officially supports Llama 3.1! 🦙✨ For Llama 3.1 405B, vLLM supports FP8 quantization on single machine and pipeline parallelism for multi-node serving. Learn more in our latest blog post: blog.vllm.ai/2024/07/23/llam…

2,974

Woosuk Kwon · Dec 9, 2024 · 9:17 PM UTC

Woosuk Kwon

@woosuk_k

9 Dec 2024

vLLM ❤️ PyTorch

vLLM

@vllm_project

9 Dec 2024

Open-source innovation is part of the vLLM’s DNA, and we love the PyTorch ecosystem! Together, let's push the boundaries of AI innovation and make it accessible to all💪

1,802

Woosuk Kwon · Jul 25, 2024 · 10:30 PM UTC

Woosuk Kwon

@woosuk_k

25 Jul 2024

Linux Foundation is home to many important open source projects like Linux and PyTorch. Today we are excited to announce that @vllm_project is joining @LFAIDataFdn, an AI-focused sub-foundation under the Linux Foundation. vLLM will keep open and trusted!

vLLM

@vllm_project

25 Jul 2024

Two exciting updates! * vLLM is already widely adopted, and we want to ensure it has open governance and longevity. We are starting to join @LFAIDataFdn! * We are doubling down in performance. Please checkout our roadmap. blog.vllm.ai/2024/07/25/lfai…

2,782

Woosuk Kwon · Apr 3, 2024 · 6:32 AM UTC

Woosuk Kwon

@woosuk_k

3 Apr 2024

We finally made a Twitter account for vLLM @vllm_project! Please follow this account for the latest updates on vLLM!

Cade Daniel 🇺🇸

@cdnamz

3 Apr 2024

vLLM made a Twitter! Go give them a follow @vllm_project And fun vLLM meetup btw! Thanks for hosting @Roblox

3,977

Woosuk Kwon · Dec 14, 2023 · 8:31 AM UTC

Woosuk Kwon

@woosuk_k

14 Dec 2023

We've just released v0.2.5 which includes this performance improvement (contributed by Antoni at @anyscalecompute). Please try it out!

Matt Shumer

@mattshumer_

14 Dec 2023

Looks like Mixtral on vLLM is about to get a LOT faster github.com/vllm-project/vllm…

73,906

Woosuk Kwon · Dec 18, 2023 · 10:01 PM UTC

Woosuk Kwon

@woosuk_k

18 Dec 2023

Great news! Phi-2 also works with vLLM and greatly benefits from our newest integration with CUDA graphs. Give it a try on vLLM!

clem 🤗

@ClementDelangue

18 Dec 2023

Phi-2 by @MicrosoftAI is now the #1 trending model on @huggingface (hf.co/models). 2024 will be the year of smoll AI models!

3,052

Woosuk Kwon · Mar 3, 2025 · 11:26 PM UTC

Woosuk Kwon

@woosuk_k

3 Mar 2025

@vllm_project just reached 40K GitHub stars! 🚀 It's incredible to see how the project has grown.

Harry Mellor @hmellor_

3 Mar 2025

The @vllm_project has just hit the 40k star milestone on GitHub! Congratulations to all involved!

1,057

Woosuk Kwon · Aug 24, 2024 · 6:12 PM UTC

Woosuk Kwon

@woosuk_k

24 Aug 2024

Always happy to see this kind of open source release. Thanks for the huge contributions to the community!

Byron Hsu

@hsu_byron

23 Aug 2024

(1/n) Training LLMs can be hindered by out-of-memory, scaling batch size, and seq length. Add one line to boost multi-GPU training throughput by 20% and reduce memory usage by 60%. Introducing Liger-Kernel: Efficient Triton Kernels for LLM Training. github.com/linkedin/Liger-Ke…

2,212

Woosuk Kwon · Jan 28, 2024 · 1:30 AM UTC

Woosuk Kwon

@woosuk_k

28 Jan 2024

Thanks for your continuous contribution to vLLM! Really happy to have you in our community :)

Casper Hansen

@casper_hansen_

27 Jan 2024

AWQ now runs up to 2.66x faster in vLLM after my PR was merged. Thanks to @woosuk_k for a quick review and merge 🥳 I hope we can bring more of these optimizations to the community so we can run models on smaller budgets!🤗

4,693

Woosuk Kwon · Apr 8, 2024 · 10:57 PM UTC

Woosuk Kwon

@woosuk_k

8 Apr 2024

We are still actively looking for new reviewers! Please feel free to join us!

vLLM

@vllm_project

8 Apr 2024

Replying to @vllm_project

We are doubling our committer base for vLLM to ensure it is best-in-class and a truly community effort. This is just a start. Let's welcome @KaichaoYou, @pcmoritz, @nickhill33, @rogerw0108, @cdnamz, @robertshaw21 as committers and thank you for your great work! 👏

1,588

Woosuk Kwon · Feb 13, 2025 · 4:46 AM UTC

Woosuk Kwon

@woosuk_k

13 Feb 2025

Congrats! vLLM wouldn't be possible without you.

Roger Wang

@rogerw0108

13 Feb 2025

Robert and I started contributing to vLLM around the same time and today is my turn. Back then vLLM had only about 30 contributors. One year later, today the project has received contributions from 800+ community members! and we're just getting started github.com/vllm-project/vllm…

1,399

Woosuk Kwon · Sep 5, 2023 · 5:16 PM UTC

Woosuk Kwon

@woosuk_k

5 Sep 2023

Please come see us at Ray Summit! @zhuohan123 and I will talk about vLLM, a state-of-the-art open-source LLM inference engine, which actually uses #Ray for distributed inference. Don't miss it!

Robert Nishihara

@robertnishihara

5 Sep 2023

Ray Summit this month will be 🔥🔥 🤯 ChatGPT creator @johnschulman2 🧙‍♀️ @bhorowitz on the AI landscape 🦹‍♂️ @hwchase17 on LangChain 🧑‍🚀 @jerryjliu0 on LlamaIndex 👨‍🎤 @zhuohan123 and @woosuk_k on vLLM 🧜 @zongheng_yang on SkyPilot 🧑‍🔧 @MetaAI on Llama-2 🧚‍♂️ @Adobe on Generative AI in Firefly 🧑‍💻 @jeffreyhuber on the Chroma vector DB 🧑‍🏫 @weights_biases on LLM observability 🧑‍🎓 @Uber @Airbnb @LinkedIn on their LLMs products 🧑‍🌾 @awscloud on Inferentia and Trainium 🧑‍💼 @googlecloud on LLMs on TPUs This is an unbelievable list. You'll also hear the nitty-gritty details of how AI gets done at @Spotify @NianticLabs @Instacart @Pinterest @Samsara @DoorDash @netflix @AntGroup @InstabaseInc @SnorkelAI @NetEaseGames_EN @LockheedMartin @clarihq and many more. On top of all that, we're running a full day of hands-on trainings where you'll go through the motions and actually build the following 🖥️ ✅ RAG versus fine-tuning ✅ Running LLMs in production ✅ Building products around stable-diffusion models ✅ Delivering AI applications at scale raysummit.anyscale.com/

1,648

Woosuk Kwon · Feb 5, 2024 · 9:26 PM UTC

Woosuk Kwon

@woosuk_k

5 Feb 2024

Super cool! This seems like the ultimate library everyone has wanted to have. Great work @ye_combinator !!

Zihao Ye @ye_combinator

5 Feb 2024

(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include: - Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page Table, Ragged Tensor, etc.) for both single-request and batch-serving scenarios. - Optimized Shared-Prefix Batch Decoding: 31x faster than vLLM's Page Attention implementation for long prompt large batch decoding. - Efficient Attention for Compressed KV-Cache: optimized grouped-query attention with Tensor Cores (3x faster than vLLM's GQA), fused-RoPE attention, and high-performance quantized attention. Check our blog and code at: 1. flashinfer.ai/2024/02/02/int… 2. github.com/flashinfer-ai/fla…

1,892

Woosuk Kwon · Jun 20, 2023 · 7:20 PM UTC

Woosuk Kwon

@woosuk_k

20 Jun 2023

Make LLM serving easy and fast with vLLM!

Zhuohan Li

@zhuohan123

20 Jun 2023

🌟 Thrilled to introduce vLLM with @woosuk_k! 🚀 vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers @lmsysorg Vicuna and Chatbot Arena. Github: github.com/vllm-project/vllm Blog: vllm.ai/

4,333

Woosuk Kwon · Sep 9, 2024 · 11:45 PM UTC

Woosuk Kwon

@woosuk_k

9 Sep 2024

The vLLM meetup with NVIDIA is happening now! Join us and learn about our latest updates!

NVIDIA AI Developer

@NVIDIAAIDev

19 Aug 2024

Dive into the latest in #AI with expert talks, updates, and networking. Join us for the @vllm_project & NVIDIA Triton User Meetup. 📅 Mon, Sept 9, 4-9 PM 📍Gallery 308, Fort Mason, SF. Secure your spot now ➡️ nvda.ws/3X9GIvR ✨

1,304

Woosuk Kwon · Mar 15, 2024 · 2:22 AM UTC

Woosuk Kwon

@woosuk_k

15 Mar 2024

Our 3rd meetup is happening on April 2nd. Come join us!

Simon Mo

@simon_mo_

15 Mar 2024

The vLLM Team is excited to announce our Third vLLM Meetup in San Carlos on April 2nd (Tuesday). We will be discussing feature updates and hear from you! We thank @Roblox for hosting the event! robloxandvllmmeetup2024.spla…

2,415

Woosuk Kwon · Sep 3, 2023 · 3:24 AM UTC

Woosuk Kwon

@woosuk_k

3 Sep 2023

Replying to @chrissuperagent

Thanks for reporting the bug! The paged kernel is now fixed by github.com/vllm-project/vllm…

[BugFix] Fix NaN errors in paged attention kernel by WoosukKwon · Pull Request #936 · vllm-projec...

Fixes #641 This PR fixes the paged attention kernel. Currently, the kernel computes attn_weight * value for all tokens in a value block, even if some of them are not included in the context. It is ...

github.com

738

Woosuk Kwon · Dec 10, 2024 · 8:52 PM UTC

Woosuk Kwon

@woosuk_k

10 Dec 2024

I’ll be at #NeurIPS 2024 this week and happy to chat about LLM inference. Feel free to reach out!

1,195

Woosuk Kwon · May 14, 2024 · 10:01 PM UTC

Woosuk Kwon

@woosuk_k

14 May 2024

Join us for our 4th meetup on June 11th. We hope to see you there!

vLLM

@vllm_project

14 May 2024

We are holding the 4th vLLM meetup at @Cloudflare with @bentomlai on June 11. Join us to discuss what's next in production LLM serving! Register at lu.ma/agivllm

1,683

Woosuk Kwon · Feb 6, 2025 · 9:17 PM UTC

Woosuk Kwon

@woosuk_k

6 Feb 2025

Multimodal LLM is now a first-class citizen in vLLM! Fantastic talk by @rogerw0108 🔥

Red Hat AI

@RedHat_AI

6 Feb 2025

[vLLM Office Hours #19] Multimodal LLMs With vLLM v1 nitter.app/i/broadcasts/1BRJjmzkB…

969

Woosuk Kwon · Sep 11, 2023 · 5:00 PM UTC

Woosuk Kwon

@woosuk_k

11 Sep 2023

Excited to see that our paged attention algorithm is adopted! Looking forward to it!

NVIDIA AI Developer

@NVIDIAAIDev

8 Sep 2023

Just announced - NVIDIA TensorRT-LLM supercharges large language model #inference on NVIDIA H100 Tensor Core GPUs. #LLM nvda.ws/3ZcmiC3

2,156

Woosuk Kwon · Jun 29, 2023 · 5:17 PM UTC

Woosuk Kwon

@woosuk_k

29 Jun 2023

Getting and managing GPUs on cloud has grown increasingly challenging nowadays. Check out our latest blog post with @skypilot_org to discover how SkyPilot is making the development and deployment of vLLM easier!

SkyPilot

@skypilot_org

29 Jun 2023

UC Berkeley's vLLM + SkyPilot speeds up LLM serving by 24x 🤩 Our user blog post on how SkyPilot combated GPU availability for #vLLM, allowing them to focus on AI and not infra. (Also includes a 1-click guide to run it on your own cloud account!) blog.skypilot.co/serving-llm…

931

Woosuk Kwon · Aug 9, 2024 · 1:57 AM UTC

Woosuk Kwon

@woosuk_k

9 Aug 2024

The next vLLM Meetup will be hosted with NVIDIA! Please join us on September 9th!

vLLM

@vllm_project

8 Aug 2024

Join us on Monday, September 9 at Fort Mason to discuss recent performance enhancements in vLLM. This is a collaboration with NVIDIA Triton team. lu.ma/87q3nvnh

983

Woosuk Kwon · Feb 26, 2025 · 2:43 AM UTC

Woosuk Kwon

@woosuk_k

26 Feb 2025

Congrats! 🎉 Well deserved. And thanks for contributing it to @vllm_project! 🥰

Kimi.ai

@Kimi_Moonshot

26 Feb 2025

To bring Mooncake's novel KVCache-centric architecture to the open-source community, the Mooncake and vLLM teams have collaborated on a multi-stage roadmap. This integration will introduce P/D disaggregation and a global KVCache design to vLLM. Check out the design doc and join our discussion in the feat-prefill-disaggregation channel on vLLM's Slack, and explore the initial PR here: github.com/vllm-project/vllm… We look forward to the community's feedback and contributions!

1,143

Woosuk Kwon · Jan 30, 2025 · 5:54 PM UTC

Woosuk Kwon

@woosuk_k

30 Jan 2025

Awesome work by @KaichaoYou making vLLM integrate smoothly with RLHF frameworks! 🔥

Costa Huang

@vwxyzjn

30 Jan 2025

Replying to @vwxyzjn

Finally, I want to give a special thanks to the @vllm_project team (@KaichaoYou @woosuk_k @simon_mo_ @zhuohan123) for their invaluable support in debugging NCCL weight transfer issues. They made our 70 RLVR weight transfer 45x faster and 405B RLVR even possible! See github.com/vllm-project/vllm…

900

Woosuk Kwon · Feb 5, 2025 · 9:37 AM UTC

Woosuk Kwon

@woosuk_k

5 Feb 2025

Replying to @cHHillee @tri_dao @typedfemale

Cascade attention by @ye_combinator et al. also leveraged the same idea: flashinfer.ai/2024/02/02/cas… Highly recommend reading it!

Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding

Many LLM inference tasks involves multiple independent text generation from a shared prefix (prompt), e.g. Self-Consistency, Tree of Thoughts and Skeleton-of-thought. Serving LLMs with common prefix...

flashinfer.ai

677

Woosuk Kwon · Feb 11, 2025 · 6:48 PM UTC

Woosuk Kwon

@woosuk_k

11 Feb 2025

The next vLLM meetup is hosted by @Meta on Feb 27th! See you at Menlo Park!

vLLM

@vllm_project

11 Feb 2025

We are excited to invite you to our Menlo Park meetup with @Meta, evening of Thursday, February 27! Meta engineers will discuss the improvements on top of vLLM, and committer @CodyHaoYu will share updates from the v0.7.x series of releases. lu.ma/h7g3kuj9

687

Woosuk Kwon · Feb 27, 2025 · 7:40 AM UTC

Woosuk Kwon

@woosuk_k

27 Feb 2025

Thank you @deepseek_ai !

vLLM

@vllm_project

27 Feb 2025

Just landed FlashMLA in vLLM and it is already boosting output throughput 2-16%! We expect more improvements in the coming days as we continue to optimize the code path. github.com/vllm-project/vllm…

1,319

Woosuk Kwon · Sep 22, 2023 · 7:46 PM UTC

Woosuk Kwon

@woosuk_k

22 Sep 2023

Please join our first meetup and share your experience with vLLM!

Zhuohan Li

@zhuohan123

22 Sep 2023

We are excited to announce the first vLLM Bay Area meetup at 6pm on 10/5 (Thu)! Please find the event details and RSVP at: lu.ma/first-vllm-meetup. The vLLM team will give a deep dive of vLLM and show the future roadmap. We will also have vLLM users and contributors share their experiences. Thank @a16z for providing the venue and supporting open source.

1,406

Woosuk Kwon · Apr 16, 2024 · 4:16 PM UTC

Woosuk Kwon

@woosuk_k

16 Apr 2024

Congrats!!!

Zhanghao Wu

@Michaelvll1

16 Apr 2024

I am honored to share that our recent paper won the Outstanding Paper Award in NSDI’24! The paper explores the policy design of our SkyPilot managed spot for @skypilot_org: Can’t Be Late: Optimizing Spot Instance Savings under Deadlines It would not be possible, if it were not with the fantastic folks and advisors: @infwinston @ziming_mao @zongheng_yang Eric Friedman, Scott Shenker and Ion Stoica; and the whole SkyPilot team @skypilot_org

1,447

Woosuk Kwon · Sep 12, 2024 · 3:30 AM UTC

Woosuk Kwon

@woosuk_k

12 Sep 2024

Try out Pixtral with vLLM!

vLLM

@vllm_project

12 Sep 2024

🖼️ pip install -U vLLM vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4' --max_num_batched_tokens 16384 github.com/vllm-project/vllm…

982

Woosuk Kwon · Jun 22, 2023 · 8:33 PM UTC

Woosuk Kwon

@woosuk_k

22 Jun 2023

Check out this wonderful article from @anyscalecompute ! It shows vLLM delivers 23x speedup on OPT-13B!

Cade Daniel 🇺🇸

@cdnamz

22 Jun 2023

I wrote about a 23x improvement (!) in LLM live-inference throughput, measured on OPT-13B on A100. There are 2 new innovations which make this possible: Continuous batching & PagedAttention. Short thread below; see writeup, experiments, and results at anyscale.com/blog/continuous…

1,709

Woosuk Kwon · Dec 18, 2024 · 9:04 PM UTC

Woosuk Kwon

@woosuk_k

18 Dec 2024

Try Bamba with vLLM!

Raghu Ganti

@RaghuGanti

18 Dec 2024

🚀 Exciting News! 🚀 In a joint effort between IBM Research, Princeton, CMU, and UIUC, we are thrilled to announce the release of our high-performing hybrid Mamba2 model! This model is trained entirely on open datasets, and we’re releasing intermediate and final checkpoints to enable community experimentation. 🔗 Read more: huggingface.co/blog/bamba Key Takeaways ⚡ Inference Efficiency The Bamba-9B model delivers significant improvements in throughput and latency, enhancing real-time application performance. Benchmarking with vLLM against Llama 3.1 8B for long contexts shows: 🔹 2.5x throughput improvement 🔹 2x lower latency And this is just the beginning – further optimizations are on the way! 🏆 Competitive Benchmarks Bamba-9B performs competitively with state-of-the-art transformer models like Meta Llama 3.1 8B. It matches average benchmark performance (excluding math and MMLU tasks), with clear opportunities to close gaps through extended training and math-focused datasets. 🤝 Open Collaboration Developed entirely with open data, this effort emphasizes transparency and reproducibility, strengthening the foundations of the open-source AI community. 📂 For details, access to the model, and resources, check out the Bamba GitHub repository: github.com/foundation-model-… Let’s collaborate, experiment, and innovate together! 🔍✨ @tri_dao @_albertgu @MinjiaZhang -- it is a great collaboration and look forward to continuing to work with you.

1,295

Woosuk Kwon · Jul 3, 2024 · 10:59 PM UTC

Woosuk Kwon

@woosuk_k

3 Jul 2024

Check out the improved multi-modality support in @vllm_project by @rogerw0108!

Roger Wang

@rogerw0108

3 Jul 2024

More exciting news around multi-modality in the upcoming v0.5.1 @vllm_project release! With a much simpler interface, vLLM will now support dynamic image embedding shapes for models such as LLaVA-NeXT and Phi-3-Vision!

939

Woosuk Kwon · Jul 15, 2024 · 6:01 PM UTC

Woosuk Kwon

@woosuk_k

15 Jul 2024

We will have the 5th vLLM meetup with AWS next Wednesday! Please join us and learn more about our recent progress!

vLLM

@vllm_project

15 Jul 2024

We are excited to invite everyone to our 5th meetup with AWS on July 24 in SF (next Wed!). The team will share recent progress in FP8, pipeline parallel, and various work on perf. The space is limited to 150 so plz register ASAP! lu.ma/lp0gyjqr

1,204

Woosuk Kwon · Feb 27, 2025 · 6:45 PM UTC

Woosuk Kwon

@woosuk_k

27 Feb 2025

Our meetup with @Meta happens this evening! See you at Menlo Park.

vLLM

@vllm_project

11 Feb 2025

1,292

Woosuk Kwon · Jan 16, 2024 · 6:16 AM UTC

Woosuk Kwon

@woosuk_k

16 Jan 2024

Please check out our second meetup on Jan 31st!!

Simon Mo

@simon_mo_

15 Jan 2024

We are hosting The Second vLLM Meetup in downtown SF on Jan 31st (Wed). Come to chat with vLLM maintainers about LLMs in production and inference optimizations! Thanks @IBM for hosting us. lu.ma/ygxbpzhl

1,575

Woosuk Kwon · Jan 27, 2025 · 8:30 PM UTC

Woosuk Kwon

@woosuk_k

27 Jan 2025

Replying to @casper_hansen_

Thanks for sharing! To clarify: we didn't start from scratch - we reimplemented core components like the scheduler while keeping most of vLLM's existing codebase.

552

Woosuk Kwon · Apr 1, 2024 · 10:10 PM UTC

Woosuk Kwon

@woosuk_k

1 Apr 2024

Replying to @HamelHusain

Why not use vLLM? :)

633

Woosuk Kwon · Dec 13, 2023 · 10:07 PM UTC

Woosuk Kwon

@woosuk_k

13 Dec 2023

Congratulations to everyone in the second batch! 🔥🔥🔥

Matt Bornstein

@BornsteinMatt

13 Dec 2023

We're announcing the second batch of @a16z open source AI grants today This cohort focuses on: ▶️ tools for LLM training/ hosting/ evals ▶️ visual AI models & communities Thank you to the grantees for your contributions! More info in the linked post a16z.com/announcing-our-late…

798

Woosuk Kwon · Mar 15, 2025 · 4:29 AM UTC

Woosuk Kwon

@woosuk_k

15 Mar 2025

@verl_project now uses vLLM V1! Upgrade veRL to enjoy the faster and more stable speed :)

Haibin @eric_haibin_lin

14 Mar 2025

Recent updates on @verl_project (RL lib for LLMs): Engine: - Megatron qwen & GRPO support, v0.11 upgrade - vllm v0.7 integration with v1 mode - experimental sglang integration Algorithm & recipes: - vision language reasoning with qwen2.5-vl - PRIME, RLOO, remax, math-verify rewards, etc Docs: - tutorial for distributed training setup and debugging - programming model tutorial Hardware: - experimental AMD support And many awesome community projects such as code-R1, Easy-R1, Search-R1, RAGEN, etc. Big thank you to the community! Working on multi-turn & environment/tool supports. Stay tuned...

629

Woosuk Kwon · Mar 7, 2025 · 8:31 PM UTC

Woosuk Kwon

@woosuk_k

7 Mar 2025

@vllm_project goes brrr on 5080 🚀

Roger Wang

@rogerw0108

7 Mar 2025

Not every 5080 is signed by the one and only @ia_buck - thank you and @nvidia so much for letting me test out getting @vllm_project running on Blackwell! GPUs Go Brrr from home! Try it out yourself with instructions here! github.com/vllm-project/vllm…

266

Woosuk Kwon · Aug 6, 2024 · 4:11 PM UTC

Woosuk Kwon

@woosuk_k

6 Aug 2024

Welcome!

Red Hat AI

@RedHat_AI

6 Aug 2024

🎉 Exciting news! Tyler Smith, one of our many talented engineers, is now Neural Magic's 3rd vLLM project committer! Check out Tyler's contributions: github.com/tlrmchlsmth. We’re proud to be a leading contributor to @vllm_project. 🚀 Cheers to Tyler and the team!

753

Woosuk Kwon · Nov 15, 2023 · 12:15 AM UTC

Woosuk Kwon

@woosuk_k

15 Nov 2023

For those who cannot access the link, please try this one instead: vllm-project.github.io/2023/…

Notes on vLLM v.s. DeepSpeed-FastGen

TL;DR:

vllm-project.github.io

1,319

Woosuk Kwon · Jan 1, 2025 · 1:57 PM UTC

Woosuk Kwon

@woosuk_k

1 Jan 2025

Replying to @rogerw0108

Absolutely love what you did in 2024 Let's keep killing it! 🔥

257

Woosuk Kwon · Apr 30, 2024 · 8:17 PM UTC

Woosuk Kwon

@woosuk_k

30 Apr 2024

Check out @luo_michael1234's awesome work on picking the best LoRA adapters to create crisp images!

Michael Luo

@michaelzluo

30 Apr 2024

[1/5] Introducing Stylus 🖌️ - an #AI tool that automatically finds and adds the best adapters (LoRAs, Textual Inversions, Hypernetworks) to #StableDiffusion based on your prompt. 🗞️ Paper: arxiv.org/abs/2404.18928 🌎 Project Page: stylus-diffusion.github.io

ALT Stable Diffusion, LoRA, Textual Inversion, Hypernetwork, Retrieval Augmented Generation, CivitAI, Huggingface

979

Woosuk Kwon · Mar 21, 2025 · 3:27 PM UTC

Woosuk Kwon

@woosuk_k

21 Mar 2025

Replying to @natolambert

It's coming to vLLM as well! 🔥 github.com/vllm-project/vllm…

[Model] Add Qwen3 and Qwen3MoE by YamPengLi · Pull Request #15289 · vllm-project/vllm

Description Recently, I have submitted a pull request to Hugging Face Transformers containing the implementation of the Qwen3 and Qwen3MoE model. I would also like to contribute these new modelsto ...

github.com

283

Woosuk Kwon · Dec 11, 2023 · 8:31 PM UTC

Woosuk Kwon

@woosuk_k

11 Dec 2023

Replying to @andrey_cheptsov

Yes! `pip install vllm megablocks` is all you need. 🚀🚀🚀

129

Woosuk Kwon · Mar 19, 2025 · 10:24 PM UTC

Woosuk Kwon

@woosuk_k

19 Mar 2025

@simon_mo_ and I are at #GTC2025 this afternoon! Please let us know if you want to chat about @vllm_project and LLM inference!

119

Woosuk Kwon · May 16, 2025 · 2:08 AM UTC

Woosuk Kwon

@woosuk_k

16 May 2025

Replying to @ye_combinator @yi_xin_dong @lmsysorg

Congrats!

194

Woosuk Kwon · Jun 21, 2023 · 12:52 AM UTC

Woosuk Kwon

@woosuk_k

21 Jun 2023

Replying to @Teknium @teknium @zhuohan123

You are right. The majority of the vLLM’s speedup comes by batching more prompts every run. However, you can get some speedup even in a single user env, because vLLM also includes some other optimizations orthogonal to PagedAttention.

Woosuk Kwon · Dec 6, 2023 · 1:34 AM UTC

Woosuk Kwon

@woosuk_k

6 Dec 2023

Amazing!! 🤣🤣

Rajko Radovanović

@rajko_rad

5 Dec 2023

Image conditioning on @ideogram_ai is awesome! 3D render of re-imagined vLLM logo below :) cc @zhuohan123 @woosuk_k

418

Woosuk Kwon · Jun 20, 2023 · 8:32 PM UTC

Woosuk Kwon

@woosuk_k

20 Jun 2023

Replying to @netapy @zhuohan123 @lmsysorg

@netapy Thanks for your interest and great question! Currently, we don't compile the models. We're currently exploring if torch.compile is suitable for us, or if we can enhance performance by optimizing the code for individual models.

142

Woosuk Kwon · Jul 30, 2024 · 9:07 PM UTC

Woosuk Kwon

@woosuk_k

30 Jul 2024

Replying to @vsreekanti @RunLLM @vllm_project

It's so useful! It not only covers the docs, but also links to our Github issues! Thanks for adding it to vLLM!

211

Woosuk Kwon · Jul 7, 2024 · 2:18 AM UTC

Woosuk Kwon

@woosuk_k

7 Jul 2024

Replying to @natolambert

@natolambert Could you please share more about the error you faced with? FlashInfer is actually required for Gemma2. You can install it with `pip install github.com/flashinfer-ai/fla…`

216

Woosuk Kwon · Mar 4, 2025 · 7:50 PM UTC

Woosuk Kwon

@woosuk_k

4 Mar 2025

Replying to @cHHillee @thinkymachines

Congrats!!

684

Woosuk Kwon · Aug 4, 2024 · 10:06 PM UTC

Woosuk Kwon

@woosuk_k

4 Aug 2024

Replying to @natolambert @mgoin_ @vllm_project

You can pip install the nightly version! export VLLM_VERSION=0.5.3.post1 export VLLM_COMMIT=16a1cc9bb2b4bba82d78f329e5a89b44a5523ac8 pip install vllm-wheels.s3.us-west-2.ama…${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl

205

Woosuk Kwon · Jan 28, 2025 · 2:19 AM UTC

Woosuk Kwon

@woosuk_k

28 Jan 2025

Replying to @StasBekman @SnowflakeDB

Congrats!

202

Woosuk Kwon · Jan 29, 2025 · 3:33 AM UTC

Woosuk Kwon

@woosuk_k

29 Jan 2025

Replying to @michaelzluo

Thanks! 😃 Great ideas. Will definitely explore them!

105

Woosuk Kwon · Mar 21, 2025 · 4:02 AM UTC

Woosuk Kwon

@woosuk_k

21 Mar 2025

Replying to @jackminong @eric_haibin_lin @vllm_project

Would you please provide a reproducible example? We really want to investigate this. Thanks!

280

Woosuk Kwon · Mar 19, 2025 · 8:28 PM UTC

Woosuk Kwon

@woosuk_k

19 Mar 2025

Replying to @casper_hansen_

Thanks for bringing it up. We are working on it!

157

Woosuk Kwon · Jan 30, 2025 · 5:00 PM UTC

Woosuk Kwon

@woosuk_k

30 Jan 2025

Replying to @natolambert @vllm_project

Amazing work!!

Woosuk Kwon · Sep 20, 2024 · 11:04 AM UTC

Woosuk Kwon

@woosuk_k

20 Sep 2024

Replying to @KaichaoYou

Congrats!!!

310

Woosuk Kwon · Jul 30, 2024 · 9:09 PM UTC

Woosuk Kwon

@woosuk_k

30 Jul 2024

Replying to @sewon__min @uwcse @UCBerkeley @berkeley_ai @BerkeleyNLP @allen_ai

Congrats!

1,454

Woosuk Kwon · Apr 24, 2024 · 11:09 PM UTC

Woosuk Kwon

@woosuk_k

24 Apr 2024

Replying to @ziming_mao @anuragk_

Amazing!! Congrats!! 👊👊

262