Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team
181
129
1,152
484,163
Developing @vllm_project taught me a tough lesson: to keep the GPU fully utilized, we need to pay close attention to everything happening on the CPU. Over the past month, the vLLM community conducted an in-depth study and made key optimizations, leading to significant performance improvements. Check out our findings and latest updates!
A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. blog.vllm.ai/2024/09/05/perf…
7
21
249
37,377
We’ve just released a new blog post comparing vLLM with DeepSpeed-FastGen. While we are happy to see the open-source technology advancements from the DeepSpeed team, we’ve got different results with more extensive performance benchmarks. vLLM is actually faster than DeepSpeed in many common scenarios. Details here: blog.vllm.ai/2023/11/14/note… (written with @zhuohan123, @simon_mo_, @eqhylxx)
3
30
203
45,192
As one of the fastest-growing OSS projects, vLLM inevitably accumulated some technical debts. We noticed it, and re-architected vLLM's core with careful engineering. Enjoy simpler code & higher performance with vLLM V1!
🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.
2
16
215
31,225
Exciting news! 🎉Our PagedAttention paper is now up on arXiv! Dive in to learn why it's an indispensable technique for all major LLM serving frameworks. @zhuohan123 and I will present it at @sospconf next month. Blog post: vllm.ai Paper: arxiv.org/abs/2309.06180
2
32
183
25,907
This Wednesday (8/21) I will be speaking about the diverse hardware support in vLLM, with a focus on AMD GPUs and Google TPUs. Sign up to learn more about vLLM! neuralmagic.com/community-of…
1
16
178
14,434
vLLM ❤️ @nvidia Dynamo
Replying to @vllm_project
We are grateful for the trust in vLLM ❤️
1
4
149
9,772
Let's make B200 go brrr 🚀 Huge thanks @nvidia for supporting us!
We're excited to receive our first #NVIDIADGX B200 system which we'll use for vLLM research and development! Thank you @nvidia!
3
6
145
9,536
Gemma 2 is also available in vLLM! 🎉github.com/vllm-project/vllm… Check out the update in the main branch and stay tuned for the next release coming soon
We're excited to unveil Gemma 2. 🛠️ Available in both 9B and 27B parameters, it delivers the best performance for its size - unlocking more possibilities for developers to build and deploy with AI. → dpmd.ai/45Q6yba
3
20
139
15,209
Replying to @jxmnop
He’s a legend. Never seen anyone so focused, productive, and kind🔥
1
2
150
25,844
vLLM + AMD MI300X = Blazingly-fast LLM serving! 🚀🚀🚀
Update: Let's look at some new inference performance data on AMD Instinct MI300X community.amd.com/t5/instinc…
7
83
12,831
We are super excited to host an inference night with @ollama next Thursday! See you all there!!
.@vllm_project and Ollama are hosting an inference night at @ycombinator San Francisco. ❤️ Let's go open source! Come meet: vLLM project leads (@simon_mo_ and @woosuk_k) Ollama maintainers startup founders / engineers RSVP required 👇👇👇
3
4
67
10,138
Check out the Mistral's official inference code at vLLM! github.com/vllm-project/vllm
Excited to have first-hand official support of the Mixtral MoE model in vLLM from @MistralAI! Getting started with Mixtral with the latest vLLM now: github.com/vllm-project/vllm. Be sure to check their announcing blog: mistral.ai/news/mixtral-of-e… Joint with @woosuk_k @PierreStock
4
4
63
8,590
The new vLLM release includes some optimizations for Gemma and Mixtral, and finally supports 8-bit GPTQ. Please give it a try!
vLLM v0.3.3 is released with Starcoder2 @BigCodeProject and Inferentia @awscloud support. I'm also excited about the addition of guided decoding* (JSON, regex) in server leveraging @OutlinesOSS. *experimental, the schema take some time to compile but will be cached.
7
65
7,043
Huge congrats to all the @googlecloud and @RedHat_AI team members who drove this effort!
spotted @vllm_project at @googlecloud next keynote today!
3
62
4,129
Check out this great work from @0xsling0! vLLM also greatly benefits from the kernels.
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…
4
59
4,843
We've partnered with @AIatMeta to support the 405B model from Day 1. Enjoy!
🚀 Exciting news! In partnership with @AIatMeta, vLLM officially supports Llama 3.1! 🦙✨ For Llama 3.1 405B, vLLM supports FP8 quantization on single machine and pipeline parallelism for multi-node serving. Learn more in our latest blog post: blog.vllm.ai/2024/07/23/llam…
2
48
2,974
vLLM ❤️ PyTorch
Open-source innovation is part of the vLLM’s DNA, and we love the PyTorch ecosystem! Together, let's push the boundaries of AI innovation and make it accessible to all💪
3
46
1,802
Linux Foundation is home to many important open source projects like Linux and PyTorch. Today we are excited to announce that @vllm_project is joining @LFAIDataFdn, an AI-focused sub-foundation under the Linux Foundation. vLLM will keep open and trusted!
Two exciting updates! * vLLM is already widely adopted, and we want to ensure it has open governance and longevity. We are starting to join @LFAIDataFdn! * We are doubling down in performance. Please checkout our roadmap. blog.vllm.ai/2024/07/25/lfai…
3
3
44
2,782
We finally made a Twitter account for vLLM @vllm_project! Please follow this account for the latest updates on vLLM!
vLLM made a Twitter! Go give them a follow @vllm_project And fun vLLM meetup btw! Thanks for hosting @Roblox
2
5
39
3,977
We've just released v0.2.5 which includes this performance improvement (contributed by Antoni at @anyscalecompute). Please try it out!
Looks like Mixtral on vLLM is about to get a LOT faster github.com/vllm-project/vllm…
3
38
73,906
Great news! Phi-2 also works with vLLM and greatly benefits from our newest integration with CUDA graphs. Give it a try on vLLM!
Phi-2 by @MicrosoftAI is now the #1 trending model on @huggingface (hf.co/models). 2024 will be the year of smoll AI models!
3
34
3,052
@vllm_project just reached 40K GitHub stars! 🚀 It's incredible to see how the project has grown.
The @vllm_project has just hit the 40k star milestone on GitHub! Congratulations to all involved!
2
2
32
1,057
Always happy to see this kind of open source release. Thanks for the huge contributions to the community!
(1/n) Training LLMs can be hindered by out-of-memory, scaling batch size, and seq length. Add one line to boost multi-GPU training throughput by 20% and reduce memory usage by 60%. Introducing Liger-Kernel: Efficient Triton Kernels for LLM Training. github.com/linkedin/Liger-Ke…
2
30
2,212
Thanks for your continuous contribution to vLLM! Really happy to have you in our community :)
AWQ now runs up to 2.66x faster in vLLM after my PR was merged. Thanks to @woosuk_k for a quick review and merge 🥳 I hope we can bring more of these optimizations to the community so we can run models on smaller budgets!🤗
3
27
4,693
We are still actively looking for new reviewers! Please feel free to join us!
Replying to @vllm_project
We are doubling our committer base for vLLM to ensure it is best-in-class and a truly community effort. This is just a start. Let's welcome @KaichaoYou, @pcmoritz, @nickhill33, @rogerw0108, @cdnamz, @robertshaw21 as committers and thank you for your great work! 👏
26
1,588
Congrats! vLLM wouldn't be possible without you.
Robert and I started contributing to vLLM around the same time and today is my turn. Back then vLLM had only about 30 contributors. One year later, today the project has received contributions from 800+ community members! and we're just getting started github.com/vllm-project/vllm…
1
25
1,399
Please come see us at Ray Summit! @zhuohan123 and I will talk about vLLM, a state-of-the-art open-source LLM inference engine, which actually uses #Ray for distributed inference. Don't miss it!
Ray Summit this month will be 🔥🔥 🤯 ChatGPT creator @johnschulman2 🧙‍♀️ @bhorowitz on the AI landscape 🦹‍♂️ @hwchase17 on LangChain 🧑‍🚀 @jerryjliu0 on LlamaIndex 👨‍🎤 @zhuohan123 and @woosuk_k on vLLM 🧜 @zongheng_yang on SkyPilot 🧑‍🔧 @MetaAI on Llama-2 🧚‍♂️ @Adobe on Generative AI in Firefly 🧑‍💻 @jeffreyhuber on the Chroma vector DB 🧑‍🏫 @weights_biases on LLM observability 🧑‍🎓 @Uber @Airbnb @LinkedIn on their LLMs products 🧑‍🌾 @awscloud on Inferentia and Trainium 🧑‍💼 @googlecloud on LLMs on TPUs This is an unbelievable list. You'll also hear the nitty-gritty details of how AI gets done at @Spotify @NianticLabs @Instacart @Pinterest @Samsara @DoorDash @netflix @AntGroup @InstabaseInc @SnorkelAI @NetEaseGames_EN @LockheedMartin @clarihq and many more. On top of all that, we're running a full day of hands-on trainings where you'll go through the motions and actually build the following 🖥️ ✅ RAG versus fine-tuning ✅ Running LLMs in production ✅ Building products around stable-diffusion models ✅ Delivering AI applications at scale raysummit.anyscale.com/
1
1
24
1,648
Super cool! This seems like the ultimate library everyone has wanted to have. Great work @ye_combinator !!
(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include: - Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page Table, Ragged Tensor, etc.) for both single-request and batch-serving scenarios. - Optimized Shared-Prefix Batch Decoding: 31x faster than vLLM's Page Attention implementation for long prompt large batch decoding. - Efficient Attention for Compressed KV-Cache: optimized grouped-query attention with Tensor Cores (3x faster than vLLM's GQA), fused-RoPE attention, and high-performance quantized attention. Check our blog and code at: 1. flashinfer.ai/2024/02/02/int… 2. github.com/flashinfer-ai/fla…
1
20
1,892
Make LLM serving easy and fast with vLLM!
🌟 Thrilled to introduce vLLM with @woosuk_k! 🚀 vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers @lmsysorg Vicuna and Chatbot Arena. Github: github.com/vllm-project/vllm Blog: vllm.ai/
4
21
4,333
The vLLM meetup with NVIDIA is happening now! Join us and learn about our latest updates!
Dive into the latest in #AI with expert talks, updates, and networking. Join us for the @vllm_project & NVIDIA Triton User Meetup. 📅 Mon, Sept 9, 4-9 PM 📍Gallery 308, Fort Mason, SF. Secure your spot now ➡️ nvda.ws/3X9GIvR
18
1,304
Our 3rd meetup is happening on April 2nd. Come join us!
The vLLM Team is excited to announce our Third vLLM Meetup in San Carlos on April 2nd (Tuesday). We will be discussing feature updates and hear from you! We thank @Roblox for hosting the event! robloxandvllmmeetup2024.spla…
2
1
18
2,415
I’ll be at #NeurIPS 2024 this week and happy to chat about LLM inference. Feel free to reach out!
3
17
1,195
Join us for our 4th meetup on June 11th. We hope to see you there!
We are holding the 4th vLLM meetup at @Cloudflare with @bentomlai on June 11. Join us to discuss what's next in production LLM serving! Register at lu.ma/agivllm
1
2
17
1,683
Multimodal LLM is now a first-class citizen in vLLM! Fantastic talk by @rogerw0108 🔥
[vLLM Office Hours #19] Multimodal LLMs With vLLM v1 nitter.app/i/broadcasts/1BRJjmzkB…
1
16
969
Excited to see that our paged attention algorithm is adopted! Looking forward to it!
Just announced - NVIDIA TensorRT-LLM supercharges large language model #inference on NVIDIA H100 Tensor Core GPUs. #LLM nvda.ws/3ZcmiC3
3
16
2,156
Getting and managing GPUs on cloud has grown increasingly challenging nowadays. Check out our latest blog post with @skypilot_org to discover how SkyPilot is making the development and deployment of vLLM easier!
UC Berkeley's vLLM + SkyPilot speeds up LLM serving by 24x 🤩 Our user blog post on how SkyPilot combated GPU availability for #vLLM, allowing them to focus on AI and not infra. (Also includes a 1-click guide  to run it on your own cloud account!) blog.skypilot.co/serving-llm…
16
931
The next vLLM Meetup will be hosted with NVIDIA! Please join us on September 9th!
Join us on Monday, September 9 at Fort Mason to discuss recent performance enhancements in vLLM. This is a collaboration with NVIDIA Triton team. lu.ma/87q3nvnh
15
983
Congrats! 🎉 Well deserved. And thanks for contributing it to @vllm_project! 🥰
To bring Mooncake's novel KVCache-centric architecture to the open-source community, the Mooncake and vLLM teams have collaborated on a multi-stage roadmap. This integration will introduce P/D disaggregation and a global KVCache design to vLLM. Check out the design doc and join our discussion in the feat-prefill-disaggregation channel on vLLM's Slack, and explore the initial PR here: github.com/vllm-project/vllm… We look forward to the community's feedback and contributions!
15
1,143
Awesome work by @KaichaoYou making vLLM integrate smoothly with RLHF frameworks! 🔥
Replying to @vwxyzjn
Finally, I want to give a special thanks to the @vllm_project team (@KaichaoYou @woosuk_k @simon_mo_ @zhuohan123) for their invaluable support in debugging NCCL weight transfer issues. They made our 70 RLVR weight transfer 45x faster and 405B RLVR even possible! See github.com/vllm-project/vllm…
2
15
900
The next vLLM meetup is hosted by @Meta on Feb 27th! See you at Menlo Park!
We are excited to invite you to our Menlo Park meetup with @Meta, evening of Thursday, February 27! Meta engineers will discuss the improvements on top of vLLM, and committer @CodyHaoYu will share updates from the v0.7.x series of releases. lu.ma/h7g3kuj9
14
687
Thank you @deepseek_ai !
Just landed FlashMLA in vLLM and it is already boosting output throughput 2-16%! We expect more improvements in the coming days as we continue to optimize the code path. github.com/vllm-project/vllm…
1
15
1,319
Please join our first meetup and share your experience with vLLM!
We are excited to announce the first vLLM Bay Area meetup at 6pm on 10/5 (Thu)! Please find the event details and RSVP at: lu.ma/first-vllm-meetup. The vLLM team will give a deep dive of vLLM and show the future roadmap. We will also have vLLM users and contributors share their experiences. Thank @a16z for providing the venue and supporting open source.
1
13
1,406
Congrats!!!
I am honored to share that our recent paper won the Outstanding Paper Award in NSDI’24! The paper explores the policy design of our SkyPilot managed spot for @skypilot_org: Can’t Be Late: Optimizing Spot Instance Savings under Deadlines It would not be possible, if it were not with the fantastic folks and advisors: @infwinston @ziming_mao @zongheng_yang Eric Friedman, Scott Shenker and Ion Stoica; and the whole SkyPilot team @skypilot_org
12
1,447
Try out Pixtral with vLLM!
🖼️ pip install -U vLLM vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4' --max_num_batched_tokens 16384 github.com/vllm-project/vllm…
13
982
Check out this wonderful article from @anyscalecompute ! It shows vLLM delivers 23x speedup on OPT-13B!
I wrote about a 23x improvement (!) in LLM live-inference throughput, measured on OPT-13B on A100. There are 2 new innovations which make this possible: Continuous batching & PagedAttention. Short thread below; see writeup, experiments, and results at anyscale.com/blog/continuous…
2
12
1,709
Try Bamba with vLLM!
🚀 Exciting News! 🚀 In a joint effort between IBM Research, Princeton, CMU, and UIUC, we are thrilled to announce the release of our high-performing hybrid Mamba2 model! This model is trained entirely on open datasets, and we’re releasing intermediate and final checkpoints to enable community experimentation. 🔗 Read more: huggingface.co/blog/bamba Key Takeaways ⚡ Inference Efficiency The Bamba-9B model delivers significant improvements in throughput and latency, enhancing real-time application performance. Benchmarking with vLLM against Llama 3.1 8B for long contexts shows: 🔹 2.5x throughput improvement 🔹 2x lower latency And this is just the beginning – further optimizations are on the way! 🏆 Competitive Benchmarks Bamba-9B performs competitively with state-of-the-art transformer models like Meta Llama 3.1 8B. It matches average benchmark performance (excluding math and MMLU tasks), with clear opportunities to close gaps through extended training and math-focused datasets. 🤝 Open Collaboration Developed entirely with open data, this effort emphasizes transparency and reproducibility, strengthening the foundations of the open-source AI community. 📂 For details, access to the model, and resources, check out the Bamba GitHub repository: github.com/foundation-model-… Let’s collaborate, experiment, and innovate together! 🔍✨ @tri_dao @_albertgu @MinjiaZhang -- it is a great collaboration and look forward to continuing to work with you.
3
12
1,295
Check out the improved multi-modality support in @vllm_project by @rogerw0108!
More exciting news around multi-modality in the upcoming v0.5.1 @vllm_project release! With a much simpler interface, vLLM will now support dynamic image embedding shapes for models such as LLaVA-NeXT and Phi-3-Vision!
1
13
939
We will have the 5th vLLM meetup with AWS next Wednesday! Please join us and learn more about our recent progress!
We are excited to invite everyone to our 5th meetup with AWS on July 24 in SF (next Wed!). The team will share recent progress in FP8, pipeline parallel, and various work on perf. The space is limited to 150 so plz register ASAP! lu.ma/lp0gyjqr
1
13
1,204
Our meetup with @Meta happens this evening! See you at Menlo Park.
We are excited to invite you to our Menlo Park meetup with @Meta, evening of Thursday, February 27! Meta engineers will discuss the improvements on top of vLLM, and committer @CodyHaoYu will share updates from the v0.7.x series of releases. lu.ma/h7g3kuj9
1
12
1,292
Please check out our second meetup on Jan 31st!!
We are hosting The Second vLLM Meetup in downtown SF on Jan 31st (Wed). Come to chat with vLLM maintainers about LLMs in production and inference optimizations! Thanks @IBM for hosting us. lu.ma/ygxbpzhl
1
1
11
1,575
Replying to @casper_hansen_
Thanks for sharing! To clarify: we didn't start from scratch - we reimplemented core components like the scheduler while keeping most of vLLM's existing codebase.
10
552
Replying to @HamelHusain
Why not use vLLM? :)
2
1
10
633
Congratulations to everyone in the second batch! 🔥🔥🔥
We're announcing the second batch of @a16z open source AI grants today This cohort focuses on: ▶️ tools for LLM training/ hosting/ evals ▶️ visual AI models & communities Thank you to the grantees for your contributions! More info in the linked post a16z.com/announcing-our-late…
6
798
@verl_project now uses vLLM V1! Upgrade veRL to enjoy the faster and more stable speed :)
Recent updates on @verl_project (RL lib for LLMs): Engine: - Megatron qwen & GRPO support, v0.11 upgrade - vllm v0.7 integration with v1 mode - experimental sglang integration Algorithm & recipes: - vision language reasoning with qwen2.5-vl - PRIME, RLOO, remax, math-verify rewards, etc Docs: - tutorial for distributed training setup and debugging - programming model tutorial Hardware: - experimental AMD support And many awesome community projects such as code-R1, Easy-R1, Search-R1, RAGEN, etc. Big thank you to the community! Working on multi-turn & environment/tool supports. Stay tuned...
1
7
629
@vllm_project goes brrr on 5080 🚀
Not every 5080 is signed by the one and only @ia_buck - thank you and @nvidia so much for letting me test out getting @vllm_project running on Blackwell! GPUs Go Brrr from home! Try it out yourself with instructions here! github.com/vllm-project/vllm…
6
266
Welcome!
🎉 Exciting news! Tyler Smith, one of our many talented engineers, is now Neural Magic's 3rd vLLM project committer! Check out Tyler's contributions: github.com/tlrmchlsmth. We’re proud to be a leading contributor to @vllm_project. 🚀 Cheers to Tyler and the team!
5
753
Replying to @rogerw0108
Absolutely love what you did in 2024 Let's keep killing it! 🔥
1
5
257
Check out @luo_michael1234's awesome work on picking the best LoRA adapters to create crisp images!
[1/5] Introducing Stylus 🖌️ - an #AI tool that automatically finds and adds the best adapters (LoRAs, Textual Inversions, Hypernetworks) to #StableDiffusion based on your prompt. 🗞️ Paper: arxiv.org/abs/2404.18928 🌎 Project Page: stylus-diffusion.github.io

ALT Stable Diffusion, LoRA, Textual Inversion, Hypernetwork, Retrieval Augmented Generation, CivitAI, Huggingface

5
979
Replying to @andrey_cheptsov
Yes! `pip install vllm megablocks` is all you need. 🚀🚀🚀
2
129
@simon_mo_ and I are at #GTC2025 this afternoon! Please let us know if you want to chat about @vllm_project and LLM inference!
4
119
You are right. The majority of the vLLM’s speedup comes by batching more prompts every run. However, you can get some speedup even in a single user env, because vLLM also includes some other optimizations orthogonal to PagedAttention.
3
70
Amazing!! 🤣🤣
Image conditioning on @ideogram_ai is awesome! 3D render of re-imagined vLLM logo below :) cc @zhuohan123 @woosuk_k
2
418
@netapy Thanks for your interest and great question! Currently, we don't compile the models. We're currently exploring if torch.compile is suitable for us, or if we can enhance performance by optimizing the code for individual models.
1
3
142
It's so useful! It not only covers the docs, but also links to our Github issues! Thanks for adding it to vLLM!
1
3
211
Replying to @natolambert
@natolambert Could you please share more about the error you faced with? FlashInfer is actually required for Gemma2. You can install it with `pip install github.com/flashinfer-ai/fla…`
1
2
216
You can pip install the nightly version! export VLLM_VERSION=0.5.3.post1 export VLLM_COMMIT=16a1cc9bb2b4bba82d78f329e5a89b44a5523ac8 pip install vllm-wheels.s3.us-west-2.ama…${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
1
2
205
Congrats!
1
2
202
Replying to @michaelzluo
Thanks! 😃 Great ideas. Will definitely explore them!
2
105
Would you please provide a reproducible example? We really want to investigate this. Thanks!
1
2
280
Replying to @casper_hansen_
Thanks for bringing it up. We are working on it!
1
2
157
Amazing work!!
1
67
Replying to @KaichaoYou
Congrats!!!
1
310
Amazing!! Congrats!! 👊👊
1
262