-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Open
Description
Motivation.
We are in the process of making incremental changes for Blackwell Support in vLLM. This issue is a tracker for all the items that are planned.
Planned or In Progress Features
The following items are either planned or currently in progress to enable vLLM support on Blackwell.
-
Enable NVFP4 Support
- (NVIDIA) Add functional support for NVFP4 Kernels for linear layers
- (NVIDIA) Add functional support for NVFP4 MoE Kernels
- (NVIDIA) Add Model Integration for nvidia/*-FP4 models
- Finetune GEMM configurations for Blackwell
- (NVIDIA) Optimize MoE for Latency
- (NVIDIA) Optimize MoE for Throughput FI: PR !1113
- (NVIDIA) MoE All Reduce Fusion FI: PR !1108
-
Optimize communication overlap ops
- (NVIDIA) Enable NCCL’s symmetric memory [core] add nccl symmetric memory for all reduce #24532
- (NVIDIA) Add support for Gemm + comm overlap
-
Blackwell Attention Kernels
- (NVIDIA) Integrate Cutlass MLA Kernels [NVIDIA] Add Cutlass MLA backend #17625
- (NVIDIA) Integrate vLLM v1-compatible Blackwell prefill and decode GQA kernels FI: PR !1051
-
FP8 Blockscale Gemm and MoE
- (NVIDIA) FP8 Blockscale GEMM
- (NVIDIA) FP8 Blockscale gemm optimizations: Sm100 blockwise fp8 swap ab #18564
- (NVIDIA) FP8 Blockscale MoE
- (NVIDIA) Latency and throughput optimizations
-
MTP support
Feedback Period.
No response
CC List.
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
kushanam, tlrmchlsmth, mgoin, WoosukKwon, xwuShirley and 5 moreshashanknimje and ivanbaldocyril23, shashanknimje and ivanbaldoWoosukKwon, xwuShirley, shashanknimje and ivanbaldo