[Kernel] some optimizations for dense marlin and moe marlin #16850

jinzhen-lin · 2025-04-18T15:46:46Z

This PR optimizes dense marlin kernel and moe marlin kernel.

Summary:

(dense marlin only) Migrate the optimization method introduced for the moe marlin kernel in [Kernel] moe wna16 marlin kernel #14447 to the dense marlin kernel. Including
- By modifying the workspace usage logic, the limitation of max_par was removed, accelerating the speed of large batches ( m > 1024)
- Simulated the m8n16k16 MMA instruction using the m16n8k16 instruction via transposition. This optimize the performance for m <= 8.
- For AWQ model, Fused mul(sub(quantized_weight, zero_points), scale) into fma(quantized_weight, scale, -mul(zero_points * scale)), where -mul(zero_points * scale) can be precomputed. This save some Floating Point Operations.
- Remove some unused kernel to reduce wheel size.
- Split the kernel into multiple files to speed up the compilation.
- etc.
(moe marlin only) Optimize the index calculation logic when reading A, caching row and column information as much as possible, this achieve a performance improvement of up to 10%.
(moe marlin only) Make use of the available shared memory to cache matrix A as much as possible, when the same threadblock processes the same M but different N, it can reduce the IOs for A.
FP8 marlin. Now we can run DeepSeek with W-FP8-A-FP16.
- Merge fp8_marlin into gptq_marlin kernel, and add block quant support.
- Add fp8 support for moe marlin

Signed-off-by: Jinzhen Lin <[email protected]>

github-actions · 2025-04-18T15:46:55Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin · 2025-04-19T12:55:07Z

dense marlin benchmark tests (on A800)

jinzhen-lin · 2025-04-19T13:49:13Z

moe marlin benchmark tests (on A800)

(NOTE1: The optimization methods introduced in this PR have already been implemented in #14447 for cases where k <= 256, resulting in limited performance improvement under such conditions.)

(NOTE2:The "main" section in the following results is inconsistent with the ones posted inhttps://github.com//pull/14447, because after posting the benchmark results in #14447, I made several rounds of optimizations.)

shapes of DeepSeek-V3-AWQ (with TP=8)

shapes of Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 (with TP=1)

shapes of Mixtral-8x7B-Instruct-v0.1-AWQ (with TP=1)

jinzhen-lin · 2025-04-19T14:10:43Z

@mgoin @LucasWilkinson

The benchmark results is posted.

BTW, should we change the default value of VLLM_MARLIN_USE_ATOMIC_ADD to 1 now ? (still don't sure if this would cause some bugs though, see #14138 )

Signed-off-by: Jinzhen Lin <[email protected]>

…ation

mgoin

This does increase the wheel size by about 10MB to 313MB, so we should try to trim down a bit.

[2025-04-22T07:40:33Z] #32 0.707 Wheel dist/vllm-0.8.5.dev150+gfb8563602-cp38-abi3-linux_x86_64.whl is within the allowed size (313.19 MB).

I think there may be some compiled function overlap that I uncovered during review.

CMakeLists.txt

csrc/moe/marlin_moe_wna16/generate_kernels.py

mgoin · 2025-04-23T02:13:30Z

csrc/moe/marlin_moe_wna16/ops.cu

-  #define HQQ_GET_IF(W_TYPE, N_BLOCKS, K_BLOCKS, NUM_THREADS)                  \
-    __GET_IF(W_TYPE, 1, N_BLOCKS, K_BLOCKS, true, false, true, 4, NUM_THREADS, \
-             true)                                                             \
-    __GET_IF(W_TYPE, 1, N_BLOCKS, K_BLOCKS, false, false, true, 4,             \
-             NUM_THREADS, true)                                                \
-    __GET_IF(W_TYPE, 2, N_BLOCKS, K_BLOCKS, false, false, true, 4,             \
-             NUM_THREADS, true)                                                \
-    __GET_IF(W_TYPE, 3, N_BLOCKS, K_BLOCKS, false, false, true, 4,             \
-             NUM_THREADS, true)                                                \
-    __GET_IF(W_TYPE, 4, N_BLOCKS, K_BLOCKS, false, false, true, 4,             \
-             NUM_THREADS, true)


Did we actually support HQQ for MoE before?

The Marlin template support is_zp_float = true (HQQ), but I don't enable it.

csrc/quantization/gptq_marlin/generate_kernels.py

csrc/quantization/gptq_marlin/gptq_marlin.cu

vllm/model_executor/layers/quantization/fp8.py

vllm/model_executor/layers/quantization/utils/marlin_utils.py

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py

vllm/model_executor/layers/quantization/utils/marlin_utils.py

vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py

vllm/scalar_type.py

Signed-off-by: Jinzhen Lin <[email protected]>

…ation

Signed-off-by: Jinzhen Lin <[email protected]>

…ation

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin · 2025-05-01T10:26:41Z

@mgoin The remaining failed tests seems not related to this PR.

mergify · 2025-05-02T18:34:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jinzhen-lin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ation

mgoin · 2025-05-03T17:43:39Z

Looks like several of the failing tests are related to the merge 😞

[2025-05-02T21:38:44Z] FAILED kernels/quantization/test_awq_marlin.py::test_fused_marlin_moe_awq[128-6-64-1024-2048-64] - RuntimeError: vllm::fused_marlin_moe() is missing value for argument 'quant_type_id'. Declaration: vllm::fused_marlin_moe(Tensor hidden_states, Tensor w1, Tensor w2, Tensor w1_scale, Tensor w2_scale, Tensor gating_output, Tensor topk_weights, Tensor topk_ids, SymInt quant_type_id, SymInt global_num_experts=-1, Tensor? expert_map=None, Tensor? g_idx1=None, Tensor? g_idx2=None, Tensor? sort_indices1=None, Tensor? sort_indices2=None, Tensor? w1_zeros=None, Tensor? w2_zeros=None, Tensor? workspace=None, bool is_k_full=True, bool inplace=False) -> Tensor

Signed-off-by: Jinzhen Lin <[email protected]>

…ation

jinzhen-lin · 2025-05-04T14:16:28Z

Looks like several of the failing tests are related to the merge 😞

[2025-05-02T21:38:44Z] FAILED kernels/quantization/test_awq_marlin.py::test_fused_marlin_moe_awq[128-6-64-1024-2048-64] - RuntimeError: vllm::fused_marlin_moe() is missing value for argument 'quant_type_id'. Declaration: vllm::fused_marlin_moe(Tensor hidden_states, Tensor w1, Tensor w2, Tensor w1_scale, Tensor w2_scale, Tensor gating_output, Tensor topk_weights, Tensor topk_ids, SymInt quant_type_id, SymInt global_num_experts=-1, Tensor? expert_map=None, Tensor? g_idx1=None, Tensor? g_idx2=None, Tensor? sort_indices1=None, Tensor? sort_indices2=None, Tensor? w1_zeros=None, Tensor? w2_zeros=None, Tensor? workspace=None, bool is_k_full=True, bool inplace=False) -> Tensor

@mgoin The error seems introduced by rebase. FIxed now (The content of test_awq_marlin.py is test cases for moe, it should be removed, the moe marlin test cases are already in test_moe.py).

…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Mu Huai <[email protected]>

robertgshaw2-redhat · 2025-05-12T15:26:06Z

Hey @jinzhen-lin @mgoin - it looks like this PR may have broken marlin utilities for a cuple integrations. See the latest nightly runs:

Would you mind taking a peek at resolving?

jinzhen-lin · 2025-05-12T15:28:17Z

Hey @jinzhen-lin @mgoin - it looks like this PR may have broken marlin for FBGEMM integration

See the latest nightly run: https://buildkite.com/vllm/ci/builds/19754#0196bd80-d18b-4c8a-bf76-50e9a53c5a6c/43-353

Would you mind taking a peek at resolving?

I will fix it later.

robertgshaw2-redhat · 2025-05-12T15:28:47Z

Hey @jinzhen-lin @mgoin - it looks like this PR may have broken marlin for FBGEMM integration

See the latest nightly run: https://buildkite.com/vllm/ci/builds/19754#0196bd80-d18b-4c8a-bf76-50e9a53c5a6c/43-353

Would you mind taking a peek at resolving?

I will fix it later.

Thank you. I posted in the comment 2 failures that I see

mgoin · 2025-05-12T19:43:03Z

I've resolved most of the model issues with the above referenced PRs #18002 and #18017 .

There is one outstanding issue that it would be useful to have you take a look at @jinzhen-lin. Regarding the weight loading buildkite test, there is this failing case with mixtral w8a16 group=128 desc_act=True

=== FAILED MODEL: gptq_marlin, TheBloke/Mixtral-8x7B-v0.1-GPTQ, gptq-8bit-128g-actorder_True ===

Locally I've been able to trigger the failure with this command which fails in the moe_wna16_marlin_gemm call:

CUDA_LAUNCH_BLOCKING=1 vllm serve TheBloke/Mixtral-8x7B-v0.1-GPTQ -tp 2 --load-format dummy --enforce-eager --revision gptq-8bit-128g-actorder_True
...
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]   File "/home/mgoin/code/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 630, in apply
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]     return torch.ops.vllm.fused_marlin_moe(
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]     return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]   File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 168, in fused_marlin_moe
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]     intermediate_cache3 = ops.moe_wna16_marlin_gemm(
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]   File "/home/mgoin/code/vllm/vllm/_custom_ops.py", line 1401, in moe_wna16_marlin_gemm
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]     return torch.ops._moe_C.moe_wna16_marlin_gemm(
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]     return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522] 
(VllmWorker rank=0 pid=3000030) ERROR 05-12 19:38:02 [multiproc_executor.py:522]

…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]>

…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

marlin optimization

7db8212

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners April 18, 2025 15:46

mergify bot added the ci/build label Apr 18, 2025

fix

48874db

Signed-off-by: Jinzhen Lin <[email protected]>

mgoin requested a review from LucasWilkinson April 18, 2025 16:04

jinzhen-lin added 7 commits April 19, 2025 00:06

fix

e13c8a1

Signed-off-by: Jinzhen Lin <[email protected]>

fix

dd8a5e1

Signed-off-by: Jinzhen Lin <[email protected]>

fix

fd16948

Signed-off-by: Jinzhen Lin <[email protected]>

fix

ac5dc47

Signed-off-by: Jinzhen Lin <[email protected]>

fix

cb8229c

Signed-off-by: Jinzhen Lin <[email protected]>

fix moe performance bad cases

8bac124

Signed-off-by: Jinzhen Lin <[email protected]>

fix dense marlin performance bad cases

649701b

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin mentioned this pull request Apr 19, 2025

[Kernel] moe wna16 marlin kernel #14447

Merged

jinzhen-lin added 4 commits April 19, 2025 22:21

some fix

5fa7f33

Signed-off-by: Jinzhen Lin <[email protected]>

fix

eb3f2ed

Signed-off-by: Jinzhen Lin <[email protected]>

Merge remote-tracking branch 'origin/main' into marlin-kernel-optimiz…

15daa36

…ation

Merge remote-tracking branch 'origin/main' into marlin-kernel-optimiz…

fb85636

…ation

mgoin reviewed Apr 23, 2025

View reviewed changes

jinzhen-lin added 4 commits April 23, 2025 13:15

fix

72a6ded

Signed-off-by: Jinzhen Lin <[email protected]>

remove kU8

367b5d9

Signed-off-by: Jinzhen Lin <[email protected]>

fix name

90e1063

Signed-off-by: Jinzhen Lin <[email protected]>

fix and add comment

f42ac97

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin added 7 commits April 30, 2025 13:49

fix

305c0bd

Signed-off-by: Jinzhen Lin <[email protected]>

Merge remote-tracking branch 'origin/main' into marlin-kernel-optimiz…

aa36125

…ation

rerun

da5d871

Signed-off-by: Jinzhen Lin <[email protected]>

rerun

e1cec3c

Signed-off-by: Jinzhen Lin <[email protected]>

fix

645f16f

Signed-off-by: Jinzhen Lin <[email protected]>

Merge remote-tracking branch 'origin/main' into marlin-kernel-optimiz…

37c4c43

…ation

fix

8444f78

Signed-off-by: Jinzhen Lin <[email protected]>

mergify bot added the needs-rebase label May 2, 2025

Merge remote-tracking branch 'origin/main' into marlin-kernel-optimiz…

77addcd

…ation

mergify bot removed the needs-rebase label May 2, 2025

jinzhen-lin added 2 commits May 4, 2025 15:12

fix

dd5cce2

Signed-off-by: Jinzhen Lin <[email protected]>

Merge remote-tracking branch 'origin/main' into marlin-kernel-optimiz…

7dd8299

…ation

mgoin approved these changes May 5, 2025

View reviewed changes

simon-mo merged commit 1d0c9d6 into vllm-project:main May 5, 2025
77 of 80 checks passed

mgoin mentioned this pull request May 7, 2025

Update CT WNA16MarlinMoE integration #16666

Merged

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Kernel] some optimizations for dense marlin and moe marlin (vllm-pro…

13fe0a5

…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Mu Huai <[email protected]>

This was referenced May 12, 2025

[Bugfix] Fix FBGEMM integration #18002

Merged

[Feature]: Support FP8 Marlin MoE for CompressedTensorsW8A8Fp8MoEMethod #18008

Closed

[Bugfix] Fixes for new marlin moe usage #18017

Merged

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[Kernel] some optimizations for dense marlin and moe marlin (vllm-pro…

7e6990e

…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]>

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Kernel] some optimizations for dense marlin and moe marlin (vllm-pro…

6cdcd55

…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

Uh oh!

[Kernel] some optimizations for dense marlin and moe marlin #16850

[Kernel] some optimizations for dense marlin and moe marlin #16850

Uh oh!

Conversation

jinzhen-lin commented Apr 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 18, 2025

Uh oh!

jinzhen-lin commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinzhen-lin commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinzhen-lin commented Apr 19, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

jinzhen-lin Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinzhen-lin commented May 1, 2025

Uh oh!

mergify bot commented May 2, 2025

Uh oh!

mgoin commented May 3, 2025

Uh oh!

jinzhen-lin commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinzhen-lin commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented May 12, 2025

Uh oh!

mgoin commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jinzhen-lin commented Apr 18, 2025 •

edited by github-actions bot

Loading

jinzhen-lin commented Apr 19, 2025 •

edited

Loading

jinzhen-lin commented Apr 19, 2025 •

edited

Loading

jinzhen-lin Apr 23, 2025 •

edited

Loading

jinzhen-lin commented May 4, 2025 •

edited

Loading

robertgshaw2-redhat commented May 12, 2025 •

edited

Loading

jinzhen-lin commented May 12, 2025 •

edited

Loading

mgoin commented May 12, 2025 •

edited

Loading