-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Kernel] some optimizations for dense marlin and moe marlin #16850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] some optimizations for dense marlin and moe marlin #16850
Conversation
Signed-off-by: Jinzhen Lin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
moe marlin benchmark tests (on A800) (NOTE1: The optimization methods introduced in this PR have already been implemented in #14447 for cases where k <= 256, resulting in limited performance improvement under such conditions.) (NOTE2:The "main" section in the following results is inconsistent with the ones posted inhttps://github.com//pull/14447, because after posting the benchmark results in #14447, I made several rounds of optimizations.) shapes of DeepSeek-V3-AWQ (with TP=8) shapes of Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 (with TP=1) shapes of Mixtral-8x7B-Instruct-v0.1-AWQ (with TP=1) |
The benchmark results is posted. BTW, should we change the default value of |
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does increase the wheel size by about 10MB to 313MB, so we should try to trim down a bit.
[2025-04-22T07:40:33Z] #32 0.707 Wheel dist/vllm-0.8.5.dev150+gfb8563602-cp38-abi3-linux_x86_64.whl is within the allowed size (313.19 MB).
I think there may be some compiled function overlap that I uncovered during review.
#define HQQ_GET_IF(W_TYPE, N_BLOCKS, K_BLOCKS, NUM_THREADS) \ | ||
__GET_IF(W_TYPE, 1, N_BLOCKS, K_BLOCKS, true, false, true, 4, NUM_THREADS, \ | ||
true) \ | ||
__GET_IF(W_TYPE, 1, N_BLOCKS, K_BLOCKS, false, false, true, 4, \ | ||
NUM_THREADS, true) \ | ||
__GET_IF(W_TYPE, 2, N_BLOCKS, K_BLOCKS, false, false, true, 4, \ | ||
NUM_THREADS, true) \ | ||
__GET_IF(W_TYPE, 3, N_BLOCKS, K_BLOCKS, false, false, true, 4, \ | ||
NUM_THREADS, true) \ | ||
__GET_IF(W_TYPE, 4, N_BLOCKS, K_BLOCKS, false, false, true, 4, \ | ||
NUM_THREADS, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we actually support HQQ for MoE before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Marlin
template support is_zp_float = true
(HQQ), but I don't enable it.
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
@mgoin The remaining failed tests seems not related to this PR. |
This pull request has merge conflicts that must be resolved before it can be |
Looks like several of the failing tests are related to the merge 😞
|
Signed-off-by: Jinzhen Lin <[email protected]>
@mgoin The error seems introduced by rebase. FIxed now (The content of |
…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Hey @jinzhen-lin @mgoin - it looks like this PR may have broken marlin utilities for a cuple integrations. See the latest nightly runs:
Would you mind taking a peek at resolving? |
I will fix it later. |
Thank you. I posted in the comment 2 failures that I see |
I've resolved most of the model issues with the above referenced PRs #18002 and #18017 . There is one outstanding issue that it would be useful to have you take a look at @jinzhen-lin. Regarding the weight loading buildkite test, there is this failing case with mixtral w8a16 group=128 desc_act=True
Locally I've been able to trigger the failure with this command which fails in the
|
…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]>
…ject#16850) Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
This PR optimizes dense marlin kernel and moe marlin kernel.
Summary:
m8n16k16
MMA instruction using them16n8k16
instruction via transposition. This optimize the performance for m <= 8.mul(sub(quantized_weight, zero_points), scale)
intofma(quantized_weight, scale, -mul(zero_points * scale))
, where-mul(zero_points * scale)
can be precomputed. This save some Floating Point Operations.