-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Model][Quantization] HQQ support through Marlin kernel expansion #9766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: ElizaWszola <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
vllm/model_executor/parameter.py
Outdated
marlin_tile_size=self.marlin_tile_size) | ||
|
||
|
||
class HQQQweightParameter(PackedvLLMParameter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsikka @mgoin @LucasWilkinson what do you guys think of inheriting from the PackedParameter to handle the details of HQQ? I think this is a reasonable approach.
Perhaps this should live inside hqq_marlin.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ElizaWszola would like to understand how we're extending this class to see if we can clean-up some of the weight loading logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsikka HQQ qweights have a data format where each row corresponds to 2 groups. I'm first reshaping HQQ qweights (and correspoinding scales/zp) to their actual shapes in qweight multiplication, and then load. Additionally, I store shard offsets and sizes because they are needed for unpacking HQQ from 4-bits to 8-bits (I need to unpack shard-by-shard, so I must know where they are). I'll add a comment on this in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Eliza! Nice job avoiding any changes to llama.py
. We have much better encapsulated the HQQ
logic.
Questions:
- do we have any end-to-end perf benchmarks?
- do we have any lm-eval runs for correctness?
- can you add a test for weight loading? @dsikka can you point out an example
The only other piece is that we have not adapted the Kernel
abstraction proposed by @LucasWilkinson. @LucasWilkinson do you think this would be appropriate?
See https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/gptq_marlin.py for an example of the Kernel
abstraction
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
@robertgshaw2-neuralmagic I've run some lm-eval: I'll add the relevant configs as soon as it's verified that the results look good. I can also add other mentioned tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
touching base offline about parameter testing
vllm/model_executor/parameter.py
Outdated
|
||
|
||
# Zero points and scales in HQQ must also be reshaped to their actual shapes. | ||
class HQQZeroScaleParameter(GroupQuantScaleParameter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we just need this class to reshape the weight before weight loading? We could just add an optional property to do this in the parent classes so that we dont have to maintain another parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other parameter in the code that already gets reshaped like this?
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
/ready |
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically looks good to me, although good for @tlrmchlsmth to go through as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ElizaWszola When you merge latest main, could you sanity check the wheel size before landing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm -- just need to address nits and merge and then I think it's g2g
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
Signed-off-by: ElizaWszola <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ElizaWszola <[email protected]>
failures look related to #10456 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
BTW just checked this, and check-wheel size goes from |
…lm-project#9766) Signed-off-by: ElizaWszola <[email protected]>
Hi, I use hqq model with lora. But it will raise |
I'm not sure how that function works, @varun-sundar-rabindranath @jeejeelee would you have an idea? |
looks like the weights are not registered as
marlin.py
|
I have a PR for this #12090 |
…lm-project#9766) Signed-off-by: ElizaWszola <[email protected]>
This PR enables HQQ support (float zero points) in float16 kernels. This PR supports 4-bit quantization and group_size 64.
unit tests:
offline inference model:
or