-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bugfix] Fix quantization skip modules logic #13562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jee Jee Li <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
# BitsAndBytes | ||
if (isinstance(quant_config, BitsAndBytesConfig) | ||
and quant_config.llm_int8_skip_modules): | ||
quant_config.llm_int8_skip_modules = [ | ||
hf_to_vllm_mapper._map_name(module) | ||
for module in quant_config.llm_int8_skip_modules | ||
] | ||
# AWQ | ||
elif (isinstance(quant_config, AWQConfig) | ||
and quant_config.modules_to_not_convert): | ||
quant_config.modules_to_not_convert = [ | ||
hf_to_vllm_mapper._map_name(module) | ||
for module in quant_config.modules_to_not_convert | ||
] | ||
# TODO: Supports more quantization types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should introduce a common ignored_modules
or ignored_prefixes
to QuantizationConfig like packed_modules_mapping https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/base_config.py#L60-L66
Then each quant config can convert their specific llm_int8_skip_modules
, modules_to_not_convert
, etc in a canonical format in ignored_modules
. This will also allow us to generalize the is_layer_skipped
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd support an implementation like this as well. This current implementation could fail to properly map module names in nested models.
modules_to_not_convert = ["SubModel.A"]
SubModel.hf_to_vllm_mapper = Mapper(orig_to_new_prefix={"A": "B"})
Note that "SubModel.A" will not match because "SubModel.A" does not start with "A"
This is a fairly minor issue, but something to keep in mind.
Another implementation could look like this:
- Add a mutable
ignored_modules
attribute to QuantizationConfig - At construction-time, using the method-specific constructor to populate the
ignored_modules
attribute from disk - At initialize-time, within
SupportsQuant
, use the given model prefix and mapper to update theignored_modules
list with the proper model-specific mapping
a.ignored_modules = [prefix + hf_to_vllm_mapper[module - prefix] for module in ignored_modules]
This has the advantage of further standardizing around the QuantizationConfig base, as well as supporting mapping with nested models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeejeelee Here's a WIP of what that might look like: #14635
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylesayrs Can you provide an example?
def _configure_packed_modules_mapping(): | ||
""" | ||
Pass packed_modules_mapping by reference to quant_config so that | ||
quant_config can properly match fused modules | ||
Note that model attributes are passed by reference to quant_config, | ||
enabling them to be updated by model_class.__new__ (ex. chatglm, qwen) | ||
""" | ||
packed_mapping = getattr(model_class, "packed_modules_mapping", None) | ||
if packed_mapping is not None: | ||
# pass packed_modules_mapping by reference to quant_config | ||
quant_config.packed_modules_mapping = packed_mapping | ||
else: | ||
logger.warning( | ||
"The model class %s has not defined `packed_modules_mapping`, " | ||
"this may lead to incorrect mapping of quantized or ignored " | ||
"modules", model_class.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed after we added SupportsQuant
(#13104), I thought getting the packed_modules_mapping from the model to the quant config was the main purpose of that. cc @kylesayrs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _configure_packed_modules_mapping
function needs to remain in place until SupportsQuant
has been added to all applicable models
Close due to #14635 |
Motivation
Some models, such as QWEN25-VL, have modified their layer hierarchy compared to their original
transformers
implementation. This change causes quantization's skip modules to become ineffective, leading to incorrect initialization of linear methods.Reproduce code
TODO
Investigate other quantization method (e.g. AWQ)
Optimize the implementation logic