-
Notifications
You must be signed in to change notification settings - Fork 264
Support INC FP8 static quantization for deepseek_v3/r1 #1907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Converting to draft while INC changes in review |
4d6bf13
to
e7ed4c3
Compare
@@ -541,27 +603,14 @@ def forward(self, hidden_states): | |||
final_hidden_states = torch.zeros( | |||
(batch * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device | |||
) | |||
htcore.mark_step() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we break the graph after the torch.zeros allocation and not before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dudilester I'll move the mark_steps.
@@ -107,6 +108,57 @@ def _get_unpad_data(attention_mask): | |||
) | |||
|
|||
|
|||
class GaudiDeepseekV3LinearFP8(nn.Linear): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this class used? why is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dudilester The class is used by Intel Neural Compressor for dynamic requantization. I will add clarifying comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a part of the deepseek module? where is this class used in the optimum-habana code?
Looks useful! #1868 just merged, if any I can help, pls ping me |
"observer": "maxabs", | ||
"scale_method": "maxabs_hw", | ||
"allowlist": {"types": [], "names": []}, | ||
"blocklist": {"types": [], "names": ["self_attn"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we block the attn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dudilester I blocked the self_attn layer due to some errors when handling k_b_proj
. k_b_proj is used in rotary embed so we need to block it from fp8 quantization with INC. I'll narrow down the blocklist to specified ops
@@ -0,0 +1,15 @@ | |||
{ | |||
"method": "HOOKS", | |||
"mode": "MEASURE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quant
a4e6f74
to
c9f2b88
Compare
def set_scale_inv_fp8(self, scale_inv_fp8: torch.Tensor): | ||
self.scale_inv_fp8 = scale_inv_fp8 | ||
|
||
def dequant_block_fp8_weight(self) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was updated, please refer to https://github.com/HabanaAI/vllm-fork/blob/1b40abb7d3eb069f8bdbf6c34609b4afb0f53c54/vllm/model_executor/layers/quantization/fp8.py#L315C9-L326.
@@ -182,7 +182,7 @@ python3 ../gaudi_spawn.py --hostfile=<hostfile> --use_deepspeed \ | |||
|
|||
To run Moonlight-16B-A3B (a DeepSeek-V3 like model) inference on a Gaudi2 card use the following command: | |||
```bash | |||
PT_HPU_LAZY_MODE=1 python3 ./run_generation.py \ | |||
python3 ./run_generation.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the PT_HPU_LAZY_MODE=1
?
Fix for FP8 layers which do not match block dimensions
9a9d187
to
75f5ddf
Compare
What does this PR do?
Support FP8 static quantization for deepseek v3/r1 models using Intel Neural Compressor (INC)
This feature needs changes in:
Steps for FP8 quantization
Fixes # (issue)
Before submitting