Skip to content

Support INC FP8 static quantization for deepseek_v3/r1 #1907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

skavulya
Copy link
Contributor

@skavulya skavulya commented Apr 4, 2025

What does this PR do?

Support FP8 static quantization for deepseek v3/r1 models using Intel Neural Compressor (INC)

This feature needs changes in:

Steps for FP8 quantization

# install OH
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git fetch origin pull/1907/head:deepseek_v3_fp8
git checkout deepseek_v3_fp8
pip install -e .
pip install git+https://github.com/HabanaAI/[email protected]
pip install blobfile tiktoken

# install INC PR with OH deepseek_v3 support
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git fetch origin pull/2164/head:oh_ds_r1
git checkout oh_ds_r1
pip uninstall neural_compressor_pt
pip install -r requirements.txt
pip install -r requirements_pt.txt
python setup.py develop pt

# Test FP8 Quantization with moonlight model on 2 cards with expert-parallelism
cd ../optimum-habana/examples/text-generation/
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=quantization_config/maxabs_measure.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

# FP8 dynamic moe op segfaults if SLICE_MAX_EXPERT>32
 SLICE_MAX_EXPERT=32 PT_HPU_LAZY_MODE=1 QUANT_CONFIG=quantization_config/maxabs_quant_mixtral.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@skavulya
Copy link
Contributor Author

skavulya commented Apr 5, 2025

Converting to draft while INC changes in review

@@ -541,27 +603,14 @@ def forward(self, hidden_states):
final_hidden_states = torch.zeros(
(batch * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
)
htcore.mark_step()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we break the graph after the torch.zeros allocation and not before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dudilester I'll move the mark_steps.

@@ -107,6 +108,57 @@ def _get_unpad_data(attention_mask):
)


class GaudiDeepseekV3LinearFP8(nn.Linear):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this class used? why is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dudilester The class is used by Intel Neural Compressor for dynamic requantization. I will add clarifying comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a part of the deepseek module? where is this class used in the optimum-habana code?

@jinyouzhi
Copy link
Contributor

Looks useful! #1868 just merged, if any I can help, pls ping me

"observer": "maxabs",
"scale_method": "maxabs_hw",
"allowlist": {"types": [], "names": []},
"blocklist": {"types": [], "names": ["self_attn"]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we block the attn ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dudilester I blocked the self_attn layer due to some errors when handling k_b_proj. k_b_proj is used in rotary embed so we need to block it from fp8 quantization with INC. I'll narrow down the blocklist to specified ops

@@ -0,0 +1,15 @@
{
"method": "HOOKS",
"mode": "MEASURE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quant

@skavulya skavulya force-pushed the deepseek_v3_fp8 branch 2 times, most recently from a4e6f74 to c9f2b88 Compare April 29, 2025 06:50
def set_scale_inv_fp8(self, scale_inv_fp8: torch.Tensor):
self.scale_inv_fp8 = scale_inv_fp8

def dequant_block_fp8_weight(self) -> torch.Tensor:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -182,7 +182,7 @@ python3 ../gaudi_spawn.py --hostfile=<hostfile> --use_deepspeed \

To run Moonlight-16B-A3B (a DeepSeek-V3 like model) inference on a Gaudi2 card use the following command:
```bash
PT_HPU_LAZY_MODE=1 python3 ./run_generation.py \
python3 ./run_generation.py \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the PT_HPU_LAZY_MODE=1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants