Support INC FP8 static quantization for deepseek_v3/r1 #1907

skavulya · 2025-04-04T01:24:36Z

What does this PR do?

Support FP8 static quantization for deepseek v3/r1 models using Intel Neural Compressor (INC)

This feature needs changes in:

INC PR Add support for optimum-habana deepseek v3/r1 fp8 quantization intel/neural-compressor#2164
Moonlight (small DeepSeek-V3 like) model for testing Add Moonlight Support #1868

Steps for FP8 quantization

# install OH
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git fetch origin pull/1907/head:deepseek_v3_fp8
git checkout deepseek_v3_fp8
pip install -e .
pip install git+https://github.com/HabanaAI/[email protected]
pip install blobfile tiktoken

# install INC PR with OH deepseek_v3 support
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git fetch origin pull/2164/head:oh_ds_r1
git checkout oh_ds_r1
pip uninstall neural_compressor_pt
pip install -r requirements.txt
pip install -r requirements_pt.txt
python setup.py develop pt

# Test FP8 Quantization with moonlight model on 2 cards with expert-parallelism
cd ../optimum-habana/examples/text-generation/
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=quantization_config/maxabs_measure.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

# FP8 dynamic moe op segfaults if SLICE_MAX_EXPERT>32
 SLICE_MAX_EXPERT=32 PT_HPU_LAZY_MODE=1 QUANT_CONFIG=quantization_config/maxabs_quant_mixtral.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

skavulya · 2025-04-05T15:58:25Z

Converting to draft while INC changes in review

dudilester · 2025-04-24T08:23:54Z

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

@@ -541,27 +603,14 @@ def forward(self, hidden_states):
            final_hidden_states = torch.zeros(
                (batch * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
            )
+            htcore.mark_step()


why do we break the graph after the torch.zeros allocation and not before?

Thanks @dudilester I'll move the mark_steps.

dudilester · 2025-04-24T08:56:44Z

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

@@ -107,6 +108,57 @@ def _get_unpad_data(attention_mask):
    )


+class GaudiDeepseekV3LinearFP8(nn.Linear):


where is this class used? why is it needed?

@dudilester The class is used by Intel Neural Compressor for dynamic requantization. I will add clarifying comments

Is it a part of the deepseek module? where is this class used in the optimum-habana code?

jinyouzhi · 2025-04-24T09:21:32Z

Looks useful! #1868 just merged, if any I can help, pls ping me

dudilester · 2025-04-24T09:56:51Z

examples/text-generation/quantization_config/maxabs_quant_deepseek_bf16.json

+    "observer": "maxabs",
+    "scale_method": "maxabs_hw",
+    "allowlist": {"types": [], "names":  []},
+    "blocklist": {"types": [], "names":  ["self_attn"]},


why do we block the attn ?

@dudilester I blocked the self_attn layer due to some errors when handling k_b_proj. k_b_proj is used in rotary embed so we need to block it from fp8 quantization with INC. I'll narrow down the blocklist to specified ops

dudilester · 2025-04-24T09:57:00Z

examples/text-generation/quantization_config/maxabs_quant_deepseek_fp8.json

@@ -0,0 +1,15 @@
+{
+    "method": "HOOKS",
+    "mode": "MEASURE",


yiliu30 · 2025-05-15T09:07:04Z

optimum/habana/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+    def set_scale_inv_fp8(self, scale_inv_fp8: torch.Tensor):
+        self.scale_inv_fp8 = scale_inv_fp8
+
+    def dequant_block_fp8_weight(self) -> torch.Tensor:


This method was updated, please refer to https://github.com/HabanaAI/vllm-fork/blob/1b40abb7d3eb069f8bdbf6c34609b4afb0f53c54/vllm/model_executor/layers/quantization/fp8.py#L315C9-L326.

yiliu30 · 2025-05-15T09:14:39Z

examples/text-generation/README.md

@@ -182,7 +182,7 @@ python3 ../gaudi_spawn.py --hostfile=<hostfile> --use_deepspeed \

 To run Moonlight-16B-A3B (a DeepSeek-V3 like model) inference on a Gaudi2 card use the following command:
 ```bash
-PT_HPU_LAZY_MODE=1 python3 ./run_generation.py \
+python3 ./run_generation.py \


Why remove the PT_HPU_LAZY_MODE=1?

Fix for FP8 layers which do not match block dimensions

skavulya requested a review from regisss as a code owner April 4, 2025 01:24

skavulya mentioned this pull request Apr 4, 2025

Add support for optimum-habana deepseek v3/r1 fp8 quantization intel/neural-compressor#2164

Draft

skavulya marked this pull request as draft April 5, 2025 15:55

skavulya force-pushed the deepseek_v3_fp8 branch from 4d6bf13 to e7ed4c3 Compare April 24, 2025 00:31

dudilester reviewed Apr 24, 2025

View reviewed changes

skavulya force-pushed the deepseek_v3_fp8 branch 2 times, most recently from a4e6f74 to c9f2b88 Compare April 29, 2025 06:50

skavulya force-pushed the deepseek_v3_fp8 branch from c9f2b88 to 9a9d187 Compare May 14, 2025 18:59

yiliu30 reviewed May 15, 2025

View reviewed changes

skavulya and others added 6 commits July 21, 2025 13:20

Support INC FP8 quantization for deepseek_v3/r1

78a4503

add moonlight support

eec2b23

Add FP8 runtime dequantization with INC for DeepSeek

6ae70e7

Fix DeepSeekV3 FP8 INC dequantization with padding

d9c2315

Fix for FP8 layers which do not match block dimensions

Checkpoint on quantization

0158412

Fixes for DeepSeek FP8 quantization with INC

75f5ddf

skavulya force-pushed the deepseek_v3_fp8 branch from 9a9d187 to 75f5ddf Compare July 31, 2025 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support INC FP8 static quantization for deepseek_v3/r1 #1907

Support INC FP8 static quantization for deepseek_v3/r1 #1907

Uh oh!

skavulya commented Apr 4, 2025 •

edited

Loading

Uh oh!

skavulya commented Apr 5, 2025

Uh oh!

dudilester Apr 24, 2025

Uh oh!

skavulya May 16, 2025

Uh oh!

dudilester Apr 24, 2025

Uh oh!

skavulya May 16, 2025

Uh oh!

dudilester May 20, 2025

Uh oh!

jinyouzhi commented Apr 24, 2025

Uh oh!

dudilester Apr 24, 2025

Uh oh!

skavulya May 16, 2025

Uh oh!

dudilester Apr 24, 2025

Uh oh!

yiliu30 May 15, 2025

Uh oh!

yiliu30 May 15, 2025

Uh oh!

Uh oh!

		@@ -107,6 +108,57 @@ def _get_unpad_data(attention_mask):
		)


		class GaudiDeepseekV3LinearFP8(nn.Linear):

Support INC FP8 static quantization for deepseek_v3/r1 #1907

Are you sure you want to change the base?

Support INC FP8 static quantization for deepseek_v3/r1 #1907

Uh oh!

Conversation

skavulya commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

skavulya commented Apr 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinyouzhi commented Apr 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skavulya commented Apr 4, 2025 •

edited

Loading