feat: enable torch_npu graph mode for Qwen-3 dense with TP support. #325

yingxudeng · 2025-11-06T07:09:02Z

No description provided.

yingxudeng · 2025-11-06T07:17:16Z

Summary of Pending Items for this PR

This PR is a work in progress. The following items need to be completed before it's ready for final review:

Compile the new libtorch_npu.so and update the related Docker images.
Refactor the WordEmbedding class:
- Rename the legacy WordEmbedding implementation to free up the class name.
- Ensure the new WordEmbedding implementation can share unified logic with MLU and GPU backends.
Apply the same refactoring (as point 2) to the LMHead component.
Investigate and fix the widespread compilation failures in the test files.
Validate that the changes do not break existing model inference pipelines.
Review and update the default values for flags like ENABLE_NATIVE_NPU and USE_NPU_TORCH (modify or remove as needed).
Analyze and optimize #if preprocessor directives for potential consolidation.
Address other minor pending items and technical debt.

xllm/core/kernels/ops_api.cpp

XuZhang99 · 2025-11-06T14:14:05Z

xllm/core/kernels/param.h

+  // for npu
+  torch::Tensor seq_lens;
+  int num_heads;
+  int num_kv_heads;


we can get num_heads and num_kv_heads form th shape of query and key.

In the file xllm_2/xllm/core/layers/common/attention.cpp, you’ve pre-shaped the query and key parameters before passing them. However, different platforms may require slightly different shapes after the view operation. It might be more flexible to pass the original query, key along with num_heads_ and num_kv_heads_ parameters, and then perform the view operation inside the batch_prefill method. This approach would provide better cross-platform compatibility and clearer parameter handling.

xllm/core/layers/common/attention.cpp

xllm/models/vlm/qwen2_5_vl.h

…are.

…ilation structure.

yingxudeng · 2025-11-11T16:29:51Z

Could you please help review this PR when you have a moment? 🙏 @yq33victor @XuZhang99

yingxudeng · 2025-11-27T16:15:20Z

xllm/core/distributed_runtime/worker_server.cpp

  proto::CommUniqueIdList uids;
  sync_master_node(master_node_addr, addr_info, uids);

  CollectiveCommunicator comm(worker_global_rank, world_size, dp_size, ep_size);


todo:
增加类似npu_operator_backend 的参数，
在CollectiveCommunicator 初始化、create_process_groups 函数中区分是 atb 后端还是torch 后端的。
后续增加新模型时，根据 model_type自动选择 npu_operator_backend 是 ATB 还是 TORCH。

yingxudeng · 2025-11-27T16:16:18Z

xllm/core/distributed_runtime/worker_server.cpp


  CollectiveCommunicator comm(worker_global_rank, world_size, dp_size, ep_size);
  const ParallelArgs* parallel_args = comm.parallel_args();
 #if defined(USE_MLU) || defined(USE_CUDA)


同上考虑，去掉 #if 这种，统一代码，在函数内部判断

yingxudeng · 2025-11-27T16:19:48Z

xllm/core/framework/parallel_state/process_group.h

  torch::Device device_;

 protected:
+#if defined(USE_NPU)


这块是因为 torch 的版本是 2.1.0 这个时候 backend 类里面还没有 shutdown() 方法，无法通过编译。如果要统一可能要升级 torch 版本，成本有些高，暂时这样

use #if TORCH_VERSION_MAJOR >= 2 && TORCH_VERSION_MINOR >= 7

yingxudeng · 2025-11-27T16:21:28Z

xllm/core/layers/common/attention.cpp

+  // for npu
+  if (attn_mask.has_value()) {
+    attn_metadata.attn_mask = attn_mask.value();
+    attn_metadata.seq_lens = params.kv_seq_lens.to(torch::kCPU);


这会导致开启 acl grpah 出问题，因为 attn_metadata.seq_lens 在 cpu 上面。
需要接入民超哥的新 attention

yingxudeng · 2025-11-27T16:29:49Z

xllm/core/layers/common/dense_mlp.cpp

    activation_params.output = output;
    activation_params.act_mode = hidden_act_;
    activation_params.is_gated = is_gated_;
    xllm::kernel::active(activation_params);


@yq33victor 大佬请教一下，类似这种 active 函数是不是可以都统一成空间申请在函数内，然后 output 再 return 出来呢？
这样用起来更加像 torch 一点，这块的代码也可以统一了。

yingxudeng · 2025-11-27T16:30:44Z

xllm/core/layers/common/fuse_norm.cpp

  fused_layernorm_params.mode = kRmsNormMode;
  fused_layernorm_params.eps = eps_;

  xllm::kernel::fused_layernorm(fused_layernorm_params);


@yq33victor 这个也是类似的考虑

yingxudeng requested review from liutongxuan and yq33victor November 6, 2025 07:09

liutongxuan changed the title ~~feat: enable torch_npu graph mode for Qwen-3 dense with single and multi-card TP support.~~ feat: enable torch_npu graph mode for Qwen-3 dense with single and multi-device TP support. Nov 6, 2025

liutongxuan changed the title ~~feat: enable torch_npu graph mode for Qwen-3 dense with single and multi-device TP support.~~ feat: enable torch_npu graph mode for Qwen-3 dense with TP support. Nov 6, 2025

yingxudeng force-pushed the feat/qwen3_npu_native_main branch from 4549991 to 1a5e2f0 Compare November 6, 2025 12:23

XuZhang99 reviewed Nov 6, 2025

View reviewed changes

yingxudeng force-pushed the feat/qwen3_npu_native_main branch 10 times, most recently from 34afffc to 97f924a Compare November 10, 2025 16:23

yingxudeng added 7 commits November 11, 2025 22:22

feat: enable torch_npu graph mode for Qwen-3 dense with TP support.

3848a83

bugfix: resolve gtest compilation failures.

6cac5ab

bugfix: resolve compilation issues when building without NPU TORCH.

1053b65

refactor: share single Lmhead class across NPU and other hardware.

49b04f8

refactor: share single WordEmbedding class across NPU and other hardw…

0eda28b

…are.

refactor: extend NPU kernel parameter set and adjust conditional comp…

41cdcab

…ilation structure.

feat: merge master branch and resolve conflicts.

f37d69d

yingxudeng force-pushed the feat/qwen3_npu_native_main branch from 97f924a to f37d69d Compare November 11, 2025 16:11

bugfix: resolve ACL graph inference error in Qwen3 with torch_npu.

da3a658

yingxudeng commented Nov 27, 2025

View reviewed changes

feat: enable torch_npu graph mode for Qwen-3 dense with TP support. #325

Are you sure you want to change the base?

feat: enable torch_npu graph mode for Qwen-3 dense with TP support. #325

Conversation

yingxudeng commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yingxudeng commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Pending Items for this PR

Uh oh!

Uh oh!

Uh oh!

XuZhang99 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yingxudeng commented Nov 11, 2025

Uh oh!

yingxudeng Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yingxudeng Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuZhang99 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yingxudeng commented Nov 6, 2025 •

edited

Loading

yingxudeng commented Nov 6, 2025 •

edited

Loading

yingxudeng Nov 27, 2025 •

edited

Loading

yingxudeng Nov 27, 2025 •

edited

Loading