optimize embedding bag #1726

jianyizh · 2025-06-09T04:52:38Z

I have remove the batch kernel config so that igc will choose grf mode 128 with the sycl assert inside kernel with vec size = 8. We can now get equivalent optimization compare to previous result.

~~1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 4~~
2. I see instruction fetch stalls because of the if branches, so move them to template params.
3. I also fixed the vectorization. Previously we actually do not enable it.
4. Previously we only use 256 threads per workgroup, but workgroup size is 1024

performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum

	PVC	BMG
main branch	0.18ms	0.43ms
current change	0.08ms	0.23ms
current change + if we remove assert	0.07 ms	0.22 ms

| ~~remove sycl assert~~ | ~~0.10ms~~ | ~~0.30 ms~~ |
| ~~remove branching~~ | ~~0.08ms~~ | ~~0.28 ms~~ |
| ~~tiling~~ | ~~0.087ms~~ | ~~0.22 ms~~ |

Note: We are stalled here vec_t other = w_vec_[i_off]; when vector size is 8, the assembly is load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC]; After fix, it changes to load.ugm.d32x4. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.

PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.

pytorchxpubot · 2025-06-11T09:50:29Z

@sys_pytorchxpubot triage result for run 15561567714

Triage bot UT analaysis result for reference only, please note unique error message only report once:

third_party.torch-xpu-ops.test.xpu.test_nn_xpu.TestNN test_LayerNorm_3d_no_affine_large_feature_cuda got failed with error message

 AssertionError: Tensor-likes are not close!

Triage bot response:

{
  "similar_issue_id": 845,
  "similar_issue_state": "closed",
  "issue_owner": "daisyden",
  "issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
  "root_causes": [
    "Discrepancies in LayerNorm implementation between CUDA and XPU.",
    "Potential differences in precision or kernel behavior affecting tensor outputs.",
    "Misalignment in computation leading to inconsistent gradients."
  ],
  "suggested_solutions": [
    "Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
    "Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
    "Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
  ]
}

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

This PR optimizes the embedding bag kernel by changing the kernel design to use template parameters instead of runtime branches and introducing hardware-aware vector size selection. The changes aim to improve performance by reducing instruction fetch stalls and enabling proper vectorization.

Replaced runtime conditional branches with compile-time template parameters for per_sample_weights and padding_idx
Changed from 2D to 1D kernel dispatch and removed BatchKernelConfig dependency
Added hardware-aware vector size selection based on GPU occupancy calculations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
src/ATen/native/xpu/sycl/EmbeddingBag.h	Refactored kernel functor to use 1D dispatch, template parameters for branches, and lambda for weight handling
src/ATen/native/xpu/sycl/EmbeddingBag.cpp	Updated kernel launch configuration, added hardware-aware vectorization logic, and expanded macro calls for template specialization

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.}

src/ATen/native/xpu/sycl/EmbeddingBag.h

src/ATen/native/xpu/sycl/EmbeddingBag.cpp

jianyizh added the kernel_optimization label Jun 9, 2025

jianyizh force-pushed the jianyi/embed_bag branch from eada1a6 to b566ba6 Compare June 9, 2025 04:53

jianyizh requested review from toyxu and EikanWang June 9, 2025 06:36

jianyizh marked this pull request as ready for review June 9, 2025 06:36

Copilot AI review requested due to automatic review settings June 9, 2025 06:36

This comment was marked as outdated.

Sign in to view

jianyizh changed the title ~~optimize embedding bag~~ [dont merge] optimize embedding bag Jun 17, 2025

jianyizh changed the title ~~[dont merge] optimize embedding bag~~ optimize embedding bag Jun 19, 2025

jianyizh force-pushed the jianyi/embed_bag branch from 111cdf5 to 5d4dd7e Compare July 11, 2025 14:48

jianyizh requested a review from Copilot July 12, 2025 15:11

This comment was marked as outdated.

Sign in to view

jianyizh force-pushed the jianyi/embed_bag branch from bea0116 to a53da4e Compare July 14, 2025 02:57

EikanWang approved these changes Aug 13, 2025

View reviewed changes

jianyizh and others added 11 commits August 13, 2025 10:41

save

bf727b9

fix vectorization

69da162

test

1fd5445

save

b273f17

save

93b7d6e

save

ee6593f

save

f8af84c

save

bf57d85

save

1dfdb6f

save

69f5d82

save

460fb7e

jianyizh force-pushed the jianyi/embed_bag branch from 9138a67 to 460fb7e Compare August 13, 2025 02:41

jianyizh enabled auto-merge August 13, 2025 15:07

weishi-deng self-requested a review August 14, 2025 03:12

weishi-deng approved these changes Aug 14, 2025

View reviewed changes

chunhuanMeng approved these changes Aug 14, 2025

View reviewed changes

jianyizh requested review from Copilot and removed request for toyxu August 14, 2025 03:14

This comment was marked as outdated.

Sign in to view

jianyizh and others added 2 commits August 14, 2025 11:21

Apply suggestions from code review

0b29bb0

Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into jianyi/embed_bag

7b71d2f

chuanqi129 disabled auto-merge August 14, 2025 03:30

chuanqi129 requested a review from Copilot August 14, 2025 22:37

chuanqi129 enabled auto-merge August 14, 2025 22:37

Copilot AI reviewed Aug 14, 2025

View reviewed changes

src/ATen/native/xpu/sycl/EmbeddingBag.h Show resolved Hide resolved

src/ATen/native/xpu/sycl/EmbeddingBag.cpp Show resolved Hide resolved

src/ATen/native/xpu/sycl/EmbeddingBag.cpp Show resolved Hide resolved

chuanqi129 added this pull request to the merge queue Aug 14, 2025

Merged via the queue into main with commit 5ee2a32 Aug 14, 2025
54 of 57 checks passed

chuanqi129 deleted the jianyi/embed_bag branch August 14, 2025 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize embedding bag #1726

optimize embedding bag #1726

Uh oh!

jianyizh commented Jun 9, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

pytorchxpubot commented Jun 11, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

optimize embedding bag #1726

optimize embedding bag #1726

Uh oh!

Conversation

jianyizh commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

pytorchxpubot commented Jun 11, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianyizh commented Jun 9, 2025 •

edited

Loading