Skip to content

Conversation

jianyizh
Copy link
Contributor

@jianyizh jianyizh commented Jun 9, 2025

I have remove the batch kernel config so that igc will choose grf mode 128 with the sycl assert inside kernel with vec size = 8. We can now get equivalent optimization compare to previous result.

1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 4
2. I see instruction fetch stalls because of the if branches, so move them to template params.
3. I also fixed the vectorization. Previously we actually do not enable it.
4. Previously we only use 256 threads per workgroup, but workgroup size is 1024

performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum

PVC BMG
main branch 0.18ms 0.43ms
current change 0.08ms 0.23ms
current change + if we remove assert 0.07 ms 0.22 ms

| remove sycl assert | 0.10ms | 0.30 ms |
| remove branching | 0.08ms | 0.28 ms |
| tiling | 0.087ms | 0.22 ms |

Note: We are stalled here vec_t other = w_vec_[i_off]; when vector size is 8, the assembly is load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC]; After fix, it changes to load.ugm.d32x4. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.

PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.

@jianyizh jianyizh force-pushed the jianyi/embed_bag branch from eada1a6 to b566ba6 Compare June 9, 2025 04:53
@jianyizh jianyizh requested review from toyxu and EikanWang June 9, 2025 06:36
@jianyizh jianyizh marked this pull request as ready for review June 9, 2025 06:36
@Copilot Copilot AI review requested due to automatic review settings June 9, 2025 06:36
Copilot

This comment was marked as outdated.

@pytorchxpubot
Copy link

@sys_pytorchxpubot triage result for run 15561567714Triage bot UT analaysis result for reference only, please note unique error message only report once:
  1. third_party.torch-xpu-ops.test.xpu.test_nn_xpu.TestNN test_LayerNorm_3d_no_affine_large_feature_cuda got failed with error message
 AssertionError: Tensor-likes are not close! 

Triage bot response:

{
  "similar_issue_id": 845,
  "similar_issue_state": "closed",
  "issue_owner": "daisyden",
  "issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
  "root_causes": [
    "Discrepancies in LayerNorm implementation between CUDA and XPU.",
    "Potential differences in precision or kernel behavior affecting tensor outputs.",
    "Misalignment in computation leading to inconsistent gradients."
  ],
  "suggested_solutions": [
    "Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
    "Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
    "Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
  ]
}

@jianyizh jianyizh changed the title optimize embedding bag [dont merge] optimize embedding bag Jun 17, 2025
@jianyizh jianyizh changed the title [dont merge] optimize embedding bag optimize embedding bag Jun 19, 2025
@jianyizh jianyizh requested a review from Copilot July 12, 2025 15:11
Copilot

This comment was marked as outdated.

@jianyizh jianyizh enabled auto-merge August 13, 2025 15:07
@weishi-deng weishi-deng self-requested a review August 14, 2025 03:12
@jianyizh jianyizh requested review from Copilot and removed request for toyxu August 14, 2025 03:14
Copilot

This comment was marked as outdated.

@chuanqi129 chuanqi129 disabled auto-merge August 14, 2025 03:30
@chuanqi129 chuanqi129 requested a review from Copilot August 14, 2025 22:37
@chuanqi129 chuanqi129 enabled auto-merge August 14, 2025 22:37
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the embedding bag kernel by changing the kernel design to use template parameters instead of runtime branches and introducing hardware-aware vector size selection. The changes aim to improve performance by reducing instruction fetch stalls and enabling proper vectorization.

  • Replaced runtime conditional branches with compile-time template parameters for per_sample_weights and padding_idx
  • Changed from 2D to 1D kernel dispatch and removed BatchKernelConfig dependency
  • Added hardware-aware vector size selection based on GPU occupancy calculations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/ATen/native/xpu/sycl/EmbeddingBag.h Refactored kernel functor to use 1D dispatch, template parameters for branches, and lambda for weight handling
src/ATen/native/xpu/sycl/EmbeddingBag.cpp Updated kernel launch configuration, added hardware-aware vectorization logic, and expanded macro calls for template specialization

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@chuanqi129 chuanqi129 added this pull request to the merge queue Aug 14, 2025
Merged via the queue into main with commit 5ee2a32 Aug 14, 2025
54 of 57 checks passed
@chuanqi129 chuanqi129 deleted the jianyi/embed_bag branch August 14, 2025 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants