-
Notifications
You must be signed in to change notification settings - Fork 57
optimize embedding bag #1726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize embedding bag #1726
Conversation
eada1a6
to
b566ba6
Compare
@sys_pytorchxpubot triage result for run 15561567714Triage bot UT analaysis result for reference only, please note unique error message only report once:
Triage bot response: {
"similar_issue_id": 845,
"similar_issue_state": "closed",
"issue_owner": "daisyden",
"issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
"root_causes": [
"Discrepancies in LayerNorm implementation between CUDA and XPU.",
"Potential differences in precision or kernel behavior affecting tensor outputs.",
"Misalignment in computation leading to inconsistent gradients."
],
"suggested_solutions": [
"Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
"Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
"Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
]
} |
111cdf5
to
5d4dd7e
Compare
bea0116
to
a53da4e
Compare
9138a67
to
460fb7e
Compare
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the embedding bag kernel by changing the kernel design to use template parameters instead of runtime branches and introducing hardware-aware vector size selection. The changes aim to improve performance by reducing instruction fetch stalls and enabling proper vectorization.
- Replaced runtime conditional branches with compile-time template parameters for
per_sample_weights
andpadding_idx
- Changed from 2D to 1D kernel dispatch and removed BatchKernelConfig dependency
- Added hardware-aware vector size selection based on GPU occupancy calculations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
src/ATen/native/xpu/sycl/EmbeddingBag.h | Refactored kernel functor to use 1D dispatch, template parameters for branches, and lambda for weight handling |
src/ATen/native/xpu/sycl/EmbeddingBag.cpp | Updated kernel launch configuration, added hardware-aware vectorization logic, and expanded macro calls for template specialization |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
I have remove the batch kernel config so that igc will choose grf mode 128 with the sycl assert inside kernel with vec size = 8. We can now get equivalent optimization compare to previous result.
1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 42. I see instruction fetch stalls because of the if branches, so move them to template params.
3. I also fixed the vectorization. Previously we actually do not enable it.
4. Previously we only use 256 threads per workgroup, but workgroup size is 1024
performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum
|
remove sycl assert|0.10ms|0.30 ms||
remove branching|0.08ms|0.28 ms||
tiling|0.087ms|0.22 ms|Note: We are stalled herevec_t other = w_vec_[i_off];
when vector size is 8, the assembly isload.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC];
After fix, it changes toload.ugm.d32x4
. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.