[ROCm] group_index_select_or_add_2d_kernel forward pass optimization #5078

avbokovoy · 2025-11-03T16:38:44Z

This PR introduces optimization for group_index_select_or_add_2d_kernel (USE_INDEX_SELECT==true) kernel with primary focus on float type and relatively small embedding dimensions. 2 things are implemented:

Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
Switch to 32 threads logical wave sizes to reduce granularity losses.

…ices for group_index_select_or_add_2d_kernel

netlify · 2025-11-03T16:38:49Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`799dad0`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/6909c89371bf93000860d668
😎 Deploy Preview	https://deploy-preview-5078--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

meta-codesync · 2025-11-03T21:50:52Z

@q10 has imported this pull request. If you are a Meta employee, you can view this in D86135611.

spcyppt · 2025-11-04T04:50:35Z

fbgemm_gpu/src/sparse_ops/sparse_group_index.cu


+// The wave size is forced to be 32 on ROCm devices in favor 
+// of granularity losses reduction.
+constexpr int EMULATED_WARP_SIZE = 32;


Can we ensure that EMULATED_WARP_SIZE = kWarpSize for CUDA?

updated this in the internal diff.

Done in 799dad0

…h#5078) Summary: This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented: 1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically) 2) Switch to 32 threads logical wave sizes to reduce granularity losses. Differential Revision: D86135611 Pulled By: q10

meta-codesync · 2025-11-04T11:45:24Z

@q10 merged this pull request in be1b514.

…h#5080) Summary: Pull Request resolved: pytorch#5080 X-link: https://github.com/facebookresearch/FBGEMM/pull/2087 This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented: 1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically) 2) Switch to 32 threads logical wave sizes to reduce granularity losses. Pull Request resolved: pytorch#5078 Reviewed By: spcyppt, haoyuz Differential Revision: D86135611 Pulled By: q10 fbshipit-source-id: f4fb9966f5f5180c4dde2aed92ca726c260b7743

Extract num_col and warps_per_row, switch to wave size 32 on ROCm dev…

5552d1d

…ices for group_index_select_or_add_2d_kernel

pytorch-bot bot added the module: rocm label Nov 3, 2025

meta-cla bot added the cla signed label Nov 3, 2025

spcyppt reviewed Nov 4, 2025

View reviewed changes

Guard EMULATED_WARP_SIZE with USE_ROCM

799dad0

meta-codesync bot closed this in be1b514 Nov 4, 2025

facebook-github-bot added the Merged label Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] group_index_select_or_add_2d_kernel forward pass optimization #5078

[ROCm] group_index_select_or_add_2d_kernel forward pass optimization #5078

Uh oh!

avbokovoy commented Nov 3, 2025

Uh oh!

netlify bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 3, 2025

Uh oh!

spcyppt Nov 4, 2025

Uh oh!

spcyppt Nov 4, 2025

Uh oh!

avbokovoy Nov 4, 2025

Uh oh!

spcyppt Nov 4, 2025

Uh oh!

meta-codesync bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ROCm] group_index_select_or_add_2d_kernel forward pass optimization #5078

[ROCm] group_index_select_or_add_2d_kernel forward pass optimization #5078

Uh oh!

Conversation

avbokovoy commented Nov 3, 2025

Uh oh!

netlify bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

meta-codesync bot commented Nov 3, 2025

Uh oh!

spcyppt Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

spcyppt Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

avbokovoy Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

spcyppt Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Nov 3, 2025 •

edited

Loading