Skip to content

Conversation

@avbokovoy
Copy link
Contributor

This PR introduces optimization for group_index_select_or_add_2d_kernel (USE_INDEX_SELECT==true) kernel with primary focus on float type and relatively small embedding dimensions. 2 things are implemented:

  1. Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
  2. Switch to 32 threads logical wave sizes to reduce granularity losses.

…ices for group_index_select_or_add_2d_kernel
@netlify
Copy link

netlify bot commented Nov 3, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 799dad0
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/6909c89371bf93000860d668
😎 Deploy Preview https://deploy-preview-5078--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 3, 2025

@q10 has imported this pull request. If you are a Meta employee, you can view this in D86135611.


// The wave size is forced to be 32 on ROCm devices in favor
// of granularity losses reduction.
constexpr int EMULATED_WARP_SIZE = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure that EMULATED_WARP_SIZE = kWarpSize for CUDA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this in the internal diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 799dad0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Nov 4, 2025
…h#5078)

Summary:
This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented:
1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
2) Switch to 32 threads logical wave sizes to reduce granularity losses.


Differential Revision: D86135611

Pulled By: q10
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 4, 2025

@q10 merged this pull request in be1b514.

Bernard-Liu pushed a commit to ROCm/FBGEMM that referenced this pull request Nov 11, 2025
…h#5080)

Summary:
Pull Request resolved: pytorch#5080

X-link: https://github.com/facebookresearch/FBGEMM/pull/2087

This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented:
1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
2) Switch to 32 threads logical wave sizes to reduce granularity losses.

Pull Request resolved: pytorch#5078

Reviewed By: spcyppt, haoyuz

Differential Revision: D86135611

Pulled By: q10

fbshipit-source-id: f4fb9966f5f5180c4dde2aed92ca726c260b7743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants