Skip to content

Conversation

@q10
Copy link
Contributor

@q10 q10 commented Nov 4, 2025

Summary:
This PR introduces optimization for group_index_select_or_add_2d_kernel (USE_INDEX_SELECT==true) kernel with primary focus on float type and relatively small embedding dimensions. 2 things are implemented:

  1. Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
  2. Switch to 32 threads logical wave sizes to reduce granularity losses.

Differential Revision: D86135611

Pulled By: q10

@netlify
Copy link

netlify bot commented Nov 4, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 23e13e3
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/6909930adcc3ce00088b70ee
😎 Deploy Preview https://deploy-preview-5080--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Nov 4, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 4, 2025

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86135611.

@q10 q10 force-pushed the export-D86135611 branch from 27c4a26 to b1f8dae Compare November 4, 2025 05:45
q10 pushed a commit to q10/FBGEMM that referenced this pull request Nov 4, 2025
…h#5080)

Summary:

X-link: facebookresearch/FBGEMM#2087

This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented:
1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
2) Switch to 32 threads logical wave sizes to reduce granularity losses.


Differential Revision: D86135611

Pulled By: q10
…h#5080)

Summary:

X-link: facebookresearch/FBGEMM#2087

This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented:
1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
2) Switch to 32 threads logical wave sizes to reduce granularity losses.


Differential Revision: D86135611

Pulled By: q10
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 4, 2025

@q10 merged this pull request in be1b514.

Bernard-Liu pushed a commit to ROCm/FBGEMM that referenced this pull request Nov 11, 2025
…h#5080)

Summary:
Pull Request resolved: pytorch#5080

X-link: https://github.com/facebookresearch/FBGEMM/pull/2087

This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented:
1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
2) Switch to 32 threads logical wave sizes to reduce granularity losses.

Pull Request resolved: pytorch#5078

Reviewed By: spcyppt, haoyuz

Differential Revision: D86135611

Pulled By: q10

fbshipit-source-id: f4fb9966f5f5180c4dde2aed92ca726c260b7743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants