Skip to content

Conversation

@emlin
Copy link
Contributor

@emlin emlin commented Nov 4, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2090

with feature score eviction, tbe will call backend to update feature score metadata separately in forward pass.
this process is designed for asynchronous update without blocking forward/backward pass, however the cpu blocking operation blocked the main stream, so after get_cuda, all2all cannot be started immediately.
from dummy profile, we can see this trace:
{F1983224804}

the set metadata operation becomes a blocker in critical path, which took 217ms

With this change, we can see the trace is updated to:
{F1983224830}

where overall prefetch is reduced to less than 70ms, also the get_cuda is followed by all2all immediately without other waiting and stream sync

Differential Revision: D86013406

@netlify
Copy link

netlify bot commented Nov 4, 2025

Deploy Preview for pytorch-fbgemm-docs failed.

Name Link
🔨 Latest commit f716372
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/690e9b93d98328000825ffab

@meta-cla meta-cla bot added the cla signed label Nov 4, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 4, 2025

@emlin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86013406.

emlin added a commit to emlin/FBGEMM that referenced this pull request Nov 4, 2025
Summary:

X-link: facebookresearch/FBGEMM#2090

with feature score eviction, tbe will call backend to update feature score metadata separately in forward pass.
this process is designed for asynchronous update without blocking forward/backward pass, however the cpu blocking operation blocked the main stream, so after get_cuda, all2all cannot be started immediately.
from dummy profile, we can see this trace:
 {F1983224804} 

the set metadata operation becomes a blocker in critical path, which took 217ms

With this change, we can see the trace is updated to:
 {F1983224830} 

where overall prefetch is reduced to less than 70ms, also the get_cuda is followed by all2all immediately without other waiting and stream sync

Differential Revision: D86013406
emlin added a commit to emlin/FBGEMM that referenced this pull request Nov 7, 2025
Summary:

X-link: facebookresearch/FBGEMM#2090

with feature score eviction, tbe will call backend to update feature score metadata separately in forward pass.
this process is designed for asynchronous update without blocking forward/backward pass, however the cpu blocking operation blocked the main stream, so after get_cuda, all2all cannot be started immediately.
from dummy profile, we can see this trace:
 {F1983224804} 

the set metadata operation becomes a blocker in critical path, which took 217ms

With this change, we can see the trace is updated to:
 {F1983224830} 

where overall prefetch is reduced to less than 70ms, also the get_cuda is followed by all2all immediately without other waiting and stream sync

Reviewed By: steven1327, kathyxuyy

Differential Revision: D86013406
Summary:

X-link: facebookresearch/FBGEMM#2090

with feature score eviction, tbe will call backend to update feature score metadata separately in forward pass.
this process is designed for asynchronous update without blocking forward/backward pass, however the cpu blocking operation blocked the main stream, so after get_cuda, all2all cannot be started immediately.
from dummy profile, we can see this trace:
 {F1983224804}

the set metadata operation becomes a blocker in critical path, which took 217ms

With this change, we can see the trace is updated to:
 {F1983224830}

where overall prefetch is reduced to less than 70ms, also the get_cuda is followed by all2all immediately without other waiting and stream sync
https://www.internalfb.com/ai_infra/zoomer/profiling-run/overview?profilingRunID=1913270729575721

Reviewed By: steven1327, kathyxuyy

Differential Revision: D86013406
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 8, 2025

This pull request has been merged in ee26ed4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants