-
Notifications
You must be signed in to change notification settings - Fork 62
[UT]XCCL remains the default backend for XPU #1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new unit test to confirm that XCCL remains the default distributed backend on XPU even after another backend is registered.
- Introduces
test_xccl_priorityto register a dummy backend and run an all-reduce call without specifying a backend. - Leverages existing
requires_xccldecorator to skip if XCCL isn’t available.
Comments suppressed due to low confidence (2)
test/xpu/distributed/test_c10d_xccl.py:568
- The test currently only invokes
all_reducebut doesn't assert that the default backend is actually XCCL. Consider retrieving the process group (e.g., viadist.distributed_c10d._get_default_group()) and asserting its type or backend name to ensure the priority behavior is verified.
dist.all_reduce(a)
test/xpu/distributed/test_c10d_xccl.py:555
- [nitpick] The test name
test_xccl_priorityis a bit generic. Consider renaming totest_default_backend_is_xccl_when_fake_registeredfor clarity on what scenario is covered.
def test_xccl_priority(self):
| dist.Backend.register_backend( | ||
| "fake", | ||
| lambda store, rank, size, timeout: dist.ProcessGroup(rank, size), | ||
| devices=["xpu"], | ||
| ) | ||
| store = dist.FileStore(self.file_name, self.world_size) | ||
| dist.init_process_group( | ||
| world_size=self.world_size, | ||
| rank=self.rank, | ||
| store=store, | ||
| ) | ||
| a = torch.randn(2, device="xpu") | ||
| dist.all_reduce(a) |
Copilot
AI
Jun 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After registering the fake backend, consider unregistering it in a finally block or teardown step to avoid side effects on other tests.
| dist.Backend.register_backend( | |
| "fake", | |
| lambda store, rank, size, timeout: dist.ProcessGroup(rank, size), | |
| devices=["xpu"], | |
| ) | |
| store = dist.FileStore(self.file_name, self.world_size) | |
| dist.init_process_group( | |
| world_size=self.world_size, | |
| rank=self.rank, | |
| store=store, | |
| ) | |
| a = torch.randn(2, device="xpu") | |
| dist.all_reduce(a) | |
| try: | |
| dist.Backend.register_backend( | |
| "fake", | |
| lambda store, rank, size, timeout: dist.ProcessGroup(rank, size), | |
| devices=["xpu"], | |
| ) | |
| store = dist.FileStore(self.file_name, self.world_size) | |
| dist.init_process_group( | |
| world_size=self.world_size, | |
| rank=self.rank, | |
| store=store, | |
| ) | |
| a = torch.randn(2, device="xpu") | |
| dist.all_reduce(a) | |
| finally: | |
| dist.Backend.unregister_backend("fake") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cases explicit init with backend xccl, it is safely no unregister.
|
Close due to pytorch/pytorch#155320 merged |
This test is designed to verify that XCCL remains the default backend for XPU, even when other groups are registered as optional backends for XPU.