Skip to content

Conversation

@Chao1Han
Copy link
Contributor

@Chao1Han Chao1Han commented Jun 5, 2025

This test is designed to verify that XCCL remains the default backend for XPU, even when other groups are registered as optional backends for XPU.

Copilot AI review requested due to automatic review settings June 5, 2025 05:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new unit test to confirm that XCCL remains the default distributed backend on XPU even after another backend is registered.

  • Introduces test_xccl_priority to register a dummy backend and run an all-reduce call without specifying a backend.
  • Leverages existing requires_xccl decorator to skip if XCCL isn’t available.
Comments suppressed due to low confidence (2)

test/xpu/distributed/test_c10d_xccl.py:568

  • The test currently only invokes all_reduce but doesn't assert that the default backend is actually XCCL. Consider retrieving the process group (e.g., via dist.distributed_c10d._get_default_group()) and asserting its type or backend name to ensure the priority behavior is verified.
dist.all_reduce(a)

test/xpu/distributed/test_c10d_xccl.py:555

  • [nitpick] The test name test_xccl_priority is a bit generic. Consider renaming to test_default_backend_is_xccl_when_fake_registered for clarity on what scenario is covered.
def test_xccl_priority(self):

Comment on lines +556 to +568
dist.Backend.register_backend(
"fake",
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size),
devices=["xpu"],
)
store = dist.FileStore(self.file_name, self.world_size)
dist.init_process_group(
world_size=self.world_size,
rank=self.rank,
store=store,
)
a = torch.randn(2, device="xpu")
dist.all_reduce(a)
Copy link

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After registering the fake backend, consider unregistering it in a finally block or teardown step to avoid side effects on other tests.

Suggested change
dist.Backend.register_backend(
"fake",
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size),
devices=["xpu"],
)
store = dist.FileStore(self.file_name, self.world_size)
dist.init_process_group(
world_size=self.world_size,
rank=self.rank,
store=store,
)
a = torch.randn(2, device="xpu")
dist.all_reduce(a)
try:
dist.Backend.register_backend(
"fake",
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size),
devices=["xpu"],
)
store = dist.FileStore(self.file_name, self.world_size)
dist.init_process_group(
world_size=self.world_size,
rank=self.rank,
store=store,
)
a = torch.randn(2, device="xpu")
dist.all_reduce(a)
finally:
dist.Backend.unregister_backend("fake")

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other cases explicit init with backend xccl, it is safely no unregister.

@Chao1Han
Copy link
Contributor Author

Close due to pytorch/pytorch#155320 merged

@Chao1Han Chao1Han closed this Jun 16, 2025
@Chao1Han Chao1Han deleted the xccl/uts branch June 16, 2025 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants