Skip to content

Conversation

mawong-amd
Copy link

Temporarily fix numpy < 2.0.0 as Numpy 2.0 breaks ROCm PyTorch. Counterpart to vllm-project#5582

Fix XGMI 1-hop detection: previous version has the following problems

  1. Ignores the device_ids passed in.
  2. Each device only checks to see if it's 1-hop XGMI connected to all other devices. Instead, each device should check that all devices are 1-hop XGMI connected to all other devices. This prevents the odd case where some devices are 1-hop XGMI connected to all other devices, but others are not, which would result in not every device enabling custom_all_reduce and hence deadlock.

@mawong-amd mawong-amd merged commit 3e7b0b6 into main Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant