-
Notifications
You must be signed in to change notification settings - Fork 473
uct/ugni: fix initialization #3080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We still strongly recommend against using UCX on Cray systems with Open MPI (we will probably never recommend it) but this fixes one glaring issue. I am tracking another bug and will likely just open an issue for that one as I don't have the time to fix ucp bugs. |
Test PASSed. |
Test FAILed. |
bot:mellanox:retest |
Test PASSed. |
jenkins env issues bot:mlx:retest |
Test PASSed. |
Updated to correct the domain_id type in one other struct. Runs cleanly with btl/uct now. |
.num_devices = -1, | ||
.initialized = 0, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note here. I removed any extraneous line setting a struct member to 0. C requires the compiler to initialize any uninitialized memory in the data sections to 0 (both public and private) so .ptag = 0 does nothing.
We were getting user complaints in Open MPI because UCX was failing to initialize when used on a Cray system. This commit fixes the issue but changing the way the CDM is calculated to match what is used in Open MPI. This eliminates the need for using Cray's PMI at all. I am taking away a bit of uniqueness from the CDM space because this might have to live side-by-side with btl/ugni. Signed-off-by: Nathan Hjelm <[email protected]>
Signed-off-by: Nathan Hjelm <[email protected]>
One more fix. @hppritcha They were not using FMA_SHARED and since they do not expose the CDM mode as a parameter they have to always set it. Without it the result is predicable:
|
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
@hjelmn - my understanding there are some failures ? how do we want to proceed ? |
@shamisp The failures are on systems without ugni from what I can see. I can't see why the internal Mellanox stuff is failing but this is working fine on our XC-40. |
@@ -389,26 +372,58 @@ ucs_status_t uct_ugni_iface_get_dev_address(uct_iface_t *tl_iface, uct_device_ad | |||
return UCS_OK; | |||
} | |||
|
|||
static int uct_ugni_next_power_of_two_inclusive (int value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe should it go to ucs as an utility ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to @hjelmn
failures should be fixed by #3083 bot:mlx:retest |
Test FAILed. |
bot:mlx:retest |
Test PASSed. |
We were getting user complaints in Open MPI because UCX was failing to
initialize when used on a Cray system. This commit fixes the issue but
changing the way the CDM is calculated to match what is used in Open
MPI. This eliminates the need for using Cray's PMI at all. I am taking
away a bit of uniqueness from the CDM space because this might have to
live side-by-side with btl/ugni.
Signed-off-by: Nathan Hjelm [email protected]
What
Fixes UCX initialization on Cray systems.
Why ?
Support for Cray systems is broken when not using Cray PMI.
How ?
Create the CDM identifier using something other than the PMI properties. Also fix the domain_id field in the CDM struct. uint16_t was wrong and was likely being overflowed.