Skip to content

uct/ugni: fix initialization #3080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 16, 2018
Merged

uct/ugni: fix initialization #3080

merged 2 commits into from
Dec 16, 2018

Conversation

hjelmn
Copy link
Contributor

@hjelmn hjelmn commented Dec 4, 2018

We were getting user complaints in Open MPI because UCX was failing to
initialize when used on a Cray system. This commit fixes the issue but
changing the way the CDM is calculated to match what is used in Open
MPI. This eliminates the need for using Cray's PMI at all. I am taking
away a bit of uniqueness from the CDM space because this might have to
live side-by-side with btl/ugni.

Signed-off-by: Nathan Hjelm [email protected]

What

Fixes UCX initialization on Cray systems.

Why ?

Support for Cray systems is broken when not using Cray PMI.

How ?

Create the CDM identifier using something other than the PMI properties. Also fix the domain_id field in the CDM struct. uint16_t was wrong and was likely being overflowed.

@hjelmn
Copy link
Contributor Author

hjelmn commented Dec 4, 2018

We still strongly recommend against using UCX on Cray systems with Open MPI (we will probably never recommend it) but this fixes one glaring issue. I am tracking another bug and will likely just open an issue for that one as I don't have the time to fix ucp bugs.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5703/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8387/ for details (Mellanox internal link).

@hppritcha
Copy link
Contributor

bot:mellanox:retest

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5704/ for details.

@brminich
Copy link
Contributor

brminich commented Dec 5, 2018

jenkins env issues

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8393/ for details (Mellanox internal link).

@hjelmn
Copy link
Contributor Author

hjelmn commented Dec 5, 2018

Updated to correct the domain_id type in one other struct. Runs cleanly with btl/uct now.

.num_devices = -1,
.initialized = 0,
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note here. I removed any extraneous line setting a struct member to 0. C requires the compiler to initialize any uninitialized memory in the data sections to 0 (both public and private) so .ptag = 0 does nothing.

We were getting user complaints in Open MPI because UCX was failing to
initialize when used on a Cray system. This commit fixes the issue but
changing the way the CDM is calculated to match what is used in Open
MPI. This eliminates the need for using Cray's PMI at all. I am taking
away a bit of uniqueness from the CDM space because this might have to
live side-by-side with btl/ugni.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Contributor Author

hjelmn commented Dec 5, 2018

One more fix. @hppritcha They were not using FMA_SHARED and since they do not expose the CDM mode as a parameter they have to always set it. Without it the result is predicable:

[1544045407.678697] [nid00043:25120:0]    ugni_device.c:441  UCX  ERROR GNI_CdmAttach failed, Error status: GNI_RC_ERROR_RESOURCE

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5710/ for details.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5711/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8394/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8395/ for details (Mellanox internal link).

@shamisp
Copy link
Contributor

shamisp commented Dec 13, 2018

@hjelmn - my understanding there are some failures ? how do we want to proceed ?

@hjelmn
Copy link
Contributor Author

hjelmn commented Dec 13, 2018

@shamisp The failures are on systems without ugni from what I can see. I can't see why the internal Mellanox stuff is failing but this is working fine on our XC-40.

@@ -389,26 +372,58 @@ ucs_status_t uct_ugni_iface_get_dev_address(uct_iface_t *tl_iface, uct_device_ad
return UCS_OK;
}

static int uct_ugni_next_power_of_two_inclusive (int value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe should it go to ucs as an utility ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to @hjelmn

@brminich
Copy link
Contributor

failures should be fixed by #3083

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8414/ for details (Mellanox internal link).

@brminich
Copy link
Contributor

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8415/ for details (Mellanox internal link).

@shamisp shamisp merged commit 23163de into openucx:master Dec 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants