Skip to content

Ensure we properly commit suicide if/when we lose connection to the daemon. #647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2015
Merged

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 18, 2015

There are multiple paths by which a lost daemon can be reported, and so a race condition exists in the pmix support. Our MPI layer wants the ability to determine the response to the failure, and so it will call down to the RTE with any abort request. This comes down to the pmix layer as a "pmix_abort" command, which involves communicating the request to the daemon - who is gone. Sadly, the pmix component may not know that just yet, and so we hang.

So add a brief timer event to kick us out of the communication. The precise amount of time we should wait is somewhat TBD, but set something short for now and we can adjust.

…aemon. There are multiple paths by which a lost daemon can be reported, and so a race condition exists in the pmix support. Our MPI layer wants the ability to determine the response to the failure, and so it will call down to the RTE with any abort request. This comes down to the pmix layer as a "pmix_abort" command, which involves communicating the request to the daemon - who is gone. Sadly, the pmix component may not know that just yet, and so we hang.

So add a brief timer event to kick us out of the communication. The precise amount of time we should wait is somewhat TBD, but set something short for now and we can adjust.
@rhc54 rhc54 added this to the Open MPI v2.0.0 milestone Jun 18, 2015
@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/job/gh-ompi-master-pr/638/

rhc54 pushed a commit that referenced this pull request Jun 18, 2015
Ensure we properly commit suicide if/when we lose connection to the daemon.
@rhc54 rhc54 merged commit 5d38283 into open-mpi:master Jun 18, 2015
@rhc54 rhc54 deleted the topic/hangs branch June 23, 2015 18:33
jsquyres added a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
HCOLL: fix hang in hcoll barrier called from finalize for MXM/yalla
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants