-
Notifications
You must be signed in to change notification settings - Fork 897
Cleanup grpcomm race conditions #1335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@rhc54 Unfortunately, no. I'm getting a segv in the orted with a fairly consistent backtrace:
It looks like all the pointers passed to memcpy are bogus:
|
Also, can you put "Fixes #1215" in the final squashed commit message? Thanks! |
once I know it actually does... 😃 |
@jsquyres Can you try this again and send me the debug output? I'm trying to figure out how you got into that situation as it isn't obvious. |
@rhc54 Output is here: https://gist.github.com/jsquyres/3ff0b97966144ac4ffe8 |
On 2/1/16, 2:32 PM, "rhc54" [email protected] wrote:
@rhc54 - I saw that you are using the same recv buffer to send to self |
Technically, the OBJ_RETAIN should be adequate to protect us when recirculating the buffer. I've copied the buffer now, so let's see if this works - if it does, then we know we have a problem in the OOB. |
Now it just loops forever, endlessly spitting out those "LOOPING BRKS" debug messages. |
@jsquyres Can you give this one a try? I cannot figure out why those daemons aren't getting the launch msg, and so this adds some delay in that loop to provide an opportunity for the event lib to progress the rest of the messages. I also added a failsafe so we only loop a few times and then abort. |
@rhc54 I tried the latest. It worked once (i.e., launched and ran the MPI job successfully), but the next 20-30 runs all failed with lots of output like this:
(the node displayed in the show-help message varied with each run) |
@rhc54 I will try to find some time to debug brks at scale here. Probably will be sometime next week. |
@hjelmn From what we have seen, it looks to me like Jeff is simply losing messages somewhere. The root cause of the problem is that downstream daemons never receive the launch message, and thus are never able to locate the job object since it doesn't get created. So I'm not sure if there is a bug in brks or not. I'm unable to replicate even on a slow ssh machine, so this may be something specific to Jeff's machine. I'm going to restore the direct component's rollup method (currently what we use in 1.10 and prior series) anyway. It may actually be faster than the other methods, but at least we know it worked. |
@jsquyres Okay, I've cleaned this up - please give it a smoke test and commit if all is well. |
@@ -55,7 +55,7 @@ static int direct_register(void) | |||
/* make the priority adjustable so users can select | |||
* direct for use by apps without affecting daemons | |||
*/ | |||
my_priority = 1; | |||
my_priority = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to leave some wiggle room here? I.e., if you set this to 100, then it's not possible to override it -- even the user on the command line can't change priorities to something that would guarantee that direct wouldn't be chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wellll...as we both know, that isn't technically true. we set the priority to much higher values in a number of places. i picked this one because rcd is at 80, but now that we have opal_ignore'd it, i can turn down the heat
@jsquyres okay, quibbles addressed 😄 |
Tested and worked flawlessly in 20 out of 20 runs of 1000+ processes across 64 servers. One last quibble (which you can choose to do or not 😄 ), add "Fixes #1215" to the commit message. Otherwise, 👍 |
…be the default, and to execute a rollup collective. This may in fact be faster than the alternatives, and something appears broken at scale when using brks in particular. Turn off the rcd and brks components as they don't work at scale right now - they can be restored at some future point when someone can debug them. Adjust to Jeff's quibbles Fixes open-mpi/mpi#1215
grumble...grumble.... 😩 |
Revive the coll/sync component, adding a test to show it. Clean some permissions
@jsquyres I believe this may cleanup the problem you've been seeing at scale. Please give it a try and let me know what you see.