-
Notifications
You must be signed in to change notification settings - Fork 901
Update PMIx and integration glue #3696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@rhc54 i made rhc54/ompi#5 in order to fix a runtime failure |
THANK YOU!! I was just going to start chasing it 😄 |
@ggouaillardet I had to rebase this to get rid of that blasted "merge" commit as it was "unsigned". I left the commit messages intact. |
@bwbarrett Is there any way to tell if this had something to do with this PR? It's a failure when |
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/e86ac666b05846eaaa421e1e32a78587 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/7a572bb1fbfed0ad825eb0975098a38c |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/f0e7a74b468a756b1d703c15375fa46b |
@rhc54 did you do something else besides the rebase ? i digged a bit and found that could be caused by a recursive lock global lock is initially taken at
and then it is taken again at
note the second lock hangs because
if it is declared as a |
@rhc54 i found some differences. if i compare my branch to the existing one, the main difference is that note you squashed all the commits (e.g. PMIx refresh from upstream repo and ompi glue) into one |
Yes, there is a Coverity fix in there as well, and I didn't get all the changes correct. I'll continue working on it. As I said, I had to rebase to get rid of the "unsigned" merge commit. |
bot:ompi:retest |
I just ran a hello world loop test (-np 2 on a single machine) and saw a segv - I'm investigating now. --------------------------------------------
-------------------- Iteration 30 of 2000
[c656f6n03:86064] *** Process received signal ***
[c656f6n03:86064] Signal: Segmentation fault (11)
[c656f6n03:86064] Signal code: Address not mapped (1)
[c656f6n03:86064] Failing at address: 0x90
[c656f6n03:86064] [ 0] [0x3fffb21b0478]
[c656f6n03:86064] [ 1] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(pmix_bfrop_pack+0xec)[0x3fffb0cf94f4]
[c656f6n03:86064] [ 2] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x51d70)[0x3fffb0c51d70]
[c656f6n03:86064] [ 3] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x52480)[0x3fffb0c52480]
[c656f6n03:86064] [ 4] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x55ce4)[0x3fffb0c55ce4]
[c656f6n03:86064] [ 5] /tmp/ompi/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xcd0)[0x3fffb1b1e290]
[c656f6n03:86064] [ 6] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x1163ec)[0x3fffb0d163ec]
[c656f6n03:86064] [ 7] /lib64/libpthread.so.0(+0x8728)[0x3fffb1f28728]
[c656f6n03:86064] [ 8] /lib64/libc.so.6(clone+0x98)[0x3fffb1e5d210]
[c656f6n03:86064] *** End of error message ***
./loop-exec.sh: line 31: 86058 Killed $cmd
Loop 30 of 2000. rc=137 r= s=1497371445 e=1497371507 d=62
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX FAILURE It looks like a race or corruption of the Program terminated with signal 11, Segmentation fault.
#0 0x00003fffb0d405fc in pmix_ptl_stub_send_recv (peer=0x1001a2636a0, bfr=0x3fffa4002ce0, cbfunc=0x3fffb0c4fb54 <regevents_cbfunc>,
cbdata=0x3fffa4000a20) at base/ptl_base_stubs.c:80
80 return pr->compat.ptl->send_recv(peer, bfr, cbfunc, cbdata);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.ppc64le
(gdb) bt
#0 0x00003fffb0d405fc in pmix_ptl_stub_send_recv (peer=0x1001a2636a0, bfr=0x3fffa4002ce0, cbfunc=0x3fffb0c4fb54 <regevents_cbfunc>,
cbdata=0x3fffa4000a20) at base/ptl_base_stubs.c:80
#1 0x00003fffb0c51d70 in _send_to_server (rcd=0x3fffa4000a20) at event/pmix_event_registration.c:201
#2 0x00003fffb0c52480 in _add_hdlr (cd=0x1001a26b100, xfer=0x3fffb0b6e5b0) at event/pmix_event_registration.c:290
#3 0x00003fffb0c55ce4 in reg_event_hdlr (sd=-1, args=4, cbdata=0x1001a26b100) at event/pmix_event_registration.c:560
#4 0x00003fffb1b1e290 in event_process_active_single_queue (activeq=0x1001a263270, base=0x1001a262dd0) at event.c:1370
#5 event_process_active (base=<optimized out>) at event.c:1440
#6 opal_libevent2022_event_base_loop (base=0x1001a262dd0, flags=<optimized out>) at event.c:1644
#7 0x00003fffb0d163ec in progress_engine (obj=0x1001a262d88) at runtime/pmix_progress_threads.c:109
#8 0x00003fffb1f28728 in start_thread () from /lib64/libpthread.so.0
#9 0x00003fffb1e5d210 in clone () from /lib64/libc.so.6
(gdb) p pr->compat
$1 = {
psec = 0x3fffb0bc0120 <pmix_native_module>,
ptl = 0x80
}
(gdb) p pmix_client_globals.myserver.compat
$2 = {
psec = 0x3fffb0bc0120 <pmix_native_module>,
ptl = 0x80
} A little context on the three threads from this core file:
|
You might try setting PMIX_MCA_ptl=tcp in your environment to see if that makes any difference. It will eliminate consideration of the usock component, which should be lower priority and ignored anyway. This would just remove one more possibility. |
Just tried that and same result (although it took longer to tigger). |
Progress update. I've so far isolated the corruptions to be around this section of code (or around this time) in if I check the value of PMIX_RELEASE_THREAD(&pmix_global_lock);
/* lood for a debugger attach key */
(void)strncpy(wildcard.nspace, pmix_globals.myid.nspace, PMIX_MAX_NSLEN);
wildcard.rank = PMIX_RANK_WILDCARD;
PMIX_INFO_LOAD(&ginfo, PMIX_IMMEDIATE, NULL, PMIX_BOOL);
if (PMIX_SUCCESS == PMIx_Get(&wildcard, PMIX_DEBUG_STOP_IN_INIT, &ginfo, 1, &val)) {
PMIX_VALUE_FREE(val, 1); // cleanup memory
/* if the value was found, then we need to wait for debugger attach here */
/* register for the debugger release notification */
PMIX_CONSTRUCT_LOCK(®lock);
PMIx_Register_event_handler(&code, 1, NULL, 0,
notification_fn, NULL, (void*)®lock);
/* wait for it to arrive */
PMIX_WAIT_THREAD(®lock);
PMIX_DESTRUCT_LOCK(®lock);
}
PMIX_INFO_DESTRUCT(&ginfo); I do not necessarily think that this section of code is the problem. However, I do think that after the release of the I have a number of things to look into yet today to isolate further. But I just wanted to post an update since I know a number of folks are watching this PR. |
opal/mca/pmix/pmix2x/pmix2x_client.c
Outdated
pt->valcbfunc = cbfunc; | ||
pt->cbdata = cbdata; | ||
pt->lock.active = false; | ||
OPAL_PMIX2X_THREADSHIFT(pt, _fence); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be _get
instead of _fence
?
Yeah, I found that too - but it turns out that we don't exercise that code path. |
opal/mca/pmix/pmix2x/pmix2x_client.c
Outdated
OPAL_PMIX_WAIT_THREAD(&pt.lock); | ||
rc = pt.status; | ||
*val = pt.val; | ||
OBJ_DESTRUCT(&pt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen a crash (just one in many, many runs) where at the very bottom of _get
we call:
if (!pt->lock.active) {
OBJ_RELEASE(pt);
}
At the exact moment as OBJ_DESTRUCT(&pt)
here and both destructors become active at the same time. Leaving to a crash in free as two threads fight over which one gets to free the memory. Should we move these OBJ_CONSTRUCT/OBJ_DESTRUCT to OBJ_NEW/OBJ_RETAIN/OBJ_RELEASE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just about to push that up now 😄
@ggouaillardet Could you perhaps give us a hand? @jjhursey and I have been banging our head on the memory corruption problem and are having difficulty tracking it down - we could use some of your "master detective" skills 😄 @jjhursey provided a very simple script for triggering it: #!/bin/bash
loops=2000
#loops=2
TIMEOUT="timeout --preserve-status -k 32 30"
TIMEOUT=""
#timeout --preserve-status -k 32 30 ./opal_fifo
MCA_PARAMS="-host rhc001:24 -mca coll ^hcoll -mca pml ob1 -mca btl tcp,self -mca oob tcp --tag-output"
VALGRIND_OPTS=""
#VALGRIND_OPTS="valgrind --track-origins=yes"
cmd="${TIMEOUT} mpirun -np 2 ${MCA_PARAMS} ${VALGRIND_OPTS} ./hello_c"
#export PMIX_DEBUG=2
export PMIX_MCA_ptl=tcp
i=1
while [ $i -le $loops ]; do
echo "--------------------------------------------"
echo "-------------------- Iteration $i of $loops"
starttime=`date +%s`
$cmd
rc=$?
endtime=`date +%s`
echo "Loop $i of $loops. rc=$rc r=$ranks s=$starttime e=$endtime d=$((endtime - starttime))"
if [ $rc != 0 ] ; then
if [ $rc == 124 ] ; then
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX TIMEOUT - skip"
else
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX FAILURE"
fi
exit -1
fi
i=$((i + 1))
done
exit 0 It takes me a few hundred iterations for it to "explode", but that varies by machine. We run it on a single node. It fails equally well on ppc, linux/x86, and Mac - equal opportunity! The failure occurs in the messaging system, but that may be a symptom and not the root cause location. Any help would be greatly appreciated! |
Per Ralph's previous comment - the What I have found so far is that inside if (PMIX_SUCCESS != (rc = pmix_ptl.send_recv(pmix_client_globals.myserver, req, job_data, (void*)&cb))){ if I add code to check the value of // check pmix_client_globals.myserver->compat.ptl --- all is good
if( (intptr_t)(pmix_client_globals.myserver->compat.ptl) < 0x1000 ) {
printf("BROKEN!!! ptr=%15p HERE 1\n", pmix_client_globals.myserver->compat.ptl);
}
usleep(100);
// check pmix_client_globals.myserver->compat.ptl --- Corrupted
if( (intptr_t)(pmix_client_globals.myserver->compat.ptl) < 0x1000 ) {
printf("BROKEN!!! ptr=%15p HERE 2\n", pmix_client_globals.myserver->compat.ptl);
} So this is telling me that the other thread is activated to do something (trying to pin that down now) which is causing the corruption. My current theory is that the |
Hmmm...hold on a minute. Take a gander at the recv_handler function. We allocate space for the message: /* allocate the data region */
peer->recv_msg->data = (char*)malloc(peer->recv_msg->hdr.nbytes);
memset(peer->recv_msg->data, 0, peer->recv_msg->hdr.nbytes);
/* point to it */
peer->recv_msg->rdptr = peer->recv_msg->data;
peer->recv_msg->rdbytes = peer->recv_msg->hdr.nbytes; but then look down below where we read it: /* continue to read the data block - we start from
* wherever we left off, which could be at the
* beginning or somewhere in the message
*/
if (PMIX_SUCCESS == (rc = read_bytes(peer->sd, &msg->rdptr, &msg->rdbytes))) {
/* we recvd all of the message */
pmix_output_verbose(2, pmix_globals.debug_output,
"RECVD COMPLETE MESSAGE FROM SERVER OF %d BYTES FOR TAG %d ON PEER SOCKET %d",
(int)peer->recv_msg->hdr.nbytes,
peer->recv_msg->hdr.tag, peer->sd);
/* post it for delivery */
PMIX_ACTIVATE_POST_MSG(peer->recv_msg);
peer->recv_msg = NULL;
return;
} else if (PMIX_ERR_RESOURCE_BUSY == rc ||
PMIX_ERR_WOULD_BLOCK == rc) {
/* exit this event and let the event lib progress */
return;
} else {
/* the remote peer closed the connection - report that condition
* and let the caller know
*/
pmix_output_verbose(2, pmix_globals.debug_output,
"ptl:base:msg_recv: peer closed connection");
goto err_close;
} In other words, we forgot to point the msg->rdptr at the allocated space! Let me try and see if that fixes it. |
Nurts - I see I was wrong as msg = peer->recv_msg, which was set. Back to the drawing board... |
But I wonder: isn't the assert a result of the race condition here? The scenario we have is:
So if B executes the callback before A can get into wait_thread, then the assert is going to hit, yes? I'm wondering if we need to lock the mutex before we call the function to ensure that we are ready for the callback. My head hurts... |
I did notice you are using trylock in an assert, and this is in general troublesome. |
@bosilca I think we could use some advice here - does it otherwise look correct? If so, the assert can be removed. I just want to ensure that we are pursuing this correctly. |
assert(0==trylock) Assert(0 != trylock) |
👍 on @ggouaillardet suggestion. |
@ggouaillardet @jjhursey Please give this a try on your systems - it works fine on mine! |
Looks like the PRBC died for an unrelated issue - somehow lost connection. bot:ompi:retest |
Just ran this with an optimized build, and ran clean thru the test again! So we might actually have this fixed now. Will wait to hear from @jjhursey as their system was the more sensitive one. |
Much sadness - I reran it again as debug build, and this time it failed:
|
I'm trying to run the test thru valgrind, but so far no luck - I'm on iteration 1673 and it hasn't failed. I guess valgrind is slowing things down enough that the corruption doesn't occur. Would it make sense to try things like electric fence? Anyone have a suggestion? |
i noted nor |
These should be fixed here: openpmix/openpmix#401 After pondering some more, I'm convinced that we should move this to the PMIx area where we can get more control over the test program. We are trying to "peer" thru the lens of OMPI's restrictions (e.g., we must return certain data from get, must register event handlers, etc). I'm going to modify @jjhursey's program to run a test over there and see if I can replicate the problem. |
Please see openpmix/openpmix#403 as we track the PMIx-only tests on this issue. |
I've done some more digging, and with the latest PMIx updates I am now seeing a segfault in the OPAL pmix layer. It stems from a thread-shift caddy being overwritten by an opal_value_t. The top-level call that starts the problem is in orte/mca/ess/pmi/ess_pmi_module.c, line 330:
The "val" is a pointer to an opal_value_t - the macro will return the pointer to an allocated opal_value_t in the val address. If we look at what all the threads in the app proc are doing:
we see that thread 3, the main thread, is waiting in the PMIx_Get inside the macro, and the PMIx progress thread is quiescent. These are both correct as the value is located in the PMIx dstore. The returned value has been thread-shifted out of the PMIx thread and into the OPAL progress thread so it can access the OPAL pmix framework-level data (e.g., to convert the nspace to a local jobid). The return void* cbdata is expected to be the pmix2x_threadshift_t object, which is to be released at the end of the callback. However, as the bt shows, the object incorrectly has an opal_list_item_t class in its "super" location instead of the expected opal_object_t. Printing out the threadshift object shows:
So what has happened here is that the "val" originally passed in the macro has been overlaid on the pmix2x_threadshift_t object. My best guess is that the cbdata pointers are getting confused as we weave our way thru all these thread-shifts. |
@jjhursey @ggouaillardet I believe we now have this fixed! See what you think |
Rats - spoke too soon. Issues in comm_spawn, working them now. |
Appear to have the event handler problems resolved - still checking on some of the dynamics. Got comm_spawn fixed at least. |
FWIW: Mellanox is aware of the problem - there is an issue with GitHub rejecting their attempt to set the test result to "passed". It is being investigated |
…d support in the opal/pmix framework to protect the framework-level structures. This now passes the loop test, and so we believe it resolves the random hangs in finalize. Changes in PMIx master that are included here: * Fixed a bug in the PMIx_Get logic * Fixed self-notification procedure * Made pmix_output functions thread safe * Fixed a number of thread safety issues * Updated configury to use 'uname -n' when hostname is unavailable Work on cleaning up the event handler thread safety problem Rarely used functions, but protect them anyway Fix the last part of the intercomm problem Ensure we don't cover any PMIx calls with the framework-level lock. Protect against NULL argv comm_spawn Signed-off-by: Ralph Castain <[email protected]>
I'm running the branch now. Looking good so far. I'm letting it run a bit longer over lunch. I'll let you know when it's done. |
MTT results:
|
This passed clean for me at 10K iterations on powerpc. I think this is good to go. Thanks @rhc54 ! |
very good - thanks! |
@rhc54 Let me know when the PR to |
No description provided.