-
Notifications
You must be signed in to change notification settings - Fork 897
grpcomm errors when launching on RHEL 7.2/ssh #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
out of curiosity - how many hosts are in that hostfile? |
The remainder of the file is similar. |
On 12/16/15, 9:20 PM, "rhc54" [email protected] wrote:
I can have the fix ready next week. I guess this is not a blocking issue, |
@annu13 That will be fine - it isn't a blocker. Just wanted to give folks some idea of when the fix might become available. Thanks for tackling it! |
Perfect, thanks. |
@rhc54 @annu13 I'm sorry; I'm finally getting around to testing this, and it looks like #1254 (which looks like it was closed and re-submitted/merged as #1255) did not fix the issue when running MPI jobs. Note that running non-MPI jobs, like Here's what I'm seeing with the current master head (70787d1): # ring_c is the simple ring program from the examples/ dir
$ mpirun -np 400 --hostfile hosts --mca btl tcp,vader,self ring_c
[pacini075.arcetri.cisco.com:13606] [[31700,0],25] ORTE_ERROR_LOG: Not found in file bas
e/grpcomm_base_stubs.c at line 294
[pacini075.arcetri.cisco.com:13606] [[31700,0],25] ORTE_ERROR_LOG: Not found in file bas
e/grpcomm_base_stubs.c at line 254
[pacini075.arcetri.cisco.com:13606] [[31700,0],25] ORTE_ERROR_LOG: Not found in file grp
comm_brks.c at line 304
*** Error in `orted': double free or corruption (out): 0x0000000000dc9c60 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x7cfe1)[0x7fb4d0990fe1]
/home/jsquyres/bogus/lib/libopen-pal.so.0(opal_free+0x1f)[0x7fb4d1ca182a]
/home/jsquyres/bogus/lib/libopen-rte.so.0(+0x66e59)[0x7fb4d1fc4e59]
/home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x11ac)[0x7fb4ce9bc1ac]
/home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x2ffb)[0x7fb4ce9bdffb]
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_complete_recv_msg+0x18a)[0x7fb4d
1ffc290]
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_process_msg+0x12f)[0x7fb4d1ffc97
a]
/home/jsquyres/bogus/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x8fc)[0x7fb
4d1cb50dc]
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_daemon+0x2339)[0x7fb4d1fa8366]
orted[0x400906]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb4d0935b15]
orted[0x4007b9]
[pacini075:13606] *** Process received signal ***
[pacini075:13606] Signal: Aborted (6)
[pacini075:13606] Signal code: (-6)
[pacini075:13606] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7fb4d0ce4100]
[pacini075:13606] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7fb4d09495f7]
[pacini075:13606] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7fb4d094ace8]
[pacini075:13606] [ 3] /usr/lib64/libc.so.6(+0x75317)[0x7fb4d0989317]
[pacini075:13606] [ 4] /usr/lib64/libc.so.6(+0x7cfe1)[0x7fb4d0990fe1]
[pacini075:13606] [ 5] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_free+0x1f)[0x7fb4d
1ca182a]
[pacini075:13606] [ 6] /home/jsquyres/bogus/lib/libopen-rte.so.0(+0x66e59)[0x7fb4d1fc4e5
9]
[pacini075:13606] [ 7] /home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x11ac)[0x7
fb4ce9bc1ac]
[pacini075:13606] [ 8] /home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x2ffb)[0x7
fb4ce9bdffb]
[pacini075:13606] [ 9] /home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_complete_
recv_msg+0x18a)[0x7fb4d1ffc290]
[pacini075:13606] [10] /home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_process_m
sg+0x12f)[0x7fb4d1ffc97a]
[pacini075:13606] [11] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_libevent2022_event
_base_loop+0x8fc)[0x7fb4d1cb50dc]
[pacini075:13606] [12] /home/jsquyres/bogus/lib/libopen-rte.so.0(orte_daemon+0x2339)[0x7
fb4d1fa8366]
[pacini075:13606] [13] orted[0x400906]
[pacini075:13606] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb4d0935b15]
[pacini075:13606] [15] orted[0x4007b9]
[pacini075:13606] *** End of error message ***
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:
hostname: pacini075
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
-------------------------------------------------------------------------- I configured Open MPI fairly simply: $ ./configure --prefix=/home/jsquyres/bogus --with-libfabric=/home/jsquyres/bogus --with-usnic --disable-oshmem --disable-mpi-fortran (but this run didn't even use the usnic stuff -- just plain/vanilla TCP, to ensure that usnic isn't causing the failure) |
Some notes I really should have included above:
|
@hppritcha I'm marking this a blocker for v2.0.0 because it seems like we have an important race condition at scale that needs to be solved before release (i.e., it's happening on the v2.x branch (open-mpi/ompi-release@dea4f34) as well as master (70787d1)). |
I know where the problem lies and will fix it this week |
Thanks! |
On 1/23/16, 7:23 AM, "rhc54" [email protected] wrote:
@rhc54 are you planning to put a fix to avoid the race condition or to |
@annu13 This isn't an error - it's a race condition that didn't get fixed in the last PR. So I plan to fix the race condition. As I described back in the original issue, it is possible to receive a grpcomm buffer from another daemon prior to having processed the launch message. So we need to recycle the incoming message until the launch message has been processed so we know how to construct the collective signature. |
On 1/23/16, 11:14 AM, "rhc54" [email protected] wrote:
Ahhh.. I got it now. Initially I thought that was the case and started |
Heh -- due to a typo in the commit message, this issue didn't auto-close when 68912d0 was committed. Just to be clear: this issue is now fixed on master. PR to v2.x coming shortly. |
I'm seeing odd behavior when trying to launch small MPI jobs on master (as of Sun 13 Dec 2015, after @rhc54's update to pmix 1.1.2).
Here's the specs:
./configure --prefix=/home/jsquyres/bogus --with-libfabric=/home/jsquyres/bogus --with-usnic --disable-vt --disable-mpi-fortran
Here's what I'm launching:
The hostfile contains a bunch of lines like this:
hostname slots=16
Sometimes that runs fine, sometimes it results in the following:
FWIW, I observed this same behavior this past Thursday (i.e., before the pmix 1.1.2 update), but didn't have the time to file a proper bug report. This suggests that the problem might be unrelated to the old-vs.-new PMIX...?
Here's a gist of a failed run, but with lots of verbosity, in case it helps. Here's the command line used to launch that run:
The text was updated successfully, but these errors were encountered: