Skip to content

grpcomm errors when launching on RHEL 7.2/ssh #1215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jsquyres opened this issue Dec 13, 2015 · 16 comments
Closed

grpcomm errors when launching on RHEL 7.2/ssh #1215

jsquyres opened this issue Dec 13, 2015 · 16 comments

Comments

@jsquyres
Copy link
Member

I'm seeing odd behavior when trying to launch small MPI jobs on master (as of Sun 13 Dec 2015, after @rhc54's update to pmix 1.1.2).

Here's the specs:

  • RHEL 7.2
  • TCP BTL
  • ssh launcher (no SLURM or any other scheduler)
  • (mostly) Default master build: ./configure --prefix=/home/jsquyres/bogus --with-libfabric=/home/jsquyres/bogus --with-usnic --disable-vt --disable-mpi-fortran
    • Yes, I built with libfabric/usnic, but I'm intentionally testing with the TCP BTL just to ensure something isn't wrong with the usnic BTL -- but I'm seeing the same behavior regardless of BTL selection

Here's what I'm launching:

$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c

The hostfile contains a bunch of lines like this: hostname slots=16

Sometimes that runs fine, sometimes it results in the following:

$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 294
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 254  
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 241  
malloc debug: Request for 4 zeroed elements of size -1 failed (grpcomm_brks.c, 92)
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 170

FWIW, I observed this same behavior this past Thursday (i.e., before the pmix 1.1.2 update), but didn't have the time to file a proper bug report. This suggests that the problem might be unrelated to the old-vs.-new PMIX...?

Here's a gist of a failed run, but with lots of verbosity, in case it helps. Here's the command line used to launch that run:

$ mpirun \
    --mca ess_base_verbose 100 \
    --mca grpcomm_base_verbose 100 \
    --mca pmix_base_verbose 100 \
    --mca pml ob1 \
    --mca btl tcp,vader,self \
    --hostfile hosts \
    -np 40 \
    ring_c
@jsquyres jsquyres added the bug label Dec 13, 2015
@jsquyres jsquyres added this to the v2.0.0 milestone Dec 13, 2015
@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2015

out of curiosity - how many hosts are in that hostfile?

@jsquyres
Copy link
Member Author

$ wc hosts
  64  128 1216 hosts
$ head hosts
pacini012 slots=16
pacini013 slots=16
pacini014 slots=16
pacini015 slots=16
pacini016 slots=16
pacini017 slots=16
pacini018 slots=16
pacini019 slots=16
pacini020 slots=16
pacini021 slots=16

The remainder of the file is similar.

@rhc54 rhc54 removed their assignment Dec 17, 2015
@rhc54
Copy link
Contributor

rhc54 commented Dec 17, 2015

@annu13 has identified the problem and is working on a solution.

@annu13 Can you provide some ETA?

@annu13
Copy link
Contributor

annu13 commented Dec 17, 2015

On 12/16/15, 9:20 PM, "rhc54" [email protected] wrote:

@annu13 https://github.com/annu13 has identified the problem and is
working on a solution.
@annu13 https://github.com/annu13 Can you provide some ETA?

Reply to this email directly or
view it on GitHub
#1215 (comment).

I can have the fix ready next week. I guess this is not a blocking issue,
let me know if its otherwise I will try to fix asap.

@rhc54
Copy link
Contributor

rhc54 commented Dec 17, 2015

@annu13 That will be fine - it isn't a blocker. Just wanted to give folks some idea of when the fix might become available.

Thanks for tackling it!

@jsquyres
Copy link
Member Author

Perfect, thanks.

@annu13
Copy link
Contributor

annu13 commented Dec 22, 2015

@jsquyres could you check if PR 1254 fixes the issue.
#1254
This fix should avoid the race condition thats causing the errors.

@jsquyres
Copy link
Member Author

@rhc54 @annu13 I'm sorry; I'm finally getting around to testing this, and it looks like #1254 (which looks like it was closed and re-submitted/merged as #1255) did not fix the issue when running MPI jobs. Note that running non-MPI jobs, like hostname and uptime works fine.

Here's what I'm seeing with the current master head (70787d1):

# ring_c is the simple ring program from the examples/ dir
$ mpirun -np 400 --hostfile hosts --mca btl tcp,vader,self ring_c
[pacini075.arcetri.cisco.com:13606] [[31700,0],25] ORTE_ERROR_LOG: Not found in file bas
e/grpcomm_base_stubs.c at line 294
[pacini075.arcetri.cisco.com:13606] [[31700,0],25] ORTE_ERROR_LOG: Not found in file bas
e/grpcomm_base_stubs.c at line 254
[pacini075.arcetri.cisco.com:13606] [[31700,0],25] ORTE_ERROR_LOG: Not found in file grp
comm_brks.c at line 304
*** Error in `orted': double free or corruption (out): 0x0000000000dc9c60 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x7cfe1)[0x7fb4d0990fe1]
/home/jsquyres/bogus/lib/libopen-pal.so.0(opal_free+0x1f)[0x7fb4d1ca182a]
/home/jsquyres/bogus/lib/libopen-rte.so.0(+0x66e59)[0x7fb4d1fc4e59]
/home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x11ac)[0x7fb4ce9bc1ac]
/home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x2ffb)[0x7fb4ce9bdffb]
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_complete_recv_msg+0x18a)[0x7fb4d
1ffc290]
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_process_msg+0x12f)[0x7fb4d1ffc97
a]
/home/jsquyres/bogus/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x8fc)[0x7fb
4d1cb50dc]
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_daemon+0x2339)[0x7fb4d1fa8366]
orted[0x400906]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb4d0935b15]
orted[0x4007b9]
[pacini075:13606] *** Process received signal ***
[pacini075:13606] Signal: Aborted (6)
[pacini075:13606] Signal code:  (-6)
[pacini075:13606] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7fb4d0ce4100]
[pacini075:13606] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7fb4d09495f7]
[pacini075:13606] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7fb4d094ace8]
[pacini075:13606] [ 3] /usr/lib64/libc.so.6(+0x75317)[0x7fb4d0989317]
[pacini075:13606] [ 4] /usr/lib64/libc.so.6(+0x7cfe1)[0x7fb4d0990fe1]
[pacini075:13606] [ 5] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_free+0x1f)[0x7fb4d
1ca182a]
[pacini075:13606] [ 6] /home/jsquyres/bogus/lib/libopen-rte.so.0(+0x66e59)[0x7fb4d1fc4e5
9]
[pacini075:13606] [ 7] /home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x11ac)[0x7
fb4ce9bc1ac]
[pacini075:13606] [ 8] /home/jsquyres/bogus/lib/openmpi/mca_grpcomm_brks.so(+0x2ffb)[0x7
fb4ce9bdffb]
[pacini075:13606] [ 9] /home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_complete_
recv_msg+0x18a)[0x7fb4d1ffc290]
[pacini075:13606] [10] /home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_base_process_m
sg+0x12f)[0x7fb4d1ffc97a]
[pacini075:13606] [11] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_libevent2022_event
_base_loop+0x8fc)[0x7fb4d1cb50dc]
[pacini075:13606] [12] /home/jsquyres/bogus/lib/libopen-rte.so.0(orte_daemon+0x2339)[0x7
fb4d1fa8366]
[pacini075:13606] [13] orted[0x400906]
[pacini075:13606] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb4d0935b15]
[pacini075:13606] [15] orted[0x4007b9]
[pacini075:13606] *** End of error message ***
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

  hostname:  pacini075

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

I configured Open MPI fairly simply:

$ ./configure --prefix=/home/jsquyres/bogus --with-libfabric=/home/jsquyres/bogus --with-usnic --disable-oshmem --disable-mpi-fortran

(but this run didn't even use the usnic stuff -- just plain/vanilla TCP, to ensure that usnic isn't causing the failure)

@jsquyres
Copy link
Member Author

Some notes I really should have included above:

  1. It doesn't always segv like this when running MPI jobs -- sometimes it hangs.
  2. This is running without a resource manager (no SLURM, etc.). Just a plain hostfile (same as I pasted above: slots=16 on 25 servers).
  3. It feels like a race condition of some kind:
    • When I run with a smaller np (e.g., np=200), it usually works fine -- but still sometimes hangs.
    • np=400 always hangs or segvs.

@jsquyres
Copy link
Member Author

@hppritcha I'm marking this a blocker for v2.0.0 because it seems like we have an important race condition at scale that needs to be solved before release (i.e., it's happening on the v2.x branch (open-mpi/ompi-release@dea4f34) as well as master (70787d1)).

@rhc54
Copy link
Contributor

rhc54 commented Jan 23, 2016

I know where the problem lies and will fix it this week

@jsquyres
Copy link
Member Author

Thanks!

@annu13
Copy link
Contributor

annu13 commented Jan 23, 2016

On 1/23/16, 7:23 AM, "rhc54" [email protected] wrote:

I know where the problem lies and will fix it this week

Reply to this email directly or
view it on GitHub
#1215 (comment).

@rhc54 are you planning to put a fix to avoid the race condition or to
handle it by queuing the request so that the receiving process can handle
it once it has all the wire-up information. I was planning to push a
change that handles request queuing and local error info propagation to
avoid the hang condition when one process locally completes the all-gather
because of error while other processes are waiting for its response..

@rhc54
Copy link
Contributor

rhc54 commented Jan 23, 2016

@annu13 This isn't an error - it's a race condition that didn't get fixed in the last PR. So I plan to fix the race condition. As I described back in the original issue, it is possible to receive a grpcomm buffer from another daemon prior to having processed the launch message. So we need to recycle the incoming message until the launch message has been processed so we know how to construct the collective signature.

@annu13
Copy link
Contributor

annu13 commented Jan 25, 2016

On 1/23/16, 11:14 AM, "rhc54" [email protected] wrote:

@annu13 https://github.com/annu13 This isn't an error - it's a race
condition that didn't get fixed in the last PR. So I plan to fix the race
condition. As I described back in the original issue, it is possible to
receive
a grpcomm buffer from another daemon prior to having processed the
launch message. So we need to recycle the incoming message until the
launch message has been processed so we know how to construct the
collective signature.

Reply to this email directly or
view it on GitHub
#1215 (comment).

Ahhh.. I got it now. Initially I thought that was the case and started
coding the request queuing piece but then I got confused with the daemon
job setup code and incorrectly concluded that the race condition would be
eliminated if I ensured that the daemon doesn�t enable comm until the
daemon job was setup.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 4, 2016

Heh -- due to a typo in the commit message, this issue didn't auto-close when 68912d0 was committed.

Just to be clear: this issue is now fixed on master. PR to v2.x coming shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants