grpcomm errors when launching on RHEL 7.2/ssh

I'm seeing odd behavior when trying to launch small MPI jobs on master (as of Sun 13 Dec 2015, after @rhc54's update to pmix 1.1.2).

Here's the specs:
- RHEL 7.2
- TCP BTL
- ssh launcher (no SLURM or any other scheduler)
- (mostly) Default master build: `./configure --prefix=/home/jsquyres/bogus --with-libfabric=/home/jsquyres/bogus --with-usnic --disable-vt --disable-mpi-fortran`
  - Yes, I built with libfabric/usnic, but I'm intentionally testing with the TCP BTL just to ensure something isn't wrong with the usnic BTL -- but I'm seeing the same behavior regardless of BTL selection

Here's what I'm launching:

``` sh
$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c
```

The hostfile contains a bunch of lines like this: `hostname slots=16`

Sometimes that runs fine, sometimes it results in the following:

``` sh
$ mpirun --mca pml ob1 --mca btl tcp,vader,self --hostfile hosts -np 40 ring_c
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 294
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_stubs.c at line 254  
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 241  
malloc debug: Request for 4 zeroed elements of size -1 failed (grpcomm_brks.c, 92)
[pacini014.arcetri.cisco.com:08929] [[4261,0],3] ORTE_ERROR_LOG: Not found in file grpcomm_brks.c at line 170
```

FWIW, I observed this same behavior this past Thursday (i.e., before the pmix 1.1.2 update), but didn't have the time to file a proper bug report.  This suggests that the problem _might_ be unrelated to the old-vs.-new PMIX...?

[Here's a gist of a failed run](https://gist.github.com/eab1a7c03822a1971843), but with lots of verbosity, in case it helps.  Here's the command line used to launch that run:

``` sh
$ mpirun \
    --mca ess_base_verbose 100 \
    --mca grpcomm_base_verbose 100 \
    --mca pmix_base_verbose 100 \
    --mca pml ob1 \
    --mca btl tcp,vader,self \
    --hostfile hosts \
    -np 40 \
    ring_c
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

grpcomm errors when launching on RHEL 7.2/ssh #1215

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

grpcomm errors when launching on RHEL 7.2/ssh #1215

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions