Skip to content

OMPI/master failures with ext PMIx v1.2 (github) #3200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
artpol84 opened this issue Mar 17, 2017 · 7 comments
Closed

OMPI/master failures with ext PMIx v1.2 (github) #3200

artpol84 opened this issue Mar 17, 2017 · 7 comments
Assignees
Milestone

Comments

@artpol84
Copy link
Contributor

For provided configuration I'm getting following error from each orted.

[jupiter022:23209] OPAL ERROR: Not found in file ../../../../../opal/mca/pmix/ext1x/pmix1x_client.c at line 235
[jupiter022:23209] [[45022,0],22] ORTE_ERROR_LOG: Not found in file ../../orte/util/nidmap.c at line 102
[jupiter022:23209] [[45022,0],22] ORTE_ERROR_LOG: Not found in file ../../../../orte/mca/ess/base/ess_base_std_orted.c at line 535

It seems like something was broken with external component. I doubt it is #3181 but I will check and update here.

@artpol84
Copy link
Contributor Author

I tentatively set v3.x as a target, but I haven't tried to reproduce with it.

@artpol84
Copy link
Contributor Author

@karasevb please check usability of ompi/master with pmix-v1.2

@karasevb
Copy link
Member

karasevb commented Jun 7, 2017

There is problem observed with external PMIx for v3.x (master works well).
I'm investigating it.

@karasevb
Copy link
Member

karasevb commented Jun 7, 2017

More details:

$./mpirun --debug-daemons hostname
[cn9:10998] OPAL ERROR: Not found in file /home/user/sandbox/src/ompi_v3/opal/mca/pmix/ext1x/pmix1x_client.c at line 235
[cn9:10998] [[2567,0],2] ORTE_ERROR_LOG: Not found in file /home/user/sandbox/src/ompi_v3/orte/util/nidmap.c at line 102
[cn9:10998] [[2567,0],2] ORTE_ERROR_LOG: Not found in file /home/user/sandbox/src/ompi_v3/orte/mca/ess/base/ess_base_std_orted.c at line 535
srun: error: cn9: task 1: Exited with exit code 213
srun: Terminating job step 123185.13
slurmstepd: error: *** STEP 123185.13 ON cn8 CANCELLED AT 2017-06-06T21:34:20 ***
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: error: cn10: task 2: Killed
srun: error: cn8: task 0: Killed
srun: error: cn23: task 3: Killed
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

$env | grep SLURM
SLURM_NODELIST=cn[8-10,23]
SLURM_JOB_NAME=bash
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=4
SLURM_JOBID=3
SLURM_TASKS_PER_NODE=8(x4)
SLURM_JOB_ID=123185
SLURM_SUBMIT_DIR=/home/user
SLURM_JOB_NODELIST=cn[8-10,23]
SLURM_JOB_CPUS_PER_NODE=8(x4)
SLURM_CLUSTER_NAME=cn
SLURM_SUBMIT_HOST=fe
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=4
./configure \
  --prefix=/home/user/sandbox/install/ompi_master \
  --with-pmix=/home/user/sandbox/install/pmix \
  --with-libevent=/home/user/sandbox/install/libevent

@karasevb karasevb added this to the v3.0.0 milestone Jun 7, 2017
@rhc54
Copy link
Contributor

rhc54 commented Jun 7, 2017

Yes, it is waiting for a PR to update it. It's the external v1.2 that remains of concern

@bwbarrett
Copy link
Member

#3677 is merged, so we should be good, correct?

@artpol84
Copy link
Contributor Author

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants