-
Notifications
You must be signed in to change notification settings - Fork 918
Description
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Open MPI git master (592e2cc)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
configured using with support for Open UCX 1.4 (downloaded from openucx.com) using configure flags --with-cray-pmi --enable-debug --with-ucx
Please describe the system on which you are running
- Operating system/version: Cray XC40
- Computer hardware:
- Network type:
Details of the problem
Trying to run a job on that machine leads to the following error and eventually crash:
$ mpirun -n 2 -N 2 --oversubscribe ./mpi/one-sided/osu_get_latency
Thu Nov 15 04:02:41 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Thu Nov 15 04:02:41 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
Thu Nov 15 04:02:41 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Thu Nov 15 04:02:41 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
[nid07057:31456:0:31456] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[nid07057:31455:0:31455] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[1542250961.930000] [nid07057:31456:0] ugni_device.c:137 UCX ERROR PMI_Init failed, Error status: -1
[1542250961.930037] [nid07057:31456:0] ugni_device.c:182 UCX ERROR Could not fetch PMI info.
[1542250961.930022] [nid07057:31455:0] ugni_device.c:137 UCX ERROR PMI_Init failed, Error status: -1
[1542250961.930055] [nid07057:31455:0] ugni_device.c:182 UCX ERROR Could not fetch PMI info.
0 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabe3c71a0]
1 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabe3c73f4]
2 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabdf36ba0]
3 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabdf36945]
4 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabdf2f369]
5 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabdcbde3b]
6 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0x44b4) [0x2aaabc2074b4]
7 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xd1) [0x2aaabc207d20]
8 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0xa888) [0x2aaabc20d888]
9 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(+0x7377f) [0x2aaaabcfa77f]
10 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x5d) [0x2aaaabcfa69c]
11 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0x132d5e) [0x2aaaab0fed5e]
0 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabe3c71a0]
1 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabe3c73f4]
2 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabdf36ba0]
3 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabdf36945]
4 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabdf2f369]
5 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabdcbde3b]
6 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0x44b4) [0x2aaabc2074b4]
7 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xd1) [0x2aaabc207d20]
8 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0xa888) [0x2aaabc20d888]
9 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(+0x7377f) [0x2aaaabcfa77f]
10 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x5d) [0x2aaaabcfa69c]
11 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0x132d5e) [0x2aaaab0fed5e]
12 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0xf3) [0x2aaaabd09e5d]
13 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x99e) [0x2aaaab030e6e]
14 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(MPI_Init+0x7f) [0x2aaaab07e58c]
15 ./mpi/one-sided/osu_get_latency() [0x401450]
16 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab5e6725]
17 ./mpi/one-sided/osu_get_latency() [0x401699]
===================
12 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0xf3) [0x2aaaabd09e5d]
13 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x99e) [0x2aaaab030e6e]
14 /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(MPI_Init+0x7f) [0x2aaaab07e58c]
15 ./mpi/one-sided/osu_get_latency() [0x401450]
16 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab5e6725]
17 ./mpi/one-sided/osu_get_latency() [0x401699]
===================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 31456 on node 7057 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
This issue could be related to #5973 (as it's the same machine).
Interestingly, I seem to be unable to disable the UCX plm using --mca pml_ucx_priority 0
(assuming that that is the right way to do it), it does not change the outcome. With Open MPI configured without UCX I am able to run Open MPI applications on that machine (using --oversubscribe
).
Another interesting observation is that when running the Open MPI + UCX under DDT I get the following error:
[1542291969.934067] [nid02068:7234 :0] sys.c:619 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1542291969.934601] [nid02068:7235 :0] sys.c:619 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
# OSU MPI_Get latency Test v5.3.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size Latency (us)
[1542291993.779432] [nid02068:7235 :1] ugni_udt_iface.c:119 UCX ERROR GNI_PostDataProbeWaitById, Error status: GNI_RC_TIMEOUT 4
[1542291993.780001] [nid02068:7234 :0] sys.c:619 UCX ERROR shmget(size=2097152 flags=0xb80) for ucp_am_bufs failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1542291993.780991] [nid02068:7234 :1] ugni_udt_iface.c:119 UCX ERROR GNI_PostDataProbeWaitById, Error status: GNI_RC_TIMEOUT 4
As suggested, here is the output of ipcs -l
on the node:
$ aprun -n 1 ipcs -l
------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509481983
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767
I am able to debug Open MPI applications if Open MPI was built without support for UCX.