Skip to content

Issues running with Open UCX 1.4 on Cray XC40 #6084

@devreal

Description

@devreal

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI git master (592e2cc)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

configured using with support for Open UCX 1.4 (downloaded from openucx.com) using configure flags --with-cray-pmi --enable-debug --with-ucx

Please describe the system on which you are running

  • Operating system/version: Cray XC40
  • Computer hardware:
  • Network type:

Details of the problem

Trying to run a job on that machine leads to the following error and eventually crash:

$ mpirun -n 2 -N 2 --oversubscribe ./mpi/one-sided/osu_get_latency
Thu Nov 15 04:02:41 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Thu Nov 15 04:02:41 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
Thu Nov 15 04:02:41 2018: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
Thu Nov 15 04:02:41 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
[nid07057:31456:0:31456] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[nid07057:31455:0:31455] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace ====
[1542250961.930000] [nid07057:31456:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1542250961.930037] [nid07057:31456:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
[1542250961.930022] [nid07057:31455:0]    ugni_device.c:137  UCX  ERROR PMI_Init failed, Error status: -1
[1542250961.930055] [nid07057:31455:0]    ugni_device.c:182  UCX  ERROR Could not fetch PMI info.
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabe3c71a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabe3c73f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabdf36ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabdf36945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabdf2f369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabdcbde3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0x44b4) [0x2aaabc2074b4]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xd1) [0x2aaabc207d20]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0xa888) [0x2aaabc20d888]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(+0x7377f) [0x2aaaabcfa77f]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x5d) [0x2aaaabcfa69c]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0x132d5e) [0x2aaaab0fed5e]
    0  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x211a0) [0x2aaabe3c71a0]
    1  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucs.so.0(+0x213f4) [0x2aaabe3c73f4]
    2  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(init_device_list+0x90) [0x2aaabdf36ba0]
    3  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(+0x18945) [0x2aaabdf36945]
    4  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libuct.so.0(uct_md_open+0x69) [0x2aaabdf2f369]
    5  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openucx-1.4/lib/libucp.so.0(ucp_init_version+0x94b) [0x2aaabdcbde3b]
    6  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0x44b4) [0x2aaabc2074b4]
    7  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_open+0xd1) [0x2aaabc207d20]
    8  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/openmpi/mca_pml_ucx.so(+0xa888) [0x2aaabc20d888]
    9  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(+0x7377f) [0x2aaaabcfa77f]
   10  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_components_open+0x5d) [0x2aaaabcfa69c]
   11  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(+0x132d5e) [0x2aaaab0fed5e]
   12  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0xf3) [0x2aaaabd09e5d]
   13  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x99e) [0x2aaaab030e6e]
   14  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(MPI_Init+0x7f) [0x2aaaab07e58c]
   15  ./mpi/one-sided/osu_get_latency() [0x401450]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab5e6725]
   17  ./mpi/one-sided/osu_get_latency() [0x401699]
===================
   12  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libopen-pal.so.0(mca_base_framework_open+0xf3) [0x2aaaabd09e5d]
   13  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(ompi_mpi_init+0x99e) [0x2aaaab030e6e]
   14  /zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-git/lib/libmpi.so.0(MPI_Init+0x7f) [0x2aaaab07e58c]
   15  ./mpi/one-sided/osu_get_latency() [0x401450]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab5e6725]
   17  ./mpi/one-sided/osu_get_latency() [0x401699]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 31456 on node 7057 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

This issue could be related to #5973 (as it's the same machine).

Interestingly, I seem to be unable to disable the UCX plm using --mca pml_ucx_priority 0 (assuming that that is the right way to do it), it does not change the outcome. With Open MPI configured without UCX I am able to run Open MPI applications on that machine (using --oversubscribe).

Another interesting observation is that when running the Open MPI + UCX under DDT I get the following error:

[1542291969.934067] [nid02068:7234 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1542291969.934601] [nid02068:7235 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
# OSU MPI_Get latency Test v5.3.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
[1542291993.779432] [nid02068:7235 :1] ugni_udt_iface.c:119  UCX  ERROR GNI_PostDataProbeWaitById, Error status: GNI_RC_TIMEOUT 4

[1542291993.780001] [nid02068:7234 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xb80) for ucp_am_bufs failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1542291993.780991] [nid02068:7234 :1] ugni_udt_iface.c:119  UCX  ERROR GNI_PostDataProbeWaitById, Error status: GNI_RC_TIMEOUT 4

As suggested, here is the output of ipcs -l on the node:

$ aprun -n 1 ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509481983
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767

I am able to debug Open MPI applications if Open MPI was built without support for UCX.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions