Skip to content

v3.0.0rc1: XLC-13.1 odd failure running ring_c #3817

@PHHargrove

Description

@PHHargrove

I have configured on a Linux/ppc64 system with xlc-13.1 as follows:

[path-to]/configure --prefix=[...] --enable-debug CC=xlc CXX=xlC FC=xlf \
        CFLAGS=-q64 --with-wrapper-cflags=-q64 \
        CXXFLAGS=-q64 --with-wrapper-cxxflags=-q64 \
        FCFLAGS=-q64 --with-wrapper-fcflags=-q64 --disable-oshmem-fortran

This build was previously failing due to #3811, but a patch from @ggouaillardet gets me past that and on to the next problem:

$ mpirun -mca btl sm,self -np 2 examples/ring_c'
[login2:71837] Component file data does not match filename: /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/pmix/mca_ptl_tcp (ptl / tcp) != ptl  -- ignored
[login2:71842] Component file data does not match filename: /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/pmix/mca_ptl_tcp (ptl / tcp) != ptl  -- ignored
[login2:71843] Component file data does not match filename: /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/pmix/mca_ptl_tcp (ptl / tcp) != ptl  -- ignored
[login2:71842] *** Process received signal ***
[login2:71842] Signal: Segmentation fault (11)
[login2:71842] Signal code: Address not mapped (1)
[login2:71842] Failing at address: 0xdeafbeeddeafbf35
[login2:71842] [ 0] [0xfffad2e0448]
[login2:71842] [ 1] [0x0]
[login2:71842] [ 2] /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/libopen-pal.so.40(+0x108fd8)[0xfffac858fd8]
[login2:71842] [ 3] [login2:71843] *** Process received signal ***
[login2:71843] Signal: Segmentation fault (11)
[login2:71843] Signal code: Address not mapped (1)
[login2:71843] Failing at address: 0xdeafbeeddeafbf35
[login2:71843] [ 0] [0xfffaad60448]
[login2:71843] [ 1] [0x0]

Note that the failing address of 0xdeafbeeddeafbf35 is suspiciously similar to
./opal/class/opal_object.h:#define OPAL_OBJ_MAGIC_ID ((0xdeafbeedULL << 32) + 0xdeafbeedULL)

From gdb on a core (different run than output above, may not match exactly):

Core was generated by `examples/ring_c '.
Program terminated with signal 11, Segmentation fault.
#0  0x00000fffa46fc0b4 in pmix_ptl_base_recv_handler (sd=14, flags=2, cbdata=0xfffa4798c78)
    at /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/mca/ptl/base/ptl_base_sendrecv.c:401
401                             (NULL == peer) ? "NULL" : peer->info->nptr->nspace,
(gdb) where
#0  0x00000fffa46fc0b4 in pmix_ptl_base_recv_handler (sd=14, flags=2, cbdata=0xfffa4798c78)
    at /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/mca/ptl/base/ptl_base_sendrecv.c:401
#1  0x00000fffa53b8fd8 in .event_persist_closure ()
   from /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/libopen-pal.so.40
#2  0x00000fffa53b9354 in .event_process_active_single_queue ()
   from /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/libopen-pal.so.40
#3  0x00000fffa53b9738 in .event_process_active ()
   from /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/libopen-pal.so.40
#4  0x00000fffa53ba5e0 in .opal_libevent2022_event_base_loop ()
   from /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/INST/lib/libopen-pal.so.40
#5  0x00000fffa46d1d4c in progress_engine (obj=0x100314d1ef8)
    at /gpfs-biou/phh1/OMPI/openmpi-3.0.0rc1-linux-ppc64-xlc-13.1/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/runtime/pmix_progress_threads.c:109
#6  0x000000800cdac5dc in .start_thread () from /lib64/libpthread.so.0
#7  0x000000800ccda9bc in .__clone () from /lib64/libc.so.6

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions