Skip to content

segmentation fault in openib with failover enabled #2228

Closed
@davidklaftenegger

Description

@davidklaftenegger

When compiling openmpi-2.0.1 (or the nightly from last Wednesday) with --enable-btl-openib-failover we experience a segmentation fault on the first use of MPI communication in all MPI applications when using openib.

[jason0:15469] *** Process received signal ***
[jason0:15469] Signal: Segmentation fault (11)
[jason0:15469] Signal code: Address not mapped (1)
[jason0:15469] Failing at address: (nil)
0 pings 1
[jason0:15469] [ 0] /lib64/libpthread.so.0(+0x10d70)[0x7f40652b2d70]
[jason0:15469] [ 1] /usr/lib64/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x66b)[0x7f405a46cceb]
[jason0:15469] [ 2] /usr/lib64/openmpi/mca_pml_ob1.so(+0xae18)[0x7f4059e2ae18]
[jason0:15469] [ 3] /usr/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x365)[0x7f4059e2b945]
[jason0:15469] [ 4] /usr/lib/libmpi.so.20(ompi_coll_base_barrier_intra_two_procs+0xb5)[0x7f406555b0f5]
[jason0:15469] [ 5] /usr/lib/libmpi.so.20(MPI_Barrier+0xb6)[0x7f4065518576]
[jason0:15469] [ 6] ./pingtest[0x4013a3]
[jason0:15469] [ 7] /lib64/libc.so.6(__libc_start_main+0xf0)[0x7f4064f26620]
[jason0:15469] [ 8] ./pingtest[0x400de9]
[jason0:15469] *** End of error message ***
[jason1:20431] *** Process received signal ***
[jason1:20431] Signal: Segmentation fault (11)
[jason1:20431] Signal code: Address not mapped (1)
[jason1:20431] Failing at address: (nil)
[jason1:20431] [ 0] /lib64/libpthread.so.0(+0x10d70)[0x7f6e12d28d70]
[jason1:20431] [ 1] /usr/lib64/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x66b)[0x7f6e07dcbceb]
[jason1:20431] [ 2] /usr/lib64/openmpi/mca_pml_ob1.so(+0xae18)[0x7f6e0c16fe18]
[jason1:20431] [ 3] /usr/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x365)[0x7f6e0c170945]
[jason1:20431] [ 4] /usr/lib/libmpi.so.20(ompi_coll_base_barrier_intra_two_procs+0xb5)[0x7f6e12fd10f5]
[jason1:20431] [ 5] /usr/lib/libmpi.so.20(MPI_Barrier+0xb6)[0x7f6e12f8e576]
[jason1:20431] [ 6] ./pingtest[0x4013a3]
[jason1:20431] [ 7] /lib64/libc.so.6(__libc_start_main+0xf0)[0x7f6e1299c620]
[jason1:20431] [ 8] ./pingtest[0x400de9]
[jason1:20431] *** End of error message ***

This is a regression from openmpi-10.0.2, where this worked without incident.

When not setting --enable-btl-openib-failover our setup seems to work again.

In case that matters, we have an mlx4 Infiniband interconnect.
If you need any additional information, please tell me.

Yours,
David

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions