Skip to content

master opal_init() failing on Absoft 32 bit MTT dlopen_test #3484

@jsquyres

Description

@jsquyres

Absoft's MTT has been failing on the dlopen_test (during make check) in opal_init() on master recently (e.g., see https://mtt.open-mpi.org/index.php?do_redir=2438).

Here's a backtrace provided by @cagoelz (note the assertion failure in opal_pointer_array_add()):

Reading symbols from /home/ompitest/scratches/2017-05-05/mpi-install/8HsK/src/openmpi-master-201705050239-eb03679/ompi/debuggers/.libs/lt-dlopen_test...done.
(gdb) run
Starting program: /home/ompitest/scratches/2017-05-05/mpi-install/8HsK/src/openmpi-master-201705050239-eb03679/ompi/debuggers/.libs/lt-dlopen_test
[Thread debugging using libthread_db enabled]
lt-dlopen_test: class/opal_pointer_array.c:241: opal_pointer_array_add: Assertion `0 == (table->free_bits[__b_idx] & (1UL << __b_pos))' failed.

Program received signal SIGABRT, Aborted.
0x00110402 in __kernel_vsyscall ()
(gdb) bt
#0  0x00110402 in __kernel_vsyscall ()
#1  0x00733b10 in raise () from /lib/libc.so.6
#2  0x00735421 in abort () from /lib/libc.so.6
#3  0x0072cf6b in __assert_fail () from /lib/libc.so.6
#4  0x00399ace in opal_pointer_array_add (table=0x482940, ptr=0x80637f0)
   at class/opal_pointer_array.c:241
#5  0x003d3e13 in register_variable (project_name=0x45b094 "opal",
   framework_name=0x45b090 "dss", component_name=0x0,
   variable_name=0x45b1a9 "buffer_threshold_size", description=0x0,
   type=MCA_BASE_VAR_TYPE_INT, enumerator=0x0, bind=0,
   flags=MCA_BASE_VAR_FLAG_SETTABLE, info_lvl=OPAL_INFO_LVL_8,
   scope=MCA_BASE_VAR_SCOPE_ALL_EQ, synonym_for=-1, storage=0x47f928)
   at mca_base_var.c:1385
#6  0x003d4649 in mca_base_var_register (project_name=0x45b094 "opal",
   framework_name=0x45b090 "dss", component_name=0x0,
   variable_name=0x45b1a9 "buffer_threshold_size", description=0x0,
   type=MCA_BASE_VAR_TYPE_INT, enumerator=0x0, bind=0,
   flags=MCA_BASE_VAR_FLAG_SETTABLE, info_lvl=OPAL_INFO_LVL_8,
   scope=MCA_BASE_VAR_SCOPE_ALL_EQ, storage=0x47f928) at mca_base_var.c:1495
#7  0x003b582f in opal_dss_register_vars () at dss/dss_open_close.c:286
#8  0x0039fe25 in opal_register_params () at runtime/opal_params.c:357
#9  0x0039ec92 in opal_init_util (pargc=0xbfffead0, pargv=0xbfffead4)
   at runtime/opal_init.c:422
#10 0x0039eeb1 in opal_init (pargc=0xbfffead0, pargv=0xbfffead4)
---Type <return> to continue, or q <return> to quit---
   at runtime/opal_init.c:513
#11 0x08048cee in main (argc=Cannot access memory at address 0x77fc) at dlopen_test.c:133
(gdb)

This is a 32 bit build, and we explicitly invoke opal_init() in this test. Are we invoking it incorrectly? Or is something legitimately busted in opal_init() in 32 bit builds?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions