Skip to content

Running opal_fifo test intermittently hangs on Power8  #5470

Closed
@mksully22

Description

@mksully22

Thank you for taking the time to submit an issue!

Background information

Running opal_fifo test intermittently hangs on Power8. Detailed debug info is provided below

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Using master branch commit level 92d8941

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone https://github.com/open-mpi/ompi.git
cd ompi
./autogen.pl
./configure --enable-debug --prefix=/usr --mandir=/usr/share/man --sysconfdir=/etc/$pkgname --enable-ipv6 --with-threads=posix --with-hwloc=/usr
make
make check
Run opal_fifo stress script:
#!/bin/bash
i=0
while :
do
        echo "Iteration: $i"
        ./test/class/opal_fifo
        ((i++))
        sleep 1
done

Please describe the system on which you are running

  • Ubuntu 17.10
  • Computer hardware: Power8 (160 linux CPUs)
  • Network type: Ethernet

Details of the problem

Using the following script to exercise the opal_fifo. The testcase will hang intermittantly. htop shows all 8 opal_fifo LWPs running at 100% CPU

Start opal_fifo stress script to reproduce:

#!/bin/bash
i=0
while :
do
        echo "Iteration: $i"
        ./test/class/opal_fifo
        ((i++))
        sleep 1
done

Looking at the running processes/LWPs

root@p82qvirt:/home/mksully# ps -eLf | grep fifo
mksully   27439  21680  27439  0    9 16:20 pts/0    00:00:00 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27466 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27467 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27468 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27469 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27470 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27471 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27472 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27473 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
root      27648  50987  27648  0    1 16:21 pts/3    00:00:00 grep --color=auto fifo

Using gdb to collect some info on where the LWPs are:

root@p82qvirt:/home/mksully# gdb attach 27439
GNU gdb (Ubuntu 8.0.1-0ubuntu1) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 27439
[New LWP 27466]
[New LWP 27467]
[New LWP 27468]
[New LWP 27469]
[New LWP 27470]
[New LWP 27471]
[New LWP 27472]
[New LWP 27473]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8) at pthread_join.c:90
90      pthread_join.c: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x796665cb55c0 (LWP 27439) "opal_fifo" 0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8)
    at pthread_join.c:90
  2    Thread 0x796661f7f180 (LWP 27466) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  3    Thread 0x79666277f180 (LWP 27467) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:59
  4    Thread 0x796662f7f180 (LWP 27468) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  5    Thread 0x79666377f180 (LWP 27469) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  6    Thread 0x796663f7f180 (LWP 27470) "opal_fifo" 0x000007e12cbf1a34 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:137
  7    Thread 0x79666577f180 (LWP 27471) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  8    Thread 0x796664f7f180 (LWP 27472) "opal_fifo" 0x000007e12cbf1a30 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:137
  9    Thread 0x79666477f180 (LWP 27473) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
(gdb) thread apply all bt

Thread 9 (Thread 0x79666477f180 (LWP 27473)):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1  0x000007e12cbf19d8 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:127
#2  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#3  0x0000796665aa8710 in start_thread (arg=0x79666477f180) at pthread_create.c:465
#4  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 8 (Thread 0x796664f7f180 (LWP 27472)):
#0  0x000007e12cbf1a30 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:137
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x796664f7f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 7 (Thread 0x79666577f180 (LWP 27471)):
#0  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x79666577f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 6 (Thread 0x796663f7f180 (LWP 27470)):
#0  0x000007e12cbf1a34 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:137
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x796663f7f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 5 (Thread 0x79666377f180 (LWP 27469)):
#0  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x79666377f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 4 (Thread 0x796662f7f180 (LWP 27468)):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1  0x000007e12cbf1a6c in opal_read_counted_pointer (value=0x796662f7e590, addr=0x7fffeeb23f70) at ../../opal/class/opal_lifo.h:82
#2  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:138
#3  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#4  0x0000796665aa8710 in start_thread (arg=0x796662f7f180) at pthread_create.c:465
#5  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 3 (Thread 0x79666277f180 (LWP 27467)):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:59
#1  0x000007e12cbf1afc in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:159
#2  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#3  0x0000796665aa8710 in start_thread (arg=0x79666277f180) at pthread_create.c:465
#4  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 2 (Thread 0x796661f7f180 (LWP 27466)):
#0  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x796661f7f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 1 (Thread 0x796665cb55c0 (LWP 27439)):
#0  0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8) at pthread_join.c:90
#1  0x000007e12cbf2964 in main (argc=1, argv=0x7fffeeb24448) at opal_fifo.c:227
(gdb)

(gdb) disassemble /s opal_atomic_rmb
Dump of assembler code for function opal_atomic_rmb:
../../opal/include/opal/sys/gcc_builtin/atomic.h:
59      {
   0x000007e12cbf115c <+0>:     std     r31,-8(r1)
   0x000007e12cbf1160 <+4>:     stdu    r1,-48(r1)
   0x000007e12cbf1164 <+8>:     mr      r31,r1

60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
   0x000007e12cbf1168 <+12>:    lwsync

61      }
   0x000007e12cbf116c <+16>:    nop
   0x000007e12cbf1170 <+20>:    addi    r1,r31,48
   0x000007e12cbf1174 <+24>:    ld      r31,-8(r1)
   0x000007e12cbf1178 <+28>:    blr
   0x000007e12cbf117c <+32>:    .long 0x0
   0x000007e12cbf1180 <+36>:    .long 0x0
   0x000007e12cbf1184 <+40>:    .long 0x1000180
End of assembler dump.
(gdb) disassemble /s opal_fifo_pop_atomic
Dump of assembler code for function opal_fifo_pop_atomic:
../../opal/class/opal_fifo.h:
119     {
   0x000007e12cbf1948 <+0>:     addis   r2,r12,2
   0x000007e12cbf194c <+4>:     addi    r2,r2,26040
   0x000007e12cbf1950 <+8>:     mflr    r0
   0x000007e12cbf1954 <+12>:    std     r0,16(r1)
   0x000007e12cbf1958 <+16>:    std     r31,-8(r1)
   0x000007e12cbf195c <+20>:    stdu    r1,-176(r1)
   0x000007e12cbf1960 <+24>:    mr      r31,r1
   0x000007e12cbf1964 <+28>:    std     r3,40(r31)
   0x000007e12cbf1968 <+32>:    ld      r9,-28688(r13)
   0x000007e12cbf196c <+36>:    std     r9,152(r31)
   0x000007e12cbf1970 <+40>:    li      r9,0

120         opal_list_item_t *item, *next, *ghost = &fifo->opal_fifo_ghost;
   0x000007e12cbf1974 <+44>:    ld      r9,40(r31)
   0x000007e12cbf1978 <+48>:    addi    r9,r9,80
   0x000007e12cbf197c <+52>:    std     r9,56(r31)

121         opal_counted_pointer_t head, tail;
122
123         opal_read_counted_pointer (&fifo->opal_fifo_head, &head);
   0x000007e12cbf1980 <+56>:    ld      r9,40(r31)
   0x000007e12cbf1984 <+60>:    addi    r9,r9,48
   0x000007e12cbf1988 <+64>:    std     r9,80(r31)
   0x000007e12cbf198c <+68>:    addi    r9,r31,112
   0x000007e12cbf1990 <+72>:    std     r9,88(r31)

../../opal/class/opal_lifo.h:
81          value->data.counter = addr->data.counter;
   0x000007e12cbf1994 <+76>:    ld      r9,80(r31)
   0x000007e12cbf1998 <+80>:    ld      r10,0(r9)
   0x000007e12cbf199c <+84>:    ld      r9,88(r31)
   0x000007e12cbf19a0 <+88>:    std     r10,0(r9)

82          opal_atomic_rmb ();
   0x000007e12cbf19a4 <+92>:    bl      0x7e12cbf115c <opal_atomic_rmb>

83          value->data.item = addr->data.item;
   0x000007e12cbf19a8 <+96>:    ld      r9,80(r31)
   0x000007e12cbf19ac <+100>:   ld      r10,8(r9)
   0x000007e12cbf19b0 <+104>:   ld      r9,88(r31)
   0x000007e12cbf19b4 <+108>:   std     r10,8(r9)

../../opal/class/opal_fifo.h:
126             tail.value = fifo->opal_fifo_tail.value;
   0x000007e12cbf19b8 <+112>:   ld      r9,40(r31)
   0x000007e12cbf19bc <+116>:   addi    r9,r9,64
   0x000007e12cbf19c0 <+120>:   lxvd2x  vs0,0,r9
   0x000007e12cbf19c4 <+124>:   xxswapd vs12,vs0
   0x000007e12cbf19c8 <+128>:   addi    r9,r31,128
   0x000007e12cbf19cc <+132>:   xxswapd vs0,vs12
   0x000007e12cbf19d0 <+136>:   stxvd2x vs0,0,r9

127             opal_atomic_rmb ();
   0x000007e12cbf19d4 <+140>:   bl      0x7e12cbf115c <opal_atomic_rmb>

128
129             item = (opal_list_item_t *) head.data.item;
   0x000007e12cbf19d8 <+144>:   ld      r9,120(r31)
   0x000007e12cbf19dc <+148>:   std     r9,64(r31)

130             next = (opal_list_item_t *) item->opal_list_next;
   0x000007e12cbf19e0 <+152>:   ld      r9,64(r31)
   0x000007e12cbf19e4 <+156>:   ld      r9,40(r9)
   0x000007e12cbf19e8 <+160>:   std     r9,72(r31)
---Type <return> to continue, or q <return> to quit---

131
132             if (ghost == tail.data.item && ghost == item) {
   0x000007e12cbf19ec <+164>:   ld      r9,136(r31)
   0x000007e12cbf19f0 <+168>:   ld      r10,56(r31)
   0x000007e12cbf19f4 <+172>:   cmpd    cr7,r10,r9
   0x000007e12cbf19f8 <+176>:   bne     cr7,0x7e12cbf1a14 <opal_fifo_pop_atomic+204>
   0x000007e12cbf19fc <+180>:   ld      r10,56(r31)
   0x000007e12cbf1a00 <+184>:   ld      r9,64(r31)
   0x000007e12cbf1a04 <+188>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a08 <+192>:   bne     cr7,0x7e12cbf1a14 <opal_fifo_pop_atomic+204>

133                 return NULL;
   0x000007e12cbf1a0c <+196>:   li      r9,0
   0x000007e12cbf1a10 <+200>:   b       0x7e12cbf1b38 <opal_fifo_pop_atomic+496>

134             }
135
136             /* the head or next pointer are in an inconsistent state. keep looping. */
137             if (tail.data.item != item && ghost != tail.data.item && ghost == next) {
   0x000007e12cbf1a14 <+204>:   ld      r9,136(r31)
   0x000007e12cbf1a18 <+208>:   ld      r10,64(r31)
   0x000007e12cbf1a1c <+212>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a20 <+216>:   beq     cr7,0x7e12cbf1a80 <opal_fifo_pop_atomic+312>
   0x000007e12cbf1a24 <+220>:   ld      r9,136(r31)
   0x000007e12cbf1a28 <+224>:   ld      r10,56(r31)
   0x000007e12cbf1a2c <+228>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a30 <+232>:   beq     cr7,0x7e12cbf1a80 <opal_fifo_pop_atomic+312>
   0x000007e12cbf1a34 <+236>:   ld      r10,56(r31)
   0x000007e12cbf1a38 <+240>:   ld      r9,72(r31)
   0x000007e12cbf1a3c <+244>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a40 <+248>:   bne     cr7,0x7e12cbf1a80 <opal_fifo_pop_atomic+312>

138                 opal_read_counted_pointer (&fifo->opal_fifo_head, &head);
   0x000007e12cbf1a44 <+252>:   ld      r9,40(r31)
   0x000007e12cbf1a48 <+256>:   addi    r9,r9,48
   0x000007e12cbf1a4c <+260>:   std     r9,96(r31)
   0x000007e12cbf1a50 <+264>:   addi    r9,r31,112
   0x000007e12cbf1a54 <+268>:   std     r9,104(r31)

../../opal/class/opal_lifo.h:
81          value->data.counter = addr->data.counter;
   0x000007e12cbf1a58 <+272>:   ld      r9,96(r31)
   0x000007e12cbf1a5c <+276>:   ld      r10,0(r9)
   0x000007e12cbf1a60 <+280>:   ld      r9,104(r31)
   0x000007e12cbf1a64 <+284>:   std     r10,0(r9)

82          opal_atomic_rmb ();
   0x000007e12cbf1a68 <+288>:   bl      0x7e12cbf115c <opal_atomic_rmb>

83          value->data.item = addr->data.item;
   0x000007e12cbf1a6c <+292>:   ld      r9,96(r31)
   0x000007e12cbf1a70 <+296>:   ld      r10,8(r9)
   0x000007e12cbf1a74 <+300>:   ld      r9,104(r31)
   0x000007e12cbf1a78 <+304>:   std     r10,8(r9)

../../opal/class/opal_fifo.h:
139                 continue;
   0x000007e12cbf1a7c <+308>:   b       0x7e12cbf1aa8 <opal_fifo_pop_atomic+352>

140             }
141
142             /* try popping the head */
143             if (opal_update_counted_pointer (&fifo->opal_fifo_head, &head, next)) {
   0x000007e12cbf1a80 <+312>:   ld      r9,40(r31)
   0x000007e12cbf1a84 <+316>:   addi    r9,r9,48
   0x000007e12cbf1a88 <+320>:   addi    r10,r31,112
---Type <return> to continue, or q <return> to quit---
   0x000007e12cbf1a8c <+324>:   ld      r5,72(r31)
   0x000007e12cbf1a90 <+328>:   mr      r4,r10
   0x000007e12cbf1a94 <+332>:   mr      r3,r9
   0x000007e12cbf1a98 <+336>:   bl      0x7e12cbf1710 <opal_update_counted_pointer+8>
   0x000007e12cbf1a9c <+340>:   mr      r9,r3
   0x000007e12cbf1aa0 <+344>:   cmpdi   cr7,r9,0
   0x000007e12cbf1aa4 <+348>:   bne     cr7,0x7e12cbf1aac <opal_fifo_pop_atomic+356>

126             tail.value = fifo->opal_fifo_tail.value;
   0x000007e12cbf1aa8 <+352>:   b       0x7e12cbf19b8 <opal_fifo_pop_atomic+112>

144                 break;
   0x000007e12cbf1aac <+356>:   nop

145             }
146         } while (1);
147
148         opal_atomic_wmb ();
   0x000007e12cbf1ab0 <+360>:   bl      0x7e12cbf1188 <opal_atomic_wmb>

149
150         /* check for tail and head consistency */
151         if (ghost == next) {
   0x000007e12cbf1ab4 <+364>:   ld      r10,56(r31)
   0x000007e12cbf1ab8 <+368>:   ld      r9,72(r31)
   0x000007e12cbf1abc <+372>:   cmpd    cr7,r10,r9
   0x000007e12cbf1ac0 <+376>:   bne     cr7,0x7e12cbf1b28 <opal_fifo_pop_atomic+480>

152             /* the head was just set to &fifo->opal_fifo_ghost. try to update the tail as well */
153             if (!opal_update_counted_pointer (&fifo->opal_fifo_tail, &tail, ghost)) {
   0x000007e12cbf1ac4 <+380>:   ld      r9,40(r31)
   0x000007e12cbf1ac8 <+384>:   addi    r9,r9,64
   0x000007e12cbf1acc <+388>:   addi    r10,r31,128
   0x000007e12cbf1ad0 <+392>:   ld      r5,56(r31)
   0x000007e12cbf1ad4 <+396>:   mr      r4,r10
   0x000007e12cbf1ad8 <+400>:   mr      r3,r9
   0x000007e12cbf1adc <+404>:   bl      0x7e12cbf1710 <opal_update_counted_pointer+8>
   0x000007e12cbf1ae0 <+408>:   mr      r9,r3
   0x000007e12cbf1ae4 <+412>:   xori    r9,r9,1
   0x000007e12cbf1ae8 <+416>:   clrlwi  r9,r9,24
   0x000007e12cbf1aec <+420>:   cmpdi   cr7,r9,0
   0x000007e12cbf1af0 <+424>:   beq     cr7,0x7e12cbf1b28 <opal_fifo_pop_atomic+480>

154                 /* tail was changed by a push operation. wait for the item's next pointer to be se then
155                  * update the head */
156
157                 /* wait for next pointer to be updated by push */
158                 while (ghost == item->opal_list_next) {
   0x000007e12cbf1af4 <+428>:   b       0x7e12cbf1afc <opal_fifo_pop_atomic+436>

159                     opal_atomic_rmb ();
   0x000007e12cbf1af8 <+432>:   bl      0x7e12cbf115c <opal_atomic_rmb>

158                 while (ghost == item->opal_list_next) {
   0x000007e12cbf1afc <+436>:   ld      r9,64(r31)
   0x000007e12cbf1b00 <+440>:   ld      r9,40(r9)
   0x000007e12cbf1b04 <+444>:   ld      r10,56(r31)
   0x000007e12cbf1b08 <+448>:   cmpd    cr7,r10,r9
   0x000007e12cbf1b0c <+452>:   beq     cr7,0x7e12cbf1af8 <opal_fifo_pop_atomic+432>

160                 }
161
162                 opal_atomic_rmb ();
   0x000007e12cbf1b10 <+456>:   bl      0x7e12cbf115c <opal_atomic_rmb>

163
164                 /* update the head with the real next value. note that no other thread
---Type <return> to continue, or q <return> to quit---
165                  * will be attempting to update the head until after it has been updated
166                  * with the next pointer. push will not see an empty list and other pop
167                  * operations will loop until the head is consistent. */
168                 fifo->opal_fifo_head.data.item = (opal_list_item_t *) item->opal_list_next;
   0x000007e12cbf1b14 <+460>:   ld      r9,64(r31)
   0x000007e12cbf1b18 <+464>:   ld      r10,40(r9)
   0x000007e12cbf1b1c <+468>:   ld      r9,40(r31)
   0x000007e12cbf1b20 <+472>:   std     r10,56(r9)

169                 opal_atomic_wmb ();
   0x000007e12cbf1b24 <+476>:   bl      0x7e12cbf1188 <opal_atomic_wmb>

170             }
171         }
172
173         item->opal_list_next = NULL;
   0x000007e12cbf1b28 <+480>:   ld      r9,64(r31)
   0x000007e12cbf1b2c <+484>:   li      r10,0
   0x000007e12cbf1b30 <+488>:   std     r10,40(r9)

174
175         return item;
   0x000007e12cbf1b34 <+492>:   ld      r9,64(r31)

176     }
   0x000007e12cbf1b38 <+496>:   mr      r3,r9
   0x000007e12cbf1b3c <+500>:   ld      r9,152(r31)
   0x000007e12cbf1b40 <+504>:   ld      r10,-28688(r13)
   0x000007e12cbf1b44 <+508>:   cmpld   cr7,r9,r10
   0x000007e12cbf1b48 <+512>:   li      r9,0
   0x000007e12cbf1b4c <+516>:   li      r10,0
   0x000007e12cbf1b50 <+520>:   beq     cr7,0x7e12cbf1b5c <opal_fifo_pop_atomic+532>
   0x000007e12cbf1b54 <+524>:   bl      0x7e12cbf0f00 <00000018.plt_call.__stack_chk_fail@@GLIBC_2.17>
   0x000007e12cbf1b58 <+528>:   ld      r2,24(r1)
   0x000007e12cbf1b5c <+532>:   addi    r1,r31,176
   0x000007e12cbf1b60 <+536>:   ld      r0,16(r1)
   0x000007e12cbf1b64 <+540>:   mtlr    r0
   0x000007e12cbf1b68 <+544>:   ld      r31,-8(r1)
   0x000007e12cbf1b6c <+548>:   blr
   0x000007e12cbf1b70 <+552>:   .long 0x0
   0x000007e12cbf1b74 <+556>:   .long 0x1000000
   0x000007e12cbf1b78 <+560>:   .long 0x1000180
End of assembler dump.

Note: LWPs 1,2,4-8 are all caught in this loop (I had gdb display some of the noteworth variable values):

(gdb)
(gdb) step
126             tail.value = fifo->opal_fifo_tail.value;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
127             opal_atomic_rmb ();
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:60
60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
(gdb)
61      }
(gdb)
opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
129             item = (opal_list_item_t *) head.data.item;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
130             next = (opal_list_item_t *) item->opal_list_next;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
132             if (ghost == tail.data.item && ghost == item) {
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
137             if (tail.data.item != item && ghost != tail.data.item && ghost == next) {
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
138                 opal_read_counted_pointer (&fifo->opal_fifo_head, &head);
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
opal_read_counted_pointer (value=0x796661f7e590, addr=0x7fffeeb23f70) at ../../opal/class/opal_lifo.h:81
81          value->data.counter = addr->data.counter;
11: addr->data.counter = 13038893
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
24: value->data.counter = 13038893
25: addr->data.counter = 13038893
26: value->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
27: addr->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb)
82          opal_atomic_rmb ();
11: addr->data.counter = 13038893
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
24: value->data.counter = 13038893
25: addr->data.counter = 13038893
26: value->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
27: addr->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb)
opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:60
60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
(gdb)
61      }
(gdb)
opal_read_counted_pointer (value=0x796661f7e590, addr=0x7fffeeb23f70) at ../../opal/class/opal_lifo.h:83
83          value->data.item = addr->data.item;
11: addr->data.counter = 13038893
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
24: value->data.counter = 13038893
25: addr->data.counter = 13038893
26: value->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
27: addr->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb)
opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:139
139                 continue;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
126             tail.value = fifo->opal_fifo_tail.value;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
(gdb) info thread
  Id   Target Id         Frame
  1    Thread 0x796665cb55c0 (LWP 27439) "opal_fifo" 0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8)
    at pthread_join.c:90
  2    Thread 0x796661f7f180 (LWP 27466) "opal_fifo" 0x000007e12cbf19f0 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:132
* 3    Thread 0x79666277f180 (LWP 27467) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:158
  4    Thread 0x796662f7f180 (LWP 27468) "opal_fifo" 0x000007e12cbf1a40 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:137
  5    Thread 0x79666377f180 (LWP 27469) "opal_fifo" opal_read_counted_pointer (value=0x79666377e590, addr=0x7fffeeb23f70)
    at ../../opal/class/opal_lifo.h:83
  6    Thread 0x796663f7f180 (LWP 27470) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  7    Thread 0x79666577f180 (LWP 27471) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  8    Thread 0x796664f7f180 (LWP 27472) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  9    Thread 0x79666477f180 (LWP 27473) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61

Note: Thread 3 is looping a bit farther down

(gdb) thread 3
opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:158
158                 while (ghost == item->opal_list_next) {
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7e168eb8610
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038892, item = 0x7e168eb8610}, value = 0x000007e168eb86100000000000c6f52c}
28: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb) step
159                     opal_atomic_rmb ();
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7e168eb8610
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038892, item = 0x7e168eb8610}, value = 0x000007e168eb86100000000000c6f52c}
28: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb) step
opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:60
60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
(gdb) step
61      }

Is there any additional information that I could collect that would be helpful to diagnose the issue?

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions