Skip to content

Test failures: opal_fifo : test fix #2526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amckinstry opened this issue Dec 6, 2016 · 14 comments
Closed

Test failures: opal_fifo : test fix #2526

amckinstry opened this issue Dec 6, 2016 · 14 comments

Comments

@amckinstry
Copy link

The test opal_fifo is failing on Debian for the git master, snapshot of November 25.
The test previously passed on 2.0.1 systems; works on most architectures except kfreebsd-i386 (kfreebsd-amd64 works) and ppc64el:

https://buildd.debian.org/status/package.php?p=openmpi&suite=experimental

See, e.g.

https://buildd.debian.org/status/fetch.php?pkg=openmpi&arch=kfreebsd-i386&ver=2.0.2%7Egit.20161225-2&stamp=1480937654

Any ideas? I'm setting up a test system to grab the logs and debug

@jsquyres jsquyres added the bug label Dec 7, 2016
@jsquyres
Copy link
Member

jsquyres commented Dec 7, 2016

@amckinstry Unfortunately I can't tell much from that detailed link -- it just shows that it failed, but not why. Can you shared opal_fifo.log, and/or a backtrace from a core dump?

@amckinstry
Copy link
Author

I was kinda hoping you'd seen this before, and could just point me at a patch :-)

The build systems hadn't kept opal_fifo.log, so I've needed to set up a ppc64el environment by hand.

Its hanging in opal_fifo.c:200, on pthread_join().
No errors, which doesn't help; just the hang. There should be 8 pthreads, but its hanging on the first call to pthread_join().
2.0.1 worked, and there's been no change to opal_fifo.c, I've just repeated with 2.0.1 run and its fine, so nothing in the environment (outside openmpi) has changed.

In gdb, I can see the threads (8 + pthread master). They're in 👍

  Id   Target Id         Frame
* 1    Thread 0x3fffa7ab83a0 (LWP 19218) "opal_fifo" 0x00003fffa77797b8 in pthread_join () from /lib/powerpc64le-linux-gnu/libpthread.so.0
  2    Thread 0x3fffa751f190 (LWP 19232) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:130
  3    Thread 0x3fffa6d1f190 (LWP 19233) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:130
  4    Thread 0x3fffa651f190 (LWP 19234) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:130
  5    Thread 0x3fffa5d1f190 (LWP 19235) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:123
  6    Thread 0x3fffa551f190 (LWP 19236) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:135
  7    Thread 0x3fffa4d1f190 (LWP 19237) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:130
  8    Thread 0x3fffa451f190 (LWP 19238) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:130
  9    Thread 0x3fffa3d1f190 (LWP 19239) "opal_fifo" opal_fifo_pop_atomic (fifo=<optimized out>) at ../../opal/class/opal_fifo.h:130

and 135 looks suspicious:

134	        /* the head or next pointer are in an inconsistent state. keep looping. */
135	        if (tail.data.item != item && &fifo->opal_fifo_ghost != tail.data.item &&
136	            &fifo->opal_fifo_ghost == next) {
137	            continue;
138	        }

Interrupting shows its looping around the do{} loop in opal_fifo_pop_atomic() indefinitely.

@jsquyres
Copy link
Member

jsquyres commented Dec 7, 2016

@hjelmn @bosilca Did something change with regard to atomics or other underpinnings of opal_fifo that could cause problems with optimization on some platforms? This is on v2.0.x HEAD.

@amckinstry
Copy link
Author

848218.txt

Patch from Thibaut Paumard [email protected] for this bug.

@amckinstry amckinstry changed the title Test failures: opal_fifo on Debian Test failures: opal_fifo : test fix Dec 15, 2016
@jsquyres
Copy link
Member

@amckinstry I'm sorry, I don't think that patch is correct. I note that thread_test() is invoked three times in the test:

  1. Called directly from main
  2. Invoked via pthread_create()
  3. Invoked again via pthread_create()

Making thread_test() exit via pthread_exit() will cause the entire test to exit in the first call to thread_test() (because causes the main thread to exit). I'm guessing that this just causes the test to exit early, before the failure occurs.

Plus, returning from the function invoked from pthread_create() is semantically equivalent to calling pthread_exit(), so that change resulted in no difference to the remaining two instances.

@amckinstry
Copy link
Author

Agreed, the patch is not correct. Still need to figure out the answer.

@hjelmn
Copy link
Member

hjelmn commented Jan 31, 2017

There was a regression in PPC atomics. Should be fixed in the latest 2.0.2 release candidate. Please test.

@AdamWill
Copy link

I am seeing the same test - test_fifo - hang on ppc64le during attempts to build openmpi 2.1.1 for Fedora. Every other arch completes successfully, but ppc64le just hangs after test_lifo passes, and eventually the job is killed when it reaches the build system's timeout.

@opoplawski
Copy link
Contributor

FWIW - This still occurs with 2.1.6rc1

@jsquyres
Copy link
Member

@opoplawski Well that's disappointing. Is this also happening with 3.0.3, 3.1.3, and/or the latest nightly 4.0.1 snapshot?

@opoplawski
Copy link
Contributor

I'm not seeing with 3.1.3 or with 4.0.0 - so that's good.

@jsquyres
Copy link
Member

Ok, great.

@hjelmn This implies that we're still missing an atomic fix from the v2.x branch...?

@hjelmn
Copy link
Member

hjelmn commented Nov 29, 2018

Maybe. Could be a opal_fifo_t fix that is missing.

@bosilca
Copy link
Member

bosilca commented May 8, 2019

I don't think anybody will backport the new atomic operations into 2.x. This ticket can be closed.

@bosilca bosilca closed this as completed May 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants