Skip to content

IBM MTT "make check" fail #2966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jsquyres opened this issue Feb 14, 2017 · 29 comments
Closed

IBM MTT "make check" fail #2966

jsquyres opened this issue Feb 14, 2017 · 29 comments
Assignees
Labels

Comments

@jsquyres
Copy link
Member

@gpaulsen @jjhursey IBM is getting an MTT make check fail: https://mtt.open-mpi.org/index.php?do_redir=2391

The error is that the opal_fifo test is failing in 32 bit POWER on master.

@jjhursey
Copy link
Member

Duplicate of Issue #1893 - we need to investigate. It's intermittent so hard to pin down.

here is the output from test-suite.log in case it helps:

=====================================================================
   Open MPI master-201702120422-81e57bb: test/class/test-suite.log
=====================================================================

# TOTAL: 10
# PASS:  9
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: opal_fifo
===============

Exhaustive atomics thread finished. Popped 168490 items. Time: 6 s 451012 us 38287 nsec/poppush
Exhaustive atomics thread finished. Popped 170672 items. Time: 6 s 498624 us 38076 nsec/poppush
Exhaustive atomics thread finished. Popped 170704 items. Time: 6 s 555723 us 38404 nsec/poppush
Exhaustive atomics thread finished. Popped 171759 items. Time: 6 s 559438 us 38189 nsec/poppush
Exhaustive atomics thread finished. Popped 167076 items. Time: 6 s 625385 us 39654 nsec/poppush
Exhaustive atomics thread finished. Popped 167749 items. Time: 6 s 661287 us 39709 nsec/poppush
Exhaustive atomics thread finished. Popped 151657 items. Time: 6 s 659123 us 43909 nsec/poppush
Exhaustive atomics thread finished. Popped 149946 items. Time: 6 s 666284 us 44457 nsec/poppush
 Failure :  fifo push/pop multi-threaded with atomics when there are insufficient items
 Failure :  fifo pop all items
SUPPORT: OMPI Test failed: opal_fifo_t (2 of 8 failed)
Single thread test. Time: 0 s 71404 us 71 nsec/poppush
Atomics thread finished. Time: 0 s 124363 us 124 nsec/poppush
Atomics thread finished. Time: 8 s 346324 us 8346 nsec/poppush
Atomics thread finished. Time: 8 s 362706 us 8362 nsec/poppush
Atomics thread finished. Time: 8 s 398374 us 8398 nsec/poppush
Atomics thread finished. Time: 8 s 403718 us 8403 nsec/poppush
Atomics thread finished. Time: 8 s 600248 us 8600 nsec/poppush
Atomics thread finished. Time: 8 s 658165 us 8658 nsec/poppush
Atomics thread finished. Time: 8 s 870449 us 8870 nsec/poppush
Atomics thread finished. Time: 8 s 859184 us 8859 nsec/poppush
All threads finished. Thread count: 8 Time: 8 s 874685 us 1109 nsec/poppush
All threads finished. Thread count: 8 Time: 6 s 671116 us 833 nsec/poppush
FAIL opal_fifo (exit status: 1)

@jjhursey
Copy link
Member

jjhursey commented May 9, 2017

PR #3468 might be related (might fix this issue - need to check)

@jsquyres
Copy link
Member Author

jsquyres commented May 9, 2017

@jjhursey Let us know if #3468 fixes the issue -- we have corresponding PRs for v2.0.x, v2.1.x, and v3.0.x.

@jjhursey
Copy link
Member

jjhursey commented May 9, 2017

😞 That PR does not seem to fix the make check issue. We'll have to keep investigating.

@jjhursey
Copy link
Member

Instead of failing now it is hanging (I have to manually kill it as it will hang the MTT runs - there is no timeout mechanism in make check). Here is a link to an MTT failure:

I suspect that the change made to fix Issue #3450 made this failure a hang now.

@jsquyres
Copy link
Member Author

@bosilca @hjelmn It looks like the recent changes to opal_fifo just changed the failure mode for IBM. Can you have a look at Josh's comment (#2966 (comment))?

@bosilca
Copy link
Member

bosilca commented May 11, 2017

First and foremost, the failure is on ppc64le, I have no access to such an architecture. What is puzzling is that I was under the impresison that for PPC we were using atomic load/store (via OPAL_HAVE_ATOMIC_LLSC_PTR) but apparently this is not the case.

I run 10k tests on different Intel architectures with an OMPI compiler is optimized mode, and I had no failure. No really sure how to approach this issue.

@jsquyres
Copy link
Member Author

@jjhursey Looks like we're going to need some IBM help on this one...

@jjhursey
Copy link
Member

Yeah that's fine. I can investigate, but I'm not back full time yet so it'll be a little while before I can probably get eyes on this. I just wanted to keep this issue updated.
The hang mode is helpful though since I can then attach a debugger and poke around a bit.

@gpaulsen
Copy link
Member

I'm going to ask @nysal to drive this from our side, since Josh is apparently skydiving.

@jjhursey
Copy link
Member

There was a request to re-assess this ticket after PR #3661 - I did so and the problem still persists. We'll continue to investigate.

@gpaulsen
Copy link
Member

I thought we'd agreed not to support 32bit anymore in master.

@jjhursey
Copy link
Member

I can reproduce this on the 64 bit default build. I think the 32 bit from the original message was a MTT config mistake on our end.

@hjelmn
Copy link
Member

hjelmn commented Jul 5, 2017

Its possible the LL/SC fifo implementation has a bug. I will take a look this week and see if there is anything obvious.

@nmorey
Copy link
Contributor

nmorey commented Jul 7, 2017

I also see this problem pop up once in a while with SUSE openmpi2 packaging, on ppc64le only.

@jjhursey
Copy link
Member

Ref: PR #3988 Issue #3697

@jjhursey
Copy link
Member

Ref: PR #2526

@AdamWill
Copy link

I am seeing the same test - test_fifo - hang on ppc64le during attempts to build openmpi 2.1.1 for Fedora. Every other arch completes successfully, but ppc64le just hangs after test_lifo passes, and eventually the job is killed when it reaches the build system's timeout.

@opoplawski
Copy link
Contributor

test_fifo still hangs in Fedora on ppc64le with 2.1.6rc1

@hjelmn
Copy link
Member

hjelmn commented Nov 29, 2018

Ok. Will dig deeper tomorrow. I know we have this fixed on master. Need to see what could be missing.

@jjhursey
Copy link
Member

@awlauria We were talking about this the other day. Is this resolved now?

We activated 'make check' for IBM's MTT runs. From last night's run it looks like PGI might still have a problem here with the v4.1.x branch.

@awlauria
Copy link
Contributor

Yeah, mtt hasn't run yet for 5.0.x and master last night, so we'll see where we stand then. So far pgi on the v4.1.x branch is the only failure.

@awlauria
Copy link
Contributor

#8649

may resolve this.

@jjhursey
Copy link
Member

Now that #8649 is merged are the make check issues resolved?

@awlauria
Copy link
Contributor

I need to cherry-pick that back to v4 and v5, but yeah I think we're probably good here.

@awlauria
Copy link
Contributor

v4.0.x: #8709
v4.1.x: #8708
v5.0.x: #8710

@awlauria
Copy link
Contributor

Note that the pr for v4.0.x was rejected, so we should consider adding

--disable-builtin-atomics to our v4.0.x mtt builds.

@awlauria
Copy link
Contributor

awlauria commented Mar 26, 2021

Our internal (IBM) MTT has been updated to build v4.0.x with --disable-builtin-atomics

@awlauria
Copy link
Contributor

Closing this, haven't seen an mtt make-check failure recently. Will re-open if it pops up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants