Skip to content

config: re-enable GCC inline ASM check for PGI #2048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 6, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Sep 2, 2016

We disabled this support a long time ago. Probably safe to assume
whatever bug we were working around no longer exists.

Closes #2044

Signed-off-by: Nathan Hjelm [email protected]

We disabled this support a long time ago. Probably safe to assume
whatever bug we were working around no longer exists.

Closes open-mpi#2044

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@PHHargrove This should fix the link issue with PGI 16.x.

@PHHargrove
Copy link
Member

@hjelmn I will try that patch out on the PGI 16.x for PPC64 system when I am able.

However, I am not sure I agree with the assumption that the problem seen in 2008 is no longer an issue. So, I'd like to find time to test PGI 12, 13, 14 and 15 on x86-64 as well. IIRC, anything older then that won't support the minimum C99 feature set required by OMPI. If you are waiting for a review from me, you won't get it until that testing can be completed.

-Paul

@jjhursey
Copy link
Member

jjhursey commented Sep 2, 2016

What versions of the PGI compiler does Open MPI plan to support? It might be worth testing some older versions to make sure this ASM protection wasn't important there. If so then we should make the configure logic a bit more complete (and document which versions it applies to).

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@jjhursey If Paul is correct then we should probably say we require PGI 12 or newer. If all the known asm bugs were indeed fixed in 10.8 then this is the complete fix. We should re-open the discussion of killing the .asm files if PGI is indeed working with inline asm.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

Going to run this branch with various versions from 10.9 -> 15.10 on x86_64.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

PGI 10.9 - Pass... Well mostly. The asm tests pass but I get this when building libopen-pal.la:

pgcc-Error-Unknown switch: -pthread

Its unrelated to this PR though.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

PGI 12.10 - ASM Pass

@PHHargrove
Copy link
Member

PGI 10.9 - Pass
Well, that is a surprise (to me anyway) independent of the ASM.

Could we also get ia32 (via -m32) tested w/ as many of the available PGI versions as possible? The 64-bit CAS code for ia32 has been particularly "tough" on compiler's support for inline ASM since it uses pretty much all of the available registers. It was responsible for several of the asm bugs (PGI and otherwise) we found in various compilers when building GASNet's atomics.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@PHHargrove Sure. Will add that to the list. Hoping to figure out this damn -pthreads error though so I can throw the opal_lifo/opal_fifo tests at it as well. Have no idea where the switch is coming from. It isn't in the Makefile (configure correctly finds -pthread does not work).

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

PGI 12.6 - ASM Pass

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

PGI 12.6 ia32 - ASM Pass, lifo pass, fifo pass

@PHHargrove
Copy link
Member

PGI 16.7 for PPC64 (the original motivation) FAILS make check on this branch (detail below).
I suspect PGI's handing of GCC-style inline asm for PPC is less than perfect this early in its lifetime.

If all the amd64 and ia32 tests are clear, we still cannot claim "Closes #2044".
However, perhaps the configury should continue to blacklist PGI, but only for PPC64.
That is still a step out of the dark ages of 2008.

Some details:

../../../config/test-driver: line 107: 21663 Segmentation fault      (core dumped) "$@" > $log_file 2>&1
FAIL: opal_lifo
../../../config/test-driver: line 107: 21687 Segmentation fault      (core dumped) "$@" > $log_file 2>&1
FAIL: opal_fifo

Respective gdb runs on the core files:

Core was generated by `/home/ubuntu/ompi/BLD/test/class/.libs/lt-opal_lifo '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000010001ff8 in opal_lifo_pop_atomic (lifo=0x3fffdffc76b8) at ../../../opal/class/opal_lifo.h:223
223             next = (opal_list_item_t *) item->opal_list_next;
(gdb) print *item
Cannot access memory at address 0x30d34590
Core was generated by `/home/ubuntu/ompi/BLD/test/class/.libs/lt-opal_lifo '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000010001ff8 in opal_obj_new_debug (type=0x30d34590,
    file=0x10030d34540 "\355\276\257\336\355\276\257\336\320c\027\217\377?", line=256)
    at ../../../opal/class/opal_object.h:265
265         object->cls_init_file_name = file;

(gdb) where
#0  0x0000000010001ff8 in opal_obj_new_debug (type=0x30d34590,
    file=0x10030d34540 "\355\276\257\336\355\276\257\336\320c\027\217\377?", line=256)
    at ../../../opal/class/opal_object.h:265
#1  0x00003fffdffc7500 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@PHHargrove Could also be bugs in our inline asm. Doubtful though since it works with gcc and xlc.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

Running list:

PGI 10.9 x86_64 - Pass (asm, fifo, lifo)
PGI 10.9 ia32 - Pass (asm, fifo, lifo)
PGI 12.6 x86_64 - Pass (asm, lifo, fifo)
PGI 12.6 ia32 - Pass (asm, lifo, fifo)
PGI 12.10 x86_64 - Pass (asm, lifo, fifo)
PGI 12.10 ia32 - Pass (asm, fifo, lifo)
PGI 13.2 x86_64 - Pass (asm, fifo, lifo)
PGI 13.2 ia32 - Pass (asm, fifo, lifo)
PGI 13.7 x86_64 - Pass (asm, lifo, fifo)
PGI 13.7 ia32 - Pass (asm, lifo, fifo)
PGI 13.10 x86_64 - Pass (asm, lifo, fifo)
PGI 13.10 ia32 - Pass (asm, lifo, fifo)
PGI 14.10 x86_64 - Pass (asm, fifo, lifo)
PGI 14.10 ia32 - X - Can't test
PGI 15.10 x86_64 - Pass (asm, fifo, lifo)
PGI 15.10 ia32 - X - Can't test
PGI 16.7 ppc64 - Pass (asm), Fail (lifo, fifo)

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@PHHargrove I will look at the generated assembly from PGI on the ppc64 VM. Might be able to figure out why it is failing. Could be Monday before I get a chance though.

@PHHargrove
Copy link
Member

PGI 16.7 ppc64 - Pass? (asm), Fail (lifo, fifo)

"make all" finishes.
"make check" at top level gets as far as the failing lifo and fifo tests.
However, before that point the "tests/asm" directory passed all tests.

Let me know if something else is needed to resolve the "?" in that line.

@PHHargrove
Copy link
Member

I will look at the generated assembly from PGI on the ppc64 VM. Might be able to figure out why it is failing. Could be Monday before I get a chance though.

Can we verify today that you are able to login to the VM?

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@PHHargrove Thats all i needed to resolve the ?. VM login looks good.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@PHHargrove The problem could be with the ll/sc atomics. I might have a typo in the clobbers or something. Will test a couple of things and see what shakes out. If it is a PGI bug we can file it and re-enable PGI ppc64 inline asm when they fix it.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

Ok, the full suite of versions I have are tested and working. I just need to figure out ppc64 now.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

Hmm, looks like on our testbed we have PGI 15.7, 16.1, and 16.5 for x86_64. Can test those next week if needed.

@PHHargrove
Copy link
Member

@hjelmn

Heads-up that you probably need --enable-debug to reproduce the problem on ppc64:

My initial ppc64 runs were with --enable-debug.
I have now tested w/o that flag and PASS all the tests (asm, lifo, fifo).
I would have expected the opposite!

-Paul

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@PHHargrove Ok, thats interesting. didn't configure with --enable-debug. That suggests a clobber or something similar since we do -O2 when not building with --enable-debug.

@PHHargrove
Copy link
Member

If you have a licence for 16.5 then you should be able to d/l and install 16.7 (their current latest) for x86-64 as well. Registration is required for the d/l, but you don't need to provide license info.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 2, 2016

@sjeaugey Does nvidia care about old versions of PGI? Would it be ok to cut off anything older than 10.8?

@sjeaugey
Copy link
Member

sjeaugey commented Sep 2, 2016

I think 6 years is pretty old indeed. Let me check.

@PHHargrove
Copy link
Member

@hjelmn

Pretty sure I see the compiler bug.
Have a look at the following, focusing on the stdcx. and the instruction immediately before it:

0000000010002120 <opal_atomic_sc_64>:
    10002120:   f8 ff 61 f8     std     r3,-8(r1)
    10002124:   f0 ff 81 f8     std     r4,-16(r1)
    10002128:   f8 ff 61 e8     ld      r3,-8(r1)
    1000212c:   78 1b 64 7c     mr      r4,r3
    10002130:   f0 ff a1 80     lwz     r5,-16(r1)
    10002134:   ad 19 a0 7c     stdcx.  r5,0,r3
    10002138:   00 00 a0 38     li      r5,0
    1000213c:   08 00 c2 40     bne-    10002144 <opal_atomic_sc_64+0x24>
    10002140:   01 00 a5 60     ori     r5,r5,1
    10002144:   ec ff c1 90     stw     r6,-20(r1)
    10002148:   e8 ff a1 90     stw     r5,-24(r1)
    1000214c:   ea ff 61 e8     lwa     r3,-24(r1)
    10002150:   20 00 80 4e     blr

The lwz immediately before the stdcx. is a 32-bit load with zero-extension. It is loading the from the location where the 2nd arg (r4) was spilled in the prologue. So, the sc_64 operation is corrupting newval (zeroing the upper 32-bits) when setting up the asm argument(s).

This almost made me laugh (or cry) because it was shockingly familiar from my past work keeping x86(-64) asm working as PGI asm continued to fix old bugs and introduce new ones:

   * GASNETI_PGI_ASM_BUG2843
   *   Compiler suffers from "tpr 17075" in which extended asm() may load only 32 bits of
   *   a 64-bit operand at -O1 (but is OK at -O0 and -O2).
   *

It seems reasonable to assume that with -O2 there is no spill and thus no incorrect load to corrupt the value.

BTW: what's up with the unused foo arguments to the asm for both sc operations?

@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

Looks like a PGI asm bug to me. This:

static inline int opal_atomic_sc_64(volatile int64_t *addr, int64_t newval)
{
    int32_t ret;

    __asm__ __volatile__ ("   stdcx.  %2, 0, %1  \n\t"
                          "   li      %0,0       \n\t"
                          "   bne-    1f         \n\t"
                          "   ori     %0,%0,1    \n\t"
                          "1:"
                          : "=r" (ret)
                          : "r" (addr), "r" (newval)
                          : "cc", "memory");
    return ret;
}

Produces this:

        .align  4
        .type   opal_atomic_sc_64,@function
opal_atomic_sc_64:                      # @opal_atomic_sc_64
.Lfunc_begin19:
        .loc    6 253 0                 # ../../opal/include/opal/sys/powerpc/atomic.h:253:0
        .cfi_startproc
# BB#0:                                 # %L.entry
        #DEBUG_VALUE: opal_atomic_sc_64:addr <- [%X1+-8]
        #DEBUG_VALUE: opal_atomic_sc_64:newval <- [%X1+-16]
        std 3, -8(1)
        std 4, -16(1)
        .loc    6 264 1 prologue_end    # ../../opal/include/opal/sys/powerpc/atomic.h:264:1
.Ltmp244:
        ld 3, -8(1)
        lwz 5, -16(1)
        #APP
        stdcx. 5, 0, 3

        li 3, 0

        bne-     0, .Ltmp245

        ori 3, 3, 1

.Ltmp245:
        #NO_APP
        std 3, -24(1)
        lwa 3, -24(1)
        blr
.Ltmp246:
        .long   0
        .quad   0
.Lfunc_end19:
        .size   opal_atomic_sc_64, .Lfunc_end19-.Lfunc_begin19
        .cfi_endproc

Notice that when loading r5 with the new value it is using lwz 5, -16(1) which is load word (32-bit) and 0. It should be using a double-word load. I am trying to figure out a workaround.

@PHHargrove
Copy link
Member

@hjelmn

You are a step behind me.
I diagnosed the same code gen error (32-bit load) one comment before yours.
I have a work-around in final testing right now.
Stay tuned...

-Paul

@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

Hah, didn't even look :). Awesome. Two independent reviews of the generated code came to the same conclusion.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

The unused arguments... No idea what I was thinking. Was cleaning up the code now. Probably an artifact of writing the atomics that didn't hurt so didn't get cleaned up. Feel free to clean them up as part of the work around.

@hjelmn hjelmn closed this Sep 3, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

Ack. Grrr.

@hjelmn hjelmn reopened this Sep 3, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

The Close and comment button is bugger than Comment :D.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

Even more fun. I made newval an input-output operand to see what it would do and it became a byte load! lbz 5, -24(1)

@PHHargrove
Copy link
Member

OK, I'll drop the "foo".

While my tests of the work-around are running, here is a minimal test case for your amusement:

$ cat q.c
long identity_function(long input) {
    long output;
    __asm__ __volatile__ (  "mr    %0,%1"
                          : "=r" (output)
                          : "r"  (input) );
    return output;
}
$ pgcc -c q.c
$ objdump -d q.o

q.o:     file format elf64-powerpcle
Disassembly of section .text:

0000000000000000 <identity_function>:
   0:   f8 ff 61 f8     std     r3,-8(r1)
   4:   f8 ff 81 80     lwz     r4,-8(r1)
   8:   78 23 83 7c     mr      r3,r4
   c:   f0 ff 61 f8     std     r3,-16(r1)
  10:   20 00 80 4e     blr
        ...

@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

You can also drop the unused *addr input operand. I have a feeling I was setting ret a different way before the li %0,0/ori %0,%0,1 code was finalized. I will clean up the sc_32 version.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

Nice clean test case there. Hopefully pgi gets this resolved in 16.6 (err 16.8). The workaround will get us up an running for older versions.

Is v16 the first one that supports ppc64?

@PHHargrove
Copy link
Member

I will clean up the sc_32 version.

You can also drop the unused *addr input operand.

In the absence of the "memory" clobber, that input might be required to prevent the compiler from moving a store to the same location over the asm (though the __volatile__ is supposed to do that too). With a "memory" clobber and the __volatile__ it can certainly be removed.

At some point my changes become too large to be considered de minimis, which is a problem since I don't have a contributor's agreement. So, how about this:

  • I will post the fix (1 line change, with a few lines of preprocessor to make it conditional) here
  • You fix everything else.

Hopefully pgi gets this resolved in 16.6

I am testing 16.7 now.
If I understand correctly, the version is year.month, to the next release will be 16.9 or higher.

Is v16 the first one that supports ppc64?

I didn't know until last night that such a beast existed.
However, based on what I learned while looking for it I think 16.5 and 16.7 have been the ONLY releases for ppc64 so far, and both are identified as beta. I assume @sjeaugey will have the real answer.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 3, 2016

Ugh, looks like the workaround will have to be applied to all the _64 atomics. Oh well. Should be easy enough to do.

@PHHargrove
Copy link
Member

Ugh, looks like the workaround will have to be applied to all the _64 atomics. Oh well. Should be easy enough to do.

Yup, I just came to the same conclusion when "make check" still failed after I fixed the generated code for sc_64. In hindsight, it should have been obvious that if that trivial test case would fail then pretty much anything w/ a 64-bit integer (not pointer) input would be vulnerable to the compiler bug.

The failure of opal_lifo test is no longer a SEGV, but it is still a test failure.
The ompi_rb_tree and opal_fifo tests still SEGV.

(It seems that ompi_rb_tree has been failing since the start, but I missed it because it scrolled off my screen.)

Good news: the patch below resolves the problems with sc_64 at -O0.
All that is needed is a cast to (long long) or (void) - either one works.
Note that a cast to int64_t or intptr_t does not help at all (nor do the unsigned versions).

Handing the remaining work over to you, Nathan.
I suggest some sort of FOO64(newval) macro rather than an #if/#else/#endif in every atomic.
Have a look at the existing OPAL_ASM_ADDR() macro in the same file (also a compiler bug work-around I provided, fwiw).

-Paul

diff --git a/opal/include/opal/sys/powerpc/atomic.h b/opal/include/opal/sys/powerpc/atomic.h
index d104379..7c28399 100644
--- a/opal/include/opal/sys/powerpc/atomic.h
+++ b/opal/include/opal/sys/powerpc/atomic.h
@@ -257,7 +257,11 @@ static inline int opal_atomic_sc_64(volatile int64_t *addr, int64_t newval)
                           "   ori     %0,%0,1    \n\t"
                           "1:"
                           : "=r" (ret), "=m" (*addr), "=r" (foo)
+#if defined(__PGI) // Work-around bad-code gen at -O0
+                          : "r" (addr), "r" ((long long)newval)
+#else
                           : "r" (addr), "r" (newval)
+#endif
                           : "cc", "memory");
     return ret;
 }

@PHHargrove
Copy link
Member

@hjelmn

FYI: I am sending my test case to Portland Group.
I'll try to remember to report the assigned TPR number here when I have one.

-Paul

@hjelmn
Copy link
Member Author

hjelmn commented Sep 6, 2016

From what I understand nvidia is ok with this change. It will actually help performance when using PGI. After this is merged I will evaluate the .asm files and see if they can be killed on master. We will leave them on v2.0.x and use them if the PGI version is < 10.8.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 6, 2016

:bot:mellanox:retest

@hjelmn hjelmn merged commit 08d08e6 into open-mpi:master Sep 6, 2016
@jjhursey
Copy link
Member

@hjelmn I don't think this fix was ever moved to the v2.x release series. Can you file a PR for it? I'd be happy to test/review.

@PHHargrove
Copy link
Member

Today I finally received a TPR number from PGI: 23064

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants