-
Notifications
You must be signed in to change notification settings - Fork 900
Powerpc atomics: Force usage of powerpc assembly. #8649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Powerpc atomics: Force usage of powerpc assembly. #8649
Conversation
Some data:
With C11 atomics (Currently the default) - opal_lifo/fifo results:
With powerpc specific assembly:
|
Here's results with gcc for splitting out the LL/LC as suggested in #8528, they're better but still lag the assembly:
|
f7a4239
to
3d7c509
Compare
The builtins used by default on Power have been shown to perform poorly. For the time being, force all compilers to use the inline assembly until atomic builtins catch-up. This changes the defaults for all compilers sans xl, including: gcc, clang, and pgi to use the assembly. Previously, all of the above were using C11 or the gcc builtins. Bonus: Add a configure flag to force Power machines to use the builtins/C11, depending on what is available. This will make future testing easier. Signed-off-by: Austen Lauria <[email protected]>
3d7c509
to
e3f3c5b
Compare
Spectrum MPI (IBM MPI):
OMPI:
Since the ppc assembly atomics have been tested on SMPI for the past few years, I think it's worth bringing this back to the v4.x series. |
bot:aws:retest |
bot:ibm:scale:retest |
@@ -84,6 +84,13 @@ else | |||
WANT_BRANCH_PROBABILITIES=0 | |||
fi | |||
|
|||
AC_ARG_ENABLE([builtin-atomics-for-ppc],[AS_HELP_STRING([--enable-builtin-atomics-for-ppc], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just verifying that this is correct. PPC is PowerPC which is derived from Power. Is it correct to use PPC here or should it be Power?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check was added under this architecture string:
powerpc-*|powerpc64-*|powerpcle-*|powerpc64le-*|rs6000-*|ppc-*)
So maybe 'power' would be more generic, though it does make for a slightly misleading configure name. (--enable-builtin-atomics-for-power
does that mean power-aware or powerful or Power arch). So if we move to the 'power' then I would make the option --enable-builtin-atomics-for-power-arch
.
That being said, I don't really have a preference on the naming here. Anything that is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with either. I feel like we use the term interchangeably, so I'm not sure tbh.
@awlauria Little surprised the LL/SC didn't get all the way there. The only left to look at would be the memory barriers. Maybe they are too strict? |
The fetch_and_add_64 assembly is basically on par with c11
|
Ok - new data. On a quiet node: C11 compare-and-swap is ~10% slower than the assembly:
|
@awlauria How about C11 but using LL/SC for the lifo/fifo. |
The last two tests I posted is a stand-alone test, no opal involved, so no load/store. Just a straight comparison between the direct assembly and C11 calls to compare and swap and add64. I'll attach the source later, but it's essentially a smaller stripped down opal_lifo program to help identify what exactly is slower. |
Some additional data: opal_atomic_fetch_add_64 -
c11:
opal_atomic_compare_exchange_strong_64 -
c11:
I attached the full results. I pushed up the test to get this data as a separate commit and added it to make-check, since it might be useful for the future. |
ee6f7b8
to
2084670
Compare
Similar to class/opal_lifo/fifo, but more granular to get a better idea of what is going on. Code was borrowed from those tests to make this one. Signed-off-by: Austen Lauria <[email protected]>
2084670
to
136213d
Compare
The builtins used by default on Power have been
shown to perform poorly. For the time being, force
all compilers to use the inline assembly until
atomic builtins catch-up.
This changes the defaults for all compilers sans xl, including:
gcc, clang, and pgi to use the assembly.
Previously, all of the above were using C11 or
the gcc builtins.
Bonus:
Add a configure flag to force Power machines to use
the builtins/C11, depending on what is available. This
will make future testing easier.
Signed-off-by: Austen Lauria [email protected]