Skip to content

Powerpc atomics: Force usage of powerpc assembly. #8649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 22, 2021

Conversation

awlauria
Copy link
Contributor

The builtins used by default on Power have been
shown to perform poorly. For the time being, force
all compilers to use the inline assembly until
atomic builtins catch-up.

This changes the defaults for all compilers sans xl, including:
gcc, clang, and pgi to use the assembly.

Previously, all of the above were using C11 or
the gcc builtins.

Bonus:
Add a configure flag to force Power machines to use
the builtins/C11, depending on what is available. This
will make future testing easier.

Signed-off-by: Austen Lauria [email protected]

@awlauria
Copy link
Contributor Author

awlauria commented Mar 18, 2021

Some data:

$. gcc --version
gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
OS: RHEL 8.2
Power9

40 physical cpus:
cpu		: POWER9, altivec supported
clock		: 3800.000000MHz
revision	: 2.2 (pvr 004e 1202)

timebase	: 512000000
platform	: PowerNV
model		: 8335-GTX
machine		: PowerNV 8335-GTX
firmware	: OPAL
MMU		: Radix

With C11 atomics (Currently the default) -

opal_lifo/fifo results:

Single thread test. Time: 0 s 71525 us 71 nsec/poppush
Single thread test. Time: 0 s 72396 us 72 nsec/poppush
Atomics thread finished. Time: 0 s 173275 us 173 nsec/poppush
Atomics thread finished. Time: 0 s 173992 us 173 nsec/poppush
Atomics thread finished. Time: 2 s 906829 us 2906 nsec/poppush
Atomics thread finished. Time: 3 s 123974 us 3123 nsec/poppush
Atomics thread finished. Time: 3 s 166354 us 3166 nsec/poppush
Atomics thread finished. Time: 3 s 186272 us 3186 nsec/poppush
Atomics thread finished. Time: 3 s 333951 us 3333 nsec/poppush
Atomics thread finished. Time: 3 s 450135 us 3450 nsec/poppush
Atomics thread finished. Time: 3 s 504011 us 3504 nsec/poppush
Atomics thread finished. Time: 3 s 501221 us 3501 nsec/poppush
Atomics thread finished. Time: 3 s 543579 us 3543 nsec/poppush
Atomics thread finished. Time: 3 s 595563 us 3595 nsec/poppush
Atomics thread finished. Time: 3 s 625169 us 3625 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)
Atomics thread finished. Time: 3 s 634076 us 3634 nsec/poppush
All threads finished. Thread count: 8 Time: 3 s 634178 us 454 nsec/poppush
Atomics thread finished. Time: 3 s 648736 us 3648 nsec/poppush
Atomics thread finished. Time: 3 s 660060 us 3660 nsec/poppush
Atomics thread finished. Time: 3 s 671995 us 3671 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)
Atomics thread finished. Time: 3 s 674218 us 3674 nsec/poppush
All threads finished. Thread count: 8 Time: 3 s 687039 us 460 nsec/poppush


Single thread test. Time: 0 s 103063 us 103 nsec/poppush
Single thread test. Time: 0 s 103234 us 103 nsec/poppush
Atomics thread finished. Time: 0 s 351569 us 351 nsec/poppush
Atomics thread finished. Time: 0 s 357985 us 357 nsec/poppush
Atomics thread finished. Time: 3 s 96329 us 3096 nsec/poppush
Atomics thread finished. Time: 3 s 88982 us 3088 nsec/poppush
Atomics thread finished. Time: 3 s 94051 us 3094 nsec/poppush
Atomics thread finished. Time: 3 s 117543 us 3117 nsec/poppush
Atomics thread finished. Time: 3 s 157315 us 3157 nsec/poppush
Atomics thread finished. Time: 3 s 165797 us 3165 nsec/poppush
Atomics thread finished. Time: 3 s 166044 us 3166 nsec/poppush
Atomics thread finished. Time: 3 s 194060 us 3194 nsec/poppush
Atomics thread finished. Time: 3 s 227445 us 3227 nsec/poppush
Atomics thread finished. Time: 3 s 261175 us 3261 nsec/poppush
Atomics thread finished. Time: 3 s 261728 us 3261 nsec/poppush
Atomics thread finished. Time: 3 s 269797 us 3269 nsec/poppush
All threads finished. Thread count: 8 Time: 3 s 286524 us 410 nsec/poppush
Atomics thread finished. Time: 3 s 302547 us 3302 nsec/poppush
Atomics thread finished. Time: 3 s 304729 us 3304 nsec/poppush
Atomics thread finished. Time: 3 s 307708 us 3307 nsec/poppush
Atomics thread finished. Time: 3 s 310381 us 3310 nsec/poppush
All threads finished. Thread count: 8 Time: 3 s 310639 us 413 nsec/poppush
Exhaustive atomics thread finished. Popped 688476 items. Time: 2 s 430590 us 3530 nsec/poppush
Exhaustive atomics thread finished. Popped 692448 items. Time: 2 s 439846 us 3523 nsec/poppush
Exhaustive atomics thread finished. Popped 693562 items. Time: 2 s 464130 us 3552 nsec/poppush
Exhaustive atomics thread finished. Popped 695684 items. Time: 2 s 507266 us 3604 nsec/poppush
Exhaustive atomics thread finished. Popped 741187 items. Time: 2 s 560753 us 3454 nsec/poppush
Exhaustive atomics thread finished. Popped 740135 items. Time: 2 s 561966 us 3461 nsec/poppush
Exhaustive atomics thread finished. Popped 739887 items. Time: 2 s 569186 us 3472 nsec/poppush
Exhaustive atomics thread finished. Popped 738936 items. Time: 2 s 569466 us 3477 nsec/poppush
Exhaustive atomics thread finished. Popped 706994 items. Time: 2 s 558795 us 3619 nsec/poppush
Exhaustive atomics thread finished. Popped 708691 items. Time: 2 s 563098 us 3616 nsec/poppush
Exhaustive atomics thread finished. Popped 714641 items. Time: 2 s 549648 us 3567 nsec/poppush
Exhaustive atomics thread finished. Popped 712414 items. Time: 2 s 558564 us 3591 nsec/poppush
SUPPORT: OMPI Test Passed: opal_fifo_t: (8 tests)
All threads finished. Thread count: 8 Time: 2 s 578063 us 322 nsec/poppush
Exhaustive atomics thread finished. Popped 697369 items. Time: 2 s 623908 us 3762 nsec/poppush
Exhaustive atomics thread finished. Popped 696264 items. Time: 2 s 625332 us 3770 nsec/poppush
Exhaustive atomics thread finished. Popped 698603 items. Time: 2 s 632692 us 3768 nsec/poppush
Exhaustive atomics thread finished. Popped 699697 items. Time: 2 s 633203 us 3763 nsec/poppush
SUPPORT: OMPI Test Passed: opal_fifo_t: (8 tests)
All threads finished. Thread count: 8 Time: 2 s 653402 us 331 nsec/poppush

With powerpc specific assembly:

Single thread test. Time: 0 s 7680 us 7 nsec/poppush
Single thread test. Time: 0 s 7742 us 7 nsec/poppush
Atomics thread finished. Time: 0 s 49484 us 49 nsec/poppush
Atomics thread finished. Time: 0 s 49617 us 49 nsec/poppush
Atomics thread finished. Time: 1 s 465175 us 1465 nsec/poppush
Atomics thread finished. Time: 1 s 469782 us 1469 nsec/poppush
Atomics thread finished. Time: 1 s 477674 us 1477 nsec/poppush
Atomics thread finished. Time: 1 s 499081 us 1499 nsec/poppush
Atomics thread finished. Time: 1 s 500167 us 1500 nsec/poppush
Atomics thread finished. Time: 1 s 522126 us 1522 nsec/poppush
Atomics thread finished. Time: 1 s 526379 us 1526 nsec/poppush
Atomics thread finished. Time: 1 s 526272 us 1526 nsec/poppush
Atomics thread finished. Time: 1 s 520135 us 1520 nsec/poppush
Atomics thread finished. Time: 1 s 535248 us 1535 nsec/poppush
Atomics thread finished. Time: 1 s 547078 us 1547 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)
Atomics thread finished. Time: 1 s 548552 us 1548 nsec/poppush
Atomics thread finished. Time: 1 s 544524 us 1544 nsec/poppush
All threads finished. Thread count: 8 Time: 1 s 549017 us 193 nsec/poppush
Atomics thread finished. Time: 1 s 548538 us 1548 nsec/poppush
Atomics thread finished. Time: 1 s 552125 us 1552 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)
Atomics thread finished. Time: 1 s 553732 us 1553 nsec/poppush
All threads finished. Thread count: 8 Time: 1 s 553825 us 194 nsec/poppush


Single thread test. Time: 0 s 7190 us 7 nsec/poppush
Single thread test. Time: 0 s 7191 us 7 nsec/poppush
Atomics thread finished. Time: 0 s 57248 us 57 nsec/poppush
Atomics thread finished. Time: 0 s 57131 us 57 nsec/poppush
Atomics thread finished. Time: 0 s 522193 us 522 nsec/poppush
Atomics thread finished. Time: 0 s 524916 us 524 nsec/poppush
Atomics thread finished. Time: 0 s 525740 us 525 nsec/poppush
Atomics thread finished. Time: 0 s 526270 us 526 nsec/poppush
Atomics thread finished. Time: 0 s 527822 us 527 nsec/poppush
Atomics thread finished. Time: 0 s 529646 us 529 nsec/poppush
Atomics thread finished. Time: 0 s 530020 us 530 nsec/poppush
Atomics thread finished. Time: 0 s 530421 us 530 nsec/poppush
Atomics thread finished. Time: 0 s 530373 us 530 nsec/poppush
Atomics thread finished. Time: 0 s 530582 us 530 nsec/poppush
Atomics thread finished. Time: 0 s 529769 us 529 nsec/poppush
Atomics thread finished. Time: 0 s 529691 us 529 nsec/poppush
Atomics thread finished. Time: 0 s 530335 us 530 nsec/poppush
Atomics thread finished. Time: 0 s 531046 us 531 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 531137 us 66 nsec/poppush
Atomics thread finished. Time: 0 s 533018 us 533 nsec/poppush
Atomics thread finished. Time: 0 s 533294 us 533 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 533452 us 66 nsec/poppush
Exhaustive atomics thread finished. Popped 780703 items. Time: 0 s 465666 us 596 nsec/poppush
Exhaustive atomics thread finished. Popped 791661 items. Time: 0 s 473218 us 597 nsec/poppush
Exhaustive atomics thread finished. Popped 788101 items. Time: 0 s 473468 us 600 nsec/poppush
Exhaustive atomics thread finished. Popped 793919 items. Time: 0 s 475323 us 598 nsec/poppush
Exhaustive atomics thread finished. Popped 785586 items. Time: 0 s 475340 us 605 nsec/poppush
Exhaustive atomics thread finished. Popped 780514 items. Time: 0 s 478296 us 612 nsec/poppush
Exhaustive atomics thread finished. Popped 800197 items. Time: 0 s 477411 us 596 nsec/poppush
Exhaustive atomics thread finished. Popped 802602 items. Time: 0 s 480630 us 598 nsec/poppush
Exhaustive atomics thread finished. Popped 799563 items. Time: 0 s 481203 us 601 nsec/poppush
Exhaustive atomics thread finished. Popped 789577 items. Time: 0 s 479647 us 607 nsec/poppush
Exhaustive atomics thread finished. Popped 801583 items. Time: 0 s 482215 us 601 nsec/poppush
Exhaustive atomics thread finished. Popped 804372 items. Time: 0 s 482810 us 600 nsec/poppush
SUPPORT: OMPI Test Passed: opal_fifo_t: (8 tests)
All threads finished. Thread count: 8 Time: 0 s 482871 us 60 nsec/poppush
Exhaustive atomics thread finished. Popped 803923 items. Time: 0 s 481683 us 599 nsec/poppush
Exhaustive atomics thread finished. Popped 802676 items. Time: 0 s 483370 us 602 nsec/poppush
Exhaustive atomics thread finished. Popped 803479 items. Time: 0 s 483833 us 602 nsec/poppush
Exhaustive atomics thread finished. Popped 806471 items. Time: 0 s 484062 us 600 nsec/poppush
SUPPORT: OMPI Test Passed: opal_fifo_t: (8 tests)
All threads finished. Thread count: 8 Time: 0 s 484186 us 60 nsec/poppush

@awlauria awlauria requested review from hjelmn, jjhursey and nysal March 18, 2021 17:35
@awlauria
Copy link
Contributor Author

awlauria commented Mar 18, 2021

@hjelmn FYI - I tried your suggestion in #8528. It helped some, unfortunately the C11 and builtin atomics as a whole just perform worse still on power.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 18, 2021

Here's results with gcc for splitting out the LL/LC as suggested in #8528, they're better but still lag the assembly:

$. ./exports/bin/mpirun --np 2 ./test/class/opal_lifo
Single thread test. Time: 0 s 71625 us 71 nsec/poppush
Single thread test. Time: 0 s 71848 us 71 nsec/poppush
Atomics thread finished. Time: 0 s 71216 us 71 nsec/poppush
Atomics thread finished. Time: 0 s 71496 us 71 nsec/poppush
Atomics thread finished. Time: 1 s 798366 us 1798 nsec/poppush
Atomics thread finished. Time: 1 s 819698 us 1819 nsec/poppush
Atomics thread finished. Time: 1 s 839104 us 1839 nsec/poppush
Atomics thread finished. Time: 1 s 844612 us 1844 nsec/poppush
Atomics thread finished. Time: 1 s 870906 us 1870 nsec/poppush
Atomics thread finished. Time: 1 s 873093 us 1873 nsec/poppush
Atomics thread finished. Time: 1 s 874304 us 1874 nsec/poppush
Atomics thread finished. Time: 1 s 872914 us 1872 nsec/poppush
Atomics thread finished. Time: 1 s 875112 us 1875 nsec/poppush
Atomics thread finished. Time: 1 s 875450 us 1875 nsec/poppush
Atomics thread finished. Time: 1 s 877418 us 1877 nsec/poppush
Atomics thread finished. Time: 1 s 877671 us 1877 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)
Atomics thread finished. Time: 1 s 879402 us 1879 nsec/poppush
All threads finished. Thread count: 8 Time: 1 s 879496 us 234 nsec/poppush
Atomics thread finished. Time: 1 s 880396 us 1880 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)
Atomics thread finished. Time: 1 s 881439 us 1881 nsec/poppush
Atomics thread finished. Time: 1 s 881279 us 1881 nsec/poppush
All threads finished. Thread count: 8 Time: 1 s 881986 us 235 nsec/poppush

$. ./exports/bin/mpirun --np 2 ./test/class/opal_fifo
Single thread test. Time: 0 s 103040 us 103 nsec/poppush
Single thread test. Time: 0 s 103195 us 103 nsec/poppush
Atomics thread finished. Time: 0 s 57306 us 57 nsec/poppush
Atomics thread finished. Time: 0 s 59145 us 59 nsec/poppush
Atomics thread finished. Time: 0 s 517384 us 517 nsec/poppush
Atomics thread finished. Time: 0 s 517913 us 517 nsec/poppush
Atomics thread finished. Time: 0 s 518922 us 518 nsec/poppush
Atomics thread finished. Time: 0 s 520351 us 520 nsec/poppush
Atomics thread finished. Time: 0 s 519209 us 519 nsec/poppush
Atomics thread finished. Time: 0 s 522466 us 522 nsec/poppush
Atomics thread finished. Time: 0 s 521398 us 521 nsec/poppush
Atomics thread finished. Time: 0 s 523523 us 523 nsec/poppush
Atomics thread finished. Time: 0 s 525236 us 525 nsec/poppush
Atomics thread finished. Time: 0 s 525286 us 525 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 525477 us 65 nsec/poppush
Atomics thread finished. Time: 0 s 524535 us 524 nsec/poppush
Atomics thread finished. Time: 0 s 525044 us 525 nsec/poppush
Atomics thread finished. Time: 0 s 525790 us 525 nsec/poppush
Atomics thread finished. Time: 0 s 527459 us 527 nsec/poppush
Atomics thread finished. Time: 0 s 528100 us 528 nsec/poppush
Atomics thread finished. Time: 0 s 528199 us 528 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 528428 us 66 nsec/poppush
Exhaustive atomics thread finished. Popped 825181 items. Time: 0 s 504962 us 611 nsec/poppush
Exhaustive atomics thread finished. Popped 832458 items. Time: 0 s 505069 us 606 nsec/poppush
Exhaustive atomics thread finished. Popped 834968 items. Time: 0 s 507558 us 607 nsec/poppush
Exhaustive atomics thread finished. Popped 819913 items. Time: 0 s 503599 us 614 nsec/poppush
Exhaustive atomics thread finished. Popped 831429 items. Time: 0 s 508836 us 612 nsec/poppush
Exhaustive atomics thread finished. Popped 833904 items. Time: 0 s 509649 us 611 nsec/poppush
Exhaustive atomics thread finished. Popped 840997 items. Time: 0 s 514532 us 611 nsec/poppush
Exhaustive atomics thread finished. Popped 833025 items. Time: 0 s 514683 us 617 nsec/poppush
Exhaustive atomics thread finished. Popped 836420 items. Time: 0 s 516539 us 617 nsec/poppush
Exhaustive atomics thread finished. Popped 831970 items. Time: 0 s 512628 us 616 nsec/poppush
Exhaustive atomics thread finished. Popped 840246 items. Time: 0 s 517646 us 616 nsec/poppush
SUPPORT: OMPI Test Passed: opal_fifo_t: (8 tests)
Exhaustive atomics thread finished. Popped 827070 items. Time: 0 s 513223 us 620 nsec/poppush
Exhaustive atomics thread finished. Popped 830095 items. Time: 0 s 513331 us 618 nsec/poppush
Exhaustive atomics thread finished. Popped 837181 items. Time: 0 s 513490 us 613 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 517725 us 64 nsec/poppush
Exhaustive atomics thread finished. Popped 833757 items. Time: 0 s 513983 us 616 nsec/poppush
Exhaustive atomics thread finished. Popped 842525 items. Time: 0 s 514471 us 610 nsec/poppush
SUPPORT: OMPI Test Passed: opal_fifo_t: (8 tests)
All threads finished. Thread count: 8 Time: 0 s 514591 us 64 nsec/poppush

@awlauria awlauria force-pushed the force_ppc_assembly_atomics branch 2 times, most recently from f7a4239 to 3d7c509 Compare March 18, 2021 18:45
The builtins used by default on Power have been
shown to perform poorly. For the time being, force
all compilers to use the inline assembly until
atomic builtins catch-up.

This changes the defaults for all compilers sans xl, including:
gcc, clang, and pgi to use the assembly.

Previously, all of the above were using C11 or
the gcc builtins.

Bonus:
Add a configure flag to force Power machines to use
the builtins/C11, depending on what is available. This
will make future testing easier.

Signed-off-by: Austen Lauria <[email protected]>
@awlauria awlauria force-pushed the force_ppc_assembly_atomics branch from 3d7c509 to e3f3c5b Compare March 18, 2021 18:46
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 18, 2021
@awlauria
Copy link
Contributor Author

Spectrum MPI (IBM MPI):

  • Has always (and will continue) to use the ppc assembly.

OMPI:

  • v4.x.x will use the builtin atomics (but not C11) for all compilers sans xl (which uses the ppc assembly). For users on power who are not using xl, they will see a performance penalty.
  • v5.x.x without this patch is the same as v4.x.x, except with C11 atomics for all compilers but xl.
  • v5.x.x with this patch will fix that going forward for all compilers.

Since the ppc assembly atomics have been tested on SMPI for the past few years, I think it's worth bringing this back to the v4.x series.

@awlauria
Copy link
Contributor Author

bot:aws:retest

@awlauria
Copy link
Contributor Author

bot:ibm:scale:retest

@@ -84,6 +84,13 @@ else
WANT_BRANCH_PROBABILITIES=0
fi

AC_ARG_ENABLE([builtin-atomics-for-ppc],[AS_HELP_STRING([--enable-builtin-atomics-for-ppc],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just verifying that this is correct. PPC is PowerPC which is derived from Power. Is it correct to use PPC here or should it be Power?

Copy link
Contributor Author

@awlauria awlauria Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good point. I'm not sure @jjhursey @gpaulsen ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check was added under this architecture string:

        powerpc-*|powerpc64-*|powerpcle-*|powerpc64le-*|rs6000-*|ppc-*)

So maybe 'power' would be more generic, though it does make for a slightly misleading configure name. (--enable-builtin-atomics-for-power does that mean power-aware or powerful or Power arch). So if we move to the 'power' then I would make the option --enable-builtin-atomics-for-power-arch.

That being said, I don't really have a preference on the naming here. Anything that is clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either. I feel like we use the term interchangeably, so I'm not sure tbh.

@hjelmn
Copy link
Member

hjelmn commented Mar 19, 2021

@awlauria Little surprised the LL/SC didn't get all the way there. The only left to look at would be the memory barriers. Maybe they are too strict?

@awlauria
Copy link
Contributor Author

awlauria commented Mar 20, 2021

The fetch_and_add_64 assembly is basically on par with c11

C11:
Atomics thread finished. Time: 0 s 151403 us 151 nsec/poppush
Atomics thread finished. Time: 0 s 153382 us 153 nsec/poppush
Atomics thread finished. Time: 0 s 154249 us 154 nsec/poppush
Atomics thread finished. Time: 0 s 156055 us 156 nsec/poppush
Atomics thread finished. Time: 0 s 157360 us 157 nsec/poppush
Atomics thread finished. Time: 0 s 157692 us 157 nsec/poppush
Atomics thread finished. Time: 0 s 158606 us 158 nsec/poppush
Atomics thread finished. Time: 0 s 159209 us 159 nsec/poppush

Assembly: 
Atomics thread finished. Time: 0 s 150572 us 150 nsec/poppush
Atomics thread finished. Time: 0 s 154833 us 154 nsec/poppush
Atomics thread finished. Time: 0 s 155988 us 155 nsec/poppush
Atomics thread finished. Time: 0 s 156271 us 156 nsec/poppush
Atomics thread finished. Time: 0 s 157934 us 157 nsec/poppush
Atomics thread finished. Time: 0 s 157544 us 157 nsec/poppush
Atomics thread finished. Time: 0 s 158556 us 158 nsec/poppush
Atomics thread finished. Time: 0 s 158837 us 158 nsec/poppush

@awlauria
Copy link
Contributor Author

Ok - new data. On a quiet node:

C11 compare-and-swap is ~10% slower than the assembly:

Atomics thread finished. Time: 0 s 52522 us 52 nsec/poppush
Atomics thread finished. Time: 0 s 52420 us 52 nsec/poppush
Atomics thread finished. Time: 0 s 52449 us 52 nsec/poppush
Atomics thread finished. Time: 0 s 52434 us 52 nsec/poppush
Atomics thread finished. Time: 0 s 52479 us 52 nsec/poppush
Atomics thread finished. Time: 0 s 52463 us 52 nsec/poppush
Atomics thread finished. Time: 0 s 52509 us 52 nsec/poppush
Atomics thread finished. Time: 0 s 52493 us 52 nsec/poppush
ASM Done 

Atomics thread finished. Time: 0 s 59044 us 59 nsec/poppush
Atomics thread finished. Time: 0 s 59060 us 59 nsec/poppush
Atomics thread finished. Time: 0 s 59026 us 59 nsec/poppush
Atomics thread finished. Time: 0 s 59107 us 59 nsec/poppush
Atomics thread finished. Time: 0 s 58996 us 58 nsec/poppush
Atomics thread finished. Time: 0 s 59088 us 59 nsec/poppush
Atomics thread finished. Time: 0 s 59077 us 59 nsec/poppush
Atomics thread finished. Time: 0 s 59011 us 59 nsec/poppush
C11 Done

@hjelmn
Copy link
Member

hjelmn commented Mar 20, 2021

@awlauria How about C11 but using LL/SC for the lifo/fifo.

@awlauria
Copy link
Contributor Author

The last two tests I posted is a stand-alone test, no opal involved, so no load/store. Just a straight comparison between the direct assembly and C11 calls to compare and swap and add64. I'll attach the source later, but it's essentially a smaller stripped down opal_lifo program to help identify what exactly is slower.

@awlauria
Copy link
Contributor Author

Some additional data:

opal_atomic_fetch_add_64 -
powerpc assembly :

opal_atomic_fetch_add_64() thread finished. Time: 0 s 147929 us 147 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 155078 us 155 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 155982 us 155 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 159343 us 159 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 160076 us 160 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 161567 us 161 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 161856 us 161 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 161900 us 161 nsec/per

c11:

opal_atomic_fetch_add_64() thread finished. Time: 0 s 166730 us 166 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 167665 us 167 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 175904 us 175 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 177398 us 177 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 180348 us 180 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 180940 us 180 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 181992 us 181 nsec/per
opal_atomic_fetch_add_64() thread finished. Time: 0 s 182936 us 182 nsec/per

opal_atomic_compare_exchange_strong_64 -
powerpc assembly:

opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 65242 us 65 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 65168 us 65 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 64215 us 64 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 64211 us 64 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 64219 us 64 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 65160 us 65 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 65227 us 65 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 64401 us 64 nsec/per

c11:

opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 72031 us 72 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 71249 us 71 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 72077 us 72 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 72052 us 72 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 71115 us 71 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 71121 us 71 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 72088 us 72 nsec/per
opal_atomic_compare_exchange_strong_64() thread finished. Time: 0 s 72000 us 71 nsec/per

I attached the full results.
out_ppc_assembly_gcc.txt
out_ppc_c11_gcc.txt

I pushed up the test to get this data as a separate commit and added it to make-check, since it might be useful for the future.

@awlauria awlauria force-pushed the force_ppc_assembly_atomics branch 2 times, most recently from ee6f7b8 to 2084670 Compare March 22, 2021 15:57
Similar to class/opal_lifo/fifo, but more granular to get a better
idea of what is going on. Code was borrowed from those tests to make
this one.

Signed-off-by: Austen Lauria <[email protected]>
@awlauria awlauria merged commit d440c13 into open-mpi:master Mar 22, 2021
@awlauria awlauria deleted the force_ppc_assembly_atomics branch March 22, 2021 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants