-
Notifications
You must be signed in to change notification settings - Fork 273
Open
Labels
Description
I'm sure it has gotten the attention of everyone that Apple's M-chips are basically as fast as the state-of-the-art x86 processors (see GMP's benchmark results). Therefore, I think we should implement assembly routines for these ones as well.
These are the current routines that should be implemented:
- Hard(ish)coded multiplication (treated in First batch of Arm assembly #1808, works as a full replacement for
mpn_mul_basecase
) - Hardcoded squaring (treated in Add mpn_sqr and mpn_mulhigh routines for Arm #1912)
- Hardcoded high multiplication (treated in Add mpn_sqr and mpn_mulhigh routines for Arm #1912)
- Hardcoded high squaring (treated in Add mpn_sqr and mpn_mulhigh routines for Arm #1912)
- High multiplication, basecase (treated in Add mpn_sqr and mpn_mulhigh routines for Arm #1912)
- High squaring, basecase
- Hardcoded low multiplication
- Hardcoded low squaring
- Low multiplication, basecase
- Low squaring, basecase
Useful links:
- https://dougallj.github.io/applecpu/firestorm.html
- https://dougallj.github.io/applecpu/firestorm-int.html
- https://dougallj.github.io/applecpu/firestorm-simd.html
- https://developer.arm.com/architectures/instruction-sets/intrinsics/
- https://developer.arm.com/documentation/ddi0602/2023-12?lang=en
- https://github.com/corsix/amx
- https://stackoverflow.com/questions/70717360/how-to-load-vector-registers-from-integer-registers-in-arm64-m1