[AArch64] Avoid generating LDAPUR on certain cores #124274

davemgreen · 2025-01-24T14:33:04Z

On the CPUs listed below, we want to avoid LDAPUR for performance reasons. Add a tuning feature to disable them when using:
-mcpu=neoverse-v2
-mcpu=neoverse-v3
-mcpu=cortex-x3
-mcpu=cortex-x4
-mcpu=cortex-x925

llvmbot · 2025-01-24T14:33:41Z

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

On the CPUs listed below, we want to avoid LDAPUR for performance reasons. Add a tuning feature to disable them when using:
-mcpu=neoverse-v2
-mcpu=neoverse-v3
-mcpu=cortex-x3
-mcpu=cortex-x4
-mcpu=cortex-x925

Full diff: https://github.com/llvm/llvm-project/pull/124274.diff

5 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64Features.td (+3)
(modified) llvm/lib/Target/AArch64/AArch64InstrAtomics.td (+3-1)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+2)
(modified) llvm/lib/Target/AArch64/AArch64Processors.td (+6)
(modified) llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll (+111-33)

diff --git a/llvm/lib/Target/AArch64/AArch64Features.td b/llvm/lib/Target/AArch64/AArch64Features.td
index 0a91edb4c1661b..5faf933fa4e507 100644
--- a/llvm/lib/Target/AArch64/AArch64Features.td
+++ b/llvm/lib/Target/AArch64/AArch64Features.td
@@ -809,6 +809,9 @@ def FeatureUseFixedOverScalableIfEqualCost: SubtargetFeature<"use-fixed-over-sca
   "UseFixedOverScalableIfEqualCost", "true",
   "Prefer fixed width loop vectorization over scalable if the cost-model assigns equal costs">;
 
+def FeatureAvoidLDAPUR: SubtargetFeature<"avoid-ldapur", "AvoidLDAPUR", "true",
+  "Prefer add+ldapr to offset ldapur">;
+
 //===----------------------------------------------------------------------===//
 // Architectures.
 //
diff --git a/llvm/lib/Target/AArch64/AArch64InstrAtomics.td b/llvm/lib/Target/AArch64/AArch64InstrAtomics.td
index de94cf64c9801c..5e6db9d007a555 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrAtomics.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrAtomics.td
@@ -575,7 +575,7 @@ let Predicates = [HasRCPC3, HasNEON] in {
 }
 
 // v8.4a FEAT_LRCPC2 patterns
-let Predicates = [HasRCPC_IMMO] in {
+let Predicates = [HasRCPC_IMMO, UseLDAPUR] in {
   // Load-Acquire RCpc Register unscaled loads
   def : Pat<(acquiring_load<atomic_load_az_8>
                (am_unscaled8 GPR64sp:$Rn, simm9:$offset)),
@@ -589,7 +589,9 @@ let Predicates = [HasRCPC_IMMO] in {
   def : Pat<(acquiring_load<atomic_load_64>
                (am_unscaled64 GPR64sp:$Rn, simm9:$offset)),
           (LDAPURXi GPR64sp:$Rn, simm9:$offset)>;
+}
 
+let Predicates = [HasRCPC_IMMO] in {
   // Store-Release Register unscaled stores
   def : Pat<(releasing_store<atomic_store_8>
                (am_unscaled8 GPR64sp:$Rn, simm9:$offset), GPR32:$val),
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index fa6385409f30c7..9d0bd44544134c 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -389,6 +389,8 @@ def NoUseScalarIncVL : Predicate<"!Subtarget->useScalarIncVL()">;
 
 def UseSVEFPLD1R : Predicate<"!Subtarget->noSVEFPLD1R()">;
 
+def UseLDAPUR : Predicate<"!Subtarget->avoidLDAPUR()">;
+
 def AArch64LocalRecover : SDNode<"ISD::LOCAL_RECOVER",
                                   SDTypeProfile<1, 1, [SDTCisSameAs<0, 1>,
                                                        SDTCisInt<1>]>>;
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index 0e3c4e8397f526..8a2c0442a0c0da 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -240,6 +240,7 @@ def TuneX3 : SubtargetFeature<"cortex-x3", "ARMProcFamily", "CortexX3",
                                FeaturePostRAScheduler,
                                FeatureEnableSelectOptimize,
                                FeatureUseFixedOverScalableIfEqualCost,
+                               FeatureAvoidLDAPUR,
                                FeaturePredictableSelectIsExpensive]>;
 
 def TuneX4 : SubtargetFeature<"cortex-x4", "ARMProcFamily", "CortexX4",
@@ -250,6 +251,7 @@ def TuneX4 : SubtargetFeature<"cortex-x4", "ARMProcFamily", "CortexX4",
                                FeaturePostRAScheduler,
                                FeatureEnableSelectOptimize,
                                FeatureUseFixedOverScalableIfEqualCost,
+                               FeatureAvoidLDAPUR,
                                FeaturePredictableSelectIsExpensive]>;
 
 def TuneX925 : SubtargetFeature<"cortex-x925", "ARMProcFamily",
@@ -260,6 +262,7 @@ def TuneX925 : SubtargetFeature<"cortex-x925", "ARMProcFamily",
                                 FeaturePostRAScheduler,
                                 FeatureEnableSelectOptimize,
                                 FeatureUseFixedOverScalableIfEqualCost,
+                                FeatureAvoidLDAPUR,
                                 FeaturePredictableSelectIsExpensive]>;
 
 def TuneA64FX : SubtargetFeature<"a64fx", "ARMProcFamily", "A64FX",
@@ -540,6 +543,7 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
                                       FeaturePostRAScheduler,
                                       FeatureEnableSelectOptimize,
                                       FeatureUseFixedOverScalableIfEqualCost,
+                                      FeatureAvoidLDAPUR,
                                       FeaturePredictableSelectIsExpensive]>;
 
 def TuneNeoverseV3 : SubtargetFeature<"neoversev3", "ARMProcFamily", "NeoverseV3",
@@ -549,6 +553,7 @@ def TuneNeoverseV3 : SubtargetFeature<"neoversev3", "ARMProcFamily", "NeoverseV3
                                       FeatureFuseAdrpAdd,
                                       FeaturePostRAScheduler,
                                       FeatureEnableSelectOptimize,
+                                      FeatureAvoidLDAPUR,
                                       FeaturePredictableSelectIsExpensive]>;
 
 def TuneNeoverseV3AE : SubtargetFeature<"neoversev3AE", "ARMProcFamily", "NeoverseV3",
@@ -558,6 +563,7 @@ def TuneNeoverseV3AE : SubtargetFeature<"neoversev3AE", "ARMProcFamily", "Neover
                                       FeatureFuseAdrpAdd,
                                       FeaturePostRAScheduler,
                                       FeatureEnableSelectOptimize,
+                                      FeatureAvoidLDAPUR,
                                       FeaturePredictableSelectIsExpensive]>;
 
 def TuneSaphira  : SubtargetFeature<"saphira", "ARMProcFamily", "Saphira",
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
index 9687ba683fb7e6..b475e68db411a4 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomic-load-rcpc_immo.ll
@@ -1,6 +1,12 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --filter-out "(?!^\s*lda.*\bsp\b)^\s*.*\bsp\b" --filter "^\s*(ld|st[^r]|swp|cas|bl|add|and|eor|orn|orr|sub|mvn|sxt|cmp|ccmp|csel|dmb)"
 ; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mattr=+v8.4a -mattr=+rcpc-immo -global-isel=true -global-isel-abort=2 -O0 | FileCheck %s --check-prefixes=CHECK,GISEL
-; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mattr=+v8.4a -mattr=+rcpc-immo -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG
+; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mattr=+v8.4a -mattr=+rcpc-immo -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG,SDAG-NOAVOIDLDAPUR
+; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mattr=+v8.4a -mattr=+rcpc-immo,avoid-ldapur -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG,SDAG-AVOIDLDAPUR
+; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mcpu=neoverse-v2 -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG,SDAG-AVOIDLDAPUR
+; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mcpu=neoverse-v3 -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG,SDAG-AVOIDLDAPUR
+; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mcpu=cortex-x3 -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG,SDAG-AVOIDLDAPUR
+; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mcpu=cortex-x4 -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG,SDAG-AVOIDLDAPUR
+; RUN: llc %s -o - -verify-machineinstrs -mtriple=aarch64 -mcpu=cortex-x925 -global-isel=false -O1 | FileCheck %s --check-prefixes=CHECK,SDAG,SDAG-AVOIDLDAPUR
 
 define i8 @load_atomic_i8_aligned_unordered(ptr %ptr) {
 ; CHECK-LABEL: load_atomic_i8_aligned_unordered:
@@ -39,8 +45,12 @@ define i8 @load_atomic_i8_aligned_acquire(ptr %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i8_aligned_acquire:
-; SDAG:    ldapurb w0, [x0, #4]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #4
+; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -51,8 +61,12 @@ define i8 @load_atomic_i8_aligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i8_aligned_acquire_const:
-; SDAG:    ldapurb w0, [x0, #4]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_aligned_acquire_const:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #4
+; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -113,8 +127,12 @@ define i16 @load_atomic_i16_aligned_acquire(ptr %ptr) {
 ; GISEL:    add x8, x0, #8
 ; GISEL:    ldaprh w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i16_aligned_acquire:
-; SDAG:    ldapurh w0, [x0, #8]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapurh w0, [x0, #8]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #8
+; SDAG-AVOIDLDAPUR:    ldaprh w0, [x8]
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep acquire, align 2
     ret i16 %r
@@ -125,8 +143,12 @@ define i16 @load_atomic_i16_aligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x8, x0, #8
 ; GISEL:    ldaprh w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i16_aligned_acquire_const:
-; SDAG:    ldapurh w0, [x0, #8]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapurh w0, [x0, #8]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i16_aligned_acquire_const:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #8
+; SDAG-AVOIDLDAPUR:    ldaprh w0, [x8]
     %gep = getelementptr inbounds i16, ptr %ptr, i32 4
     %r = load atomic i16, ptr %gep acquire, align 2
     ret i16 %r
@@ -183,16 +205,30 @@ define i32 @load_atomic_i32_aligned_monotonic_const(ptr readonly %ptr) {
 }
 
 define i32 @load_atomic_i32_aligned_acquire(ptr %ptr) {
-; CHECK-LABEL: load_atomic_i32_aligned_acquire:
-; CHECK:    ldapur w0, [x0, #16]
+; GISEL-LABEL: load_atomic_i32_aligned_acquire:
+; GISEL:    ldapur w0, [x0, #16]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapur w0, [x0, #16]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #16
+; SDAG-AVOIDLDAPUR:    ldapr w0, [x8]
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep acquire, align 4
     ret i32 %r
 }
 
 define i32 @load_atomic_i32_aligned_acquire_const(ptr readonly %ptr) {
-; CHECK-LABEL: load_atomic_i32_aligned_acquire_const:
-; CHECK:    ldapur w0, [x0, #16]
+; GISEL-LABEL: load_atomic_i32_aligned_acquire_const:
+; GISEL:    ldapur w0, [x0, #16]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapur w0, [x0, #16]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i32_aligned_acquire_const:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #16
+; SDAG-AVOIDLDAPUR:    ldapr w0, [x8]
     %gep = getelementptr inbounds i32, ptr %ptr, i32 4
     %r = load atomic i32, ptr %gep acquire, align 4
     ret i32 %r
@@ -249,16 +285,30 @@ define i64 @load_atomic_i64_aligned_monotonic_const(ptr readonly %ptr) {
 }
 
 define i64 @load_atomic_i64_aligned_acquire(ptr %ptr) {
-; CHECK-LABEL: load_atomic_i64_aligned_acquire:
-; CHECK:    ldapur x0, [x0, #32]
+; GISEL-LABEL: load_atomic_i64_aligned_acquire:
+; GISEL:    ldapur x0, [x0, #32]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapur x0, [x0, #32]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #32
+; SDAG-AVOIDLDAPUR:    ldapr x0, [x8]
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep acquire, align 8
     ret i64 %r
 }
 
 define i64 @load_atomic_i64_aligned_acquire_const(ptr readonly %ptr) {
-; CHECK-LABEL: load_atomic_i64_aligned_acquire_const:
-; CHECK:    ldapur x0, [x0, #32]
+; GISEL-LABEL: load_atomic_i64_aligned_acquire_const:
+; GISEL:    ldapur x0, [x0, #32]
+;
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapur x0, [x0, #32]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i64_aligned_acquire_const:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #32
+; SDAG-AVOIDLDAPUR:    ldapr x0, [x8]
     %gep = getelementptr inbounds i64, ptr %ptr, i32 4
     %r = load atomic i64, ptr %gep acquire, align 8
     ret i64 %r
@@ -387,8 +437,12 @@ define i8 @load_atomic_i8_unaligned_acquire(ptr %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i8_unaligned_acquire:
-; SDAG:    ldapurb w0, [x0, #4]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #4
+; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -399,8 +453,12 @@ define i8 @load_atomic_i8_unaligned_acquire_const(ptr readonly %ptr) {
 ; GISEL:    add x8, x0, #4
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i8_unaligned_acquire_const:
-; SDAG:    ldapurb w0, [x0, #4]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [x0, #4]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_unaligned_acquire_const:
+; SDAG-AVOIDLDAPUR:    add x8, x0, #4
+; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
     %gep = getelementptr inbounds i8, ptr %ptr, i32 4
     %r = load atomic i8, ptr %gep acquire, align 1
     ret i8 %r
@@ -846,9 +904,14 @@ define i8 @load_atomic_i8_from_gep() {
 ; GISEL:    add x8, x8, #1
 ; GISEL:    ldaprb w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i8_from_gep:
-; SDAG:    bl init
-; SDAG:    ldapurb w0, [sp, #13]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i8_from_gep:
+; SDAG-NOAVOIDLDAPUR:    bl init
+; SDAG-NOAVOIDLDAPUR:    ldapurb w0, [sp, #13]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i8_from_gep:
+; SDAG-AVOIDLDAPUR:    bl init
+; SDAG-AVOIDLDAPUR:    orr x8, x19, #0x1
+; SDAG-AVOIDLDAPUR:    ldaprb w0, [x8]
   %a = alloca [3 x i8]
   call void @init(ptr %a)
   %arrayidx  = getelementptr [3 x i8], ptr %a, i64 0, i64 1
@@ -862,9 +925,14 @@ define i16 @load_atomic_i16_from_gep() {
 ; GISEL:    add x8, x8, #2
 ; GISEL:    ldaprh w0, [x8]
 ;
-; SDAG-LABEL: load_atomic_i16_from_gep:
-; SDAG:    bl init
-; SDAG:    ldapurh w0, [sp, #10]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i16_from_gep:
+; SDAG-NOAVOIDLDAPUR:    bl init
+; SDAG-NOAVOIDLDAPUR:    ldapurh w0, [sp, #10]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i16_from_gep:
+; SDAG-AVOIDLDAPUR:    bl init
+; SDAG-AVOIDLDAPUR:    orr x8, x19, #0x2
+; SDAG-AVOIDLDAPUR:    ldaprh w0, [x8]
   %a = alloca [3 x i16]
   call void @init(ptr %a)
   %arrayidx  = getelementptr [3 x i16], ptr %a, i64 0, i64 1
@@ -877,9 +945,14 @@ define i32 @load_atomic_i32_from_gep() {
 ; GISEL:    bl init
 ; GISEL:    ldapur w0, [x8, #4]
 ;
-; SDAG-LABEL: load_atomic_i32_from_gep:
-; SDAG:    bl init
-; SDAG:    ldapur w0, [sp, #8]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i32_from_gep:
+; SDAG-NOAVOIDLDAPUR:    bl init
+; SDAG-NOAVOIDLDAPUR:    ldapur w0, [sp, #8]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i32_from_gep:
+; SDAG-AVOIDLDAPUR:    bl init
+; SDAG-AVOIDLDAPUR:    add x8, x19, #4
+; SDAG-AVOIDLDAPUR:    ldapr w0, [x8]
   %a = alloca [3 x i32]
   call void @init(ptr %a)
   %arrayidx  = getelementptr [3 x i32], ptr %a, i64 0, i64 1
@@ -892,9 +965,14 @@ define i64 @load_atomic_i64_from_gep() {
 ; GISEL:    bl init
 ; GISEL:    ldapur x0, [x8, #8]
 ;
-; SDAG-LABEL: load_atomic_i64_from_gep:
-; SDAG:    bl init
-; SDAG:    ldapur x0, [sp, #16]
+; SDAG-NOAVOIDLDAPUR-LABEL: load_atomic_i64_from_gep:
+; SDAG-NOAVOIDLDAPUR:    bl init
+; SDAG-NOAVOIDLDAPUR:    ldapur x0, [sp, #16]
+;
+; SDAG-AVOIDLDAPUR-LABEL: load_atomic_i64_from_gep:
+; SDAG-AVOIDLDAPUR:    bl init
+; SDAG-AVOIDLDAPUR:    add x8, x19, #8
+; SDAG-AVOIDLDAPUR:    ldapr x0, [x8]
   %a = alloca [3 x i64]
   call void @init(ptr %a)
   %arrayidx  = getelementptr [3 x i64], ptr %a, i64 0, i64 1

sjoerdmeijer

We also observed this ldapur behaviour.
Thanks for fixing this, LGTM.

sjoerdmeijer · 2025-01-27T09:13:48Z

llvm/lib/Target/AArch64/AArch64Features.td

@@ -809,6 +809,9 @@ def FeatureUseFixedOverScalableIfEqualCost: SubtargetFeature<"use-fixed-over-sca
  "UseFixedOverScalableIfEqualCost", "true",
  "Prefer fixed width loop vectorization over scalable if the cost-model assigns equal costs">;

+def FeatureAvoidLDAPUR: SubtargetFeature<"avoid-ldapur", "AvoidLDAPUR", "true",


Nit: maybe add a comment that we want to avoid this instruction for performance reasons on some cores?

On the CPUs listed below, we want to avoid LDAPUR for performance reasons. Add a tuning feature to disable them when using: -mcpu=neoverse-v2 -mcpu=neoverse-v3 -mcpu=cortex-x3 -mcpu=cortex-x4 -mcpu=cortex-x925

rj-jesus · 2025-01-28T11:32:28Z

Hi,

Sorry for being late - I acknowledge this has already been merged. Nevertheless, I think the logic for this should be reversed. Instead of explicitly disabling the fold for the few cores above, in my opinion we should explicitly enable it only where we know it's safe to do so (and assume default disabled).

The penalty for getting it wrong on the affected cores outweighs the penalty for missing out on the fold on unaffected cores (as in this latter case, it's just an extra GP instruction), and many users/package providers compile with -march rather than an -mcpu.

davemgreen · 2025-01-29T10:10:22Z

Hi,

Sorry for being late - I acknowledge this has already been merged. Nevertheless, I think the logic for this should be reversed. Instead of explicitly disabling the fold for the few cores above, in my opinion we should explicitly enable it only where we know it's safe to do so (and assume default disabled).

The penalty for getting it wrong on the affected cores outweighs the penalty for missing out on the fold on unaffected cores (as in this latter case, it's just an extra GP instruction), and many users/package providers compile with -march rather than an -mcpu.

Hi. I was thinking about the same thing too and wasn't sure which way to go on it. Just to be clear it is always safe to use the instructions, they are not incorrect, they just act slower than ldapr. The benefit of using ldapur is relatively minor compared to using a register increment, but it is a little better in terms of both codesize, performance and register pressure.

Anyone using the default -march=armv8 will not use ldapur, so will not see any problems. I was thinking of it in terms of enabling the tuning feature for -mcpu=generic, for which the main problem is when do we stop avoiding them? In 5 year? 10? Do we end up never using the instruction because some cpus had an issue with them? There might be something we can do where we tie it to the architecture revision though, and make -mcpu=generic + armv8.4->9.3 avoid the instructions, but anything after that use them. I will see if I can put together a patch so we can see what it looks like.

ktkachov · 2025-01-29T10:29:09Z

Hi,
Sorry for being late - I acknowledge this has already been merged. Nevertheless, I think the logic for this should be reversed. Instead of explicitly disabling the fold for the few cores above, in my opinion we should explicitly enable it only where we know it's safe to do so (and assume default disabled).
The penalty for getting it wrong on the affected cores outweighs the penalty for missing out on the fold on unaffected cores (as in this latter case, it's just an extra GP instruction), and many users/package providers compile with -march rather than an -mcpu.

Hi. I was thinking about the same thing too and wasn't sure which way to go on it. Just to be clear it is always safe to use the instructions, they are not incorrect, they just act slower than ldapr. The benefit of using ldapur is relatively minor compared to using a register increment, but it is a little better in terms of both codesize, performance and register pressure.

Anyone using the default -march=armv8 will not use ldapur, so will not see any problems. I was thinking of it in terms of enabling the tuning feature for -mcpu=generic, for which the main problem is when do we stop avoiding them? In 5 year? 10? Do we end up never using the instruction because some cpus had an issue with them? There might be something we can do where we tie it to the architecture revision though, and make -mcpu=generic + armv8.4->9.3 avoid the instructions, but anything after that use them. I will see if I can put together a patch so we can see what it looks like.

I think tying it to architecture level when tuning for generic is a reasonable approach.
We've done something similar in GCC for the trick of avoiding SVE throughput-limited INC* instructions by default for generic SVE1 cores but enabling them back for generic tuning of SVE2-enabled architectures under the reasoning that cores of that generation and later didn't need the workaround.

rj-jesus · 2025-01-29T11:01:15Z

Thanks very much, @davemgreen. As you and @ktkachov suggested, I think tying it to the architecture revision (when tuning for generic) makes sense.

…9.3. As added in llvm#124274, CPUs in this range can suffer from performance issues with ldapur. As the gain from ldar->ldapr is expected to be greater than the minor gain from ldapr->ldapur, this opts to avoid the instruction under the default -mcpu=generic when the -march is less that armv9.3.

…9.3. (#125261) As added in #124274, CPUs in this range can suffer from performance issues with ldapur. As the gain from ldar->ldapr is expected to be greater than the minor gain from ldapr->ldapur, this opts to avoid the instruction under the default -mcpu=generic when the -march is less that armv8.8 / armv9.3. I renamed AArch64Subtarget::Others to AArch64Subtarget::Generic to be clearer what it means.

…9.3. (llvm#125261) As added in llvm#124274, CPUs in this range can suffer from performance issues with ldapur. As the gain from ldar->ldapr is expected to be greater than the minor gain from ldapr->ldapur, this opts to avoid the instruction under the default -mcpu=generic when the -march is less that armv8.8 / armv9.3. I renamed AArch64Subtarget::Others to AArch64Subtarget::Generic to be clearer what it means. (cherry picked from commit 6424abc)

…9.3. (llvm#125261) As added in llvm#124274, CPUs in this range can suffer from performance issues with ldapur. As the gain from ldar->ldapr is expected to be greater than the minor gain from ldapr->ldapur, this opts to avoid the instruction under the default -mcpu=generic when the -march is less that armv8.8 / armv9.3. I renamed AArch64Subtarget::Others to AArch64Subtarget::Generic to be clearer what it means.

davemgreen requested a review from sjoerdmeijer January 24, 2025 14:33

llvmbot added the backend:AArch64 label Jan 24, 2025

sjoerdmeijer approved these changes Jan 27, 2025

View reviewed changes

[AArch64] Avoid generating LDAPUR on certain cores

b3f6eb4

On the CPUs listed below, we want to avoid LDAPUR for performance reasons. Add a tuning feature to disable them when using: -mcpu=neoverse-v2 -mcpu=neoverse-v3 -mcpu=cortex-x3 -mcpu=cortex-x4 -mcpu=cortex-x925

davemgreen force-pushed the gh-a64-avoidldapur branch from 0d906b6 to b3f6eb4 Compare January 27, 2025 13:10

davemgreen merged commit ef54e0b into llvm:main Jan 27, 2025
5 of 8 checks passed

davemgreen deleted the gh-a64-avoidldapur branch January 27, 2025 13:12

davemgreen mentioned this pull request Jan 31, 2025

[AArch64] Enable AvoidLDAPUR for cpu=generic between armv8.4 and armv9.3. #125261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Avoid generating LDAPUR on certain cores #124274

[AArch64] Avoid generating LDAPUR on certain cores #124274

Uh oh!

davemgreen commented Jan 24, 2025

Uh oh!

llvmbot commented Jan 24, 2025

Uh oh!

sjoerdmeijer left a comment

Uh oh!

sjoerdmeijer Jan 27, 2025

Uh oh!

Uh oh!

rj-jesus commented Jan 28, 2025 •

edited

Loading

Uh oh!

davemgreen commented Jan 29, 2025

Uh oh!

ktkachov commented Jan 29, 2025

Uh oh!

rj-jesus commented Jan 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

[AArch64] Avoid generating LDAPUR on certain cores #124274

[AArch64] Avoid generating LDAPUR on certain cores #124274

Uh oh!

Conversation

davemgreen commented Jan 24, 2025

Uh oh!

llvmbot commented Jan 24, 2025

Uh oh!

sjoerdmeijer left a comment

Choose a reason for hiding this comment

Uh oh!

sjoerdmeijer Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rj-jesus commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemgreen commented Jan 29, 2025

Uh oh!

ktkachov commented Jan 29, 2025

Uh oh!

rj-jesus commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rj-jesus commented Jan 28, 2025 •

edited

Loading

rj-jesus commented Jan 29, 2025 •

edited

Loading