[AMDGPU] Add an option to disable unsafe uses of atomic xor #69229

pasaulais · 2023-10-16T17:38:45Z

On some systems and when accessing certain memory areas it is not safe to use atomic xor with AMDGPU. This may be the case when using fine-grained memory allocations (e.g. through hipAllocManaged) and when the atomic operation needs to go through the PCIe bus, which does not support atomic xor (PCIe 3.0 does support other atomic operations like fetch_and_add and cmpxchg) 1.

The issue has been worked around in DPC++/HIP by prefetching memory before executing kernels 2, however this adds an overhead that can outweigh the performance benefit of using atomic xor over a cmpxchg loop. This is why a way to switch between 'native' atomic xor and a cmpxchg-loop solution is needed.

These changes add a -mamdgpu-fine-grained-mem option to clang that can be used to switch between the two implementations. The default is the current behaviour of generating 'native' atomic xor instructions. When -mamdgpu-fine-grained-mem is passed to clang, atomic xor instructions are marked with !amdgpu.atomic instruction-level metadata at the IR level and this tells the backend to expand atomic xor to a cmpxchg loop. The options only affect the global and flat address spaces.

A !amdgpu.atomic metadata node is a set of key-value pairs that provide more information about the atomic operation. fine_grained is currently the only supported key. When the value is 1, the atomic operation can potentially take fine-grained memory pointers. The default value when this pair is not present is 0, i.e. fine-grained memory pointers are not passed to this instruction.

%2 = atomicrmw xor ptr %0, i32 %1 seq_cst, !amdgpu.atomic !0

!0 = !{!1}
!1 = !{!"fine_grained", i32 1}

pasaulais · 2023-10-16T18:58:10Z

Hi @jansvoboda11, @arsenm, I haven't yet got a commit merged to LLVM and can't request a review through the UI. Could I get a review on this patch? Thanks in advance and apologies if I've missed a step in the submission process.

jansvoboda11 · 2023-10-19T15:44:31Z

The command-line changes look good to me. Adding @arsenm as a reviewer proper.

pasaulais · 2023-10-19T16:20:06Z

The command-line changes look good to me. Adding @arsenm as a reviewer proper.

Thanks Jan!

llvmbot · 2023-10-19T16:44:43Z

@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre-Andre Saulais (pasaulais)

Changes

On some systems and when accessing certain memory areas it is not safe to use atomic xor with AMDGPU. This may be the case when using fine-grained memory allocations (e.g. through hipAllocManaged) and when the atomic operation needs to go through the PCIe bus, which does not support atomic xor (PCIe 3.0 does support other atomic operations like fetch_and_add and cmpxchg) 1.

The issue has been worked around in DPC++/HIP by prefetching memory before executing kernels 2, however this adds an overhead that can outweigh the performance benefit of using atomic xor over a cmpxchg loop. This is why a way to switch between 'native' atomic xor and a cmpxchg-loop solution is needed.

These changes add -munsafe-int-atomics and -mno-unsafe-int-atomics options to clang that can be used to switch between the two implementations. The default is the current behaviour of generating 'native' atomic xor instructions. When -mno-unsafe-int-atomics is passed to clang, functions are marked with the amdgpu-unsafe-int-atomics attribute (set to false) at the IR level and this tells the backend to expand atomic xor to a cmpxchg loop. The options only affect the global and flat address spaces.

Full diff: https://github.com/llvm/llvm-project/pull/69229.diff

10 Files Affected:

(modified) clang/include/clang/Basic/TargetInfo.h (+7)
(modified) clang/include/clang/Basic/TargetOptions.h (+3)
(modified) clang/include/clang/Driver/Options.td (+8)
(modified) clang/lib/Basic/TargetInfo.cpp (+1)
(modified) clang/lib/Basic/Targets/AMDGPU.cpp (+1)
(modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+3)
(modified) clang/lib/Driver/ToolChains/Clang.cpp (+2)
(added) clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl (+55)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+15)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+19-4)

diff --git a/clang/include/clang/Basic/TargetInfo.h b/clang/include/clang/Basic/TargetInfo.h
index 9d56e97a3d4bb88..96416c4405115cd 100644
--- a/clang/include/clang/Basic/TargetInfo.h
+++ b/clang/include/clang/Basic/TargetInfo.h
@@ -253,6 +253,8 @@ class TargetInfo : public TransferrableTargetInfo,
 
   unsigned AllowAMDGPUUnsafeFPAtomics : 1;
 
+  unsigned AllowAMDGPUUnsafeIntAtomics : 1;
+
   unsigned ARMCDECoprocMask : 8;
 
   unsigned MaxOpenCLWorkGroupSize;
@@ -995,6 +997,11 @@ class TargetInfo : public TransferrableTargetInfo,
   /// allowed.
   bool allowAMDGPUUnsafeFPAtomics() const { return AllowAMDGPUUnsafeFPAtomics; }
 
+  /// Returns whether or not the AMDGPU unsafe integer atomics are allowed.
+  bool allowAMDGPUUnsafeIntAtomics() const {
+    return AllowAMDGPUUnsafeIntAtomics;
+  }
+
   /// For ARM targets returns a mask defining which coprocessors are configured
   /// as Custom Datapath.
   uint32_t getARMCDECoprocMask() const { return ARMCDECoprocMask; }
diff --git a/clang/include/clang/Basic/TargetOptions.h b/clang/include/clang/Basic/TargetOptions.h
index ba3acd029587160..4feb6f2dcd86b83 100644
--- a/clang/include/clang/Basic/TargetOptions.h
+++ b/clang/include/clang/Basic/TargetOptions.h
@@ -78,6 +78,9 @@ class TargetOptions {
   /// \brief If enabled, allow AMDGPU unsafe floating point atomics.
   bool AllowAMDGPUUnsafeFPAtomics = false;
 
+  /// \brief If enabled, allow AMDGPU unsafe integer atomics.
+  bool AllowAMDGPUUnsafeIntAtomics = true;
+
   /// \brief Enumeration value for AMDGPU code object version, which is the
   /// code object version times 100.
   enum CodeObjectVersionKind {
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index a89d6b6579f1176..265f5c23e7858ee 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -4705,6 +4705,14 @@ defm unsafe_fp_atomics : BoolOption<"m", "unsafe-fp-atomics",
           "for certain memory destinations. (AMDGPU only)">,
   NegFlag<SetFalse>>, Group<m_Group>;
 
+defm unsafe_int_atomics : BoolOption<"m", "unsafe-int-atomics",
+  TargetOpts<"AllowAMDGPUUnsafeIntAtomics">, DefaultTrue,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option],
+          "Enable generation of unsafe integer atomic instructions. May "
+          "generate more efficient code, but may give incorrect results "
+          "for certain memory destinations. (AMDGPU only)">,
+  NegFlag<SetFalse, [], [ClangOption, CC1Option]>>, Group<m_Group>;
+
 def faltivec : Flag<["-"], "faltivec">, Group<f_Group>, Flags<[NoXarchOption]>;
 def fno_altivec : Flag<["-"], "fno-altivec">, Group<f_Group>,
   Flags<[NoXarchOption]>;
diff --git a/clang/lib/Basic/TargetInfo.cpp b/clang/lib/Basic/TargetInfo.cpp
index 6cd5d618a4acaa5..06997cc54642623 100644
--- a/clang/lib/Basic/TargetInfo.cpp
+++ b/clang/lib/Basic/TargetInfo.cpp
@@ -157,6 +157,7 @@ TargetInfo::TargetInfo(const llvm::Triple &T) : Triple(T) {
   HasAArch64SVETypes = false;
   HasRISCVVTypes = false;
   AllowAMDGPUUnsafeFPAtomics = false;
+  AllowAMDGPUUnsafeIntAtomics = true;
   ARMCDECoprocMask = 0;
 
   // Default to no types using fpret.
diff --git a/clang/lib/Basic/Targets/AMDGPU.cpp b/clang/lib/Basic/Targets/AMDGPU.cpp
index 409ae32ab424215..20ad8e5c59c2b66 100644
--- a/clang/lib/Basic/Targets/AMDGPU.cpp
+++ b/clang/lib/Basic/Targets/AMDGPU.cpp
@@ -232,6 +232,7 @@ AMDGPUTargetInfo::AMDGPUTargetInfo(const llvm::Triple &Triple,
   HasFloat16 = true;
   WavefrontSize = GPUFeatures & llvm::AMDGPU::FEATURE_WAVE32 ? 32 : 64;
   AllowAMDGPUUnsafeFPAtomics = Opts.AllowAMDGPUUnsafeFPAtomics;
+  AllowAMDGPUUnsafeIntAtomics = Opts.AllowAMDGPUUnsafeIntAtomics;
 
   // Set pointer width and alignment for the generic address space.
   PointerWidth = PointerAlign = getPointerWidthV(LangAS::Default);
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index f6a614b3e4d54dd..9be15d89d3f2f46 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -409,6 +409,9 @@ void AMDGPUTargetCodeGenInfo::setTargetAttributes(
   if (M.getContext().getTargetInfo().allowAMDGPUUnsafeFPAtomics())
     F->addFnAttr("amdgpu-unsafe-fp-atomics", "true");
 
+  if (!M.getContext().getTargetInfo().allowAMDGPUUnsafeIntAtomics())
+    F->addFnAttr("amdgpu-unsafe-int-atomics", "false");
+
   if (!getABIInfo().getCodeGenOpts().EmitIEEENaNCompliantInsts)
     F->addFnAttr("amdgpu-ieee", "false");
 }
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 94c184435ae14de..be3076cdbdc4a68 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -7372,6 +7372,8 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
 
     Args.addOptInFlag(CmdArgs, options::OPT_munsafe_fp_atomics,
                       options::OPT_mno_unsafe_fp_atomics);
+    Args.addOptOutFlag(CmdArgs, options::OPT_munsafe_int_atomics,
+                       options::OPT_mno_unsafe_int_atomics);
     Args.addOptOutFlag(CmdArgs, options::OPT_mamdgpu_ieee,
                        options::OPT_mno_amdgpu_ieee);
   }
diff --git a/clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl b/clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl
new file mode 100644
index 000000000000000..02dfddd5791b405
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl
@@ -0,0 +1,55 @@
+// REQUIRES: amdgpu-registered-target
+//
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -emit-llvm \
+// RUN:   | FileCheck -check-prefixes=COMMON,UNSAFE-INT-DEFAULT %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -emit-llvm -munsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON,UNSAFE-INT-ON %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -emit-llvm -mno-unsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON,UNSAFE-INT-OFF %s
+
+// Check AMDGCN ISA generation.
+
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -munsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON-ISA,ISA-ON %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -mno-unsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON-ISA,ISA-OFF %s
+
+#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
+#pragma OPENCL EXTENSION cl_khr_int64_extended_atomics : enable
+
+typedef enum memory_order {
+  memory_order_relaxed = __ATOMIC_RELAXED,
+  memory_order_acquire = __ATOMIC_ACQUIRE,
+  memory_order_release = __ATOMIC_RELEASE,
+  memory_order_acq_rel = __ATOMIC_ACQ_REL,
+  memory_order_seq_cst = __ATOMIC_SEQ_CST
+} memory_order;
+
+typedef enum memory_scope {
+  memory_scope_work_item = __OPENCL_MEMORY_SCOPE_WORK_ITEM,
+  memory_scope_work_group = __OPENCL_MEMORY_SCOPE_WORK_GROUP,
+  memory_scope_device = __OPENCL_MEMORY_SCOPE_DEVICE,
+  memory_scope_all_svm_devices = __OPENCL_MEMORY_SCOPE_ALL_SVM_DEVICES,
+#if defined(cl_intel_subgroups) || defined(cl_khr_subgroups)
+  memory_scope_sub_group = __OPENCL_MEMORY_SCOPE_SUB_GROUP
+#endif
+} memory_scope;
+
+// COMMON-ISA: kern:
+// ISA-ON: flat_atomic_xor v{{[0-9]+}}, v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} glc 
+// ISA-OFF: flat_atomic_cmpswap v{{[0-9]+}},  v[{{[0-9]+}}:{{[0-9]+}}],  v[{{[0-9]+}}:{{[0-9]+}}] glc
+kernel void kern(global atomic_int *x, int y, global int *z) {
+  *z = __opencl_atomic_fetch_xor(x, y, memory_order_seq_cst, memory_scope_work_group);
+}
+
+// COMMON: define{{.*}} amdgpu_kernel void @kern{{.*}} [[ATTRS1:#[0-9]+]]
+// COMMON: atomicrmw xor ptr addrspace(1) %x, i32 %y syncscope("workgroup") seq_cst, align 4
+//
+// UNSAFE-INT-DEFAULT-NOT: attributes [[ATTRS1]] = {{.*}} "amdgpu-unsafe-int-atomics"
+// UNSAFE-INT-ON-NOT: attributes [[ATTRS1]] = {{.*}} "amdgpu-unsafe-int-atomics"
+// UNSAFE-INT-OFF: attributes [[ATTRS1]] = {{.*}} "amdgpu-unsafe-int-atomics"="false"
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 9c5b166c9652238..c379cf4dd64642f 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -14975,6 +14975,14 @@ bool unsafeFPAtomicsDisabled(Function *F) {
          "true";
 }
 
+// The amdgpu-unsafe-int-atomics attribute disables generation of unsafe
+// integer atomic instructions. May generate more efficient code, but may
+// give incorrect results for certain memory destinations.
+bool unsafeIntAtomicsDisabled(Function *F) {
+  return F->getFnAttribute("amdgpu-unsafe-int-atomics").getValueAsString() ==
+         "false";
+}
+
 TargetLowering::AtomicExpansionKind
 SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
   unsigned AS = RMW->getPointerAddressSpace();
@@ -15097,6 +15105,13 @@ SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
     }
     break;
   }
+  case AtomicRMWInst::Xor: {
+    if (AMDGPU::isFlatGlobalAddrSpace(AS) &&
+        unsafeIntAtomicsDisabled(RMW->getFunction())) {
+      return AtomicExpansionKind::CmpXChg;
+    }
+    break;
+  }
   default:
     break;
   }
diff --git a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
index 92b70ad797b88ae..6cc97019d3ff5ce 100644
--- a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX908 %s
-; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -march=amdgcn -mcpu=gfx940 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX940 %s
-; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX1100 %s
+; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX908 %s
+; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX90A %s
+; RUN: llc -march=amdgcn -mcpu=gfx940 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX940 %s
+; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX1100 %s
 
 define float @syncscope_system(ptr %addr, float %val) #0 {
 ; GFX908-LABEL: syncscope_system:
@@ -391,4 +391,19 @@ define float @no_unsafe(ptr %addr, float %val) {
   ret float %res
 }
 
+define i32 @default_xor(ptr %addr, i32 %val) {
+  ; GCN-LABEL: default_xor:
+  ; GCN: flat_atomic_xor
+  %res = atomicrmw xor ptr %addr, i32 %val seq_cst
+  ret i32 %res
+}
+
+define i32 @no_unsafe_xor(ptr %addr, i32 %val) #1 {
+  ; GCN-LABEL: no_unsafe_xor:
+  ; GCN: flat_atomic_cmpswap
+  %res = atomicrmw xor ptr %addr, i32 %val seq_cst
+  ret i32 %res
+}
+
 attributes #0 = { "amdgpu-unsafe-fp-atomics"="true" }
+attributes #1 = { "amdgpu-unsafe-int-atomics"="false" }

pasaulais · 2023-10-23T16:06:41Z

Ping @hdelan

arsenm

I am opposed to adding any additional attributes for this. I very much want to remove the current unsafe-fp-atomics attribute. It's broken in the presence of inlining as it is, and at some point acquired too many meanings. There's also should be no reason to have separate FP and integer controls, there's nothing fundamental about the type here.

What we really need to do is have instruction level metadata annotations, separating the FP restrictions and the communication restrictions (which may or may not be isomorphic to a different scope). I have an incomplete patch where I was working towards this

pasaulais · 2023-10-24T12:25:24Z

Thanks for the review. I agree that function-level attributes are not ideal for solving this issue and instruction-level metadata would work better with things like inlining. Is the incomplete patch you mentioned something I could take on and complete?

Regarding int vs floating-point, I'm afraid there is a need for toggling one independently of the other (or at least special-casing operations like XOR that are not supported by PCIe 3.0). As the link I posted in the description mentions (see this comment ROCm/ROCm#2481 (comment)), there are configurations where using FP atomics like add would work whereas XOR doesn't, due to missing support in the PCIe 3.0 spec. I have reproduced this on a system with a RX 6700 XT GPU, where global_atomic_add_f32 works as expected using fine-grained allocations, and global_atomic_xor doesn't.

pasaulais · 2023-11-09T13:55:49Z

@arsenm, could you share this unfinished patch you were working on? I could start from scratch but I don't want to duplicate the work you've already done.

…f atomic builtin (#11876) 1. Fix a race condition in cmpxchg 2. Change `atomic_xor` to use a CAS loop instead of atomic builtin. This is needed to merge this in UR oneapi-src/unified-runtime#936 so that perf regression can be fixed. The long term fix is to use a compiler flag to choose between builtin and safe CAS implementation, but talks upstream may take some time to figure out the details. See llvm/llvm-project#69229

arsenm · 2024-02-14T13:52:36Z

We're gradually converging on something that looks like this, subject to bike shedding the name

yxsamliu · 2024-02-14T15:36:58Z

We're gradually converging on something that looks like this, subject to bike shedding the name

I did not know about this PR. It is interesting that our other discussions lead to similar solution.

I agree that per-instruction metadata is needed, and the metadata should convey separate control for fp atomic and unsupported integer atomic across PCIE.

The reason to use per-instruction metadata instead of per-function metadata or attribute is that per-function attribute does not work well for inlined functions.

As for controlling FE emitting this metadata, a more-fine-grained control e.g. pragma is desirable, however, for this patch, a clang option is probably sufficient. We could consider pragma control later.

We need to coin some good concise name for the metadata:

For unsafe fp atomic - unsafe_fp ?

For unsupported integer atomic across PCIE - unsafe_pcie ? fine_grained may not be suitable since fine_grained memory accessed across XGMI supports integer atomic AND/OR/XOR ops etc

b-sumner · 2024-02-14T15:37:27Z

We're aware of the concerns regarding atomics handling. We're trying to find the approach which is least distasteful to all parties. I expect we will address the concern raised here, but not necessarily in the same way.

pasaulais · 2024-02-19T17:12:32Z

Thanks for the comments @arsenm @yxsamliu @b-sumner.

By approaching a similar solution, do you mean MMRAs (#78569) ?

If so, should I rebase/adapt my patch to the MMRA PR? Or will this PR be redundant and needs closing?

@yxsamliu These concise names look good to me.

Pierre-vh · 2024-02-21T07:38:42Z

Thanks for the comments @arsenm @yxsamliu @b-sumner.

By approaching a similar solution, do you mean MMRAs (#78569) ?

If so, should I rebase/adapt my patch to the MMRA PR? Or will this PR be redundant and needs closing?

@yxsamliu These concise names look good to me.

In this case, MMRAs would only help in the sense that you won't need any new attributes and can just add an MMRA such as atomic-lowering:fine-grained. It's not really what MMRAs were made for (because this attribute doesn't affect semantics, just lowering style I think?), but all of the boilerplate would be there to preserve this annotation until the backend, and as long as you don't add any other atomic-lowering:<X> flag, you shouldn't have issues with compatibility relations - if you plan to add other flags like that then it may be an issue because you could end up with two operations being considered incompatible (= can reorder them) when they're still technically compatible

arsenm · 2024-02-29T11:39:51Z

In this case, MMRAs would only help in the sense that you won't need any new attributes and can just add an MMRA such as atomic-lowering:fine-grained. It's not really what MMRAs were made for (because this attribute doesn't affect semantics, just lowering style I think?),

It is semantics. You get undefined behavior if you access fine grained memory for an atomic where the instruction doesn't handle it. I don't think it quite fits into the MMRA box, since it isn't about the synchronization properties of the memory access but where the underlying memory is

pasaulais · 2024-02-29T13:13:48Z

In this case, MMRAs would only help in the sense that you won't need any new attributes and can just add an MMRA such as atomic-lowering:fine-grained. It's not really what MMRAs were made for (because this attribute doesn't affect semantics, just lowering style I think?),

It is semantics. You get undefined behavior if you access fine grained memory for an atomic where the instruction doesn't handle it. I don't think it quite fits into the MMRA box, since it isn't about the synchronization properties of the memory access but where the underlying memory is

@arsenm That makes sense, I don't think MMRA fits the fine-grained use case either. Does that mean we can stick with the approach from this PR? @b-sumner mentioned there was another similar approach being worked on.

arsenm · 2024-02-29T13:33:55Z

@arsenm That makes sense, I don't think MMRA fits the fine-grained use case either. Does that mean we can stick with the approach from this PR? @b-sumner mentioned there was another similar approach being worked on.

Something like this, but the naming and direction of this PR is backwards. The default should be assume fine grained memory is possible. We also have another orthogonal bit we need to track in addition to fine grained memory. I was envisioning this as a single integer interpreted as bitfields of the two, but this uses different key:value metadata pairs. It should be named "amdgpu.no.something" to assert the operation doesn't access fine grained

Add a spec for yet-to-be-implemented metadata to allow the backend to fully handle atomicrmw lowering. This is the base of an alternative to llvm#69229, which inverts the direction to be correct by default, and extends to cover the peer device case. Could use a better name

pasaulais · 2024-03-14T17:00:15Z

@arsenm That makes sense, I don't think MMRA fits the fine-grained use case either. Does that mean we can stick with the approach from this PR? @b-sumner mentioned there was another similar approach being worked on.

Something like this, but the naming and direction of this PR is backwards. The default should be assume fine grained memory is possible. We also have another orthogonal bit we need to track in addition to fine grained memory. I was envisioning this as a single integer interpreted as bitfields of the two, but this uses different key:value metadata pairs. It should be named "amdgpu.no.something" to assert the operation doesn't access fine grained

@arsenm I agree that the default should be assuming fine-grained is possible. My thinking behind the original naming and direction was not wanting to introduce an unexpected performance regression by default. I'm happy for both to be changed, and this patch being rebased on top of #85052 once it is merged.

One oustanding question for me is, although outside of the scope of this PR, how will the original 'no-unsafe-fp' option fit in the new metadata node in terms of constraints? Would it imply a new constraint or be covered by no_fine_grained and no_remote?

pasaulais · 2024-03-14T17:01:54Z

Also, this will need to be extended to other operations that can be impacted by fine-grained and remote access like add, and, or, sub, fpadd and fpsub.

arsenm · 2024-03-15T09:05:20Z

@arsenm I agree that the default should be assuming fine-grained is possible. My thinking behind the original naming and direction was not wanting to introduce an unexpected performance regression by default. I'm happy for both to be changed, and this patch being rebased on top of #85052 once it is merged.

The opting for fast-and-maybe-broken by default needs to be a frontend decision (i.e. the language can choose to add the metadata to all atomics). I believe @yxsamliu is going to be working on the frontend half which will preserve the HIP behavior

One oustanding question for me is, although outside of the scope of this PR, how will the original 'no-unsafe-fp' option fit in the new metadata node in terms of constraints? Would it imply a new constraint or be covered by no_fine_grained and no_remote?

I believe we need an additional piece of metadata (which I have another draft for) to express the relaxable floating point. We can express the unsafe-fp-math option with the 2 combined, and then can drop the old IR attribute

yxsamliu · 2024-03-15T13:20:32Z

@arsenm I agree that the default should be assuming fine-grained is possible. My thinking behind the original naming and direction was not wanting to introduce an unexpected performance regression by default. I'm happy for both to be changed, and this patch being rebased on top of #85052 once it is merged.

The opting for fast-and-maybe-broken by default needs to be a frontend decision (i.e. the language can choose to add the metadata to all atomics). I believe @yxsamliu is going to be working on the frontend half which will preserve the HIP behavior

One oustanding question for me is, although outside of the scope of this PR, how will the original 'no-unsafe-fp' option fit in the new metadata node in terms of constraints? Would it imply a new constraint or be covered by no_fine_grained and no_remote?

I believe we need an additional piece of metadata (which I have another draft for) to express the relaxable floating point. We can express the unsafe-fp-math option with the 2 combined, and then can drop the old IR attribute

Will the metadata for unsafe-fp-atomics also be controlled by the pragma that controls the no-fine-grained and no-remote metadata? e.g. something like

#pragma clang atomics begin no-fine-grained(on) no-remote(on) unsafe-fp(on)

arsenm · 2024-03-15T13:42:02Z

Will the metadata for unsafe-fp-atomics also be controlled by the pragma that controls the no-fine-grained and no-remote metadata? e.g. something like
#pragma clang atomics begin no-fine-grained(on) no-remote(on) unsafe-fp(on)

Yes, I would expect something like that for consistency. The most problematic part is the denormal flushing in some address spaces, and secondarily the fixed rounding mode in other cases

yxsamliu · 2024-03-15T15:33:33Z

Will the metadata for unsafe-fp-atomics also be controlled by the pragma that controls the no-fine-grained and no-remote metadata? e.g. something like
#pragma clang atomics begin no-fine-grained(on) no-remote(on) unsafe-fp(on)
Yes, I would expect something like that for consistency. The most problematic part is the denormal flushing in some address spaces, and secondarily the fixed rounding mode in other cases

we might document that those are not supported for now. if users really need them, introducing more controls to support them

arsenm · 2024-03-15T17:03:42Z

we might document that those are not supported for now. if users really need them, introducing more controls to support them

Sure. I think it's easiest to make progress if we fix the integer cases as a first step

AMDGPU reflect pass is needed to choose between safe and unsafe atomics at the libclc level. In the long run we will delete this patch as work is being done to ensure correct lowering of atomic instructions. See patches: llvm/llvm-project#85052 llvm/llvm-project#69229 This work is necessary as malloc shared atomics rely on PCIe atomics which can have patchy and unreliable support. We want to therefore be able to choose at compile time whether we should use safe atomics using CAS (which PCIe should support), or if we want to rely of the availability of the newest PCIe atomics, if malloc shared atomics are desired. Also changes the implementation of Or, And so that they can choose between the safe or unsafe version based on the AMDGPU reflect value.

… AMDGPU atomics (#11467) AMDGPU reflect pass is needed to choose between safe and unsafe atomics at the libclc level. In the long run we will delete this patch as work is being done to ensure correct lowering of atomic instructions. See patches: llvm/llvm-project#85052 llvm/llvm-project#69229 This work is necessary as malloc shared atomics rely on PCIe atomics which can have patchy and unreliable support. Therefore, we want to be able to choose at compile time whether we should use safe atomics using CAS (which PCIe should support), or if we want to rely of the availability of the newest PCIe atomics, if malloc shared atomics are desired. Also changes the implementation of `atomic_or`, `atomic_and` so that they can choose between the safe or unsafe version based on the AMDGPU reflect value.

Add a spec for yet-to-be-implemented metadata to allow the backend to fully handle atomicrmw lowering. This is the base of an alternative to #69229, which inverts the direction to be correct by default, and extends to cover the peer device case.

Add a spec for yet-to-be-implemented metadata to allow the backend to fully handle atomicrmw lowering. This is the base of an alternative to llvm#69229, which inverts the direction to be correct by default, and extends to cover the peer device case.

jansvoboda11 requested a review from arsenm October 19, 2023 15:44

arsenm requested changes Oct 24, 2023

View reviewed changes

pasaulais requested a review from arsenm October 30, 2023 11:11

hdelan mentioned this pull request Nov 14, 2023

[HIP][libclc] Fix race in cmpxchg and change xor to use CAS instead of atomic builtin intel/llvm#11876

Merged

pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 508e577 to 24b0e52 Compare December 4, 2023 18:11

[AMDGPU] Add an option to disable unsafe uses of atomic xor

cd33e84

pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 24b0e52 to cd33e84 Compare December 4, 2023 18:12

arsenm requested a review from yxsamliu February 14, 2024 13:52

yxsamliu requested a review from b-sumner February 14, 2024 15:21

arsenm mentioned this pull request Mar 13, 2024

AMDGPU: Add description for new atomicrmw metadata #85052

Merged

hdelan mentioned this pull request Apr 1, 2024

[SYCL][HIP] Add AMDGPU reflect pass to choose between safe and unsafe AMDGPU atomics intel/llvm#11467

Merged

[AMDGPU] Add an option to disable unsafe uses of atomic xor #69229

Are you sure you want to change the base?

[AMDGPU] Add an option to disable unsafe uses of atomic xor #69229

Uh oh!

Conversation

pasaulais commented Oct 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pasaulais commented Oct 16, 2023

Uh oh!

jansvoboda11 commented Oct 19, 2023

Uh oh!

pasaulais commented Oct 19, 2023

Uh oh!

llvmbot commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pasaulais commented Oct 23, 2023

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

pasaulais commented Oct 24, 2023

Uh oh!

pasaulais commented Nov 9, 2023

Uh oh!

arsenm commented Feb 14, 2024

Uh oh!

yxsamliu commented Feb 14, 2024

Uh oh!

b-sumner commented Feb 14, 2024

Uh oh!

pasaulais commented Feb 19, 2024

Uh oh!

Pierre-vh commented Feb 21, 2024

Uh oh!

arsenm commented Feb 29, 2024

Uh oh!

pasaulais commented Feb 29, 2024

Uh oh!

arsenm commented Feb 29, 2024

Uh oh!

pasaulais commented Mar 14, 2024

Uh oh!

pasaulais commented Mar 14, 2024

Uh oh!

arsenm commented Mar 15, 2024

Uh oh!

yxsamliu commented Mar 15, 2024

Uh oh!

arsenm commented Mar 15, 2024

Uh oh!

yxsamliu commented Mar 15, 2024

Uh oh!

arsenm commented Mar 15, 2024

Uh oh!

Uh oh!

pasaulais commented Oct 16, 2023 •

edited

Loading

llvmbot commented Oct 19, 2023 •

edited

Loading