Skip to content

[AMDGPU] Add an option to disable unsafe uses of atomic xor #69229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pasaulais
Copy link
Contributor

@pasaulais pasaulais commented Oct 16, 2023

On some systems and when accessing certain memory areas it is not safe to use atomic xor with AMDGPU. This may be the case when using fine-grained memory allocations (e.g. through hipAllocManaged) and when the atomic operation needs to go through the PCIe bus, which does not support atomic xor (PCIe 3.0 does support other atomic operations like fetch_and_add and cmpxchg) 1.

The issue has been worked around in DPC++/HIP by prefetching memory before executing kernels 2, however this adds an overhead that can outweigh the performance benefit of using atomic xor over a cmpxchg loop. This is why a way to switch between 'native' atomic xor and a cmpxchg-loop solution is needed.

These changes add a -mamdgpu-fine-grained-mem option to clang that can be used to switch between the two implementations. The default is the current behaviour of generating 'native' atomic xor instructions. When -mamdgpu-fine-grained-mem is passed to clang, atomic xor instructions are marked with !amdgpu.atomic instruction-level metadata at the IR level and this tells the backend to expand atomic xor to a cmpxchg loop. The options only affect the global and flat address spaces.

A !amdgpu.atomic metadata node is a set of key-value pairs that provide more information about the atomic operation. fine_grained is currently the only supported key. When the value is 1, the atomic operation can potentially take fine-grained memory pointers. The default value when this pair is not present is 0, i.e. fine-grained memory pointers are not passed to this instruction.

%2 = atomicrmw xor ptr %0, i32 %1 seq_cst, !amdgpu.atomic !0

!0 = !{!1}
!1 = !{!"fine_grained", i32 1}

@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. labels Oct 16, 2023
@pasaulais
Copy link
Contributor Author

Hi @jansvoboda11, @arsenm, I haven't yet got a commit merged to LLVM and can't request a review through the UI. Could I get a review on this patch? Thanks in advance and apologies if I've missed a step in the submission process.

@jansvoboda11 jansvoboda11 requested a review from arsenm October 19, 2023 15:44
@jansvoboda11
Copy link
Contributor

The command-line changes look good to me. Adding @arsenm as a reviewer proper.

@pasaulais
Copy link
Contributor Author

The command-line changes look good to me. Adding @arsenm as a reviewer proper.

Thanks Jan!

@llvmbot
Copy link
Member

llvmbot commented Oct 19, 2023

@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre-Andre Saulais (pasaulais)

Changes

On some systems and when accessing certain memory areas it is not safe to use atomic xor with AMDGPU. This may be the case when using fine-grained memory allocations (e.g. through hipAllocManaged) and when the atomic operation needs to go through the PCIe bus, which does not support atomic xor (PCIe 3.0 does support other atomic operations like fetch_and_add and cmpxchg) 1.

The issue has been worked around in DPC++/HIP by prefetching memory before executing kernels 2, however this adds an overhead that can outweigh the performance benefit of using atomic xor over a cmpxchg loop. This is why a way to switch between 'native' atomic xor and a cmpxchg-loop solution is needed.

These changes add -munsafe-int-atomics and -mno-unsafe-int-atomics options to clang that can be used to switch between the two implementations. The default is the current behaviour of generating 'native' atomic xor instructions. When -mno-unsafe-int-atomics is passed to clang, functions are marked with the amdgpu-unsafe-int-atomics attribute (set to false) at the IR level and this tells the backend to expand atomic xor to a cmpxchg loop. The options only affect the global and flat address spaces.


Full diff: https://github.com/llvm/llvm-project/pull/69229.diff

10 Files Affected:

  • (modified) clang/include/clang/Basic/TargetInfo.h (+7)
  • (modified) clang/include/clang/Basic/TargetOptions.h (+3)
  • (modified) clang/include/clang/Driver/Options.td (+8)
  • (modified) clang/lib/Basic/TargetInfo.cpp (+1)
  • (modified) clang/lib/Basic/Targets/AMDGPU.cpp (+1)
  • (modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+3)
  • (modified) clang/lib/Driver/ToolChains/Clang.cpp (+2)
  • (added) clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl (+55)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+15)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+19-4)
diff --git a/clang/include/clang/Basic/TargetInfo.h b/clang/include/clang/Basic/TargetInfo.h
index 9d56e97a3d4bb88..96416c4405115cd 100644
--- a/clang/include/clang/Basic/TargetInfo.h
+++ b/clang/include/clang/Basic/TargetInfo.h
@@ -253,6 +253,8 @@ class TargetInfo : public TransferrableTargetInfo,
 
   unsigned AllowAMDGPUUnsafeFPAtomics : 1;
 
+  unsigned AllowAMDGPUUnsafeIntAtomics : 1;
+
   unsigned ARMCDECoprocMask : 8;
 
   unsigned MaxOpenCLWorkGroupSize;
@@ -995,6 +997,11 @@ class TargetInfo : public TransferrableTargetInfo,
   /// allowed.
   bool allowAMDGPUUnsafeFPAtomics() const { return AllowAMDGPUUnsafeFPAtomics; }
 
+  /// Returns whether or not the AMDGPU unsafe integer atomics are allowed.
+  bool allowAMDGPUUnsafeIntAtomics() const {
+    return AllowAMDGPUUnsafeIntAtomics;
+  }
+
   /// For ARM targets returns a mask defining which coprocessors are configured
   /// as Custom Datapath.
   uint32_t getARMCDECoprocMask() const { return ARMCDECoprocMask; }
diff --git a/clang/include/clang/Basic/TargetOptions.h b/clang/include/clang/Basic/TargetOptions.h
index ba3acd029587160..4feb6f2dcd86b83 100644
--- a/clang/include/clang/Basic/TargetOptions.h
+++ b/clang/include/clang/Basic/TargetOptions.h
@@ -78,6 +78,9 @@ class TargetOptions {
   /// \brief If enabled, allow AMDGPU unsafe floating point atomics.
   bool AllowAMDGPUUnsafeFPAtomics = false;
 
+  /// \brief If enabled, allow AMDGPU unsafe integer atomics.
+  bool AllowAMDGPUUnsafeIntAtomics = true;
+
   /// \brief Enumeration value for AMDGPU code object version, which is the
   /// code object version times 100.
   enum CodeObjectVersionKind {
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index a89d6b6579f1176..265f5c23e7858ee 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -4705,6 +4705,14 @@ defm unsafe_fp_atomics : BoolOption<"m", "unsafe-fp-atomics",
           "for certain memory destinations. (AMDGPU only)">,
   NegFlag<SetFalse>>, Group<m_Group>;
 
+defm unsafe_int_atomics : BoolOption<"m", "unsafe-int-atomics",
+  TargetOpts<"AllowAMDGPUUnsafeIntAtomics">, DefaultTrue,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option],
+          "Enable generation of unsafe integer atomic instructions. May "
+          "generate more efficient code, but may give incorrect results "
+          "for certain memory destinations. (AMDGPU only)">,
+  NegFlag<SetFalse, [], [ClangOption, CC1Option]>>, Group<m_Group>;
+
 def faltivec : Flag<["-"], "faltivec">, Group<f_Group>, Flags<[NoXarchOption]>;
 def fno_altivec : Flag<["-"], "fno-altivec">, Group<f_Group>,
   Flags<[NoXarchOption]>;
diff --git a/clang/lib/Basic/TargetInfo.cpp b/clang/lib/Basic/TargetInfo.cpp
index 6cd5d618a4acaa5..06997cc54642623 100644
--- a/clang/lib/Basic/TargetInfo.cpp
+++ b/clang/lib/Basic/TargetInfo.cpp
@@ -157,6 +157,7 @@ TargetInfo::TargetInfo(const llvm::Triple &T) : Triple(T) {
   HasAArch64SVETypes = false;
   HasRISCVVTypes = false;
   AllowAMDGPUUnsafeFPAtomics = false;
+  AllowAMDGPUUnsafeIntAtomics = true;
   ARMCDECoprocMask = 0;
 
   // Default to no types using fpret.
diff --git a/clang/lib/Basic/Targets/AMDGPU.cpp b/clang/lib/Basic/Targets/AMDGPU.cpp
index 409ae32ab424215..20ad8e5c59c2b66 100644
--- a/clang/lib/Basic/Targets/AMDGPU.cpp
+++ b/clang/lib/Basic/Targets/AMDGPU.cpp
@@ -232,6 +232,7 @@ AMDGPUTargetInfo::AMDGPUTargetInfo(const llvm::Triple &Triple,
   HasFloat16 = true;
   WavefrontSize = GPUFeatures & llvm::AMDGPU::FEATURE_WAVE32 ? 32 : 64;
   AllowAMDGPUUnsafeFPAtomics = Opts.AllowAMDGPUUnsafeFPAtomics;
+  AllowAMDGPUUnsafeIntAtomics = Opts.AllowAMDGPUUnsafeIntAtomics;
 
   // Set pointer width and alignment for the generic address space.
   PointerWidth = PointerAlign = getPointerWidthV(LangAS::Default);
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index f6a614b3e4d54dd..9be15d89d3f2f46 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -409,6 +409,9 @@ void AMDGPUTargetCodeGenInfo::setTargetAttributes(
   if (M.getContext().getTargetInfo().allowAMDGPUUnsafeFPAtomics())
     F->addFnAttr("amdgpu-unsafe-fp-atomics", "true");
 
+  if (!M.getContext().getTargetInfo().allowAMDGPUUnsafeIntAtomics())
+    F->addFnAttr("amdgpu-unsafe-int-atomics", "false");
+
   if (!getABIInfo().getCodeGenOpts().EmitIEEENaNCompliantInsts)
     F->addFnAttr("amdgpu-ieee", "false");
 }
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 94c184435ae14de..be3076cdbdc4a68 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -7372,6 +7372,8 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
 
     Args.addOptInFlag(CmdArgs, options::OPT_munsafe_fp_atomics,
                       options::OPT_mno_unsafe_fp_atomics);
+    Args.addOptOutFlag(CmdArgs, options::OPT_munsafe_int_atomics,
+                       options::OPT_mno_unsafe_int_atomics);
     Args.addOptOutFlag(CmdArgs, options::OPT_mamdgpu_ieee,
                        options::OPT_mno_amdgpu_ieee);
   }
diff --git a/clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl b/clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl
new file mode 100644
index 000000000000000..02dfddd5791b405
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/amdgpu-unsafe-int-atomics.cl
@@ -0,0 +1,55 @@
+// REQUIRES: amdgpu-registered-target
+//
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -emit-llvm \
+// RUN:   | FileCheck -check-prefixes=COMMON,UNSAFE-INT-DEFAULT %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -emit-llvm -munsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON,UNSAFE-INT-ON %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -emit-llvm -mno-unsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON,UNSAFE-INT-OFF %s
+
+// Check AMDGCN ISA generation.
+
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -munsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON-ISA,ISA-ON %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -O3 -S -o - %s \
+// RUN:   -mno-unsafe-int-atomics \
+// RUN:   | FileCheck -check-prefixes=COMMON-ISA,ISA-OFF %s
+
+#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
+#pragma OPENCL EXTENSION cl_khr_int64_extended_atomics : enable
+
+typedef enum memory_order {
+  memory_order_relaxed = __ATOMIC_RELAXED,
+  memory_order_acquire = __ATOMIC_ACQUIRE,
+  memory_order_release = __ATOMIC_RELEASE,
+  memory_order_acq_rel = __ATOMIC_ACQ_REL,
+  memory_order_seq_cst = __ATOMIC_SEQ_CST
+} memory_order;
+
+typedef enum memory_scope {
+  memory_scope_work_item = __OPENCL_MEMORY_SCOPE_WORK_ITEM,
+  memory_scope_work_group = __OPENCL_MEMORY_SCOPE_WORK_GROUP,
+  memory_scope_device = __OPENCL_MEMORY_SCOPE_DEVICE,
+  memory_scope_all_svm_devices = __OPENCL_MEMORY_SCOPE_ALL_SVM_DEVICES,
+#if defined(cl_intel_subgroups) || defined(cl_khr_subgroups)
+  memory_scope_sub_group = __OPENCL_MEMORY_SCOPE_SUB_GROUP
+#endif
+} memory_scope;
+
+// COMMON-ISA: kern:
+// ISA-ON: flat_atomic_xor v{{[0-9]+}}, v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} glc 
+// ISA-OFF: flat_atomic_cmpswap v{{[0-9]+}},  v[{{[0-9]+}}:{{[0-9]+}}],  v[{{[0-9]+}}:{{[0-9]+}}] glc
+kernel void kern(global atomic_int *x, int y, global int *z) {
+  *z = __opencl_atomic_fetch_xor(x, y, memory_order_seq_cst, memory_scope_work_group);
+}
+
+// COMMON: define{{.*}} amdgpu_kernel void @kern{{.*}} [[ATTRS1:#[0-9]+]]
+// COMMON: atomicrmw xor ptr addrspace(1) %x, i32 %y syncscope("workgroup") seq_cst, align 4
+//
+// UNSAFE-INT-DEFAULT-NOT: attributes [[ATTRS1]] = {{.*}} "amdgpu-unsafe-int-atomics"
+// UNSAFE-INT-ON-NOT: attributes [[ATTRS1]] = {{.*}} "amdgpu-unsafe-int-atomics"
+// UNSAFE-INT-OFF: attributes [[ATTRS1]] = {{.*}} "amdgpu-unsafe-int-atomics"="false"
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 9c5b166c9652238..c379cf4dd64642f 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -14975,6 +14975,14 @@ bool unsafeFPAtomicsDisabled(Function *F) {
          "true";
 }
 
+// The amdgpu-unsafe-int-atomics attribute disables generation of unsafe
+// integer atomic instructions. May generate more efficient code, but may
+// give incorrect results for certain memory destinations.
+bool unsafeIntAtomicsDisabled(Function *F) {
+  return F->getFnAttribute("amdgpu-unsafe-int-atomics").getValueAsString() ==
+         "false";
+}
+
 TargetLowering::AtomicExpansionKind
 SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
   unsigned AS = RMW->getPointerAddressSpace();
@@ -15097,6 +15105,13 @@ SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
     }
     break;
   }
+  case AtomicRMWInst::Xor: {
+    if (AMDGPU::isFlatGlobalAddrSpace(AS) &&
+        unsafeIntAtomicsDisabled(RMW->getFunction())) {
+      return AtomicExpansionKind::CmpXChg;
+    }
+    break;
+  }
   default:
     break;
   }
diff --git a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
index 92b70ad797b88ae..6cc97019d3ff5ce 100644
--- a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX908 %s
-; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -march=amdgcn -mcpu=gfx940 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX940 %s
-; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX1100 %s
+; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX908 %s
+; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX90A %s
+; RUN: llc -march=amdgcn -mcpu=gfx940 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX940 %s
+; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,GFX1100 %s
 
 define float @syncscope_system(ptr %addr, float %val) #0 {
 ; GFX908-LABEL: syncscope_system:
@@ -391,4 +391,19 @@ define float @no_unsafe(ptr %addr, float %val) {
   ret float %res
 }
 
+define i32 @default_xor(ptr %addr, i32 %val) {
+  ; GCN-LABEL: default_xor:
+  ; GCN: flat_atomic_xor
+  %res = atomicrmw xor ptr %addr, i32 %val seq_cst
+  ret i32 %res
+}
+
+define i32 @no_unsafe_xor(ptr %addr, i32 %val) #1 {
+  ; GCN-LABEL: no_unsafe_xor:
+  ; GCN: flat_atomic_cmpswap
+  %res = atomicrmw xor ptr %addr, i32 %val seq_cst
+  ret i32 %res
+}
+
 attributes #0 = { "amdgpu-unsafe-fp-atomics"="true" }
+attributes #1 = { "amdgpu-unsafe-int-atomics"="false" }

@pasaulais
Copy link
Contributor Author

Ping @hdelan

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am opposed to adding any additional attributes for this. I very much want to remove the current unsafe-fp-atomics attribute. It's broken in the presence of inlining as it is, and at some point acquired too many meanings. There's also should be no reason to have separate FP and integer controls, there's nothing fundamental about the type here.

What we really need to do is have instruction level metadata annotations, separating the FP restrictions and the communication restrictions (which may or may not be isomorphic to a different scope). I have an incomplete patch where I was working towards this

@pasaulais
Copy link
Contributor Author

Thanks for the review. I agree that function-level attributes are not ideal for solving this issue and instruction-level metadata would work better with things like inlining. Is the incomplete patch you mentioned something I could take on and complete?

Regarding int vs floating-point, I'm afraid there is a need for toggling one independently of the other (or at least special-casing operations like XOR that are not supported by PCIe 3.0). As the link I posted in the description mentions (see this comment ROCm/ROCm#2481 (comment)), there are configurations where using FP atomics like add would work whereas XOR doesn't, due to missing support in the PCIe 3.0 spec. I have reproduced this on a system with a RX 6700 XT GPU, where global_atomic_add_f32 works as expected using fine-grained allocations, and global_atomic_xor doesn't.

@pasaulais pasaulais requested a review from arsenm October 30, 2023 11:11
@pasaulais
Copy link
Contributor Author

@arsenm, could you share this unfinished patch you were working on? I could start from scratch but I don't want to duplicate the work you've already done.

steffenlarsen pushed a commit to intel/llvm that referenced this pull request Nov 15, 2023
…f atomic builtin (#11876)

1. Fix a race condition in cmpxchg
2. Change `atomic_xor` to use a CAS loop instead of atomic builtin. This
is needed to merge this in UR
oneapi-src/unified-runtime#936 so that perf
regression can be fixed. The long term fix is to use a compiler flag to
choose between builtin and safe CAS implementation, but talks upstream
may take some time to figure out the details. See
llvm/llvm-project#69229
@pasaulais pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 508e577 to 24b0e52 Compare December 4, 2023 18:11
@pasaulais pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 24b0e52 to cd33e84 Compare December 4, 2023 18:12
@arsenm arsenm requested a review from yxsamliu February 14, 2024 13:52
@arsenm
Copy link
Contributor

arsenm commented Feb 14, 2024

We're gradually converging on something that looks like this, subject to bike shedding the name

@yxsamliu yxsamliu requested a review from b-sumner February 14, 2024 15:21
@yxsamliu
Copy link
Collaborator

We're gradually converging on something that looks like this, subject to bike shedding the name

I did not know about this PR. It is interesting that our other discussions lead to similar solution.

I agree that per-instruction metadata is needed, and the metadata should convey separate control for fp atomic and unsupported integer atomic across PCIE.

The reason to use per-instruction metadata instead of per-function metadata or attribute is that per-function attribute does not work well for inlined functions.

As for controlling FE emitting this metadata, a more-fine-grained control e.g. pragma is desirable, however, for this patch, a clang option is probably sufficient. We could consider pragma control later.

We need to coin some good concise name for the metadata:

For unsafe fp atomic - unsafe_fp ?

For unsupported integer atomic across PCIE - unsafe_pcie ? fine_grained may not be suitable since fine_grained memory accessed across XGMI supports integer atomic AND/OR/XOR ops etc

@b-sumner
Copy link

We're aware of the concerns regarding atomics handling. We're trying to find the approach which is least distasteful to all parties. I expect we will address the concern raised here, but not necessarily in the same way.

@pasaulais
Copy link
Contributor Author

Thanks for the comments @arsenm @yxsamliu @b-sumner.

By approaching a similar solution, do you mean MMRAs (#78569) ?

If so, should I rebase/adapt my patch to the MMRA PR? Or will this PR be redundant and needs closing?

@yxsamliu These concise names look good to me.

@Pierre-vh
Copy link
Contributor

Thanks for the comments @arsenm @yxsamliu @b-sumner.

By approaching a similar solution, do you mean MMRAs (#78569) ?

If so, should I rebase/adapt my patch to the MMRA PR? Or will this PR be redundant and needs closing?

@yxsamliu These concise names look good to me.

In this case, MMRAs would only help in the sense that you won't need any new attributes and can just add an MMRA such as atomic-lowering:fine-grained. It's not really what MMRAs were made for (because this attribute doesn't affect semantics, just lowering style I think?), but all of the boilerplate would be there to preserve this annotation until the backend, and as long as you don't add any other atomic-lowering:<X> flag, you shouldn't have issues with compatibility relations - if you plan to add other flags like that then it may be an issue because you could end up with two operations being considered incompatible (= can reorder them) when they're still technically compatible

@arsenm
Copy link
Contributor

arsenm commented Feb 29, 2024

In this case, MMRAs would only help in the sense that you won't need any new attributes and can just add an MMRA such as atomic-lowering:fine-grained. It's not really what MMRAs were made for (because this attribute doesn't affect semantics, just lowering style I think?),

It is semantics. You get undefined behavior if you access fine grained memory for an atomic where the instruction doesn't handle it. I don't think it quite fits into the MMRA box, since it isn't about the synchronization properties of the memory access but where the underlying memory is

@pasaulais
Copy link
Contributor Author

In this case, MMRAs would only help in the sense that you won't need any new attributes and can just add an MMRA such as atomic-lowering:fine-grained. It's not really what MMRAs were made for (because this attribute doesn't affect semantics, just lowering style I think?),

It is semantics. You get undefined behavior if you access fine grained memory for an atomic where the instruction doesn't handle it. I don't think it quite fits into the MMRA box, since it isn't about the synchronization properties of the memory access but where the underlying memory is

@arsenm That makes sense, I don't think MMRA fits the fine-grained use case either. Does that mean we can stick with the approach from this PR? @b-sumner mentioned there was another similar approach being worked on.

@arsenm
Copy link
Contributor

arsenm commented Feb 29, 2024

@arsenm That makes sense, I don't think MMRA fits the fine-grained use case either. Does that mean we can stick with the approach from this PR? @b-sumner mentioned there was another similar approach being worked on.

Something like this, but the naming and direction of this PR is backwards. The default should be assume fine grained memory is possible. We also have another orthogonal bit we need to track in addition to fine grained memory. I was envisioning this as a single integer interpreted as bitfields of the two, but this uses different key:value metadata pairs. It should be named "amdgpu.no.something" to assert the operation doesn't access fine grained

arsenm added a commit to arsenm/llvm-project that referenced this pull request Mar 13, 2024
Add a spec for yet-to-be-implemented metadata to allow the backend to
fully handle atomicrmw lowering. This is the base of an alternative
to llvm#69229, which inverts the direction to be correct by default, and
extends to cover the peer device case.

Could use a better name
@pasaulais
Copy link
Contributor Author

@arsenm That makes sense, I don't think MMRA fits the fine-grained use case either. Does that mean we can stick with the approach from this PR? @b-sumner mentioned there was another similar approach being worked on.

Something like this, but the naming and direction of this PR is backwards. The default should be assume fine grained memory is possible. We also have another orthogonal bit we need to track in addition to fine grained memory. I was envisioning this as a single integer interpreted as bitfields of the two, but this uses different key:value metadata pairs. It should be named "amdgpu.no.something" to assert the operation doesn't access fine grained

@arsenm I agree that the default should be assuming fine-grained is possible. My thinking behind the original naming and direction was not wanting to introduce an unexpected performance regression by default. I'm happy for both to be changed, and this patch being rebased on top of #85052 once it is merged.

One oustanding question for me is, although outside of the scope of this PR, how will the original 'no-unsafe-fp' option fit in the new metadata node in terms of constraints? Would it imply a new constraint or be covered by no_fine_grained and no_remote?

@pasaulais
Copy link
Contributor Author

Also, this will need to be extended to other operations that can be impacted by fine-grained and remote access like add, and, or, sub, fpadd and fpsub.

@arsenm
Copy link
Contributor

arsenm commented Mar 15, 2024

@arsenm I agree that the default should be assuming fine-grained is possible. My thinking behind the original naming and direction was not wanting to introduce an unexpected performance regression by default. I'm happy for both to be changed, and this patch being rebased on top of #85052 once it is merged.

The opting for fast-and-maybe-broken by default needs to be a frontend decision (i.e. the language can choose to add the metadata to all atomics). I believe @yxsamliu is going to be working on the frontend half which will preserve the HIP behavior

One oustanding question for me is, although outside of the scope of this PR, how will the original 'no-unsafe-fp' option fit in the new metadata node in terms of constraints? Would it imply a new constraint or be covered by no_fine_grained and no_remote?

I believe we need an additional piece of metadata (which I have another draft for) to express the relaxable floating point. We can express the unsafe-fp-math option with the 2 combined, and then can drop the old IR attribute

@yxsamliu
Copy link
Collaborator

@arsenm I agree that the default should be assuming fine-grained is possible. My thinking behind the original naming and direction was not wanting to introduce an unexpected performance regression by default. I'm happy for both to be changed, and this patch being rebased on top of #85052 once it is merged.

The opting for fast-and-maybe-broken by default needs to be a frontend decision (i.e. the language can choose to add the metadata to all atomics). I believe @yxsamliu is going to be working on the frontend half which will preserve the HIP behavior

One oustanding question for me is, although outside of the scope of this PR, how will the original 'no-unsafe-fp' option fit in the new metadata node in terms of constraints? Would it imply a new constraint or be covered by no_fine_grained and no_remote?

I believe we need an additional piece of metadata (which I have another draft for) to express the relaxable floating point. We can express the unsafe-fp-math option with the 2 combined, and then can drop the old IR attribute

Will the metadata for unsafe-fp-atomics also be controlled by the pragma that controls the no-fine-grained and no-remote metadata? e.g. something like

#pragma clang atomics begin no-fine-grained(on) no-remote(on) unsafe-fp(on)

@arsenm
Copy link
Contributor

arsenm commented Mar 15, 2024

Will the metadata for unsafe-fp-atomics also be controlled by the pragma that controls the no-fine-grained and no-remote metadata? e.g. something like

#pragma clang atomics begin no-fine-grained(on) no-remote(on) unsafe-fp(on)

Yes, I would expect something like that for consistency. The most problematic part is the denormal flushing in some address spaces, and secondarily the fixed rounding mode in other cases

@yxsamliu
Copy link
Collaborator

Will the metadata for unsafe-fp-atomics also be controlled by the pragma that controls the no-fine-grained and no-remote metadata? e.g. something like

#pragma clang atomics begin no-fine-grained(on) no-remote(on) unsafe-fp(on)

Yes, I would expect something like that for consistency. The most problematic part is the denormal flushing in some address spaces, and secondarily the fixed rounding mode in other cases

we might document that those are not supported for now. if users really need them, introducing more controls to support them

@arsenm
Copy link
Contributor

arsenm commented Mar 15, 2024

we might document that those are not supported for now. if users really need them, introducing more controls to support them

Sure. I think it's easiest to make progress if we fix the integer cases as a first step

hdelan added a commit to hdelan/llvm that referenced this pull request Apr 22, 2024
AMDGPU reflect pass is needed to choose between safe and unsafe atomics
at the libclc level. In the long run we will delete this patch as work
is being done to ensure correct lowering of atomic instructions. See
patches:

llvm/llvm-project#85052
llvm/llvm-project#69229

This work is necessary as malloc shared atomics rely on PCIe atomics
which can have patchy and unreliable support. We want to therefore be
able to choose at compile time whether we should use safe atomics using
CAS (which PCIe should support), or if we want to rely of the
availability of the newest PCIe atomics, if malloc shared atomics are
desired.

Also changes the implementation of Or, And so that they can choose
between the safe or unsafe version based on the AMDGPU reflect value.
ldrumm pushed a commit to intel/llvm that referenced this pull request Apr 24, 2024
… AMDGPU atomics (#11467)

AMDGPU reflect pass is needed to choose between safe and unsafe atomics
at the libclc level. In the long run we will delete this patch as work
is being done to ensure correct lowering of atomic instructions. See
patches:

llvm/llvm-project#85052
llvm/llvm-project#69229

This work is necessary as malloc shared atomics rely on PCIe atomics
which can have patchy and unreliable support. Therefore, we want to be
able to choose at compile time whether we should use safe atomics using
CAS (which PCIe should support), or if we want to rely of the
availability of the newest PCIe atomics, if malloc shared atomics are
desired.

Also changes the implementation of `atomic_or`, `atomic_and` so that
they
can choose between the safe or unsafe version based on the AMDGPU
reflect value.
arsenm added a commit that referenced this pull request Jul 10, 2024
Add a spec for yet-to-be-implemented metadata to allow the backend to
fully handle atomicrmw lowering. This is the base of an alternative
to #69229, which inverts the direction to be correct by default, and
extends to cover the peer device case.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
Add a spec for yet-to-be-implemented metadata to allow the backend to
fully handle atomicrmw lowering. This is the base of an alternative
to llvm#69229, which inverts the direction to be correct by default, and
extends to cover the peer device case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants