[AMDGPU] Move kernarg preload logic to AMDGPU Attributor #123547

kerbowa · 2025-01-20T04:56:29Z

Besides the changes listed below everything works the same as it did when this code was in AMDGPULowerKernelArguments.

There is a refactoring of the free user SGPR tracking to make it simplified and more accurate. We don't actually care which SGPRs hold which arguments before ISel so specific tracking of the number of free registers is removed. In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it. Even though inreg is meant to act as a hint the coupling between the attribute and whether an argument is actually preloaded should be equivalent now, although ISel always makes the final determination.

Since we are no longer handling this in AMDGPULowerKernelArguments that pass must rely on the inreg attribute to determine whether to leave arguments as is. This leads to some test changes.

This lowering is moved out of the llc pipeline which requires test updates.

Cloned function declarations are removed when kernel signatures are modified to preload hidden arguments.

llvmbot · 2025-01-20T04:57:02Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

Besides the changes listed below everything works the same as it did when this code was in AMDGPULowerKernelArguments.

There is a refactoring of the free user SGPR tracking to make it simplified and more accurate. We don't actually care which SGPRs hold which arguments before ISel so specific tracking of the number of free registers is removed. In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it. Even though inreg is meant to act as a hint the coupling between the attribute and whether an argument is actually preloaded should be equivalent now, although ISel always makes the final determination.

Since we are no longer handling this in AMDGPULowerKernelArguments that pass must rely on the inreg attribute to determine whether to leave arguments as is. This leads to some test changes.

This lowering is moved out of the llc pipeline which requires test updates.

Cloned function declarations are removed when kernel signatures are modified to preload hidden arguments.

Patch is 242.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123547.diff

12 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+270-13)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp (+2-253)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd-with-ret.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs-IR-lowering.ll (+554-90)
(modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs.ll (+116-477)
(modified) llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll (+926-399)
(removed) llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll (-263)
(modified) llvm/test/CodeGen/AMDGPU/preload-kernargs.ll (+13-15)
(modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+4-4)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 546db318c17d53..d7c700f40c824b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -14,7 +14,9 @@
 #include "GCNSubtarget.h"
 #include "Utils/AMDGPUBaseInfo.h"
 #include "llvm/Analysis/CycleAnalysis.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/IntrinsicsR600.h"
 #include "llvm/Target/TargetMachine.h"
@@ -144,6 +146,213 @@ static bool funcRequiresHostcallPtr(const Function &F) {
 }
 
 namespace {
+
+class PreloadKernelArgInfo {
+private:
+  Function &F;
+  const GCNSubtarget &ST;
+  unsigned NumFreeUserSGPRs;
+
+  enum HiddenArg : unsigned {
+    HIDDEN_BLOCK_COUNT_X,
+    HIDDEN_BLOCK_COUNT_Y,
+    HIDDEN_BLOCK_COUNT_Z,
+    HIDDEN_GROUP_SIZE_X,
+    HIDDEN_GROUP_SIZE_Y,
+    HIDDEN_GROUP_SIZE_Z,
+    HIDDEN_REMAINDER_X,
+    HIDDEN_REMAINDER_Y,
+    HIDDEN_REMAINDER_Z,
+    END_HIDDEN_ARGS
+  };
+
+  // Stores information about a specific hidden argument.
+  struct HiddenArgInfo {
+    // Offset in bytes from the location in the kernearg segment pointed to by
+    // the implicitarg pointer.
+    uint8_t Offset;
+    // The size of the hidden argument in bytes.
+    uint8_t Size;
+    // The name of the hidden argument in the kernel signature.
+    const char *Name;
+  };
+
+  static constexpr HiddenArgInfo HiddenArgs[END_HIDDEN_ARGS] = {
+      {0, 4, "_hidden_block_count_x"}, {4, 4, "_hidden_block_count_y"},
+      {8, 4, "_hidden_block_count_z"}, {12, 2, "_hidden_group_size_x"},
+      {14, 2, "_hidden_group_size_y"}, {16, 2, "_hidden_group_size_z"},
+      {18, 2, "_hidden_remainder_x"},  {20, 2, "_hidden_remainder_y"},
+      {22, 2, "_hidden_remainder_z"}};
+
+  static HiddenArg getHiddenArgFromOffset(unsigned Offset) {
+    for (unsigned I = 0; I < END_HIDDEN_ARGS; ++I)
+      if (HiddenArgs[I].Offset == Offset)
+        return static_cast<HiddenArg>(I);
+
+    return END_HIDDEN_ARGS;
+  }
+
+  static Type *getHiddenArgType(LLVMContext &Ctx, HiddenArg HA) {
+    if (HA < END_HIDDEN_ARGS)
+      return Type::getIntNTy(Ctx, HiddenArgs[HA].Size * 8);
+
+    llvm_unreachable("Unexpected hidden argument.");
+  }
+
+  static const char *getHiddenArgName(HiddenArg HA) {
+    if (HA < END_HIDDEN_ARGS) {
+      return HiddenArgs[HA].Name;
+    }
+    llvm_unreachable("Unexpected hidden argument.");
+  }
+
+  // Clones the function after adding implicit arguments to the argument list
+  // and returns the new updated function. Preloaded implicit arguments are
+  // added up to and including the last one that will be preloaded, indicated by
+  // LastPreloadIndex. Currently preloading is only performed on the totality of
+  // sequential data from the kernarg segment including implicit (hidden)
+  // arguments. This means that all arguments up to the last preloaded argument
+  // will also be preloaded even if that data is unused.
+  Function *cloneFunctionWithPreloadImplicitArgs(unsigned LastPreloadIndex) {
+    FunctionType *FT = F.getFunctionType();
+    LLVMContext &Ctx = F.getParent()->getContext();
+    SmallVector<Type *, 16> FTypes(FT->param_begin(), FT->param_end());
+    for (unsigned I = 0; I <= LastPreloadIndex; ++I)
+      FTypes.push_back(getHiddenArgType(Ctx, HiddenArg(I)));
+
+    FunctionType *NFT =
+        FunctionType::get(FT->getReturnType(), FTypes, FT->isVarArg());
+    Function *NF =
+        Function::Create(NFT, F.getLinkage(), F.getAddressSpace(), F.getName());
+
+    NF->copyAttributesFrom(&F);
+    NF->copyMetadata(&F, 0);
+    NF->setIsNewDbgInfoFormat(F.IsNewDbgInfoFormat);
+
+    F.getParent()->getFunctionList().insert(F.getIterator(), NF);
+    NF->takeName(&F);
+    NF->splice(NF->begin(), &F);
+
+    Function::arg_iterator NFArg = NF->arg_begin();
+    for (Argument &Arg : F.args()) {
+      Arg.replaceAllUsesWith(&*NFArg);
+      NFArg->takeName(&Arg);
+      ++NFArg;
+    }
+
+    AttrBuilder AB(Ctx);
+    AB.addAttribute(Attribute::InReg);
+    AB.addAttribute("amdgpu-hidden-argument");
+    AttributeList AL = NF->getAttributes();
+    for (unsigned I = 0; I <= LastPreloadIndex; ++I) {
+      AL = AL.addParamAttributes(Ctx, NFArg->getArgNo(), AB);
+      NFArg++->setName(getHiddenArgName(HiddenArg(I)));
+    }
+
+    NF->setAttributes(AL);
+    F.replaceAllUsesWith(NF);
+
+    return NF;
+  }
+
+public:
+  PreloadKernelArgInfo(Function &F, const GCNSubtarget &ST) : F(F), ST(ST) {
+    setInitialFreeUserSGPRsCount();
+  }
+
+  // Returns the maximum number of user SGPRs that we have available to preload
+  // arguments.
+  void setInitialFreeUserSGPRsCount() {
+    GCNUserSGPRUsageInfo UserSGPRInfo(F, ST);
+    NumFreeUserSGPRs = UserSGPRInfo.getNumFreeUserSGPRs();
+  }
+
+  bool canPreloadKernArgAtOffset(uint64_t ExplicitArgOffset) {
+    return ExplicitArgOffset <= NumFreeUserSGPRs * 4;
+  }
+
+  // Try to allocate SGPRs to preload hidden kernel arguments.
+  void
+  tryAllocHiddenArgPreloadSGPRs(uint64_t ImplicitArgsBaseOffset,
+                                SmallVectorImpl<Function *> &FunctionsToErase) {
+    Function *ImplicitArgPtr = Intrinsic::getDeclarationIfExists(
+        F.getParent(), Intrinsic::amdgcn_implicitarg_ptr);
+    if (!ImplicitArgPtr)
+      return;
+
+    const DataLayout &DL = F.getParent()->getDataLayout();
+    // Pair is the load and the load offset.
+    SmallVector<std::pair<LoadInst *, unsigned>, 4> ImplicitArgLoads;
+    for (auto *U : ImplicitArgPtr->users()) {
+      Instruction *CI = dyn_cast<Instruction>(U);
+      if (!CI || CI->getParent()->getParent() != &F)
+        continue;
+
+      for (auto *U : CI->users()) {
+        int64_t Offset = 0;
+        auto *Load = dyn_cast<LoadInst>(U); // Load from ImplicitArgPtr?
+        if (!Load) {
+          if (GetPointerBaseWithConstantOffset(U, Offset, DL) != CI)
+            continue;
+
+          Load = dyn_cast<LoadInst>(*U->user_begin()); // Load from GEP?
+        }
+
+        if (!Load || !Load->isSimple())
+          continue;
+
+        // FIXME: Expand handle merged loads.
+        LLVMContext &Ctx = F.getParent()->getContext();
+        Type *LoadTy = Load->getType();
+        HiddenArg HA = getHiddenArgFromOffset(Offset);
+        if (HA == END_HIDDEN_ARGS || LoadTy != getHiddenArgType(Ctx, HA))
+          continue;
+
+        ImplicitArgLoads.push_back(std::make_pair(Load, Offset));
+      }
+    }
+
+    if (ImplicitArgLoads.empty())
+      return;
+
+    // Allocate loads in order of offset. We need to be sure that the implicit
+    // argument can actually be preloaded.
+    std::sort(ImplicitArgLoads.begin(), ImplicitArgLoads.end(), less_second());
+
+    // If we fail to preload any implicit argument we know we don't have SGPRs
+    // to preload any subsequent ones with larger offsets. Find the first
+    // argument that we cannot preload.
+    auto *PreloadEnd =
+        std::find_if(ImplicitArgLoads.begin(), ImplicitArgLoads.end(),
+                     [&](const std::pair<LoadInst *, unsigned> &Load) {
+                       unsigned LoadSize =
+                           DL.getTypeStoreSize(Load.first->getType());
+                       unsigned LoadOffset = Load.second;
+                       if (!canPreloadKernArgAtOffset(LoadOffset + LoadSize +
+                                                      ImplicitArgsBaseOffset))
+                         return true;
+
+                       return false;
+                     });
+
+    if (PreloadEnd == ImplicitArgLoads.begin())
+      return;
+
+    unsigned LastHiddenArgIndex = getHiddenArgFromOffset(PreloadEnd[-1].second);
+    Function *NF = cloneFunctionWithPreloadImplicitArgs(LastHiddenArgIndex);
+    assert(NF);
+    FunctionsToErase.push_back(&F);
+    for (const auto *I = ImplicitArgLoads.begin(); I != PreloadEnd; ++I) {
+      LoadInst *LoadInst = I->first;
+      unsigned LoadOffset = I->second;
+      unsigned HiddenArgIndex = getHiddenArgFromOffset(LoadOffset);
+      unsigned Index = NF->arg_size() - LastHiddenArgIndex + HiddenArgIndex - 1;
+      Argument *Arg = NF->getArg(Index);
+      LoadInst->replaceAllUsesWith(Arg);
+    }
+  }
+};
+
 class AMDGPUInformationCache : public InformationCache {
 public:
   AMDGPUInformationCache(const Module &M, AnalysisGetter &AG,
@@ -1314,19 +1523,64 @@ struct AAAMDGPUNoAGPR
 
 const char AAAMDGPUNoAGPR::ID = 0;
 
-static void addPreloadKernArgHint(Function &F, TargetMachine &TM) {
-  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
-  for (unsigned I = 0;
-       I < F.arg_size() &&
-       I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs());
-       ++I) {
-    Argument &Arg = *F.getArg(I);
-    // Check for incompatible attributes.
-    if (Arg.hasByRefAttr() || Arg.hasNestAttr())
-      break;
+static void markKernelArgsAsInreg(SetVector<Function *> &Functions,
+                                  TargetMachine &TM) {
+  SmallVector<Function *, 4> FunctionsToErase;
+  for (auto *F : Functions) {
+    const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(*F);
+    if (!ST.hasKernargPreload() ||
+        F->getCallingConv() != CallingConv::AMDGPU_KERNEL || F->arg_empty())
+      continue;
+
+    PreloadKernelArgInfo PreloadInfo(*F, ST);
+    uint64_t ExplicitArgOffset = 0;
+    const DataLayout &DL = F->getDataLayout();
+    const uint64_t BaseOffset = ST.getExplicitKernelArgOffset();
+    unsigned NumPreloadsRequested = KernargPreloadCount;
+    unsigned NumPreloadedExplicitArgs = 0;
+    for (Argument &Arg : F->args()) {
+      // Avoid incompatible attributes and guard against running this pass
+      // twice.
+      if (Arg.hasByRefAttr() || Arg.hasNestAttr() ||
+          Arg.hasAttribute("amdgpu-hidden-argument"))
+        break;
+
+      // Inreg may be pre-existing on some arguments, try to preload these.
+      if (NumPreloadsRequested == 0 && !Arg.hasInRegAttr())
+        break;
+
+      // FIXME: Preload aggregates.
+      if (Arg.getType()->isAggregateType())
+        break;
+
+      Type *ArgTy = Arg.getType();
+      Align ABITypeAlign = DL.getABITypeAlign(ArgTy);
+      uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
+      ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
+      if (!PreloadInfo.canPreloadKernArgAtOffset(ExplicitArgOffset))
+        break;
+
+      Arg.addAttr(Attribute::InReg);
+      NumPreloadedExplicitArgs++;
+      if (NumPreloadsRequested > 0)
+        NumPreloadsRequested--;
+    }
 
-    Arg.addAttr(Attribute::InReg);
+    // Only try preloading hidden arguments if we can successfully preload the
+    // last explicit argument.
+    if (NumPreloadedExplicitArgs == F->arg_size()) {
+      uint64_t ImplicitArgsBaseOffset =
+          alignTo(ExplicitArgOffset, ST.getAlignmentForImplicitArgPtr()) +
+          BaseOffset;
+      PreloadInfo.tryAllocHiddenArgPreloadSGPRs(ImplicitArgsBaseOffset,
+                                                FunctionsToErase);
+    }
   }
+
+  // Erase cloned functions if we needed to update the kernel signature to
+  // support preloading hidden kernel arguments.
+  for (auto *F : FunctionsToErase)
+    F->eraseFromParent();
 }
 
 static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
@@ -1378,8 +1632,6 @@ static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
     if (!AMDGPU::isEntryFunctionCC(CC)) {
       A.getOrCreateAAFor<AAAMDFlatWorkGroupSize>(IRPosition::function(*F));
       A.getOrCreateAAFor<AAAMDWavesPerEU>(IRPosition::function(*F));
-    } else if (CC == CallingConv::AMDGPU_KERNEL) {
-      addPreloadKernArgHint(*F, TM);
     }
 
     for (auto &I : instructions(F)) {
@@ -1400,6 +1652,11 @@ static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
   }
 
   ChangeStatus Change = A.run();
+
+  // Mark kernel arguments with 'inreg' attribute to indicate that they should
+  // be preloaded into SGPRs.
+  markKernelArgsAsInreg(Functions, TM);
+
   return Change == ChangeStatus::CHANGED;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
index e9d009baa20af2..7f6c5b4b476d8e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
@@ -27,230 +27,6 @@ using namespace llvm;
 
 namespace {
 
-class PreloadKernelArgInfo {
-private:
-  Function &F;
-  const GCNSubtarget &ST;
-  unsigned NumFreeUserSGPRs;
-
-  enum HiddenArg : unsigned {
-    HIDDEN_BLOCK_COUNT_X,
-    HIDDEN_BLOCK_COUNT_Y,
-    HIDDEN_BLOCK_COUNT_Z,
-    HIDDEN_GROUP_SIZE_X,
-    HIDDEN_GROUP_SIZE_Y,
-    HIDDEN_GROUP_SIZE_Z,
-    HIDDEN_REMAINDER_X,
-    HIDDEN_REMAINDER_Y,
-    HIDDEN_REMAINDER_Z,
-    END_HIDDEN_ARGS
-  };
-
-  // Stores information about a specific hidden argument.
-  struct HiddenArgInfo {
-    // Offset in bytes from the location in the kernearg segment pointed to by
-    // the implicitarg pointer.
-    uint8_t Offset;
-    // The size of the hidden argument in bytes.
-    uint8_t Size;
-    // The name of the hidden argument in the kernel signature.
-    const char *Name;
-  };
-
-  static constexpr HiddenArgInfo HiddenArgs[END_HIDDEN_ARGS] = {
-      {0, 4, "_hidden_block_count_x"}, {4, 4, "_hidden_block_count_y"},
-      {8, 4, "_hidden_block_count_z"}, {12, 2, "_hidden_group_size_x"},
-      {14, 2, "_hidden_group_size_y"}, {16, 2, "_hidden_group_size_z"},
-      {18, 2, "_hidden_remainder_x"},  {20, 2, "_hidden_remainder_y"},
-      {22, 2, "_hidden_remainder_z"}};
-
-  static HiddenArg getHiddenArgFromOffset(unsigned Offset) {
-    for (unsigned I = 0; I < END_HIDDEN_ARGS; ++I)
-      if (HiddenArgs[I].Offset == Offset)
-        return static_cast<HiddenArg>(I);
-
-    return END_HIDDEN_ARGS;
-  }
-
-  static Type *getHiddenArgType(LLVMContext &Ctx, HiddenArg HA) {
-    if (HA < END_HIDDEN_ARGS)
-      return Type::getIntNTy(Ctx, HiddenArgs[HA].Size * 8);
-
-    llvm_unreachable("Unexpected hidden argument.");
-  }
-
-  static const char *getHiddenArgName(HiddenArg HA) {
-    if (HA < END_HIDDEN_ARGS) {
-      return HiddenArgs[HA].Name;
-    }
-    llvm_unreachable("Unexpected hidden argument.");
-  }
-
-  // Clones the function after adding implicit arguments to the argument list
-  // and returns the new updated function. Preloaded implicit arguments are
-  // added up to and including the last one that will be preloaded, indicated by
-  // LastPreloadIndex. Currently preloading is only performed on the totality of
-  // sequential data from the kernarg segment including implicit (hidden)
-  // arguments. This means that all arguments up to the last preloaded argument
-  // will also be preloaded even if that data is unused.
-  Function *cloneFunctionWithPreloadImplicitArgs(unsigned LastPreloadIndex) {
-    FunctionType *FT = F.getFunctionType();
-    LLVMContext &Ctx = F.getParent()->getContext();
-    SmallVector<Type *, 16> FTypes(FT->param_begin(), FT->param_end());
-    for (unsigned I = 0; I <= LastPreloadIndex; ++I)
-      FTypes.push_back(getHiddenArgType(Ctx, HiddenArg(I)));
-
-    FunctionType *NFT =
-        FunctionType::get(FT->getReturnType(), FTypes, FT->isVarArg());
-    Function *NF =
-        Function::Create(NFT, F.getLinkage(), F.getAddressSpace(), F.getName());
-
-    NF->copyAttributesFrom(&F);
-    NF->copyMetadata(&F, 0);
-    NF->setIsNewDbgInfoFormat(F.IsNewDbgInfoFormat);
-
-    F.getParent()->getFunctionList().insert(F.getIterator(), NF);
-    NF->takeName(&F);
-    NF->splice(NF->begin(), &F);
-
-    Function::arg_iterator NFArg = NF->arg_begin();
-    for (Argument &Arg : F.args()) {
-      Arg.replaceAllUsesWith(&*NFArg);
-      NFArg->takeName(&Arg);
-      ++NFArg;
-    }
-
-    AttrBuilder AB(Ctx);
-    AB.addAttribute(Attribute::InReg);
-    AB.addAttribute("amdgpu-hidden-argument");
-    AttributeList AL = NF->getAttributes();
-    for (unsigned I = 0; I <= LastPreloadIndex; ++I) {
-      AL = AL.addParamAttributes(Ctx, NFArg->getArgNo(), AB);
-      NFArg++->setName(getHiddenArgName(HiddenArg(I)));
-    }
-
-    NF->setAttributes(AL);
-    F.replaceAllUsesWith(NF);
-    F.setCallingConv(CallingConv::C);
-
-    return NF;
-  }
-
-public:
-  PreloadKernelArgInfo(Function &F, const GCNSubtarget &ST) : F(F), ST(ST) {
-    setInitialFreeUserSGPRsCount();
-  }
-
-  // Returns the maximum number of user SGPRs that we have available to preload
-  // arguments.
-  void setInitialFreeUserSGPRsCount() {
-    GCNUserSGPRUsageInfo UserSGPRInfo(F, ST);
-    NumFreeUserSGPRs = UserSGPRInfo.getNumFreeUserSGPRs();
-  }
-
-  bool tryAllocPreloadSGPRs(unsigned AllocSize, uint64_t ArgOffset,
-                            uint64_t LastExplicitArgOffset) {
-    //  Check if this argument may be loaded into the same register as the
-    //  previous argument.
-    if (ArgOffset - LastExplicitArgOffset < 4 &&
-        !isAligned(Align(4), ArgOffset))
-      return true;
-
-    // Pad SGPRs for kernarg alignment.
-    ArgOffset = alignDown(ArgOffset, 4);
-    unsigned Padding = ArgOffset - LastExplicitArgOffset;
-    unsigned PaddingSGPRs = alignTo(Padding, 4) / 4;
-    unsigned NumPreloadSGPRs = alignTo(AllocSize, 4) / 4;
-    if (NumPreloadSGPRs + PaddingSGPRs > NumFreeUserSGPRs)
-      return false;
-
-    NumFreeUserSGPRs -= (NumPreloadSGPRs + PaddingSGPRs);
-    return true;
-  }
-
-  // Try to allocate SGPRs to preload implicit kernel arguments.
-  void tryAllocImplicitArgPreloadSGPRs(uint64_t ImplicitArgsBaseOffset,
-                                       uint64_t LastExplicitArgOffset,
-                                       IRBuilder<> &Builder) {
-    Function *ImplicitArgPtr = Intrinsic::getDeclarationIfExists(
-        F.getParent(), Intrinsic::amdgcn_implicitarg_ptr);
-    if (!ImplicitArgPtr)
-      return;
-
-    const DataLayout &DL = F.getParent()->getDataLayout();
-    // Pair is the load and the load offset.
-    SmallVector<std::pair<LoadInst *, unsigned>, 4> ImplicitArgLoads;
-    for (auto *U : ImplicitArgPtr->users()) {
-      Instruction *CI = dyn_cast<Instruction>(U);
-      if (!CI || CI->getParent()->getParent() != &F)
-        continue;
-
-      for (auto *U : CI->users()) {
-        int64_t Offset = 0;
-        auto *Load = dyn_cast<LoadInst>(U); // Load from ImplicitArgPtr?
-        if (!Load) {
-          if (GetPointerBaseWithConstantOffset(U, Offset, DL) != CI)
-            continue;
-
-          Load = dyn_cast<LoadInst>(*U->user_begin()); // Load from GEP?
-        }
-
-        if (!Load || !Load->isSimple())
-          continue;
-
-        // FIXME: Expand to handle 64-bit implicit args and large merged loads.
-        LLVMContext &Ctx = F.getParent()->getContext();
-        Type *LoadTy = Load->getType();
-        HiddenArg HA = getHiddenArgFromOffset(Offset);
-        if (HA == END_HIDDEN_ARGS || LoadTy != getHiddenArgType(Ctx, HA))
-          continue;
-
-        ImplicitArgLoads.push_back(std::make_pair(Load, Offset));
-      }
-    }
-
-    if (ImplicitArgLoads.empty())
-      return;
-
-    // Allocate loads in order of offset. We need to be sure that the implicit
-    // argument can actually be preloaded.
-    std::sort(ImplicitArgLoads.begin(), ImplicitArgLoads.end(), less_second());
-
-    // If we fail to preload any implicit argument we know we don't have SGPRs
-    // to preload any subsequent ones with larger offsets. Find the first
-    // argument that we cannot preload.
-    auto *PreloadEnd = std::find_if(
-        ImplicitArgLoads.begin(), ImplicitArgLoads.end(),
-        [&](const std::pair<LoadInst *, unsigned> &Load) {
-          unsigned LoadSize = DL.getTypeStoreSize(Load.first->getType());
-          unsigned LoadOffset = Load.second;
-          if (!tryAllocPreloadSGPRs(LoadSize,
-                                    LoadOffset + ImplicitArgsBaseOffset,
-                                    LastExplicitArgOffset))
-            return true;
-
-          LastExplicitArgOffset =
-              ImplicitArgsBaseOffset + LoadOffset + LoadSize;
-          return false;
-        });
-
-    if (PreloadEnd == ImplicitArgLoads.begin())
-      return;
-
-    unsigned LastHiddenArgIndex = getHiddenArgFromOffset(PreloadEnd[-1].second);
-    Function *NF = cloneFunctionWithPreloadImplicitArgs(LastHiddenArgIndex);
-    assert(NF);
-    for (const auto *I = ImplicitArgLoads.beg...
[truncated]

arsenm · 2025-01-21T12:53:16Z

In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it.

Which case is this? Getting the exact number of arguments at all points should not be difficult

arsenm · 2025-01-21T12:54:39Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+  for (auto *F : Functions) {
+    const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(*F);
+    if (!ST.hasKernargPreload() ||
+        F->getCallingConv() != CallingConv::AMDGPU_KERNEL || F->arg_empty())


Not really worth handling arg_empty a s a special case, it should fall out naturally and never happens in practice

arsenm · 2025-01-21T12:55:40Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+      if (Arg.hasByRefAttr() || Arg.hasNestAttr() ||
+          Arg.hasAttribute("amdgpu-hidden-argument"))
+        break;
+


Should todo handle byref, in the future all kernel arguments should be byref

arsenm · 2025-01-21T12:56:39Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+      Align ABITypeAlign = DL.getABITypeAlign(ArgTy);
+      uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
+      ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;


You probably should be using the TargetLowering functions the calling convention lowering uses to see how these types will really be processed

I think the way I'm doing it is what we want since AMDGPUCallLowering::lowerFormalArgumentsKernel and HSA metadata streamer calculates the argument offsets this same way, so it should be consistent with how they are processed.

NFC. Since the handling will be moved to the middlend the tests need to be reorganized. The IR tests are expanded since end-to-end testing is no longer baked in.

Besides the changes listed below everything works the same as it did when this code was in AMDGPULowerKernelArguments. There is a refactoring of the free user SGPR tracking to make it simplified and more accurate. We don't actually care which SGPRs hold which arguments before ISel so specific tracking of the number of free registers is removed. In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it. Even though inreg is meant to act as a hint the coupling between the attribute and whether an argument is actually preloaded should be equivalent now, although ISel always makes the final determination. Since we are no longer handling this in AMDGPULowerKernelArguments that pass must rely on the inreg attribute to determine whether to leave arguments as is. This leads to some test changes. This lowering is moved out of the llc pipeline which requires test updates. Cloned function declarations are removed when kernel signatures are modified to preload hidden arguments.

kerbowa

In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it.

Which case is this? Getting the exact number of arguments at all points should not be difficult

half_v7bfloat_kernel_preload_arg as an example. The difference between the IR tracking of this and ISel is that ISel would always allocate in-order whereas in IR we would sometimes see if we could fit a randomly located argument which had bugs around arguments in previously allocated SGPRs and also the alignment of the hidden arguments. Looking just at the offsets in the same way the HSA metadata is emitted avoids all of this since we don't need to track the specific registers that arguments are preloaded to in IR to determine what ISel will be able to preload.

kerbowa · 2025-01-31T17:22:40Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+      Align ABITypeAlign = DL.getABITypeAlign(ArgTy);
+      uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
+      ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;


I think the way I'm doing it is what we want since AMDGPUCallLowering::lowerFormalArgumentsKernel and HSA metadata streamer calculates the argument offsets this same way, so it should be consistent with how they are processed.

shiltian · 2025-02-07T19:32:23Z

Why do we want to move it to this file? I'm not sure if this is the good place. I think this file is only for things that run part of the attributor framework.

kerbowa · 2025-02-07T19:42:54Z

Why do we want to move it to this file? I'm not sure if this is the good place. I think this file is only for things that run part of the attributor framework.

It needs to be part of a module pass since it does things like cloning and deleting functions. I had a PR up to make AMDGPULowerKernelArguments a module pass, but I think the consensus was that it made sense to unify all the inreg/preload kernarg functionality here.

shiltian · 2025-02-07T19:57:20Z

I had a PR up to make AMDGPULowerKernelArguments a module pass, but I think the consensus was that it made sense to unify all the inreg/preload kernarg functionality here.

I remember that PR. I think that should be the way to go.

IIUC, @arsenm did mention in #101609 (which is indeed something about inreg) that we can probably move the logic there, but at that time I thought it runs part of the attributor iterations. I'm not sure if that is a good idea to pile everything in this file.

shiltian · 2025-02-07T19:57:53Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+
+  // Mark kernel arguments with 'inreg' attribute to indicate that they should
+  // be preloaded into SGPRs.
+  markKernelArgsAsInreg(Functions, TM);


This doesn't reflect the change of the pass or not.

kerbowa · 2025-02-07T20:51:56Z

I had a PR up to make AMDGPULowerKernelArguments a module pass, but I think the consensus was that it made sense to unify all the inreg/preload kernarg functionality here.

I remember that PR. I think that should be the way to go.

IIUC, @arsenm did mention in #101609 (which is indeed something about inreg) that we can probably move the logic there, but at that time I thought it runs part of the attributor iterations. I'm not sure if that is a good idea to pile everything in this file.

See also this comment here: #112790 (comment)

Eventually it may be possible to do some optimization where we drop the kernarg segment pointer or other optimization in callees related to kernel arguments that requires attributor iterations.

Right now this needs to run after because the number of free user SGPRs depends on attributes on kernels calculated here.

Which parts are you thinking should move, because if it goes back to AMDGPULowerKernelArguments I think everything should move there including the kernel inreg stuff?

shiltian · 2025-02-11T00:09:04Z

Which parts are you thinking should move

IMHO things that don't run as part of attributor should not stay in this file.

I think everything should move there including the kernel inreg stuff?

Yes, because for now it is not running as part of attributor iterations.

Eventually it may be possible to do some optimization where we drop the kernarg segment pointer or other optimization in callees related to kernel arguments that requires attributor iterations.

For now probably a TODO mentioning some logic can be moved to attributor. I assume we can move them after #101609 is landed.

If I read @arsenm 's comment in #112790 correctly, half of the pass will go after #115821 is landed, and the pre load stuff is not really related to the pass itself. Moving it out to a separate module pass is an option.

kerbowa · 2025-02-11T15:48:42Z

Which parts are you thinking should move

IMHO things that don't run as part of attributor should not stay in this file.

I think everything should move there including the kernel inreg stuff?

Yes, because for now it is not running as part of attributor iterations.

Eventually it may be possible to do some optimization where we drop the kernarg segment pointer or other optimization in callees related to kernel arguments that requires attributor iterations.

For now probably a TODO mentioning some logic can be moved to attributor. I assume we can move them after #101609 is landed.

If I read @arsenm 's comment in #112790 correctly, half of the pass will go after #115821 is landed, and the pre load stuff is not really related to the pass itself. Moving it out to a separate module pass is an option.

My only concern is that some of the logic calculating offsets and such would need to be duplicated if some functionality did end up moving back to the attributor. I could create a common interface though. I would like Matt's opinion.

kerbowa requested a review from arsenm January 20, 2025 04:56

llvmbot added backend:AMDGPU llvm:globalisel labels Jan 20, 2025

arsenm reviewed Jan 21, 2025

View reviewed changes

kerbowa added 2 commits January 30, 2025 17:16

[AMDGPU] Pre-commit tests for preload kernarg move to attributor

6d27ee0

NFC. Since the handling will be moved to the middlend the tests need to be reorganized. The IR tests are expanded since end-to-end testing is no longer baked in.

kerbowa commented Jan 31, 2025

View reviewed changes

Rebase and address comments.

cdfd728

kerbowa force-pushed the move-kernarg-preload-lowering-to-amdgpuattributor branch from f0b5ede to cdfd728 Compare January 31, 2025 17:29

kerbowa mentioned this pull request Feb 7, 2025

[AMDGPU] Remove dead function metadata after amdgpu-lower-kernel-arguments #126147

Merged

kerbowa requested a review from arsenm February 7, 2025 19:15

kerbowa mentioned this pull request Feb 7, 2025

[AMDGPU] Push amdgpu-preload-kern-arg-prolog after livedebugvalues #126148

Merged

shiltian reviewed Feb 7, 2025

View reviewed changes

kerbowa requested a review from krzysz00 April 7, 2025 15:24

kerbowa mentioned this pull request Apr 7, 2025

[AMDGPU] Move kernarg preload logic to separate pass #130434

Merged

kerbowa closed this May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Move kernarg preload logic to AMDGPU Attributor #123547

[AMDGPU] Move kernarg preload logic to AMDGPU Attributor #123547

kerbowa commented Jan 20, 2025

llvmbot commented Jan 20, 2025 •

edited

Loading

arsenm commented Jan 21, 2025

arsenm Jan 21, 2025

arsenm Jan 21, 2025

arsenm Jan 21, 2025

kerbowa Jan 31, 2025

kerbowa left a comment

kerbowa Jan 31, 2025

shiltian commented Feb 7, 2025 •

edited

Loading

kerbowa commented Feb 7, 2025

shiltian commented Feb 7, 2025

shiltian Feb 7, 2025

kerbowa commented Feb 7, 2025

shiltian commented Feb 11, 2025

kerbowa commented Feb 11, 2025

[AMDGPU] Move kernarg preload logic to AMDGPU Attributor #123547

[AMDGPU] Move kernarg preload logic to AMDGPU Attributor #123547

Conversation

kerbowa commented Jan 20, 2025

llvmbot commented Jan 20, 2025 • edited Loading

arsenm commented Jan 21, 2025

arsenm Jan 21, 2025

Choose a reason for hiding this comment

arsenm Jan 21, 2025

Choose a reason for hiding this comment

arsenm Jan 21, 2025

Choose a reason for hiding this comment

kerbowa Jan 31, 2025

Choose a reason for hiding this comment

kerbowa left a comment

Choose a reason for hiding this comment

kerbowa Jan 31, 2025

Choose a reason for hiding this comment

shiltian commented Feb 7, 2025 • edited Loading

kerbowa commented Feb 7, 2025

shiltian commented Feb 7, 2025

shiltian Feb 7, 2025

Choose a reason for hiding this comment

kerbowa commented Feb 7, 2025

shiltian commented Feb 11, 2025

kerbowa commented Feb 11, 2025

llvmbot commented Jan 20, 2025 •

edited

Loading

shiltian commented Feb 7, 2025 •

edited

Loading