Skip to content

[AMDGPU] Move kernarg preload logic to AMDGPU Attributor #123547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

kerbowa
Copy link
Member

@kerbowa kerbowa commented Jan 20, 2025

Besides the changes listed below everything works the same as it did when this code was in AMDGPULowerKernelArguments.

There is a refactoring of the free user SGPR tracking to make it simplified and more accurate. We don't actually care which SGPRs hold which arguments before ISel so specific tracking of the number of free registers is removed. In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it. Even though inreg is meant to act as a hint the coupling between the attribute and whether an argument is actually preloaded should be equivalent now, although ISel always makes the final determination.

Since we are no longer handling this in AMDGPULowerKernelArguments that pass must rely on the inreg attribute to determine whether to leave arguments as is. This leads to some test changes.

This lowering is moved out of the llc pipeline which requires test updates.

Cloned function declarations are removed when kernel signatures are modified to preload hidden arguments.

@llvmbot
Copy link
Member

llvmbot commented Jan 20, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

Besides the changes listed below everything works the same as it did when this code was in AMDGPULowerKernelArguments.

There is a refactoring of the free user SGPR tracking to make it simplified and more accurate. We don't actually care which SGPRs hold which arguments before ISel so specific tracking of the number of free registers is removed. In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it. Even though inreg is meant to act as a hint the coupling between the attribute and whether an argument is actually preloaded should be equivalent now, although ISel always makes the final determination.

Since we are no longer handling this in AMDGPULowerKernelArguments that pass must rely on the inreg attribute to determine whether to leave arguments as is. This leads to some test changes.

This lowering is moved out of the llc pipeline which requires test updates.

Cloned function declarations are removed when kernel signatures are modified to preload hidden arguments.


Patch is 242.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123547.diff

12 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+270-13)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp (+2-253)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd-with-ret.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs-IR-lowering.ll (+554-90)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs.ll (+116-477)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll (+926-399)
  • (removed) llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll (-263)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-kernargs.ll (+13-15)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+4-4)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 546db318c17d53..d7c700f40c824b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -14,7 +14,9 @@
 #include "GCNSubtarget.h"
 #include "Utils/AMDGPUBaseInfo.h"
 #include "llvm/Analysis/CycleAnalysis.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/IntrinsicsR600.h"
 #include "llvm/Target/TargetMachine.h"
@@ -144,6 +146,213 @@ static bool funcRequiresHostcallPtr(const Function &F) {
 }
 
 namespace {
+
+class PreloadKernelArgInfo {
+private:
+  Function &F;
+  const GCNSubtarget &ST;
+  unsigned NumFreeUserSGPRs;
+
+  enum HiddenArg : unsigned {
+    HIDDEN_BLOCK_COUNT_X,
+    HIDDEN_BLOCK_COUNT_Y,
+    HIDDEN_BLOCK_COUNT_Z,
+    HIDDEN_GROUP_SIZE_X,
+    HIDDEN_GROUP_SIZE_Y,
+    HIDDEN_GROUP_SIZE_Z,
+    HIDDEN_REMAINDER_X,
+    HIDDEN_REMAINDER_Y,
+    HIDDEN_REMAINDER_Z,
+    END_HIDDEN_ARGS
+  };
+
+  // Stores information about a specific hidden argument.
+  struct HiddenArgInfo {
+    // Offset in bytes from the location in the kernearg segment pointed to by
+    // the implicitarg pointer.
+    uint8_t Offset;
+    // The size of the hidden argument in bytes.
+    uint8_t Size;
+    // The name of the hidden argument in the kernel signature.
+    const char *Name;
+  };
+
+  static constexpr HiddenArgInfo HiddenArgs[END_HIDDEN_ARGS] = {
+      {0, 4, "_hidden_block_count_x"}, {4, 4, "_hidden_block_count_y"},
+      {8, 4, "_hidden_block_count_z"}, {12, 2, "_hidden_group_size_x"},
+      {14, 2, "_hidden_group_size_y"}, {16, 2, "_hidden_group_size_z"},
+      {18, 2, "_hidden_remainder_x"},  {20, 2, "_hidden_remainder_y"},
+      {22, 2, "_hidden_remainder_z"}};
+
+  static HiddenArg getHiddenArgFromOffset(unsigned Offset) {
+    for (unsigned I = 0; I < END_HIDDEN_ARGS; ++I)
+      if (HiddenArgs[I].Offset == Offset)
+        return static_cast<HiddenArg>(I);
+
+    return END_HIDDEN_ARGS;
+  }
+
+  static Type *getHiddenArgType(LLVMContext &Ctx, HiddenArg HA) {
+    if (HA < END_HIDDEN_ARGS)
+      return Type::getIntNTy(Ctx, HiddenArgs[HA].Size * 8);
+
+    llvm_unreachable("Unexpected hidden argument.");
+  }
+
+  static const char *getHiddenArgName(HiddenArg HA) {
+    if (HA < END_HIDDEN_ARGS) {
+      return HiddenArgs[HA].Name;
+    }
+    llvm_unreachable("Unexpected hidden argument.");
+  }
+
+  // Clones the function after adding implicit arguments to the argument list
+  // and returns the new updated function. Preloaded implicit arguments are
+  // added up to and including the last one that will be preloaded, indicated by
+  // LastPreloadIndex. Currently preloading is only performed on the totality of
+  // sequential data from the kernarg segment including implicit (hidden)
+  // arguments. This means that all arguments up to the last preloaded argument
+  // will also be preloaded even if that data is unused.
+  Function *cloneFunctionWithPreloadImplicitArgs(unsigned LastPreloadIndex) {
+    FunctionType *FT = F.getFunctionType();
+    LLVMContext &Ctx = F.getParent()->getContext();
+    SmallVector<Type *, 16> FTypes(FT->param_begin(), FT->param_end());
+    for (unsigned I = 0; I <= LastPreloadIndex; ++I)
+      FTypes.push_back(getHiddenArgType(Ctx, HiddenArg(I)));
+
+    FunctionType *NFT =
+        FunctionType::get(FT->getReturnType(), FTypes, FT->isVarArg());
+    Function *NF =
+        Function::Create(NFT, F.getLinkage(), F.getAddressSpace(), F.getName());
+
+    NF->copyAttributesFrom(&F);
+    NF->copyMetadata(&F, 0);
+    NF->setIsNewDbgInfoFormat(F.IsNewDbgInfoFormat);
+
+    F.getParent()->getFunctionList().insert(F.getIterator(), NF);
+    NF->takeName(&F);
+    NF->splice(NF->begin(), &F);
+
+    Function::arg_iterator NFArg = NF->arg_begin();
+    for (Argument &Arg : F.args()) {
+      Arg.replaceAllUsesWith(&*NFArg);
+      NFArg->takeName(&Arg);
+      ++NFArg;
+    }
+
+    AttrBuilder AB(Ctx);
+    AB.addAttribute(Attribute::InReg);
+    AB.addAttribute("amdgpu-hidden-argument");
+    AttributeList AL = NF->getAttributes();
+    for (unsigned I = 0; I <= LastPreloadIndex; ++I) {
+      AL = AL.addParamAttributes(Ctx, NFArg->getArgNo(), AB);
+      NFArg++->setName(getHiddenArgName(HiddenArg(I)));
+    }
+
+    NF->setAttributes(AL);
+    F.replaceAllUsesWith(NF);
+
+    return NF;
+  }
+
+public:
+  PreloadKernelArgInfo(Function &F, const GCNSubtarget &ST) : F(F), ST(ST) {
+    setInitialFreeUserSGPRsCount();
+  }
+
+  // Returns the maximum number of user SGPRs that we have available to preload
+  // arguments.
+  void setInitialFreeUserSGPRsCount() {
+    GCNUserSGPRUsageInfo UserSGPRInfo(F, ST);
+    NumFreeUserSGPRs = UserSGPRInfo.getNumFreeUserSGPRs();
+  }
+
+  bool canPreloadKernArgAtOffset(uint64_t ExplicitArgOffset) {
+    return ExplicitArgOffset <= NumFreeUserSGPRs * 4;
+  }
+
+  // Try to allocate SGPRs to preload hidden kernel arguments.
+  void
+  tryAllocHiddenArgPreloadSGPRs(uint64_t ImplicitArgsBaseOffset,
+                                SmallVectorImpl<Function *> &FunctionsToErase) {
+    Function *ImplicitArgPtr = Intrinsic::getDeclarationIfExists(
+        F.getParent(), Intrinsic::amdgcn_implicitarg_ptr);
+    if (!ImplicitArgPtr)
+      return;
+
+    const DataLayout &DL = F.getParent()->getDataLayout();
+    // Pair is the load and the load offset.
+    SmallVector<std::pair<LoadInst *, unsigned>, 4> ImplicitArgLoads;
+    for (auto *U : ImplicitArgPtr->users()) {
+      Instruction *CI = dyn_cast<Instruction>(U);
+      if (!CI || CI->getParent()->getParent() != &F)
+        continue;
+
+      for (auto *U : CI->users()) {
+        int64_t Offset = 0;
+        auto *Load = dyn_cast<LoadInst>(U); // Load from ImplicitArgPtr?
+        if (!Load) {
+          if (GetPointerBaseWithConstantOffset(U, Offset, DL) != CI)
+            continue;
+
+          Load = dyn_cast<LoadInst>(*U->user_begin()); // Load from GEP?
+        }
+
+        if (!Load || !Load->isSimple())
+          continue;
+
+        // FIXME: Expand handle merged loads.
+        LLVMContext &Ctx = F.getParent()->getContext();
+        Type *LoadTy = Load->getType();
+        HiddenArg HA = getHiddenArgFromOffset(Offset);
+        if (HA == END_HIDDEN_ARGS || LoadTy != getHiddenArgType(Ctx, HA))
+          continue;
+
+        ImplicitArgLoads.push_back(std::make_pair(Load, Offset));
+      }
+    }
+
+    if (ImplicitArgLoads.empty())
+      return;
+
+    // Allocate loads in order of offset. We need to be sure that the implicit
+    // argument can actually be preloaded.
+    std::sort(ImplicitArgLoads.begin(), ImplicitArgLoads.end(), less_second());
+
+    // If we fail to preload any implicit argument we know we don't have SGPRs
+    // to preload any subsequent ones with larger offsets. Find the first
+    // argument that we cannot preload.
+    auto *PreloadEnd =
+        std::find_if(ImplicitArgLoads.begin(), ImplicitArgLoads.end(),
+                     [&](const std::pair<LoadInst *, unsigned> &Load) {
+                       unsigned LoadSize =
+                           DL.getTypeStoreSize(Load.first->getType());
+                       unsigned LoadOffset = Load.second;
+                       if (!canPreloadKernArgAtOffset(LoadOffset + LoadSize +
+                                                      ImplicitArgsBaseOffset))
+                         return true;
+
+                       return false;
+                     });
+
+    if (PreloadEnd == ImplicitArgLoads.begin())
+      return;
+
+    unsigned LastHiddenArgIndex = getHiddenArgFromOffset(PreloadEnd[-1].second);
+    Function *NF = cloneFunctionWithPreloadImplicitArgs(LastHiddenArgIndex);
+    assert(NF);
+    FunctionsToErase.push_back(&F);
+    for (const auto *I = ImplicitArgLoads.begin(); I != PreloadEnd; ++I) {
+      LoadInst *LoadInst = I->first;
+      unsigned LoadOffset = I->second;
+      unsigned HiddenArgIndex = getHiddenArgFromOffset(LoadOffset);
+      unsigned Index = NF->arg_size() - LastHiddenArgIndex + HiddenArgIndex - 1;
+      Argument *Arg = NF->getArg(Index);
+      LoadInst->replaceAllUsesWith(Arg);
+    }
+  }
+};
+
 class AMDGPUInformationCache : public InformationCache {
 public:
   AMDGPUInformationCache(const Module &M, AnalysisGetter &AG,
@@ -1314,19 +1523,64 @@ struct AAAMDGPUNoAGPR
 
 const char AAAMDGPUNoAGPR::ID = 0;
 
-static void addPreloadKernArgHint(Function &F, TargetMachine &TM) {
-  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
-  for (unsigned I = 0;
-       I < F.arg_size() &&
-       I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs());
-       ++I) {
-    Argument &Arg = *F.getArg(I);
-    // Check for incompatible attributes.
-    if (Arg.hasByRefAttr() || Arg.hasNestAttr())
-      break;
+static void markKernelArgsAsInreg(SetVector<Function *> &Functions,
+                                  TargetMachine &TM) {
+  SmallVector<Function *, 4> FunctionsToErase;
+  for (auto *F : Functions) {
+    const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(*F);
+    if (!ST.hasKernargPreload() ||
+        F->getCallingConv() != CallingConv::AMDGPU_KERNEL || F->arg_empty())
+      continue;
+
+    PreloadKernelArgInfo PreloadInfo(*F, ST);
+    uint64_t ExplicitArgOffset = 0;
+    const DataLayout &DL = F->getDataLayout();
+    const uint64_t BaseOffset = ST.getExplicitKernelArgOffset();
+    unsigned NumPreloadsRequested = KernargPreloadCount;
+    unsigned NumPreloadedExplicitArgs = 0;
+    for (Argument &Arg : F->args()) {
+      // Avoid incompatible attributes and guard against running this pass
+      // twice.
+      if (Arg.hasByRefAttr() || Arg.hasNestAttr() ||
+          Arg.hasAttribute("amdgpu-hidden-argument"))
+        break;
+
+      // Inreg may be pre-existing on some arguments, try to preload these.
+      if (NumPreloadsRequested == 0 && !Arg.hasInRegAttr())
+        break;
+
+      // FIXME: Preload aggregates.
+      if (Arg.getType()->isAggregateType())
+        break;
+
+      Type *ArgTy = Arg.getType();
+      Align ABITypeAlign = DL.getABITypeAlign(ArgTy);
+      uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
+      ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
+      if (!PreloadInfo.canPreloadKernArgAtOffset(ExplicitArgOffset))
+        break;
+
+      Arg.addAttr(Attribute::InReg);
+      NumPreloadedExplicitArgs++;
+      if (NumPreloadsRequested > 0)
+        NumPreloadsRequested--;
+    }
 
-    Arg.addAttr(Attribute::InReg);
+    // Only try preloading hidden arguments if we can successfully preload the
+    // last explicit argument.
+    if (NumPreloadedExplicitArgs == F->arg_size()) {
+      uint64_t ImplicitArgsBaseOffset =
+          alignTo(ExplicitArgOffset, ST.getAlignmentForImplicitArgPtr()) +
+          BaseOffset;
+      PreloadInfo.tryAllocHiddenArgPreloadSGPRs(ImplicitArgsBaseOffset,
+                                                FunctionsToErase);
+    }
   }
+
+  // Erase cloned functions if we needed to update the kernel signature to
+  // support preloading hidden kernel arguments.
+  for (auto *F : FunctionsToErase)
+    F->eraseFromParent();
 }
 
 static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
@@ -1378,8 +1632,6 @@ static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
     if (!AMDGPU::isEntryFunctionCC(CC)) {
       A.getOrCreateAAFor<AAAMDFlatWorkGroupSize>(IRPosition::function(*F));
       A.getOrCreateAAFor<AAAMDWavesPerEU>(IRPosition::function(*F));
-    } else if (CC == CallingConv::AMDGPU_KERNEL) {
-      addPreloadKernArgHint(*F, TM);
     }
 
     for (auto &I : instructions(F)) {
@@ -1400,6 +1652,11 @@ static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
   }
 
   ChangeStatus Change = A.run();
+
+  // Mark kernel arguments with 'inreg' attribute to indicate that they should
+  // be preloaded into SGPRs.
+  markKernelArgsAsInreg(Functions, TM);
+
   return Change == ChangeStatus::CHANGED;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
index e9d009baa20af2..7f6c5b4b476d8e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
@@ -27,230 +27,6 @@ using namespace llvm;
 
 namespace {
 
-class PreloadKernelArgInfo {
-private:
-  Function &F;
-  const GCNSubtarget &ST;
-  unsigned NumFreeUserSGPRs;
-
-  enum HiddenArg : unsigned {
-    HIDDEN_BLOCK_COUNT_X,
-    HIDDEN_BLOCK_COUNT_Y,
-    HIDDEN_BLOCK_COUNT_Z,
-    HIDDEN_GROUP_SIZE_X,
-    HIDDEN_GROUP_SIZE_Y,
-    HIDDEN_GROUP_SIZE_Z,
-    HIDDEN_REMAINDER_X,
-    HIDDEN_REMAINDER_Y,
-    HIDDEN_REMAINDER_Z,
-    END_HIDDEN_ARGS
-  };
-
-  // Stores information about a specific hidden argument.
-  struct HiddenArgInfo {
-    // Offset in bytes from the location in the kernearg segment pointed to by
-    // the implicitarg pointer.
-    uint8_t Offset;
-    // The size of the hidden argument in bytes.
-    uint8_t Size;
-    // The name of the hidden argument in the kernel signature.
-    const char *Name;
-  };
-
-  static constexpr HiddenArgInfo HiddenArgs[END_HIDDEN_ARGS] = {
-      {0, 4, "_hidden_block_count_x"}, {4, 4, "_hidden_block_count_y"},
-      {8, 4, "_hidden_block_count_z"}, {12, 2, "_hidden_group_size_x"},
-      {14, 2, "_hidden_group_size_y"}, {16, 2, "_hidden_group_size_z"},
-      {18, 2, "_hidden_remainder_x"},  {20, 2, "_hidden_remainder_y"},
-      {22, 2, "_hidden_remainder_z"}};
-
-  static HiddenArg getHiddenArgFromOffset(unsigned Offset) {
-    for (unsigned I = 0; I < END_HIDDEN_ARGS; ++I)
-      if (HiddenArgs[I].Offset == Offset)
-        return static_cast<HiddenArg>(I);
-
-    return END_HIDDEN_ARGS;
-  }
-
-  static Type *getHiddenArgType(LLVMContext &Ctx, HiddenArg HA) {
-    if (HA < END_HIDDEN_ARGS)
-      return Type::getIntNTy(Ctx, HiddenArgs[HA].Size * 8);
-
-    llvm_unreachable("Unexpected hidden argument.");
-  }
-
-  static const char *getHiddenArgName(HiddenArg HA) {
-    if (HA < END_HIDDEN_ARGS) {
-      return HiddenArgs[HA].Name;
-    }
-    llvm_unreachable("Unexpected hidden argument.");
-  }
-
-  // Clones the function after adding implicit arguments to the argument list
-  // and returns the new updated function. Preloaded implicit arguments are
-  // added up to and including the last one that will be preloaded, indicated by
-  // LastPreloadIndex. Currently preloading is only performed on the totality of
-  // sequential data from the kernarg segment including implicit (hidden)
-  // arguments. This means that all arguments up to the last preloaded argument
-  // will also be preloaded even if that data is unused.
-  Function *cloneFunctionWithPreloadImplicitArgs(unsigned LastPreloadIndex) {
-    FunctionType *FT = F.getFunctionType();
-    LLVMContext &Ctx = F.getParent()->getContext();
-    SmallVector<Type *, 16> FTypes(FT->param_begin(), FT->param_end());
-    for (unsigned I = 0; I <= LastPreloadIndex; ++I)
-      FTypes.push_back(getHiddenArgType(Ctx, HiddenArg(I)));
-
-    FunctionType *NFT =
-        FunctionType::get(FT->getReturnType(), FTypes, FT->isVarArg());
-    Function *NF =
-        Function::Create(NFT, F.getLinkage(), F.getAddressSpace(), F.getName());
-
-    NF->copyAttributesFrom(&F);
-    NF->copyMetadata(&F, 0);
-    NF->setIsNewDbgInfoFormat(F.IsNewDbgInfoFormat);
-
-    F.getParent()->getFunctionList().insert(F.getIterator(), NF);
-    NF->takeName(&F);
-    NF->splice(NF->begin(), &F);
-
-    Function::arg_iterator NFArg = NF->arg_begin();
-    for (Argument &Arg : F.args()) {
-      Arg.replaceAllUsesWith(&*NFArg);
-      NFArg->takeName(&Arg);
-      ++NFArg;
-    }
-
-    AttrBuilder AB(Ctx);
-    AB.addAttribute(Attribute::InReg);
-    AB.addAttribute("amdgpu-hidden-argument");
-    AttributeList AL = NF->getAttributes();
-    for (unsigned I = 0; I <= LastPreloadIndex; ++I) {
-      AL = AL.addParamAttributes(Ctx, NFArg->getArgNo(), AB);
-      NFArg++->setName(getHiddenArgName(HiddenArg(I)));
-    }
-
-    NF->setAttributes(AL);
-    F.replaceAllUsesWith(NF);
-    F.setCallingConv(CallingConv::C);
-
-    return NF;
-  }
-
-public:
-  PreloadKernelArgInfo(Function &F, const GCNSubtarget &ST) : F(F), ST(ST) {
-    setInitialFreeUserSGPRsCount();
-  }
-
-  // Returns the maximum number of user SGPRs that we have available to preload
-  // arguments.
-  void setInitialFreeUserSGPRsCount() {
-    GCNUserSGPRUsageInfo UserSGPRInfo(F, ST);
-    NumFreeUserSGPRs = UserSGPRInfo.getNumFreeUserSGPRs();
-  }
-
-  bool tryAllocPreloadSGPRs(unsigned AllocSize, uint64_t ArgOffset,
-                            uint64_t LastExplicitArgOffset) {
-    //  Check if this argument may be loaded into the same register as the
-    //  previous argument.
-    if (ArgOffset - LastExplicitArgOffset < 4 &&
-        !isAligned(Align(4), ArgOffset))
-      return true;
-
-    // Pad SGPRs for kernarg alignment.
-    ArgOffset = alignDown(ArgOffset, 4);
-    unsigned Padding = ArgOffset - LastExplicitArgOffset;
-    unsigned PaddingSGPRs = alignTo(Padding, 4) / 4;
-    unsigned NumPreloadSGPRs = alignTo(AllocSize, 4) / 4;
-    if (NumPreloadSGPRs + PaddingSGPRs > NumFreeUserSGPRs)
-      return false;
-
-    NumFreeUserSGPRs -= (NumPreloadSGPRs + PaddingSGPRs);
-    return true;
-  }
-
-  // Try to allocate SGPRs to preload implicit kernel arguments.
-  void tryAllocImplicitArgPreloadSGPRs(uint64_t ImplicitArgsBaseOffset,
-                                       uint64_t LastExplicitArgOffset,
-                                       IRBuilder<> &Builder) {
-    Function *ImplicitArgPtr = Intrinsic::getDeclarationIfExists(
-        F.getParent(), Intrinsic::amdgcn_implicitarg_ptr);
-    if (!ImplicitArgPtr)
-      return;
-
-    const DataLayout &DL = F.getParent()->getDataLayout();
-    // Pair is the load and the load offset.
-    SmallVector<std::pair<LoadInst *, unsigned>, 4> ImplicitArgLoads;
-    for (auto *U : ImplicitArgPtr->users()) {
-      Instruction *CI = dyn_cast<Instruction>(U);
-      if (!CI || CI->getParent()->getParent() != &F)
-        continue;
-
-      for (auto *U : CI->users()) {
-        int64_t Offset = 0;
-        auto *Load = dyn_cast<LoadInst>(U); // Load from ImplicitArgPtr?
-        if (!Load) {
-          if (GetPointerBaseWithConstantOffset(U, Offset, DL) != CI)
-            continue;
-
-          Load = dyn_cast<LoadInst>(*U->user_begin()); // Load from GEP?
-        }
-
-        if (!Load || !Load->isSimple())
-          continue;
-
-        // FIXME: Expand to handle 64-bit implicit args and large merged loads.
-        LLVMContext &Ctx = F.getParent()->getContext();
-        Type *LoadTy = Load->getType();
-        HiddenArg HA = getHiddenArgFromOffset(Offset);
-        if (HA == END_HIDDEN_ARGS || LoadTy != getHiddenArgType(Ctx, HA))
-          continue;
-
-        ImplicitArgLoads.push_back(std::make_pair(Load, Offset));
-      }
-    }
-
-    if (ImplicitArgLoads.empty())
-      return;
-
-    // Allocate loads in order of offset. We need to be sure that the implicit
-    // argument can actually be preloaded.
-    std::sort(ImplicitArgLoads.begin(), ImplicitArgLoads.end(), less_second());
-
-    // If we fail to preload any implicit argument we know we don't have SGPRs
-    // to preload any subsequent ones with larger offsets. Find the first
-    // argument that we cannot preload.
-    auto *PreloadEnd = std::find_if(
-        ImplicitArgLoads.begin(), ImplicitArgLoads.end(),
-        [&](const std::pair<LoadInst *, unsigned> &Load) {
-          unsigned LoadSize = DL.getTypeStoreSize(Load.first->getType());
-          unsigned LoadOffset = Load.second;
-          if (!tryAllocPreloadSGPRs(LoadSize,
-                                    LoadOffset + ImplicitArgsBaseOffset,
-                                    LastExplicitArgOffset))
-            return true;
-
-          LastExplicitArgOffset =
-              ImplicitArgsBaseOffset + LoadOffset + LoadSize;
-          return false;
-        });
-
-    if (PreloadEnd == ImplicitArgLoads.begin())
-      return;
-
-    unsigned LastHiddenArgIndex = getHiddenArgFromOffset(PreloadEnd[-1].second);
-    Function *NF = cloneFunctionWithPreloadImplicitArgs(LastHiddenArgIndex);
-    assert(NF);
-    for (const auto *I = ImplicitArgLoads.beg...
[truncated]

@arsenm
Copy link
Contributor

arsenm commented Jan 21, 2025

In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it.

Which case is this? Getting the exact number of arguments at all points should not be difficult

for (auto *F : Functions) {
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(*F);
if (!ST.hasKernargPreload() ||
F->getCallingConv() != CallingConv::AMDGPU_KERNEL || F->arg_empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really worth handling arg_empty a s a special case, it should fall out naturally and never happens in practice

if (Arg.hasByRefAttr() || Arg.hasNestAttr() ||
Arg.hasAttribute("amdgpu-hidden-argument"))
break;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should todo handle byref, in the future all kernel arguments should be byref

Comment on lines +1557 to +1561
Align ABITypeAlign = DL.getABITypeAlign(ArgTy);
uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably should be using the TargetLowering functions the calling convention lowering uses to see how these types will really be processed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way I'm doing it is what we want since AMDGPUCallLowering::lowerFormalArgumentsKernel and HSA metadata streamer calculates the argument offsets this same way, so it should be consistent with how they are processed.

NFC. Since the handling will be moved to the middlend the tests need to
be reorganized. The IR tests are expanded since end-to-end testing is no
longer baked in.
Besides the changes listed below everything works the same as it did
when this code was in AMDGPULowerKernelArguments.

There is a refactoring of the free user SGPR tracking to make it
simplified and more accurate. We don't actually care which SGPRs hold
which arguments before ISel so specific tracking of the number of free
registers is removed. In one case this leads to one extra argument being
preloaded in a test. ISel correctly identifies this opportunity even
when the IR pass previously missed it. Even though inreg is meant to act
as a hint the coupling between the attribute and whether an argument is
actually preloaded should be equivalent now, although ISel always makes
the final determination.

Since we are no longer handling this in AMDGPULowerKernelArguments that
pass must rely on the inreg attribute to determine whether to leave
arguments as is. This leads to some test changes.

This lowering is moved out of the llc pipeline which requires test
updates.

Cloned function declarations are removed when kernel signatures are
modified to preload hidden arguments.
Copy link
Member Author

@kerbowa kerbowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one case this leads to one extra argument being preloaded in a test. ISel correctly identifies this opportunity even when the IR pass previously missed it.

Which case is this? Getting the exact number of arguments at all points should not be difficult

half_v7bfloat_kernel_preload_arg as an example. The difference between the IR tracking of this and ISel is that ISel would always allocate in-order whereas in IR we would sometimes see if we could fit a randomly located argument which had bugs around arguments in previously allocated SGPRs and also the alignment of the hidden arguments. Looking just at the offsets in the same way the HSA metadata is emitted avoids all of this since we don't need to track the specific registers that arguments are preloaded to in IR to determine what ISel will be able to preload.

Comment on lines +1557 to +1561
Align ABITypeAlign = DL.getABITypeAlign(ArgTy);
uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way I'm doing it is what we want since AMDGPUCallLowering::lowerFormalArgumentsKernel and HSA metadata streamer calculates the argument offsets this same way, so it should be consistent with how they are processed.

@shiltian
Copy link
Contributor

shiltian commented Feb 7, 2025

Why do we want to move it to this file? I'm not sure if this is the good place. I think this file is only for things that run part of the attributor framework.

@kerbowa
Copy link
Member Author

kerbowa commented Feb 7, 2025

Why do we want to move it to this file? I'm not sure if this is the good place. I think this file is only for things that run part of the attributor framework.

It needs to be part of a module pass since it does things like cloning and deleting functions. I had a PR up to make AMDGPULowerKernelArguments a module pass, but I think the consensus was that it made sense to unify all the inreg/preload kernarg functionality here.

@shiltian
Copy link
Contributor

shiltian commented Feb 7, 2025

I had a PR up to make AMDGPULowerKernelArguments a module pass, but I think the consensus was that it made sense to unify all the inreg/preload kernarg functionality here.

I remember that PR. I think that should be the way to go.

IIUC, @arsenm did mention in #101609 (which is indeed something about inreg) that we can probably move the logic there, but at that time I thought it runs part of the attributor iterations. I'm not sure if that is a good idea to pile everything in this file.


// Mark kernel arguments with 'inreg' attribute to indicate that they should
// be preloaded into SGPRs.
markKernelArgsAsInreg(Functions, TM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't reflect the change of the pass or not.

@kerbowa
Copy link
Member Author

kerbowa commented Feb 7, 2025

I had a PR up to make AMDGPULowerKernelArguments a module pass, but I think the consensus was that it made sense to unify all the inreg/preload kernarg functionality here.

I remember that PR. I think that should be the way to go.

IIUC, @arsenm did mention in #101609 (which is indeed something about inreg) that we can probably move the logic there, but at that time I thought it runs part of the attributor iterations. I'm not sure if that is a good idea to pile everything in this file.

See also this comment here: #112790 (comment)

Eventually it may be possible to do some optimization where we drop the kernarg segment pointer or other optimization in callees related to kernel arguments that requires attributor iterations.

Right now this needs to run after because the number of free user SGPRs depends on attributes on kernels calculated here.

Which parts are you thinking should move, because if it goes back to AMDGPULowerKernelArguments I think everything should move there including the kernel inreg stuff?

@shiltian
Copy link
Contributor

Which parts are you thinking should move

IMHO things that don't run as part of attributor should not stay in this file.

I think everything should move there including the kernel inreg stuff?

Yes, because for now it is not running as part of attributor iterations.

Eventually it may be possible to do some optimization where we drop the kernarg segment pointer or other optimization in callees related to kernel arguments that requires attributor iterations.

For now probably a TODO mentioning some logic can be moved to attributor. I assume we can move them after #101609 is landed.

If I read @arsenm 's comment in #112790 correctly, half of the pass will go after #115821 is landed, and the pre load stuff is not really related to the pass itself. Moving it out to a separate module pass is an option.

@kerbowa
Copy link
Member Author

kerbowa commented Feb 11, 2025

Which parts are you thinking should move

IMHO things that don't run as part of attributor should not stay in this file.

I think everything should move there including the kernel inreg stuff?

Yes, because for now it is not running as part of attributor iterations.

Eventually it may be possible to do some optimization where we drop the kernarg segment pointer or other optimization in callees related to kernel arguments that requires attributor iterations.

For now probably a TODO mentioning some logic can be moved to attributor. I assume we can move them after #101609 is landed.

If I read @arsenm 's comment in #112790 correctly, half of the pass will go after #115821 is landed, and the pre load stuff is not really related to the pass itself. Moving it out to a separate module pass is an option.

My only concern is that some of the logic calculating offsets and such would need to be duplicated if some functionality did end up moving back to the attributor. I could create a common interface though. I would like Matt's opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants