[AMDGPU] Make AMDGPULowerKernelArguments a module pass #112790

kerbowa · 2024-10-17T22:35:31Z

After c4d8920 AMDGPULowerKernelArguments may clone functions and modify the kernel signature of those functions to enable preloading hidden kernel arguments. These leftover functions end up as dead declarations which may cause issues with the toolchain downstream.

This patch makes AMDGPULowerKernelArguments a module pass so that we can safely erase these leftover declarations.

There is also some small refactoring to avoid duplicated logic with the different pass managers. The update changes the pass interfaces to look similar to other AMDGPU passes that have been migrated over to the new pass manager.

llvmbot · 2024-10-17T22:36:06Z

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

After c4d8920 AMDGPULowerKernelArguments may clone functions and modify the kernel signature of those functions to enable preloading hidden kernel arguments. These leftover functions end up as dead declarations which may cause issues with the toolchain downstream.

This patch makes AMDGPULowerKernelArguments a module pass so that we can safely erase these leftover declarations.

There is also some small refactoring to avoid duplicated logic with the different pass managers. The update changes the pass interfaces to look similar to other AMDGPU passes that have been migrated over to the new pass manager.

Patch is 29.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112790.diff

6 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+4-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp (+243-222)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5-10)
(modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs-IR-lowering.ll (+11-2)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 342d55e828bca5..9ffd1f3977213e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -111,9 +111,9 @@ ModulePass *createAMDGPUCtorDtorLoweringLegacyPass();
 void initializeAMDGPUCtorDtorLoweringLegacyPass(PassRegistry &);
 extern char &AMDGPUCtorDtorLoweringLegacyPassID;
 
-FunctionPass *createAMDGPULowerKernelArgumentsPass();
-void initializeAMDGPULowerKernelArgumentsPass(PassRegistry &);
-extern char &AMDGPULowerKernelArgumentsID;
+ModulePass *createAMDGPULowerKernelArgumentsLegacyPass(const TargetMachine *TM);
+void initializeAMDGPULowerKernelArgumentsLegacyPass(PassRegistry &);
+extern char &AMDGPULowerKernelArgumentsLegacyPassID;
 
 FunctionPass *createAMDGPUPromoteKernelArgumentsPass();
 void initializeAMDGPUPromoteKernelArgumentsPass(PassRegistry &);
@@ -310,7 +310,7 @@ class AMDGPULowerKernelArgumentsPass
 
 public:
   AMDGPULowerKernelArgumentsPass(TargetMachine &TM) : TM(TM){};
-  PreservedAnalyses run(Function &, FunctionAnalysisManager &);
+  PreservedAnalyses run(Module &, ModuleAnalysisManager &);
 };
 
 struct AMDGPUAttributorOptions {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
index 6573176492b7f3..7b986b4385023e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
@@ -131,7 +131,6 @@ class PreloadKernelArgInfo {
 
     NF->setAttributes(AL);
     F.replaceAllUsesWith(NF);
-    F.setCallingConv(CallingConv::C);
 
     return NF;
   }
@@ -169,8 +168,9 @@ class PreloadKernelArgInfo {
   }
 
   // Try to allocate SGPRs to preload implicit kernel arguments.
-  void tryAllocImplicitArgPreloadSGPRs(uint64_t ImplicitArgsBaseOffset,
-                                       IRBuilder<> &Builder) {
+  void tryAllocImplicitArgPreloadSGPRs(
+      uint64_t ImplicitArgsBaseOffset, IRBuilder<> &Builder,
+      SmallVectorImpl<Function *> &FunctionsToErase) {
     Function *ImplicitArgPtr = Intrinsic::getDeclarationIfExists(
         F.getParent(), Intrinsic::amdgcn_implicitarg_ptr);
     if (!ImplicitArgPtr)
@@ -239,6 +239,7 @@ class PreloadKernelArgInfo {
     unsigned LastHiddenArgIndex = getHiddenArgFromOffset(PreloadEnd[-1].second);
     Function *NF = cloneFunctionWithPreloadImplicitArgs(LastHiddenArgIndex);
     assert(NF);
+    FunctionsToErase.push_back(&F);
     for (const auto *I = ImplicitArgLoads.begin(); I != PreloadEnd; ++I) {
       LoadInst *LoadInst = I->first;
       unsigned LoadOffset = I->second;
@@ -250,264 +251,284 @@ class PreloadKernelArgInfo {
   }
 };
 
-class AMDGPULowerKernelArguments : public FunctionPass {
-public:
-  static char ID;
+class AMDGPULowerKernelArguments {
+  const TargetMachine &TM;
+  SmallVector<Function *> FunctionsToErase;
 
-  AMDGPULowerKernelArguments() : FunctionPass(ID) {}
+public:
+  AMDGPULowerKernelArguments(const TargetMachine &TM) : TM(TM) {}
+
+  // skip allocas
+  static BasicBlock::iterator getInsertPt(BasicBlock &BB) {
+    BasicBlock::iterator InsPt = BB.getFirstInsertionPt();
+    for (BasicBlock::iterator E = BB.end(); InsPt != E; ++InsPt) {
+      AllocaInst *AI = dyn_cast<AllocaInst>(&*InsPt);
+
+      // If this is a dynamic alloca, the value may depend on the loaded kernargs,
+      // so loads will need to be inserted before it.
+      if (!AI || !AI->isStaticAlloca())
+        break;
+    }
 
-  bool runOnFunction(Function &F) override;
+    return InsPt;
+  }
 
-  void getAnalysisUsage(AnalysisUsage &AU) const override {
-    AU.addRequired<TargetPassConfig>();
-    AU.setPreservesAll();
- }
-};
+  bool lowerKernelArguments(Function &F) {
+    CallingConv::ID CC = F.getCallingConv();
+    if (CC != CallingConv::AMDGPU_KERNEL || F.arg_empty())
+      return false;
 
-} // end anonymous namespace
+    const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+    LLVMContext &Ctx = F.getParent()->getContext();
+    const DataLayout &DL = F.getDataLayout();
+    BasicBlock &EntryBlock = *F.begin();
+    IRBuilder<> Builder(&EntryBlock, getInsertPt(EntryBlock));
 
-// skip allocas
-static BasicBlock::iterator getInsertPt(BasicBlock &BB) {
-  BasicBlock::iterator InsPt = BB.getFirstInsertionPt();
-  for (BasicBlock::iterator E = BB.end(); InsPt != E; ++InsPt) {
-    AllocaInst *AI = dyn_cast<AllocaInst>(&*InsPt);
+    const Align KernArgBaseAlign(16); // FIXME: Increase if necessary
+    const uint64_t BaseOffset = ST.getExplicitKernelArgOffset();
 
-    // If this is a dynamic alloca, the value may depend on the loaded kernargs,
-    // so loads will need to be inserted before it.
-    if (!AI || !AI->isStaticAlloca())
-      break;
-  }
+    Align MaxAlign;
+    // FIXME: Alignment is broken with explicit arg offset.;
+    const uint64_t TotalKernArgSize = ST.getKernArgSegmentSize(F, MaxAlign);
+    if (TotalKernArgSize == 0)
+      return false;
 
-  return InsPt;
-}
+    CallInst *KernArgSegment =
+        Builder.CreateIntrinsic(Intrinsic::amdgcn_kernarg_segment_ptr, {}, {},
+                                nullptr, F.getName() + ".kernarg.segment");
+    KernArgSegment->addRetAttr(Attribute::NonNull);
+    KernArgSegment->addRetAttr(
+        Attribute::getWithDereferenceableBytes(Ctx, TotalKernArgSize));
 
-static bool lowerKernelArguments(Function &F, const TargetMachine &TM) {
-  CallingConv::ID CC = F.getCallingConv();
-  if (CC != CallingConv::AMDGPU_KERNEL || F.arg_empty())
-    return false;
-
-  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
-  LLVMContext &Ctx = F.getParent()->getContext();
-  const DataLayout &DL = F.getDataLayout();
-  BasicBlock &EntryBlock = *F.begin();
-  IRBuilder<> Builder(&EntryBlock, getInsertPt(EntryBlock));
-
-  const Align KernArgBaseAlign(16); // FIXME: Increase if necessary
-  const uint64_t BaseOffset = ST.getExplicitKernelArgOffset();
-
-  Align MaxAlign;
-  // FIXME: Alignment is broken with explicit arg offset.;
-  const uint64_t TotalKernArgSize = ST.getKernArgSegmentSize(F, MaxAlign);
-  if (TotalKernArgSize == 0)
-    return false;
-
-  CallInst *KernArgSegment =
-      Builder.CreateIntrinsic(Intrinsic::amdgcn_kernarg_segment_ptr, {}, {},
-                              nullptr, F.getName() + ".kernarg.segment");
-  KernArgSegment->addRetAttr(Attribute::NonNull);
-  KernArgSegment->addRetAttr(
-      Attribute::getWithDereferenceableBytes(Ctx, TotalKernArgSize));
-
-  uint64_t ExplicitArgOffset = 0;
-  // Preloaded kernel arguments must be sequential.
-  bool InPreloadSequence = true;
-  PreloadKernelArgInfo PreloadInfo(F, ST);
-
-  for (Argument &Arg : F.args()) {
-    const bool IsByRef = Arg.hasByRefAttr();
-    Type *ArgTy = IsByRef ? Arg.getParamByRefType() : Arg.getType();
-    MaybeAlign ParamAlign = IsByRef ? Arg.getParamAlign() : std::nullopt;
-    Align ABITypeAlign = DL.getValueOrABITypeAlignment(ParamAlign, ArgTy);
-
-    uint64_t Size = DL.getTypeSizeInBits(ArgTy);
-    uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
-
-    uint64_t EltOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + BaseOffset;
-    uint64_t LastExplicitArgOffset = ExplicitArgOffset;
-    ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
-
-    // Guard against the situation where hidden arguments have already been
-    // lowered and added to the kernel function signiture, i.e. in a situation
-    // where this pass has run twice.
-    if (Arg.hasAttribute("amdgpu-hidden-argument"))
-      break;
-
-    // Try to preload this argument into user SGPRs.
-    if (Arg.hasInRegAttr() && InPreloadSequence && ST.hasKernargPreload() &&
-        !Arg.getType()->isAggregateType())
-      if (PreloadInfo.tryAllocPreloadSGPRs(AllocSize, EltOffset,
-                                           LastExplicitArgOffset))
-        continue;
+    uint64_t ExplicitArgOffset = 0;
+    // Preloaded kernel arguments must be sequential.
+    bool InPreloadSequence = true;
+    PreloadKernelArgInfo PreloadInfo(F, ST);
 
-    InPreloadSequence = false;
+    for (Argument &Arg : F.args()) {
+      const bool IsByRef = Arg.hasByRefAttr();
+      Type *ArgTy = IsByRef ? Arg.getParamByRefType() : Arg.getType();
+      MaybeAlign ParamAlign = IsByRef ? Arg.getParamAlign() : std::nullopt;
+      Align ABITypeAlign = DL.getValueOrABITypeAlignment(ParamAlign, ArgTy);
+
+      uint64_t Size = DL.getTypeSizeInBits(ArgTy);
+      uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
+
+      uint64_t EltOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + BaseOffset;
+      uint64_t LastExplicitArgOffset = ExplicitArgOffset;
+      ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
+
+      // Guard against the situation where hidden arguments have already been
+      // lowered and added to the kernel function signiture, i.e. in a situation
+      // where this pass has run twice.
+      if (Arg.hasAttribute("amdgpu-hidden-argument"))
+        break;
+
+      // Try to preload this argument into user SGPRs.
+      if (Arg.hasInRegAttr() && InPreloadSequence && ST.hasKernargPreload() &&
+          !Arg.getType()->isAggregateType())
+        if (PreloadInfo.tryAllocPreloadSGPRs(AllocSize, EltOffset,
+                                            LastExplicitArgOffset))
+          continue;
 
-    if (Arg.use_empty())
-      continue;
+      InPreloadSequence = false;
 
-    // If this is byval, the loads are already explicit in the function. We just
-    // need to rewrite the pointer values.
-    if (IsByRef) {
-      Value *ArgOffsetPtr = Builder.CreateConstInBoundsGEP1_64(
-          Builder.getInt8Ty(), KernArgSegment, EltOffset,
-          Arg.getName() + ".byval.kernarg.offset");
+      if (Arg.use_empty())
+        continue;
 
-      Value *CastOffsetPtr =
-          Builder.CreateAddrSpaceCast(ArgOffsetPtr, Arg.getType());
-      Arg.replaceAllUsesWith(CastOffsetPtr);
-      continue;
-    }
+      // If this is byval, the loads are already explicit in the function. We just
+      // need to rewrite the pointer values.
+      if (IsByRef) {
+        Value *ArgOffsetPtr = Builder.CreateConstInBoundsGEP1_64(
+            Builder.getInt8Ty(), KernArgSegment, EltOffset,
+            Arg.getName() + ".byval.kernarg.offset");
 
-    if (PointerType *PT = dyn_cast<PointerType>(ArgTy)) {
-      // FIXME: Hack. We rely on AssertZext to be able to fold DS addressing
-      // modes on SI to know the high bits are 0 so pointer adds don't wrap. We
-      // can't represent this with range metadata because it's only allowed for
-      // integer types.
-      if ((PT->getAddressSpace() == AMDGPUAS::LOCAL_ADDRESS ||
-           PT->getAddressSpace() == AMDGPUAS::REGION_ADDRESS) &&
-          !ST.hasUsableDSOffset())
+        Value *CastOffsetPtr =
+            Builder.CreateAddrSpaceCast(ArgOffsetPtr, Arg.getType());
+        Arg.replaceAllUsesWith(CastOffsetPtr);
         continue;
+      }
 
-      // FIXME: We can replace this with equivalent alias.scope/noalias
-      // metadata, but this appears to be a lot of work.
-      if (Arg.hasNoAliasAttr())
-        continue;
-    }
+      if (PointerType *PT = dyn_cast<PointerType>(ArgTy)) {
+        // FIXME: Hack. We rely on AssertZext to be able to fold DS addressing
+        // modes on SI to know the high bits are 0 so pointer adds don't wrap. We
+        // can't represent this with range metadata because it's only allowed for
+        // integer types.
+        if ((PT->getAddressSpace() == AMDGPUAS::LOCAL_ADDRESS ||
+            PT->getAddressSpace() == AMDGPUAS::REGION_ADDRESS) &&
+            !ST.hasUsableDSOffset())
+          continue;
 
-    auto *VT = dyn_cast<FixedVectorType>(ArgTy);
-    bool IsV3 = VT && VT->getNumElements() == 3;
-    bool DoShiftOpt = Size < 32 && !ArgTy->isAggregateType();
-
-    VectorType *V4Ty = nullptr;
-
-    int64_t AlignDownOffset = alignDown(EltOffset, 4);
-    int64_t OffsetDiff = EltOffset - AlignDownOffset;
-    Align AdjustedAlign = commonAlignment(
-        KernArgBaseAlign, DoShiftOpt ? AlignDownOffset : EltOffset);
-
-    Value *ArgPtr;
-    Type *AdjustedArgTy;
-    if (DoShiftOpt) { // FIXME: Handle aggregate types
-      // Since we don't have sub-dword scalar loads, avoid doing an extload by
-      // loading earlier than the argument address, and extracting the relevant
-      // bits.
-      // TODO: Update this for GFX12 which does have scalar sub-dword loads.
-      //
-      // Additionally widen any sub-dword load to i32 even if suitably aligned,
-      // so that CSE between different argument loads works easily.
-      ArgPtr = Builder.CreateConstInBoundsGEP1_64(
-          Builder.getInt8Ty(), KernArgSegment, AlignDownOffset,
-          Arg.getName() + ".kernarg.offset.align.down");
-      AdjustedArgTy = Builder.getInt32Ty();
-    } else {
-      ArgPtr = Builder.CreateConstInBoundsGEP1_64(
-          Builder.getInt8Ty(), KernArgSegment, EltOffset,
-          Arg.getName() + ".kernarg.offset");
-      AdjustedArgTy = ArgTy;
-    }
+        // FIXME: We can replace this with equivalent alias.scope/noalias
+        // metadata, but this appears to be a lot of work.
+        if (Arg.hasNoAliasAttr())
+          continue;
+      }
 
-    if (IsV3 && Size >= 32) {
-      V4Ty = FixedVectorType::get(VT->getElementType(), 4);
-      // Use the hack that clang uses to avoid SelectionDAG ruining v3 loads
-      AdjustedArgTy = V4Ty;
-    }
+      auto *VT = dyn_cast<FixedVectorType>(ArgTy);
+      bool IsV3 = VT && VT->getNumElements() == 3;
+      bool DoShiftOpt = Size < 32 && !ArgTy->isAggregateType();
+
+      VectorType *V4Ty = nullptr;
+
+      int64_t AlignDownOffset = alignDown(EltOffset, 4);
+      int64_t OffsetDiff = EltOffset - AlignDownOffset;
+      Align AdjustedAlign = commonAlignment(
+          KernArgBaseAlign, DoShiftOpt ? AlignDownOffset : EltOffset);
+
+      Value *ArgPtr;
+      Type *AdjustedArgTy;
+      if (DoShiftOpt) { // FIXME: Handle aggregate types
+        // Since we don't have sub-dword scalar loads, avoid doing an extload by
+        // loading earlier than the argument address, and extracting the relevant
+        // bits.
+        // TODO: Update this for GFX12 which does have scalar sub-dword loads.
+        //
+        // Additionally widen any sub-dword load to i32 even if suitably aligned,
+        // so that CSE between different argument loads works easily.
+        ArgPtr = Builder.CreateConstInBoundsGEP1_64(
+            Builder.getInt8Ty(), KernArgSegment, AlignDownOffset,
+            Arg.getName() + ".kernarg.offset.align.down");
+        AdjustedArgTy = Builder.getInt32Ty();
+      } else {
+        ArgPtr = Builder.CreateConstInBoundsGEP1_64(
+            Builder.getInt8Ty(), KernArgSegment, EltOffset,
+            Arg.getName() + ".kernarg.offset");
+        AdjustedArgTy = ArgTy;
+      }
 
-    LoadInst *Load =
-        Builder.CreateAlignedLoad(AdjustedArgTy, ArgPtr, AdjustedAlign);
-    Load->setMetadata(LLVMContext::MD_invariant_load, MDNode::get(Ctx, {}));
+      if (IsV3 && Size >= 32) {
+        V4Ty = FixedVectorType::get(VT->getElementType(), 4);
+        // Use the hack that clang uses to avoid SelectionDAG ruining v3 loads
+        AdjustedArgTy = V4Ty;
+      }
 
-    MDBuilder MDB(Ctx);
+      LoadInst *Load =
+          Builder.CreateAlignedLoad(AdjustedArgTy, ArgPtr, AdjustedAlign);
+      Load->setMetadata(LLVMContext::MD_invariant_load, MDNode::get(Ctx, {}));
 
-    if (isa<PointerType>(ArgTy)) {
-      if (Arg.hasNonNullAttr())
-        Load->setMetadata(LLVMContext::MD_nonnull, MDNode::get(Ctx, {}));
+      MDBuilder MDB(Ctx);
 
-      uint64_t DerefBytes = Arg.getDereferenceableBytes();
-      if (DerefBytes != 0) {
-        Load->setMetadata(
-          LLVMContext::MD_dereferenceable,
-          MDNode::get(Ctx,
-                      MDB.createConstant(
-                        ConstantInt::get(Builder.getInt64Ty(), DerefBytes))));
-      }
+      if (isa<PointerType>(ArgTy)) {
+        if (Arg.hasNonNullAttr())
+          Load->setMetadata(LLVMContext::MD_nonnull, MDNode::get(Ctx, {}));
 
-      uint64_t DerefOrNullBytes = Arg.getDereferenceableOrNullBytes();
-      if (DerefOrNullBytes != 0) {
-        Load->setMetadata(
-          LLVMContext::MD_dereferenceable_or_null,
-          MDNode::get(Ctx,
-                      MDB.createConstant(ConstantInt::get(Builder.getInt64Ty(),
-                                                          DerefOrNullBytes))));
+        uint64_t DerefBytes = Arg.getDereferenceableBytes();
+        if (DerefBytes != 0) {
+          Load->setMetadata(
+            LLVMContext::MD_dereferenceable,
+            MDNode::get(Ctx,
+                        MDB.createConstant(
+                          ConstantInt::get(Builder.getInt64Ty(), DerefBytes))));
+        }
+
+        uint64_t DerefOrNullBytes = Arg.getDereferenceableOrNullBytes();
+        if (DerefOrNullBytes != 0) {
+          Load->setMetadata(
+            LLVMContext::MD_dereferenceable_or_null,
+            MDNode::get(Ctx,
+                        MDB.createConstant(ConstantInt::get(Builder.getInt64Ty(),
+                                                            DerefOrNullBytes))));
+        }
+
+        if (MaybeAlign ParamAlign = Arg.getParamAlign()) {
+          Load->setMetadata(
+              LLVMContext::MD_align,
+              MDNode::get(Ctx, MDB.createConstant(ConstantInt::get(
+                                  Builder.getInt64Ty(), ParamAlign->value()))));
+        }
       }
 
-      if (MaybeAlign ParamAlign = Arg.getParamAlign()) {
-        Load->setMetadata(
-            LLVMContext::MD_align,
-            MDNode::get(Ctx, MDB.createConstant(ConstantInt::get(
-                                 Builder.getInt64Ty(), ParamAlign->value()))));
+      // TODO: Convert noalias arg to !noalias
+
+      if (DoShiftOpt) {
+        Value *ExtractBits = OffsetDiff == 0 ?
+          Load : Builder.CreateLShr(Load, OffsetDiff * 8);
+
+        IntegerType *ArgIntTy = Builder.getIntNTy(Size);
+        Value *Trunc = Builder.CreateTrunc(ExtractBits, ArgIntTy);
+        Value *NewVal = Builder.CreateBitCast(Trunc, ArgTy,
+                                              Arg.getName() + ".load");
+        Arg.replaceAllUsesWith(NewVal);
+      } else if (IsV3) {
+        Value *Shuf = Builder.CreateShuffleVector(Load, ArrayRef<int>{0, 1, 2},
+                                                  Arg.getName() + ".load");
+        Arg.replaceAllUsesWith(Shuf);
+      } else {
+        Load->setName(Arg.getName() + ".load");
+        Arg.replaceAllUsesWith(Load);
       }
     }
 
-    // TODO: Convert noalias arg to !noalias
-
-    if (DoShiftOpt) {
-      Value *ExtractBits = OffsetDiff == 0 ?
-        Load : Builder.CreateLShr(Load, OffsetDiff * 8);
-
-      IntegerType *ArgIntTy = Builder.getIntNTy(Size);
-      Value *Trunc = Builder.CreateTrunc(ExtractBits, ArgIntTy);
-      Value *NewVal = Builder.CreateBitCast(Trunc, ArgTy,
-                                            Arg.getName() + ".load");
-      Arg.replaceAllUsesWith(NewVal);
-    } else if (IsV3) {
-      Value *Shuf = Builder.CreateShuffleVector(Load, ArrayRef<int>{0, 1, 2},
-                                                Arg.getName() + ".load");
-      Arg.replaceAllUsesWith(Shuf);
-    } else {
-      Load->setName(Arg.getName() + ".load");
-      Arg.replaceAllUsesWith(Load);
+    KernArgSegment->addRetAttr(
+        Attribute::getWithAlignment(Ctx, std::max(KernArgBaseAlign, MaxAlign)));
+
+    if (InPreloadSequence) {
+      uint64_t ImplicitArgsBaseOffset =
+          alignTo(ExplicitArgOffset, ST.getAlignmentForImplicitArgPtr()) +
+          BaseOffset;
+      PreloadInfo.tryAllocImplicitArgPreloadSGPRs(ImplicitArgsBaseOffset,
+                                                  Builder, FunctionsToErase);
     }
+
+    return true;
   }
 
-  KernArgSegment->addRetAttr(
-      Attribute::getWithAlignment(Ctx, std::max(KernArgBaseAlign, MaxAlign)));
+  bool runOnModule(Module &M) {
+    bool Changed = false;
 
-  if (InPreloadSequence) {
-    uint64_t ImplicitArgsBaseOffset =
-        alignTo(ExplicitArgOffset, ST.getAlignmentForImplicitArgPtr()) +
-        BaseOffset;
-    PreloadInfo.tryAllocImplicitArgPreloadSGPRs(ImplicitArgsBaseOffset,
-                                                Builder);
+    for (Function &F : M)
+      Changed |= lowerKernelArguments(F);
+
+  ...
[truncated]

github-actions · 2024-10-17T22:38:58Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 1c8fca82a0f4ac6df5db539e96adcad143f5ebe7 3a6de1e17129abae6247193fd863f6f171a6f544 --extensions h,cpp -- llvm/lib/Target/AMDGPU/AMDGPU.h llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 8d6926a494..1098175ae8 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -10,10 +10,10 @@
 #ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPU_H
 #define LLVM_LIB_TARGET_AMDGPU_AMDGPU_H
 
+#include "llvm/Analysis/CallGraphSCCPass.h"
 #include "llvm/CodeGen/MachinePassManager.h"
 #include "llvm/IR/PassManager.h"
 #include "llvm/Pass.h"
-#include "llvm/Analysis/CallGraphSCCPass.h"
 #include "llvm/Support/AMDGPUAddrSpace.h"
 #include "llvm/Support/CodeGen.h"
 
@@ -112,7 +112,8 @@ ModulePass *createAMDGPUCtorDtorLoweringLegacyPass();
 void initializeAMDGPUCtorDtorLoweringLegacyPass(PassRegistry &);
 extern char &AMDGPUCtorDtorLoweringLegacyPassID;
 
-CallGraphSCCPass *createAMDGPULowerKernelArgumentsLegacyPass(const TargetMachine *TM);
+CallGraphSCCPass *
+createAMDGPULowerKernelArgumentsLegacyPass(const TargetMachine *TM);
 void initializeAMDGPULowerKernelArgumentsLegacyPass(PassRegistry &);
 extern char &AMDGPULowerKernelArgumentsLegacyPassID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
index 9fcb90fa0c..d940072bd7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
@@ -275,8 +275,8 @@ public:
     for (BasicBlock::iterator E = BB.end(); InsPt != E; ++InsPt) {
       AllocaInst *AI = dyn_cast<AllocaInst>(&*InsPt);
 
-      // If this is a dynamic alloca, the value may depend on the loaded kernargs,
-      // so loads will need to be inserted before it.
+      // If this is a dynamic alloca, the value may depend on the loaded
+      // kernargs, so loads will need to be inserted before it.
       if (!AI || !AI->isStaticAlloca())
         break;
     }
@@ -325,7 +325,8 @@ public:
       uint64_t Size = DL.getTypeSizeInBits(ArgTy);
       uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
 
-      uint64_t EltOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + BaseOffset;
+      uint64_t EltOffset =
+          alignTo(ExplicitArgOffset, ABITypeAlign) + BaseOffset;
       uint64_t LastExplicitArgOffset = ExplicitArgOffset;
       ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
 
@@ -339,7 +340,7 @@ public:
       if (Arg.hasInRegAttr() && InPreloadSequence && ST.hasKernargPreload() &&
           !Arg.getType()->isAggregateType())
         if (PreloadInfo.tryAllocPreloadSGPRs(AllocSize, EltOffset,
-                                            LastExplicitArgOffset))
+                                             LastExplicitArgOffset))
           continue;
 
       InPreloadSequence = false;
@@ -347,8 +348,8 @@ public:
       if (Arg.use_empty())
         continue;
 
-      // If this is byval, the loads are already explicit in the function. We just
-      // need to rewrite the pointer values.
+      // If this is byval, the loads are already explicit in the function. We
+      // just need to rewrite the pointer values.
       if (IsByRef) {
         Value *ArgOffsetPtr = Builder.CreateConstInBoundsGEP1_64(
             Builder.getInt8Ty(), KernArgSegment, EltOffset,
@@ -362,11 +363,11 @@ public:
 
       if (PointerType *PT = dyn_cast<PointerType>(ArgTy)) {
         // FIXME: Hack. We rely on AssertZext to be able to fold DS addressing
-        // modes on SI to know the high bits are 0 so pointer adds don't wrap. We
-        // can't represent this with range metadata because it's only allowed for
-        // integer types.
+        // modes on SI to know the high bits are 0 so pointer adds don't wrap.
+        // We can't represent this with range metadata because it's only allowed
+        // for integer types.
         if ((PT->getAddressSpace() == AMDGPUAS::LOCAL_ADDRESS ||
-            PT->getAddressSpace() == AMDGPUAS::REGION_ADDRESS) &&
+             PT->getAddressSpace() == AMDGPUAS::REGION_ADDRESS) &&
             !ST.hasUsableDSOffset())
           continue;
 
@@ -391,12 +392,12 @@ public:
       Type *AdjustedArgTy;
       if (DoShiftOpt) { // FIXME: Handle aggregate types
         // Since we don't have sub-dword scalar loads, avoid doing an extload by
-        // loading earlier than the argument address, and extracting the relevant
-        // bits.
+        // loading earlier than the argument address, and extracting the
+        // relevant bits.
         // TODO: Update this for GFX12 which does have scalar sub-dword loads.
         //
-        // Additionally widen any sub-dword load to i32 even if suitably aligned,
-        // so that CSE between different argument loads works easily.
+        // Additionally widen any sub-dword load to i32 even if suitably
+        // aligned, so that CSE between different argument loads works easily.
         ArgPtr = Builder.CreateConstInBoundsGEP1_64(
             Builder.getInt8Ty(), KernArgSegment, AlignDownOffset,
             Arg.getName() + ".kernarg.offset.align.down");
@@ -427,39 +428,38 @@ public:
         uint64_t DerefBytes = Arg.getDereferenceableBytes();
         if (DerefBytes != 0) {
           Load->setMetadata(
-            LLVMContext::MD_dereferenceable,
-            MDNode::get(Ctx,
-                        MDB.createConstant(
-                          ConstantInt::get(Builder.getInt64Ty(), DerefBytes))));
+              LLVMContext::MD_dereferenceable,
+              MDNode::get(Ctx, MDB.createConstant(ConstantInt::get(
+                                   Builder.getInt64Ty(), DerefBytes))));
         }
 
         uint64_t DerefOrNullBytes = Arg.getDereferenceableOrNullBytes();
         if (DerefOrNullBytes != 0) {
           Load->setMetadata(
-            LLVMContext::MD_dereferenceable_or_null,
-            MDNode::get(Ctx,
-                        MDB.createConstant(ConstantInt::get(Builder.getInt64Ty(),
-                                                            DerefOrNullBytes))));
+              LLVMContext::MD_dereferenceable_or_null,
+              MDNode::get(Ctx, MDB.createConstant(ConstantInt::get(
+                                   Builder.getInt64Ty(), DerefOrNullBytes))));
         }
 
         if (MaybeAlign ParamAlign = Arg.getParamAlign()) {
           Load->setMetadata(
               LLVMContext::MD_align,
-              MDNode::get(Ctx, MDB.createConstant(ConstantInt::get(
-                                  Builder.getInt64Ty(), ParamAlign->value()))));
+              MDNode::get(Ctx,
+                          MDB.createConstant(ConstantInt::get(
+                              Builder.getInt64Ty(), ParamAlign->value()))));
         }
       }
 
       // TODO: Convert noalias arg to !noalias
 
       if (DoShiftOpt) {
-        Value *ExtractBits = OffsetDiff == 0 ?
-          Load : Builder.CreateLShr(Load, OffsetDiff * 8);
+        Value *ExtractBits =
+            OffsetDiff == 0 ? Load : Builder.CreateLShr(Load, OffsetDiff * 8);
 
         IntegerType *ArgIntTy = Builder.getIntNTy(Size);
         Value *Trunc = Builder.CreateTrunc(ExtractBits, ArgIntTy);
-        Value *NewVal = Builder.CreateBitCast(Trunc, ArgTy,
-                                              Arg.getName() + ".load");
+        Value *NewVal =
+            Builder.CreateBitCast(Trunc, ArgTy, Arg.getName() + ".load");
         Arg.replaceAllUsesWith(NewVal);
       } else if (IsV3) {
         Value *Shuf = Builder.CreateShuffleVector(Load, ArrayRef<int>{0, 1, 2},

shiltian · 2024-10-17T23:10:22Z

I'd make the refactoring and change to module pass separate PRs.

kerbowa · 2024-10-18T04:21:53Z

I'd make the refactoring and change to module pass separate PRs.

I can, but the refactoring is actually quite minor, the diff looks big but 90% is just indentation.

arsenm · 2024-10-20T01:58:25Z

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll

 ; GCN-O3-NEXT:      AMDGPU Annotate Kernel Features
-; GCN-O3-NEXT:      FunctionPass Manager
-; GCN-O3-NEXT:        AMDGPU Lower Kernel Arguments
+; GCN-O3-NEXT:    AMDGPU Lower Kernel Arguments


Can this be a CGSCC pass instead?

I've made this change. I'm worried that this leaves the SCC in an inconsonant state, although it works. It breaks if I try to remove any functions outside of finalization.

There are 3 other options. Was wondering if you have any opinion on the best one.

Module pass

Keep it a function pass and do all the preload stuff in the doInitialization callback

Make the preloading stuff a separate module pass and keep lowerkernelarguments a function pass

CGSCC pass

I'm leaning towards option 3 personally.

wrt some of the (legacy) passes: it may not be an issue for now I'd be cautious adding module passes in the (Pre)CodeGen part as it pops off the CGSCC pass off the pass manager stack. From what I see in llc-pipeline.ll, it shouldn't be an issue but it's the reason why a few lines under where you add the pass in AMDGPUTargetMachine you see a sudden addPass(new DummyCGSCCPass()); added: it pushes the CGSCC pass back on the pass manager stack.

Related to that: having the pass be a function pass will more than likely run the functions in CGSCC order anyways since it seems to inherit the CGSCC pass manager from the AMDGPUAnnotateKernelFeatures pass before it.

TL;DR: fine to add module pass in this case but may need a comment in AMDGPUTargetMachine mentioning that it pops off the CGSCC pass manager from the PM stack for future explorers of said code.

Thanks @JanekvO I'll add a comment to that effect.

@arsenm to future proof for NPM I think our only options are making AMDGPULowerKernelArguments a module pass or factoring out the preload stuff and creating a new module pass. What do you think is better?

I thought you were moving the preload logic to the attributor? This probably belongs in the attributor, or a separate CGSCC/CGSCC-in-module pass.

This does not feel like it belongs in AMDGPULowerKernelArguments. AMDGPULowerKernelArguments is at least a 50% hack due to clang emitting sub-optimal kernel signatures to begin with (which #115821 will soon give us a path forward to fixing). The other piece is it enables vectorization of kernel argument loads, which will not go away. In either case, its only job is to rewrite argument references in terms of a base pointer. It shouldn't need to consider the use context like preload should

arsenm · 2024-10-20T01:58:52Z

llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs-IR-lowering.ll

  ret void
 }

+; PRELOAD-NOT: declare {{.*}}@1


better to use -implicit-check-not=declare

This doesn't work because of implicit_arg ptr declaration.

Can you relax it to @{{[0-9]+}}?

After c4d8920 AMDGPULowerKernelArguments may clone functions and modify the kernel signature of those functions to enable preloading hidden kernel arguments. These leftover functions end up as dead declarations which may cause issues with the toolchain downstream. This patch makes AMDGPULowerKernelArguments a CGSCC pass so that we can safely erase these leftover declarations. There is also some small refactoring to avoid duplicated logic with the different pass managers. The update changes the pass interfaces to look similar to other AMDGPU passes that have been migrated over to the new pass manager.

krzysz00 · 2025-04-07T16:10:55Z

One note - though it's late - CallGraphUpdater exists

kerbowa requested a review from arsenm October 17, 2024 22:35

llvmbot added the backend:AMDGPU label Oct 17, 2024

kerbowa requested a review from shiltian October 18, 2024 04:22

arsenm reviewed Oct 20, 2024

View reviewed changes

kerbowa added 2 commits November 10, 2024 13:16

Move to CGSCC pass.

3a6de1e

kerbowa force-pushed the fix-removing-cloned-function-hidden-arg-preload branch from a8cb03f to 3a6de1e Compare November 10, 2024 21:18

kerbowa closed this Jan 13, 2025

kerbowa mentioned this pull request Feb 7, 2025

[AMDGPU] Move kernarg preload logic to AMDGPU Attributor #123547

Closed

kerbowa requested a review from krzysz00 April 7, 2025 15:24

kerbowa mentioned this pull request Apr 7, 2025

[AMDGPU] Move kernarg preload logic to separate pass #130434

Merged

[AMDGPU] Make AMDGPULowerKernelArguments a module pass #112790

[AMDGPU] Make AMDGPULowerKernelArguments a module pass #112790

Uh oh!

Conversation

kerbowa commented Oct 17, 2024

Uh oh!

llvmbot commented Oct 17, 2024

Uh oh!

github-actions bot commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian commented Oct 17, 2024

Uh oh!

kerbowa commented Oct 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krzysz00 commented Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

github-actions bot commented Oct 17, 2024 •

edited

Loading