[OpenMP] Team reduction work specialization #70766

jdoerfert · 2023-10-31T05:44:18Z

Last commit, the others are part of existing PRs.

llvmbot · 2023-10-31T05:45:31Z

@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: Johannes Doerfert (jdoerfert)

Changes

Last commit, the others are part of existing PRs.

Patch is 4.73 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70766.diff

186 Files Affected:

(modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+28-47)
(modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.h (-2)
(modified) clang/lib/Sema/SemaOpenMP.cpp (+18-8)
(modified) clang/test/OpenMP/amdgcn_target_codegen.cpp (+10-4)
(modified) clang/test/OpenMP/amdgcn_target_device_vla.cpp (+20-8)
(modified) clang/test/OpenMP/amdgcn_target_init_temp_alloca.cpp (+2)
(modified) clang/test/OpenMP/amdgpu_target_with_aligned_attribute.c (+5-2)
(modified) clang/test/OpenMP/assumes_include_nvptx.cpp (+2-2)
(modified) clang/test/OpenMP/bug60602.cpp (+7-7)
(modified) clang/test/OpenMP/declare_target_codegen.cpp (+6-6)
(modified) clang/test/OpenMP/declare_target_codegen_globalization.cpp (+4-2)
(modified) clang/test/OpenMP/declare_target_link_codegen.cpp (+1-1)
(modified) clang/test/OpenMP/declare_variant_mixed_codegen.c (+1-1)
(modified) clang/test/OpenMP/distribute_codegen.cpp (+62-42)
(modified) clang/test/OpenMP/distribute_firstprivate_codegen.cpp (+36-36)
(modified) clang/test/OpenMP/distribute_lastprivate_codegen.cpp (+36-36)
(modified) clang/test/OpenMP/distribute_parallel_for_codegen.cpp (+118-118)
(modified) clang/test/OpenMP/distribute_parallel_for_firstprivate_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/distribute_parallel_for_if_codegen.cpp (+31-31)
(modified) clang/test/OpenMP/distribute_parallel_for_lastprivate_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/distribute_parallel_for_num_threads_codegen.cpp (+152-152)
(modified) clang/test/OpenMP/distribute_parallel_for_private_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/distribute_parallel_for_proc_bind_codegen.cpp (+11-11)
(modified) clang/test/OpenMP/distribute_parallel_for_simd_codegen.cpp (+118-118)
(modified) clang/test/OpenMP/distribute_parallel_for_simd_firstprivate_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/distribute_parallel_for_simd_if_codegen.cpp (+128-128)
(modified) clang/test/OpenMP/distribute_parallel_for_simd_lastprivate_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/distribute_parallel_for_simd_num_threads_codegen.cpp (+152-152)
(modified) clang/test/OpenMP/distribute_parallel_for_simd_private_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/distribute_parallel_for_simd_proc_bind_codegen.cpp (+11-11)
(modified) clang/test/OpenMP/distribute_private_codegen.cpp (+40-40)
(modified) clang/test/OpenMP/distribute_simd_codegen.cpp (+60-20)
(modified) clang/test/OpenMP/distribute_simd_firstprivate_codegen.cpp (+36-36)
(modified) clang/test/OpenMP/distribute_simd_lastprivate_codegen.cpp (+36-36)
(modified) clang/test/OpenMP/distribute_simd_private_codegen.cpp (+40-40)
(modified) clang/test/OpenMP/distribute_simd_reduction_codegen.cpp (+14-14)
(modified) clang/test/OpenMP/nvptx_SPMD_codegen.cpp (+2679-2301)
(modified) clang/test/OpenMP/nvptx_data_sharing.cpp (+4-2)
(modified) clang/test/OpenMP/nvptx_declare_target_var_ctor_dtor_codegen.cpp (+1-1)
(modified) clang/test/OpenMP/nvptx_distribute_parallel_generic_mode_codegen.cpp (+8-4)
(modified) clang/test/OpenMP/nvptx_lambda_capturing.cpp (+47-27)
(modified) clang/test/OpenMP/nvptx_multi_target_parallel_codegen.cpp (+16-8)
(modified) clang/test/OpenMP/nvptx_nested_parallel_codegen.cpp (+8-4)
(modified) clang/test/OpenMP/nvptx_parallel_codegen.cpp (+24-12)
(modified) clang/test/OpenMP/nvptx_parallel_for_codegen.cpp (+4-2)
(modified) clang/test/OpenMP/nvptx_target_codegen.cpp (+64-32)
(modified) clang/test/OpenMP/nvptx_target_firstprivate_codegen.cpp (+12-6)
(modified) clang/test/OpenMP/nvptx_target_parallel_codegen.cpp (+16-8)
(modified) clang/test/OpenMP/nvptx_target_parallel_num_threads_codegen.cpp (+16-8)
(modified) clang/test/OpenMP/nvptx_target_parallel_proc_bind_codegen.cpp (+72-36)
(modified) clang/test/OpenMP/nvptx_target_parallel_reduction_codegen.cpp (+36-18)
(modified) clang/test/OpenMP/nvptx_target_parallel_reduction_codegen_tbaa_PR46146.cpp (+272-268)
(modified) clang/test/OpenMP/nvptx_target_printf_codegen.c (+24-12)
(modified) clang/test/OpenMP/nvptx_target_simd_codegen.cpp (+318-270)
(modified) clang/test/OpenMP/nvptx_target_teams_codegen.cpp (+24-12)
(modified) clang/test/OpenMP/nvptx_target_teams_distribute_codegen.cpp (+8-4)
(modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp (+72-36)
(modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_generic_mode_codegen.cpp (+8-4)
(modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp (+364-348)
(modified) clang/test/OpenMP/nvptx_target_teams_distribute_simd_codegen.cpp (+390-342)
(modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_codegen.cpp (+60-30)
(modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_generic_mode_codegen.cpp (+8-4)
(modified) clang/test/OpenMP/nvptx_target_teams_ompx_bare_codegen.cpp (+3-1)
(modified) clang/test/OpenMP/nvptx_teams_codegen.cpp (+32-16)
(modified) clang/test/OpenMP/nvptx_teams_reduction_codegen.cpp (+156-138)
(modified) clang/test/OpenMP/ompx_attributes_codegen.cpp (+3-3)
(modified) clang/test/OpenMP/openmp_offload_codegen.cpp (+1-1)
(modified) clang/test/OpenMP/reduction_implicit_map.cpp (+35-33)
(modified) clang/test/OpenMP/remarks_parallel_in_multiple_target_state_machines.c (+2-1)
(modified) clang/test/OpenMP/remarks_parallel_in_target_state_machine.c (+2-1)
(modified) clang/test/OpenMP/target_codegen_global_capture.cpp (+30-30)
(modified) clang/test/OpenMP/target_firstprivate_codegen.cpp (+72-24)
(modified) clang/test/OpenMP/target_map_codegen_03.cpp (+6-6)
(modified) clang/test/OpenMP/target_map_member_expr_codegen.cpp (+2-2)
(modified) clang/test/OpenMP/target_ompx_dyn_cgroup_mem_codegen.cpp (+36-12)
(modified) clang/test/OpenMP/target_parallel_codegen.cpp (+42-14)
(modified) clang/test/OpenMP/target_parallel_debug_codegen.cpp (+441-420)
(modified) clang/test/OpenMP/target_parallel_for_codegen.cpp (+42-14)
(modified) clang/test/OpenMP/target_parallel_for_debug_codegen.cpp (+610-589)
(modified) clang/test/OpenMP/target_parallel_for_simd_codegen.cpp (+84-28)
(modified) clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp (+79-3)
(modified) clang/test/OpenMP/target_parallel_for_tl_codegen.cpp (+72-3)
(modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-1.cpp (+44-44)
(modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-2.cpp (+24-16)
(modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-3.cpp (+610-589)
(modified) clang/test/OpenMP/target_parallel_generic_loop_codegen.cpp (+5-2)
(modified) clang/test/OpenMP/target_parallel_generic_loop_depend_codegen.cpp (+4-6)
(modified) clang/test/OpenMP/target_parallel_generic_loop_tl_codegen.cpp (+72-3)
(modified) clang/test/OpenMP/target_parallel_generic_loop_uses_allocators_codegen.cpp (+2-2)
(modified) clang/test/OpenMP/target_parallel_if_codegen.cpp (+96-72)
(modified) clang/test/OpenMP/target_parallel_num_threads_codegen.cpp (+78-54)
(modified) clang/test/OpenMP/target_parallel_tl_codegen.cpp (+22-3)
(modified) clang/test/OpenMP/target_private_codegen.cpp (+14-7)
(modified) clang/test/OpenMP/target_reduction_codegen.cpp (+12-6)
(modified) clang/test/OpenMP/target_simd_tl_codegen.cpp (+35-3)
(modified) clang/test/OpenMP/target_task_affinity_codegen.cpp (+6-2)
(modified) clang/test/OpenMP/target_teams_codegen.cpp (+66-22)
(modified) clang/test/OpenMP/target_teams_distribute_codegen.cpp (+42-14)
(modified) clang/test/OpenMP/target_teams_distribute_collapse_codegen.cpp (+18-18)
(modified) clang/test/OpenMP/target_teams_distribute_dist_schedule_codegen.cpp (+42-42)
(modified) clang/test/OpenMP/target_teams_distribute_firstprivate_codegen.cpp (+7-7)
(modified) clang/test/OpenMP/target_teams_distribute_lastprivate_codegen.cpp (+36-36)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_codegen.cpp (+16-8)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_collapse_codegen.cpp (+24-24)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_dist_schedule_codegen.cpp (+60-60)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_firstprivate_codegen.cpp (+138-128)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_if_codegen.cpp (+34-34)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_lastprivate_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_order_codegen.cpp (+4-4)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_private_codegen.cpp (+94-84)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_proc_bind_codegen.cpp (+11-11)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_reduction_codegen.cpp (+29-29)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_schedule_codegen.cpp (+192-192)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_codegen.cpp (+24-12)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_collapse_codegen.cpp (+24-24)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_dist_schedule_codegen.cpp (+60-60)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_firstprivate_codegen.cpp (+138-128)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_lastprivate_codegen.cpp (+50-50)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_private_codegen.cpp (+94-84)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_proc_bind_codegen.cpp (+11-11)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_reduction_codegen.cpp (+29-29)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_schedule_codegen.cpp (+192-192)
(modified) clang/test/OpenMP/target_teams_distribute_private_codegen.cpp (+7-7)
(modified) clang/test/OpenMP/target_teams_distribute_reduction_codegen.cpp (+145-145)
(modified) clang/test/OpenMP/target_teams_distribute_simd_codegen.cpp (+84-28)
(modified) clang/test/OpenMP/target_teams_distribute_simd_collapse_codegen.cpp (+18-18)
(modified) clang/test/OpenMP/target_teams_distribute_simd_dist_schedule_codegen.cpp (+42-42)
(modified) clang/test/OpenMP/target_teams_distribute_simd_firstprivate_codegen.cpp (+7-7)
(modified) clang/test/OpenMP/target_teams_distribute_simd_lastprivate_codegen.cpp (+36-36)
(modified) clang/test/OpenMP/target_teams_distribute_simd_private_codegen.cpp (+7-7)
(modified) clang/test/OpenMP/target_teams_distribute_simd_reduction_codegen.cpp (+19-19)
(modified) clang/test/OpenMP/target_teams_generic_loop_codegen-1.cpp (+16-8)
(modified) clang/test/OpenMP/target_teams_generic_loop_codegen.cpp (+15-12)
(modified) clang/test/OpenMP/target_teams_generic_loop_collapse_codegen.cpp (+24-24)
(modified) clang/test/OpenMP/target_teams_generic_loop_depend_codegen.cpp (+4-6)
(modified) clang/test/OpenMP/target_teams_generic_loop_if_codegen.cpp (+34-34)
(modified) clang/test/OpenMP/target_teams_generic_loop_order_codegen.cpp (+4-4)
(modified) clang/test/OpenMP/target_teams_generic_loop_private_codegen.cpp (+94-84)
(modified) clang/test/OpenMP/target_teams_generic_loop_reduction_codegen.cpp (+29-29)
(modified) clang/test/OpenMP/target_teams_generic_loop_uses_allocators_codegen.cpp (+3-3)
(modified) clang/test/OpenMP/target_teams_map_codegen.cpp (+130-94)
(modified) clang/test/OpenMP/target_teams_num_teams_codegen.cpp (+78-54)
(modified) clang/test/OpenMP/target_teams_thread_limit_codegen.cpp (+44-20)
(modified) clang/test/OpenMP/teams_codegen.cpp (+72-56)
(modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+4-1)
(modified) llvm/include/llvm/Frontend/OpenMP/OMPKinds.def (+6-2)
(modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+30-2)
(modified) llvm/test/Transforms/OpenMP/add_attributes.ll (+4-4)
(modified) llvm/test/Transforms/OpenMP/always_inline_device.ll (+4-4)
(modified) llvm/test/Transforms/OpenMP/custom_state_machines.ll (+85-85)
(modified) llvm/test/Transforms/OpenMP/custom_state_machines_pre_lto.ll (+148-148)
(modified) llvm/test/Transforms/OpenMP/custom_state_machines_remarks.ll (+5-5)
(modified) llvm/test/Transforms/OpenMP/deduplication_target.ll (+5-5)
(modified) llvm/test/Transforms/OpenMP/get_hardware_num_threads_in_block_fold.ll (+13-13)
(modified) llvm/test/Transforms/OpenMP/get_hardware_num_threads_in_block_fold_optnone.ll (+7-7)
(modified) llvm/test/Transforms/OpenMP/global_constructor.ll (+5-5)
(modified) llvm/test/Transforms/OpenMP/globalization_remarks.ll (+2-2)
(modified) llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll (+2-2)
(modified) llvm/test/Transforms/OpenMP/indirect_call_kernel_info_crash.ll (+3-3)
(modified) llvm/test/Transforms/OpenMP/is_spmd_exec_mode_fold.ll (+9-9)
(modified) llvm/test/Transforms/OpenMP/nested_parallelism.ll (+7-7)
(modified) llvm/test/Transforms/OpenMP/parallel_level_fold.ll (+7-7)
(modified) llvm/test/Transforms/OpenMP/remove_globalization.ll (+9-9)
(modified) llvm/test/Transforms/OpenMP/replace_globalization.ll (+14-14)
(modified) llvm/test/Transforms/OpenMP/single_threaded_execution.ll (+3-3)
(modified) llvm/test/Transforms/OpenMP/spmdization.ll (+49-49)
(modified) llvm/test/Transforms/OpenMP/spmdization_assumes.ll (+5-5)
(modified) llvm/test/Transforms/OpenMP/spmdization_constant_prop.ll (+3-3)
(modified) llvm/test/Transforms/OpenMP/spmdization_guarding.ll (+9-9)
(modified) llvm/test/Transforms/OpenMP/spmdization_guarding_two_reaching_kernels.ll (+15-15)
(modified) llvm/test/Transforms/OpenMP/spmdization_indirect.ll (+15-15)
(modified) llvm/test/Transforms/OpenMP/spmdization_kernel_env_dep.ll (+7-6)
(modified) llvm/test/Transforms/OpenMP/spmdization_no_guarding_two_reaching_kernels.ll (+15-15)
(modified) llvm/test/Transforms/OpenMP/spmdization_remarks.ll (+5-5)
(modified) llvm/test/Transforms/OpenMP/value-simplify-openmp-opt.ll (+7-7)
(modified) llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp (+12-5)
(modified) openmp/libomptarget/DeviceRTL/include/Interface.h (+5-1)
(modified) openmp/libomptarget/DeviceRTL/include/State.h (+8-2)
(modified) openmp/libomptarget/DeviceRTL/src/Kernel.cpp (+10-6)
(modified) openmp/libomptarget/DeviceRTL/src/Reduction.cpp (+111-9)
(modified) openmp/libomptarget/DeviceRTL/src/State.cpp (+11-1)
(modified) openmp/libomptarget/include/Environment.h (+7)
(modified) openmp/libomptarget/include/omptarget.h (+10)
(modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp (+65-12)
(modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h (+17-2)
(added) openmp/libomptarget/test/offloading/parallel_target_teams_reduction.cpp (+36)

diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
index 9d00ebae702802a..de028b0209c171a 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
@@ -803,8 +803,30 @@ void CGOpenMPRuntimeGPU::emitKernelDeinit(CodeGenFunction &CGF,
   if (!IsSPMD)
     emitGenericVarsEpilog(CGF);
 
+  // This is temporary until we remove the fixed sized buffer.
+  ASTContext &C = CGM.getContext();
+  RecordDecl *StaticRD = C.buildImplicitRecord(
+      "_openmp_teams_reduction_type_$_", RecordDecl::TagKind::TTK_Union);
+  StaticRD->startDefinition();
+  for (const RecordDecl *TeamReductionRec : TeamsReductions) {
+    QualType RecTy = C.getRecordType(TeamReductionRec);
+    auto *Field = FieldDecl::Create(
+        C, StaticRD, SourceLocation(), SourceLocation(), nullptr, RecTy,
+        C.getTrivialTypeSourceInfo(RecTy, SourceLocation()),
+        /*BW=*/nullptr, /*Mutable=*/false,
+        /*InitStyle=*/ICIS_NoInit);
+    Field->setAccess(AS_public);
+    StaticRD->addDecl(Field);
+  }
+  StaticRD->completeDefinition();
+  QualType StaticTy = C.getRecordType(StaticRD);
+  llvm::Type *LLVMReductionsBufferTy =
+      CGM.getTypes().ConvertTypeForMem(StaticTy);
+  const auto &DL = CGM.getModule().getDataLayout();
+  uint64_t BufferSize =
+      DL.getTypeAllocSize(LLVMReductionsBufferTy).getFixedValue();
   CGBuilderTy &Bld = CGF.Builder;
-  OMPBuilder.createTargetDeinit(Bld);
+  OMPBuilder.createTargetDeinit(Bld, BufferSize);
 }
 
 void CGOpenMPRuntimeGPU::emitSPMDKernel(const OMPExecutableDirective &D,
@@ -2998,15 +3020,10 @@ void CGOpenMPRuntimeGPU::emitReduction(
         CGM.getContext(), PrivatesReductions, std::nullopt, VarFieldMap,
         C.getLangOpts().OpenMPCUDAReductionBufNum);
     TeamsReductions.push_back(TeamReductionRec);
-    if (!KernelTeamsReductionPtr) {
-      KernelTeamsReductionPtr = new llvm::GlobalVariable(
-          CGM.getModule(), CGM.VoidPtrTy, /*isConstant=*/true,
-          llvm::GlobalValue::InternalLinkage, nullptr,
-          "_openmp_teams_reductions_buffer_$_$ptr");
-    }
-    llvm::Value *GlobalBufferPtr = CGF.EmitLoadOfScalar(
-        Address(KernelTeamsReductionPtr, CGF.VoidPtrTy, CGM.getPointerAlign()),
-        /*Volatile=*/false, C.getPointerType(C.VoidPtrTy), Loc);
+    auto *KernelTeamsReductionPtr = CGF.EmitRuntimeCall(
+        OMPBuilder.getOrCreateRuntimeFunction(
+            CGM.getModule(), OMPRTL___kmpc_reduction_get_fixed_buffer),
+        {}, "_openmp_teams_reductions_buffer_$_$ptr");
     llvm::Value *GlobalToBufferCpyFn = ::emitListToGlobalCopyFunction(
         CGM, Privates, ReductionArrayTy, Loc, TeamReductionRec, VarFieldMap);
     llvm::Value *GlobalToBufferRedFn = ::emitListToGlobalReduceFunction(
@@ -3021,7 +3038,7 @@ void CGOpenMPRuntimeGPU::emitReduction(
     llvm::Value *Args[] = {
         RTLoc,
         ThreadId,
-        GlobalBufferPtr,
+        KernelTeamsReductionPtr,
         CGF.Builder.getInt32(C.getLangOpts().OpenMPCUDAReductionBufNum),
         RL,
         ShuffleAndReduceFn,
@@ -3654,42 +3671,6 @@ void CGOpenMPRuntimeGPU::processRequiresDirective(
   CGOpenMPRuntime::processRequiresDirective(D);
 }
 
-void CGOpenMPRuntimeGPU::clear() {
-
-  if (!TeamsReductions.empty()) {
-    ASTContext &C = CGM.getContext();
-    RecordDecl *StaticRD = C.buildImplicitRecord(
-        "_openmp_teams_reduction_type_$_", RecordDecl::TagKind::TTK_Union);
-    StaticRD->startDefinition();
-    for (const RecordDecl *TeamReductionRec : TeamsReductions) {
-      QualType RecTy = C.getRecordType(TeamReductionRec);
-      auto *Field = FieldDecl::Create(
-          C, StaticRD, SourceLocation(), SourceLocation(), nullptr, RecTy,
-          C.getTrivialTypeSourceInfo(RecTy, SourceLocation()),
-          /*BW=*/nullptr, /*Mutable=*/false,
-          /*InitStyle=*/ICIS_NoInit);
-      Field->setAccess(AS_public);
-      StaticRD->addDecl(Field);
-    }
-    StaticRD->completeDefinition();
-    QualType StaticTy = C.getRecordType(StaticRD);
-    llvm::Type *LLVMReductionsBufferTy =
-        CGM.getTypes().ConvertTypeForMem(StaticTy);
-    // FIXME: nvlink does not handle weak linkage correctly (object with the
-    // different size are reported as erroneous).
-    // Restore CommonLinkage as soon as nvlink is fixed.
-    auto *GV = new llvm::GlobalVariable(
-        CGM.getModule(), LLVMReductionsBufferTy,
-        /*isConstant=*/false, llvm::GlobalValue::InternalLinkage,
-        llvm::Constant::getNullValue(LLVMReductionsBufferTy),
-        "_openmp_teams_reductions_buffer_$_");
-    KernelTeamsReductionPtr->setInitializer(
-        llvm::ConstantExpr::getPointerBitCastOrAddrSpaceCast(GV,
-                                                             CGM.VoidPtrTy));
-  }
-  CGOpenMPRuntime::clear();
-}
-
 llvm::Value *CGOpenMPRuntimeGPU::getGPUNumThreads(CodeGenFunction &CGF) {
   CGBuilderTy &Bld = CGF.Builder;
   llvm::Module *M = &CGF.CGM.getModule();
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
index 46e1361f2f895ba..141436f26230dde 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
@@ -130,7 +130,6 @@ class CGOpenMPRuntimeGPU : public CGOpenMPRuntime {
 
 public:
   explicit CGOpenMPRuntimeGPU(CodeGenModule &CGM);
-  void clear() override;
 
   bool isGPU() const override { return true; };
 
@@ -386,7 +385,6 @@ class CGOpenMPRuntimeGPU : public CGOpenMPRuntime {
   /// Maps the function to the list of the globalized variables with their
   /// addresses.
   llvm::SmallDenseMap<llvm::Function *, FunctionData> FunctionGlobalizedDecls;
-  llvm::GlobalVariable *KernelTeamsReductionPtr = nullptr;
   /// List of the records with the list of fields for the reductions across the
   /// teams. Used to build the intermediate buffer for the fast teams
   /// reductions.
diff --git a/clang/lib/Sema/SemaOpenMP.cpp b/clang/lib/Sema/SemaOpenMP.cpp
index 75f9e152dca9297..145f4dc4670081d 100644
--- a/clang/lib/Sema/SemaOpenMP.cpp
+++ b/clang/lib/Sema/SemaOpenMP.cpp
@@ -4249,12 +4249,15 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
-    Sema::CapturedParamNameType ParamsTarget[] = {
-        std::make_pair(StringRef(), QualType()) // __context with shared vars
-    };
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     // Start a captured region for 'target' with no implicit parameters.
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
-                             ParamsTarget, /*OpenMPCaptureLevel=*/1);
+                             ParamsTarget,
+                             /*OpenMPCaptureLevel=*/1);
     Sema::CapturedParamNameType ParamsTeamsOrParallel[] = {
         std::make_pair(".global_tid.", KmpInt32PtrTy),
         std::make_pair(".bound_tid.", KmpInt32PtrTy),
@@ -4293,8 +4296,13 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
-                             std::make_pair(StringRef(), QualType()),
+                             ParamsTarget,
                              /*OpenMPCaptureLevel=*/1);
     break;
   }
@@ -4499,9 +4507,11 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
-    Sema::CapturedParamNameType ParamsTarget[] = {
-        std::make_pair(StringRef(), QualType()) // __context with shared vars
-    };
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     // Start a captured region for 'target' with no implicit parameters.
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
                              ParamsTarget, /*OpenMPCaptureLevel=*/1);
diff --git a/clang/test/OpenMP/amdgcn_target_codegen.cpp b/clang/test/OpenMP/amdgcn_target_codegen.cpp
index 90d2ebdf26bd645..3ea2d107f072adb 100644
--- a/clang/test/OpenMP/amdgcn_target_codegen.cpp
+++ b/clang/test/OpenMP/amdgcn_target_codegen.cpp
@@ -29,15 +29,18 @@ int test_amdgcn_target_tid_threads_simd() {
 
 #endif
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[ARR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[ARR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[ARR_ADDR]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[ARR]], ptr [[ARR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -66,19 +69,22 @@ int test_amdgcn_target_tid_threads_simd() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR1:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR1:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[ARR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[ARR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[ARR_ADDR]] to ptr
 // CHECK-NEXT:    [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
 // CHECK-NEXT:    [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[ARR]], ptr [[ARR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
diff --git a/clang/test/OpenMP/amdgcn_target_device_vla.cpp b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
index b2b630b546713dd..de150a0fcb4afd2 100644
--- a/clang/test/OpenMP/amdgcn_target_device_vla.cpp
+++ b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -97,21 +97,24 @@ int main() {
 
 #endif
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[SUM_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[N:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[I1:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[SUM_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SUM_ADDR]] to ptr
 // CHECK-NEXT:    [[N_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N]] to ptr
 // CHECK-NEXT:    [[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
 // CHECK-NEXT:    [[I1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I1]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[SUM]], ptr [[SUM_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[SUM_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -174,26 +177,29 @@ int main() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30
-// CHECK-SAME: (i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
 // CHECK-NEXT:    [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
 // CHECK-NEXT:    [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
 // CHECK-NEXT:    [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
 // CHECK-NEXT:    [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
 // CHECK-NEXT:    [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP2]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -540,26 +546,29 @@ int main() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52
-// CHECK-SAME: (i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
 // CHECK-NEXT:    [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
 // CHECK-NEXT:    [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
 // CHECK-NEXT:    [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
 // CHECK-NEXT:    [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
 // CHECK-NEXT:    [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load...
[truncated]

jhuber6

LG

We default to < 1024 teams if the user did not specify otherwise. As such we can avoid the extra logic in the teams reduction that handles more than num_of_records (default 1024) teams. This is a stopgap but still shaves of 33% of the runtime in some simple reduction examples.

ronlieb · 2023-11-04T11:16:02Z

patch seems to break these 4 sollve tests

./sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_distribute_reduction_and.c
./sollve_vv/tests/5.0/target_teams_distribute/test_target_teams_distribute_reduction_and.c
./sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_distribute_reduction_multiply.c
./sollve_vv/tests/5.0/target_teams_distribute/test_target_teams_distribute_reduction_multiply.c

fail in latest build of llvm, passes if patch reverted locally i nbuild

…case (llvm#70766)" fails 4 sollve tests: test_target_teams_distribute_reduction_and.c test_target_teams_distribute_reduction_multiply.c test_target_teams_distribute_reduction_and.c test_target_teams_distribute_reduction_multiply.c This reverts commit eab828d. Change-Id: If6beb31e12531c9232ccf9a711fbb2a1cbe99898

AntonRydahl · 2023-11-07T23:41:13Z

This commit breaks minimization and multiplication reductions.

shiltian · 2023-11-07T23:55:45Z

The patch has been reverted @AntonRydahl

AntonRydahl · 2023-11-08T00:01:48Z

When? I just found it with git bisect on main. Maybe something is wrong in my fork.

ronlieb · 2023-11-08T00:04:01Z

i dont see it reverted either. i do recall Shilei , you reverted a different reduction related patch with 3 sollve failures, this one has 4.

AntonRydahl · 2023-11-08T00:04:44Z

I think there were multiple reduction commits on the same day. It is in the history here: https://github.com/llvm/llvm-project/commits/main/openmp/libomptarget/DeviceRTL/src/Reduction.cpp

shiltian · 2023-11-08T00:14:33Z

@AntonRydahl @ronlieb Sorry I was looking at the wrong one:

[searlmc1](https://github.com/searlmc1) pushed a commit to RadeonOpenCompute/llvm-project that referenced this pull request [3 days ago](https://github.com/llvm/llvm-project/pull/70766#ref-commit-07441d5)
@ronlieb
[Revert "[OpenMP] Provide a specialized team reduction for the common …](https://github.com/RadeonOpenCompute/llvm-project/commit/07441d5b9640dd7549c6472883d7dedfc82d7426)

I'll revert it right now.

…case (#70766)" This reverts commit eab828d.

AntonRydahl · 2023-11-08T00:19:56Z

Thanks a bunch, @shiltian!

ronlieb · 2023-11-08T00:21:59Z

shilei, are you willing to revert this one also ? it breaks spec accel v1.4 552.pep
OpenMP][NFC] Split the reduction buffer size into two components

shiltian · 2023-11-08T00:29:54Z

That one has to be reverted by @jdoerfert as I tried but there are way too many conflicts.

jdoerfert · 2023-11-08T00:42:57Z

I'll revert this one.

Based on #70766 I think it would be good to have a few more offloading reduction tests, so we do not accidentally break minimum and maximum reductions another time.

jdoerfert · 2023-11-10T22:32:13Z

shilei, are you willing to revert this one also ? it breaks spec accel v1.4 552.pep OpenMP][NFC] Split the reduction buffer size into two components

It also caused problems for OpenMC. I used OpenMC to verify my fix worked. I did not think this patch changed much but I forgot that they used the type for offset calculations, not only for type adjustment. The fix will make sure we properly adjust the layout change everywhere.

Based on llvm/llvm-project#70766 I think it would be good to have a few more offloading reduction tests, so we do not accidentally break minimum and maximum reductions another time.

jdoerfert requested review from shiltian and jhuber6 October 31, 2023 05:44

jdoerfert force-pushed the team_reduction_work_specialization branch 3 times, most recently from 9c557e1 to e28dfce Compare November 2, 2023 19:52

jhuber6 approved these changes Nov 2, 2023

View reviewed changes

jdoerfert force-pushed the team_reduction_work_specialization branch from e28dfce to 33b1a34 Compare November 2, 2023 22:47

jdoerfert merged commit eab828d into llvm:main Nov 2, 2023

jdoerfert deleted the team_reduction_work_specialization branch November 2, 2023 22:50

shiltian added a commit that referenced this pull request Nov 8, 2023

Revert "[OpenMP] Provide a specialized team reduction for the common …

6e574f1

…case (#70766)" This reverts commit eab828d.

AntonRydahl mentioned this pull request Nov 8, 2023

[OpenMP ]Added more libomptarget reduction tests #71616

Merged

jdoerfert added a commit to jdoerfert/llvm-project that referenced this pull request Apr 8, 2024

Reduction specialization llvm#70766

ab9157c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenMP] Team reduction work specialization #70766

[OpenMP] Team reduction work specialization #70766

Uh oh!

jdoerfert commented Oct 31, 2023

Uh oh!

llvmbot commented Oct 31, 2023 •

edited

Loading

Uh oh!

jhuber6 left a comment

Uh oh!

ronlieb commented Nov 4, 2023

Uh oh!

AntonRydahl commented Nov 7, 2023

Uh oh!

shiltian commented Nov 7, 2023

Uh oh!

AntonRydahl commented Nov 8, 2023

Uh oh!

ronlieb commented Nov 8, 2023

Uh oh!

AntonRydahl commented Nov 8, 2023

Uh oh!

shiltian commented Nov 8, 2023

Uh oh!

AntonRydahl commented Nov 8, 2023

Uh oh!

ronlieb commented Nov 8, 2023

Uh oh!

shiltian commented Nov 8, 2023

Uh oh!

jdoerfert commented Nov 8, 2023

Uh oh!

jdoerfert commented Nov 10, 2023

Uh oh!

Uh oh!

[OpenMP] Team reduction work specialization #70766

[OpenMP] Team reduction work specialization #70766

Uh oh!

Conversation

jdoerfert commented Oct 31, 2023

Uh oh!

llvmbot commented Oct 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

ronlieb commented Nov 4, 2023

Uh oh!

AntonRydahl commented Nov 7, 2023

Uh oh!

shiltian commented Nov 7, 2023

Uh oh!

AntonRydahl commented Nov 8, 2023

Uh oh!

ronlieb commented Nov 8, 2023

Uh oh!

AntonRydahl commented Nov 8, 2023

Uh oh!

shiltian commented Nov 8, 2023

Uh oh!

AntonRydahl commented Nov 8, 2023

Uh oh!

ronlieb commented Nov 8, 2023

Uh oh!

shiltian commented Nov 8, 2023

Uh oh!

jdoerfert commented Nov 8, 2023

Uh oh!

jdoerfert commented Nov 10, 2023

Uh oh!

Uh oh!

llvmbot commented Oct 31, 2023 •

edited

Loading