Skip to content

[OpenMP] Team reduction work specialization #70766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 2, 2023

Conversation

jdoerfert
Copy link
Member

Last commit, the others are part of existing PRs.

@jdoerfert jdoerfert requested review from shiltian and jhuber6 October 31, 2023 05:44
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. flang:openmp llvm:transforms clang:openmp OpenMP related changes to Clang openmp:libomptarget OpenMP offload runtime labels Oct 31, 2023
@llvmbot
Copy link
Member

llvmbot commented Oct 31, 2023

@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: Johannes Doerfert (jdoerfert)

Changes

Last commit, the others are part of existing PRs.


Patch is 4.73 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70766.diff

186 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+28-47)
  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.h (-2)
  • (modified) clang/lib/Sema/SemaOpenMP.cpp (+18-8)
  • (modified) clang/test/OpenMP/amdgcn_target_codegen.cpp (+10-4)
  • (modified) clang/test/OpenMP/amdgcn_target_device_vla.cpp (+20-8)
  • (modified) clang/test/OpenMP/amdgcn_target_init_temp_alloca.cpp (+2)
  • (modified) clang/test/OpenMP/amdgpu_target_with_aligned_attribute.c (+5-2)
  • (modified) clang/test/OpenMP/assumes_include_nvptx.cpp (+2-2)
  • (modified) clang/test/OpenMP/bug60602.cpp (+7-7)
  • (modified) clang/test/OpenMP/declare_target_codegen.cpp (+6-6)
  • (modified) clang/test/OpenMP/declare_target_codegen_globalization.cpp (+4-2)
  • (modified) clang/test/OpenMP/declare_target_link_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/declare_variant_mixed_codegen.c (+1-1)
  • (modified) clang/test/OpenMP/distribute_codegen.cpp (+62-42)
  • (modified) clang/test/OpenMP/distribute_firstprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_parallel_for_codegen.cpp (+118-118)
  • (modified) clang/test/OpenMP/distribute_parallel_for_firstprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_if_codegen.cpp (+31-31)
  • (modified) clang/test/OpenMP/distribute_parallel_for_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_num_threads_codegen.cpp (+152-152)
  • (modified) clang/test/OpenMP/distribute_parallel_for_private_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_codegen.cpp (+118-118)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_firstprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_if_codegen.cpp (+128-128)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_num_threads_codegen.cpp (+152-152)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_private_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/distribute_private_codegen.cpp (+40-40)
  • (modified) clang/test/OpenMP/distribute_simd_codegen.cpp (+60-20)
  • (modified) clang/test/OpenMP/distribute_simd_firstprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_simd_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_simd_private_codegen.cpp (+40-40)
  • (modified) clang/test/OpenMP/distribute_simd_reduction_codegen.cpp (+14-14)
  • (modified) clang/test/OpenMP/nvptx_SPMD_codegen.cpp (+2679-2301)
  • (modified) clang/test/OpenMP/nvptx_data_sharing.cpp (+4-2)
  • (modified) clang/test/OpenMP/nvptx_declare_target_var_ctor_dtor_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/nvptx_distribute_parallel_generic_mode_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_lambda_capturing.cpp (+47-27)
  • (modified) clang/test/OpenMP/nvptx_multi_target_parallel_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/nvptx_nested_parallel_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_parallel_codegen.cpp (+24-12)
  • (modified) clang/test/OpenMP/nvptx_parallel_for_codegen.cpp (+4-2)
  • (modified) clang/test/OpenMP/nvptx_target_codegen.cpp (+64-32)
  • (modified) clang/test/OpenMP/nvptx_target_firstprivate_codegen.cpp (+12-6)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_num_threads_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_proc_bind_codegen.cpp (+72-36)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_reduction_codegen.cpp (+36-18)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_reduction_codegen_tbaa_PR46146.cpp (+272-268)
  • (modified) clang/test/OpenMP/nvptx_target_printf_codegen.c (+24-12)
  • (modified) clang/test/OpenMP/nvptx_target_simd_codegen.cpp (+318-270)
  • (modified) clang/test/OpenMP/nvptx_target_teams_codegen.cpp (+24-12)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp (+72-36)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_generic_mode_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp (+364-348)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_simd_codegen.cpp (+390-342)
  • (modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_codegen.cpp (+60-30)
  • (modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_generic_mode_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_target_teams_ompx_bare_codegen.cpp (+3-1)
  • (modified) clang/test/OpenMP/nvptx_teams_codegen.cpp (+32-16)
  • (modified) clang/test/OpenMP/nvptx_teams_reduction_codegen.cpp (+156-138)
  • (modified) clang/test/OpenMP/ompx_attributes_codegen.cpp (+3-3)
  • (modified) clang/test/OpenMP/openmp_offload_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/reduction_implicit_map.cpp (+35-33)
  • (modified) clang/test/OpenMP/remarks_parallel_in_multiple_target_state_machines.c (+2-1)
  • (modified) clang/test/OpenMP/remarks_parallel_in_target_state_machine.c (+2-1)
  • (modified) clang/test/OpenMP/target_codegen_global_capture.cpp (+30-30)
  • (modified) clang/test/OpenMP/target_firstprivate_codegen.cpp (+72-24)
  • (modified) clang/test/OpenMP/target_map_codegen_03.cpp (+6-6)
  • (modified) clang/test/OpenMP/target_map_member_expr_codegen.cpp (+2-2)
  • (modified) clang/test/OpenMP/target_ompx_dyn_cgroup_mem_codegen.cpp (+36-12)
  • (modified) clang/test/OpenMP/target_parallel_codegen.cpp (+42-14)
  • (modified) clang/test/OpenMP/target_parallel_debug_codegen.cpp (+441-420)
  • (modified) clang/test/OpenMP/target_parallel_for_codegen.cpp (+42-14)
  • (modified) clang/test/OpenMP/target_parallel_for_debug_codegen.cpp (+610-589)
  • (modified) clang/test/OpenMP/target_parallel_for_simd_codegen.cpp (+84-28)
  • (modified) clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp (+79-3)
  • (modified) clang/test/OpenMP/target_parallel_for_tl_codegen.cpp (+72-3)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-1.cpp (+44-44)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-2.cpp (+24-16)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-3.cpp (+610-589)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen.cpp (+5-2)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_depend_codegen.cpp (+4-6)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_tl_codegen.cpp (+72-3)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_uses_allocators_codegen.cpp (+2-2)
  • (modified) clang/test/OpenMP/target_parallel_if_codegen.cpp (+96-72)
  • (modified) clang/test/OpenMP/target_parallel_num_threads_codegen.cpp (+78-54)
  • (modified) clang/test/OpenMP/target_parallel_tl_codegen.cpp (+22-3)
  • (modified) clang/test/OpenMP/target_private_codegen.cpp (+14-7)
  • (modified) clang/test/OpenMP/target_reduction_codegen.cpp (+12-6)
  • (modified) clang/test/OpenMP/target_simd_tl_codegen.cpp (+35-3)
  • (modified) clang/test/OpenMP/target_task_affinity_codegen.cpp (+6-2)
  • (modified) clang/test/OpenMP/target_teams_codegen.cpp (+66-22)
  • (modified) clang/test/OpenMP/target_teams_distribute_codegen.cpp (+42-14)
  • (modified) clang/test/OpenMP/target_teams_distribute_collapse_codegen.cpp (+18-18)
  • (modified) clang/test/OpenMP/target_teams_distribute_dist_schedule_codegen.cpp (+42-42)
  • (modified) clang/test/OpenMP/target_teams_distribute_firstprivate_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_collapse_codegen.cpp (+24-24)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_dist_schedule_codegen.cpp (+60-60)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_firstprivate_codegen.cpp (+138-128)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_if_codegen.cpp (+34-34)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_order_codegen.cpp (+4-4)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_private_codegen.cpp (+94-84)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_reduction_codegen.cpp (+29-29)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_schedule_codegen.cpp (+192-192)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_codegen.cpp (+24-12)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_collapse_codegen.cpp (+24-24)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_dist_schedule_codegen.cpp (+60-60)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_firstprivate_codegen.cpp (+138-128)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_private_codegen.cpp (+94-84)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_reduction_codegen.cpp (+29-29)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_schedule_codegen.cpp (+192-192)
  • (modified) clang/test/OpenMP/target_teams_distribute_private_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_reduction_codegen.cpp (+145-145)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_codegen.cpp (+84-28)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_collapse_codegen.cpp (+18-18)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_dist_schedule_codegen.cpp (+42-42)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_firstprivate_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_private_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_reduction_codegen.cpp (+19-19)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen-1.cpp (+16-8)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen.cpp (+15-12)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_collapse_codegen.cpp (+24-24)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_depend_codegen.cpp (+4-6)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_if_codegen.cpp (+34-34)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_order_codegen.cpp (+4-4)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_private_codegen.cpp (+94-84)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_reduction_codegen.cpp (+29-29)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_uses_allocators_codegen.cpp (+3-3)
  • (modified) clang/test/OpenMP/target_teams_map_codegen.cpp (+130-94)
  • (modified) clang/test/OpenMP/target_teams_num_teams_codegen.cpp (+78-54)
  • (modified) clang/test/OpenMP/target_teams_thread_limit_codegen.cpp (+44-20)
  • (modified) clang/test/OpenMP/teams_codegen.cpp (+72-56)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+4-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPKinds.def (+6-2)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+30-2)
  • (modified) llvm/test/Transforms/OpenMP/add_attributes.ll (+4-4)
  • (modified) llvm/test/Transforms/OpenMP/always_inline_device.ll (+4-4)
  • (modified) llvm/test/Transforms/OpenMP/custom_state_machines.ll (+85-85)
  • (modified) llvm/test/Transforms/OpenMP/custom_state_machines_pre_lto.ll (+148-148)
  • (modified) llvm/test/Transforms/OpenMP/custom_state_machines_remarks.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/deduplication_target.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/get_hardware_num_threads_in_block_fold.ll (+13-13)
  • (modified) llvm/test/Transforms/OpenMP/get_hardware_num_threads_in_block_fold_optnone.ll (+7-7)
  • (modified) llvm/test/Transforms/OpenMP/global_constructor.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/globalization_remarks.ll (+2-2)
  • (modified) llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll (+2-2)
  • (modified) llvm/test/Transforms/OpenMP/indirect_call_kernel_info_crash.ll (+3-3)
  • (modified) llvm/test/Transforms/OpenMP/is_spmd_exec_mode_fold.ll (+9-9)
  • (modified) llvm/test/Transforms/OpenMP/nested_parallelism.ll (+7-7)
  • (modified) llvm/test/Transforms/OpenMP/parallel_level_fold.ll (+7-7)
  • (modified) llvm/test/Transforms/OpenMP/remove_globalization.ll (+9-9)
  • (modified) llvm/test/Transforms/OpenMP/replace_globalization.ll (+14-14)
  • (modified) llvm/test/Transforms/OpenMP/single_threaded_execution.ll (+3-3)
  • (modified) llvm/test/Transforms/OpenMP/spmdization.ll (+49-49)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_assumes.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_constant_prop.ll (+3-3)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_guarding.ll (+9-9)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_guarding_two_reaching_kernels.ll (+15-15)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_indirect.ll (+15-15)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_kernel_env_dep.ll (+7-6)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_no_guarding_two_reaching_kernels.ll (+15-15)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_remarks.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/value-simplify-openmp-opt.ll (+7-7)
  • (modified) llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp (+12-5)
  • (modified) openmp/libomptarget/DeviceRTL/include/Interface.h (+5-1)
  • (modified) openmp/libomptarget/DeviceRTL/include/State.h (+8-2)
  • (modified) openmp/libomptarget/DeviceRTL/src/Kernel.cpp (+10-6)
  • (modified) openmp/libomptarget/DeviceRTL/src/Reduction.cpp (+111-9)
  • (modified) openmp/libomptarget/DeviceRTL/src/State.cpp (+11-1)
  • (modified) openmp/libomptarget/include/Environment.h (+7)
  • (modified) openmp/libomptarget/include/omptarget.h (+10)
  • (modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp (+65-12)
  • (modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h (+17-2)
  • (added) openmp/libomptarget/test/offloading/parallel_target_teams_reduction.cpp (+36)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
index 9d00ebae702802a..de028b0209c171a 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
@@ -803,8 +803,30 @@ void CGOpenMPRuntimeGPU::emitKernelDeinit(CodeGenFunction &CGF,
   if (!IsSPMD)
     emitGenericVarsEpilog(CGF);
 
+  // This is temporary until we remove the fixed sized buffer.
+  ASTContext &C = CGM.getContext();
+  RecordDecl *StaticRD = C.buildImplicitRecord(
+      "_openmp_teams_reduction_type_$_", RecordDecl::TagKind::TTK_Union);
+  StaticRD->startDefinition();
+  for (const RecordDecl *TeamReductionRec : TeamsReductions) {
+    QualType RecTy = C.getRecordType(TeamReductionRec);
+    auto *Field = FieldDecl::Create(
+        C, StaticRD, SourceLocation(), SourceLocation(), nullptr, RecTy,
+        C.getTrivialTypeSourceInfo(RecTy, SourceLocation()),
+        /*BW=*/nullptr, /*Mutable=*/false,
+        /*InitStyle=*/ICIS_NoInit);
+    Field->setAccess(AS_public);
+    StaticRD->addDecl(Field);
+  }
+  StaticRD->completeDefinition();
+  QualType StaticTy = C.getRecordType(StaticRD);
+  llvm::Type *LLVMReductionsBufferTy =
+      CGM.getTypes().ConvertTypeForMem(StaticTy);
+  const auto &DL = CGM.getModule().getDataLayout();
+  uint64_t BufferSize =
+      DL.getTypeAllocSize(LLVMReductionsBufferTy).getFixedValue();
   CGBuilderTy &Bld = CGF.Builder;
-  OMPBuilder.createTargetDeinit(Bld);
+  OMPBuilder.createTargetDeinit(Bld, BufferSize);
 }
 
 void CGOpenMPRuntimeGPU::emitSPMDKernel(const OMPExecutableDirective &D,
@@ -2998,15 +3020,10 @@ void CGOpenMPRuntimeGPU::emitReduction(
         CGM.getContext(), PrivatesReductions, std::nullopt, VarFieldMap,
         C.getLangOpts().OpenMPCUDAReductionBufNum);
     TeamsReductions.push_back(TeamReductionRec);
-    if (!KernelTeamsReductionPtr) {
-      KernelTeamsReductionPtr = new llvm::GlobalVariable(
-          CGM.getModule(), CGM.VoidPtrTy, /*isConstant=*/true,
-          llvm::GlobalValue::InternalLinkage, nullptr,
-          "_openmp_teams_reductions_buffer_$_$ptr");
-    }
-    llvm::Value *GlobalBufferPtr = CGF.EmitLoadOfScalar(
-        Address(KernelTeamsReductionPtr, CGF.VoidPtrTy, CGM.getPointerAlign()),
-        /*Volatile=*/false, C.getPointerType(C.VoidPtrTy), Loc);
+    auto *KernelTeamsReductionPtr = CGF.EmitRuntimeCall(
+        OMPBuilder.getOrCreateRuntimeFunction(
+            CGM.getModule(), OMPRTL___kmpc_reduction_get_fixed_buffer),
+        {}, "_openmp_teams_reductions_buffer_$_$ptr");
     llvm::Value *GlobalToBufferCpyFn = ::emitListToGlobalCopyFunction(
         CGM, Privates, ReductionArrayTy, Loc, TeamReductionRec, VarFieldMap);
     llvm::Value *GlobalToBufferRedFn = ::emitListToGlobalReduceFunction(
@@ -3021,7 +3038,7 @@ void CGOpenMPRuntimeGPU::emitReduction(
     llvm::Value *Args[] = {
         RTLoc,
         ThreadId,
-        GlobalBufferPtr,
+        KernelTeamsReductionPtr,
         CGF.Builder.getInt32(C.getLangOpts().OpenMPCUDAReductionBufNum),
         RL,
         ShuffleAndReduceFn,
@@ -3654,42 +3671,6 @@ void CGOpenMPRuntimeGPU::processRequiresDirective(
   CGOpenMPRuntime::processRequiresDirective(D);
 }
 
-void CGOpenMPRuntimeGPU::clear() {
-
-  if (!TeamsReductions.empty()) {
-    ASTContext &C = CGM.getContext();
-    RecordDecl *StaticRD = C.buildImplicitRecord(
-        "_openmp_teams_reduction_type_$_", RecordDecl::TagKind::TTK_Union);
-    StaticRD->startDefinition();
-    for (const RecordDecl *TeamReductionRec : TeamsReductions) {
-      QualType RecTy = C.getRecordType(TeamReductionRec);
-      auto *Field = FieldDecl::Create(
-          C, StaticRD, SourceLocation(), SourceLocation(), nullptr, RecTy,
-          C.getTrivialTypeSourceInfo(RecTy, SourceLocation()),
-          /*BW=*/nullptr, /*Mutable=*/false,
-          /*InitStyle=*/ICIS_NoInit);
-      Field->setAccess(AS_public);
-      StaticRD->addDecl(Field);
-    }
-    StaticRD->completeDefinition();
-    QualType StaticTy = C.getRecordType(StaticRD);
-    llvm::Type *LLVMReductionsBufferTy =
-        CGM.getTypes().ConvertTypeForMem(StaticTy);
-    // FIXME: nvlink does not handle weak linkage correctly (object with the
-    // different size are reported as erroneous).
-    // Restore CommonLinkage as soon as nvlink is fixed.
-    auto *GV = new llvm::GlobalVariable(
-        CGM.getModule(), LLVMReductionsBufferTy,
-        /*isConstant=*/false, llvm::GlobalValue::InternalLinkage,
-        llvm::Constant::getNullValue(LLVMReductionsBufferTy),
-        "_openmp_teams_reductions_buffer_$_");
-    KernelTeamsReductionPtr->setInitializer(
-        llvm::ConstantExpr::getPointerBitCastOrAddrSpaceCast(GV,
-                                                             CGM.VoidPtrTy));
-  }
-  CGOpenMPRuntime::clear();
-}
-
 llvm::Value *CGOpenMPRuntimeGPU::getGPUNumThreads(CodeGenFunction &CGF) {
   CGBuilderTy &Bld = CGF.Builder;
   llvm::Module *M = &CGF.CGM.getModule();
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
index 46e1361f2f895ba..141436f26230dde 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
@@ -130,7 +130,6 @@ class CGOpenMPRuntimeGPU : public CGOpenMPRuntime {
 
 public:
   explicit CGOpenMPRuntimeGPU(CodeGenModule &CGM);
-  void clear() override;
 
   bool isGPU() const override { return true; };
 
@@ -386,7 +385,6 @@ class CGOpenMPRuntimeGPU : public CGOpenMPRuntime {
   /// Maps the function to the list of the globalized variables with their
   /// addresses.
   llvm::SmallDenseMap<llvm::Function *, FunctionData> FunctionGlobalizedDecls;
-  llvm::GlobalVariable *KernelTeamsReductionPtr = nullptr;
   /// List of the records with the list of fields for the reductions across the
   /// teams. Used to build the intermediate buffer for the fast teams
   /// reductions.
diff --git a/clang/lib/Sema/SemaOpenMP.cpp b/clang/lib/Sema/SemaOpenMP.cpp
index 75f9e152dca9297..145f4dc4670081d 100644
--- a/clang/lib/Sema/SemaOpenMP.cpp
+++ b/clang/lib/Sema/SemaOpenMP.cpp
@@ -4249,12 +4249,15 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
-    Sema::CapturedParamNameType ParamsTarget[] = {
-        std::make_pair(StringRef(), QualType()) // __context with shared vars
-    };
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     // Start a captured region for 'target' with no implicit parameters.
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
-                             ParamsTarget, /*OpenMPCaptureLevel=*/1);
+                             ParamsTarget,
+                             /*OpenMPCaptureLevel=*/1);
     Sema::CapturedParamNameType ParamsTeamsOrParallel[] = {
         std::make_pair(".global_tid.", KmpInt32PtrTy),
         std::make_pair(".bound_tid.", KmpInt32PtrTy),
@@ -4293,8 +4296,13 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
-                             std::make_pair(StringRef(), QualType()),
+                             ParamsTarget,
                              /*OpenMPCaptureLevel=*/1);
     break;
   }
@@ -4499,9 +4507,11 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
-    Sema::CapturedParamNameType ParamsTarget[] = {
-        std::make_pair(StringRef(), QualType()) // __context with shared vars
-    };
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     // Start a captured region for 'target' with no implicit parameters.
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
                              ParamsTarget, /*OpenMPCaptureLevel=*/1);
diff --git a/clang/test/OpenMP/amdgcn_target_codegen.cpp b/clang/test/OpenMP/amdgcn_target_codegen.cpp
index 90d2ebdf26bd645..3ea2d107f072adb 100644
--- a/clang/test/OpenMP/amdgcn_target_codegen.cpp
+++ b/clang/test/OpenMP/amdgcn_target_codegen.cpp
@@ -29,15 +29,18 @@ int test_amdgcn_target_tid_threads_simd() {
 
 #endif
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[ARR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[ARR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[ARR_ADDR]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[ARR]], ptr [[ARR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -66,19 +69,22 @@ int test_amdgcn_target_tid_threads_simd() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR1:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR1:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[ARR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[ARR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[ARR_ADDR]] to ptr
 // CHECK-NEXT:    [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
 // CHECK-NEXT:    [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[ARR]], ptr [[ARR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
diff --git a/clang/test/OpenMP/amdgcn_target_device_vla.cpp b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
index b2b630b546713dd..de150a0fcb4afd2 100644
--- a/clang/test/OpenMP/amdgcn_target_device_vla.cpp
+++ b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -97,21 +97,24 @@ int main() {
 
 #endif
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[SUM_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[N:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[I1:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[SUM_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SUM_ADDR]] to ptr
 // CHECK-NEXT:    [[N_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N]] to ptr
 // CHECK-NEXT:    [[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
 // CHECK-NEXT:    [[I1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I1]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[SUM]], ptr [[SUM_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[SUM_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -174,26 +177,29 @@ int main() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30
-// CHECK-SAME: (i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
 // CHECK-NEXT:    [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
 // CHECK-NEXT:    [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
 // CHECK-NEXT:    [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
 // CHECK-NEXT:    [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
 // CHECK-NEXT:    [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP2]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -540,26 +546,29 @@ int main() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52
-// CHECK-SAME: (i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
 // CHECK-NEXT:    [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
 // CHECK-NEXT:    [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
 // CHECK-NEXT:    [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
 // CHECK-NEXT:    [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
 // CHECK-NEXT:    [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load...
[truncated]

@jdoerfert jdoerfert force-pushed the team_reduction_work_specialization branch 3 times, most recently from 9c557e1 to e28dfce Compare November 2, 2023 19:52
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

We default to < 1024 teams if the user did not specify otherwise. As
such we can avoid the extra logic in the teams reduction that handles
more than num_of_records (default 1024) teams. This is a stopgap but
still shaves of 33% of the runtime in some simple reduction examples.
@jdoerfert jdoerfert force-pushed the team_reduction_work_specialization branch from e28dfce to 33b1a34 Compare November 2, 2023 22:47
@jdoerfert jdoerfert merged commit eab828d into llvm:main Nov 2, 2023
@jdoerfert jdoerfert deleted the team_reduction_work_specialization branch November 2, 2023 22:50
@ronlieb
Copy link
Contributor

ronlieb commented Nov 4, 2023

patch seems to break these 4 sollve tests

./sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_distribute_reduction_and.c
./sollve_vv/tests/5.0/target_teams_distribute/test_target_teams_distribute_reduction_and.c
./sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_distribute_reduction_multiply.c
./sollve_vv/tests/5.0/target_teams_distribute/test_target_teams_distribute_reduction_multiply.c

fail in latest build of llvm, passes if patch reverted locally i nbuild

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Nov 4, 2023
…case (llvm#70766)"

fails 4 sollve tests:

   test_target_teams_distribute_reduction_and.c
   test_target_teams_distribute_reduction_multiply.c
   test_target_teams_distribute_reduction_and.c
   test_target_teams_distribute_reduction_multiply.c

This reverts commit eab828d.

Change-Id: If6beb31e12531c9232ccf9a711fbb2a1cbe99898
@AntonRydahl
Copy link
Contributor

This commit breaks minimization and multiplication reductions.

@shiltian
Copy link
Contributor

shiltian commented Nov 7, 2023

The patch has been reverted @AntonRydahl

@AntonRydahl
Copy link
Contributor

When? I just found it with git bisect on main. Maybe something is wrong in my fork.

@ronlieb
Copy link
Contributor

ronlieb commented Nov 8, 2023

i dont see it reverted either. i do recall Shilei , you reverted a different reduction related patch with 3 sollve failures, this one has 4.

@AntonRydahl
Copy link
Contributor

I think there were multiple reduction commits on the same day. It is in the history here: https://github.com/llvm/llvm-project/commits/main/openmp/libomptarget/DeviceRTL/src/Reduction.cpp

@shiltian
Copy link
Contributor

shiltian commented Nov 8, 2023

@AntonRydahl @ronlieb Sorry I was looking at the wrong one:

[searlmc1](https://github.com/searlmc1) pushed a commit to RadeonOpenCompute/llvm-project that referenced this pull request [3 days ago](https://github.com/llvm/llvm-project/pull/70766#ref-commit-07441d5)
@ronlieb
[Revert "[OpenMP] Provide a specialized team reduction for the common …](https://github.com/RadeonOpenCompute/llvm-project/commit/07441d5b9640dd7549c6472883d7dedfc82d7426)

I'll revert it right now.

shiltian added a commit that referenced this pull request Nov 8, 2023
@AntonRydahl
Copy link
Contributor

Thanks a bunch, @shiltian!

@ronlieb
Copy link
Contributor

ronlieb commented Nov 8, 2023

shilei, are you willing to revert this one also ? it breaks spec accel v1.4 552.pep
OpenMP][NFC] Split the reduction buffer size into two components

@shiltian
Copy link
Contributor

shiltian commented Nov 8, 2023

That one has to be reverted by @jdoerfert as I tried but there are way too many conflicts.

@jdoerfert
Copy link
Member Author

I'll revert this one.

AntonRydahl added a commit that referenced this pull request Nov 8, 2023
Based on #70766 I think it
would be good to have a few more offloading reduction tests, so we do
not accidentally break minimum and maximum reductions another time.
@jdoerfert
Copy link
Member Author

shilei, are you willing to revert this one also ? it breaks spec accel v1.4 552.pep OpenMP][NFC] Split the reduction buffer size into two components

It also caused problems for OpenMC. I used OpenMC to verify my fix worked. I did not think this patch changed much but I forgot that they used the type for offset calculations, not only for type adjustment. The fix will make sure we properly adjust the layout change everywhere.

jdoerfert added a commit to jdoerfert/llvm-project that referenced this pull request Apr 8, 2024
qihangkong pushed a commit to rvgpu/llvm that referenced this pull request Apr 18, 2024
Based on llvm/llvm-project#70766 I think it
would be good to have a few more offloading reduction tests, so we do
not accidentally break minimum and maximum reductions another time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category flang:openmp llvm:transforms openmp:libomptarget OpenMP offload runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants