Skip to content

[NVPTX] Add intrinsics for st.bulk instruction #128856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Wolfram70
Copy link
Contributor

Adds NVVM intrinsics and NVPTX codegen for the st.bulk instruction introduced in ptx8.6 for sm_100. Tests added in CodeGen/NVPTX/st_bulk.ll and verified through ptxas 12.8.0.

PTX Spec Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st-bulk

@llvmbot
Copy link
Member

llvmbot commented Feb 26, 2025

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Srinivasa Ravi (Wolfram70)

Changes

Adds NVVM intrinsics and NVPTX codegen for the st.bulk instruction introduced in ptx8.6 for sm_100. Tests added in CodeGen/NVPTX/st_bulk.ll and verified through ptxas 12.8.0.

PTX Spec Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st-bulk


Full diff: https://github.com/llvm/llvm-project/pull/128856.diff

3 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+8)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+14)
  • (added) llvm/test/CodeGen/NVPTX/st_bulk.ll (+33)
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index c32bf0318b5d6..bd3b4cdba8f35 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -5186,4 +5186,12 @@ foreach cta_group = ["cg1", "cg2"] in {
   }
 }
 
+//
+// Bulk store intrinsics
+//
+
+def int_nvvm_st_bulk: Intrinsic<[], [llvm_global_ptr_ty, llvm_i64_ty, llvm_i64_ty], [IntrArgMemOnly, WriteOnly<ArgIndex<0>>, ImmArg<ArgIndex<2>>]>;
+
+def int_nvvm_st_bulk_shared_cta : Intrinsic<[], [llvm_shared_ptr_ty, llvm_i64_ty, llvm_i64_ty], [IntrArgMemOnly, WriteOnly<ArgIndex<0>>, ImmArg<ArgIndex<2>>]>;
+
 } // let TargetPrefix = "nvvm"
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 0640d25031c6a..09534f1fc0f58 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -7731,3 +7731,17 @@ def tcgen05_fence_after_thread_sync: NVPTXInst<(outs), (ins),
   Requires<[hasTcgen05Instructions]>;
 
 } // hasSideEffects
+
+// Bulk store instructions
+                            
+def INT_NVVM_ST_BULK_GENERIC :
+  NVPTXInst<(outs), (ins Int64Regs:$dest_addr, Int64Regs:$size),
+            "st.bulk [$dest_addr], $size, 0;",
+            [(int_nvvm_st_bulk i64:$dest_addr, i64:$size, (i64 0))]>,
+            Requires<[hasSM<100>, hasPTX<86>]>;
+
+def INT_NVVM_ST_BULK_SHARED_CTA:
+  NVPTXInst<(outs), (ins Int64Regs:$dest_addr, Int64Regs:$size),
+            "st.bulk.shared::cta [$dest_addr], $size, 0;",
+            [(int_nvvm_st_bulk_shared_cta i64:$dest_addr, i64:$size, (i64 0))]>,
+            Requires<[hasSM<100>, hasPTX<86>]>;
diff --git a/llvm/test/CodeGen/NVPTX/st_bulk.ll b/llvm/test/CodeGen/NVPTX/st_bulk.ll
new file mode 100644
index 0000000000000..9d4a425d155e7
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/st_bulk.ll
@@ -0,0 +1,33 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 | FileCheck %s
+; RUN: %if ptxas-12.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 | %ptxas-verify -arch=sm_100 %}
+
+declare void @llvm.nvvm.st.bulk(ptr addrspace(1), i64, i64)
+define void @st_bulk(ptr addrspace(1) %dest_addr, i64 %size) {
+; CHECK-LABEL: st_bulk(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b64 %rd<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.u64 %rd1, [st_bulk_param_0];
+; CHECK-NEXT:    ld.param.u64 %rd2, [st_bulk_param_1];
+; CHECK-NEXT:    st.bulk [%rd1], %rd2, 0;
+; CHECK-NEXT:    ret;
+  call void @llvm.nvvm.st.bulk(ptr addrspace(1) %dest_addr, i64 %size, i64 0)
+  ret void
+}
+
+declare void @llvm.nvvm.st.bulk.shared.cta(ptr addrspace(3), i64, i64)
+define void @st_bulk_shared_cta(ptr addrspace(3) %dest_addr, i64 %size) {
+; CHECK-LABEL: st_bulk_shared_cta(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b64 %rd<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.u64 %rd1, [st_bulk_shared_cta_param_0];
+; CHECK-NEXT:    ld.param.u64 %rd2, [st_bulk_shared_cta_param_1];
+; CHECK-NEXT:    st.bulk.shared::cta [%rd1], %rd2, 0;
+; CHECK-NEXT:    ret;
+   call void @llvm.nvvm.st.bulk.shared.cta(ptr addrspace(3) %dest_addr, i64 %size, i64 0)
+   ret void
+}

@Wolfram70 Wolfram70 force-pushed the dev/Wolfram70/nvptx-st-bulk-intrinsics branch 2 times, most recently from b4696d4 to 7022b16 Compare February 27, 2025 07:07
@durga4github
Copy link
Contributor

Changes LGTM.

Please resolve the conflicts so we can get an updated result from the builders.

@Wolfram70 Wolfram70 force-pushed the dev/Wolfram70/nvptx-st-bulk-intrinsics branch from 7022b16 to dd96135 Compare February 27, 2025 14:05
@Wolfram70 Wolfram70 force-pushed the dev/Wolfram70/nvptx-st-bulk-intrinsics branch 3 times, most recently from 2c1c024 to 3e5d4be Compare March 3, 2025 15:01
@Wolfram70 Wolfram70 force-pushed the dev/Wolfram70/nvptx-st-bulk-intrinsics branch from 3e5d4be to 7a46d33 Compare March 4, 2025 06:36
Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add these to the NVPTXUsage doc but otherwise LGTM

@Wolfram70 Wolfram70 force-pushed the dev/Wolfram70/nvptx-st-bulk-intrinsics branch 2 times, most recently from b5c5c09 to d64b85e Compare March 5, 2025 06:11
@durga4github
Copy link
Contributor

LGTM,
Kindly fix the typos and then merge.

Adds NVVM intrinsics and NVPTX codegen for the `st.bulk` instruction
introduced in ptx8.6 for sm_100. Tests added in `CodeGen/NVPTX/st_bulk.ll`
and verified through ptxas 12.8.0.

PTX Spec Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st-bulk
@Wolfram70 Wolfram70 force-pushed the dev/Wolfram70/nvptx-st-bulk-intrinsics branch from d64b85e to c37453c Compare March 5, 2025 10:20
@Wolfram70 Wolfram70 merged commit 9a5a8c9 into llvm:main Mar 11, 2025
12 checks passed
@Wolfram70 Wolfram70 deleted the dev/Wolfram70/nvptx-st-bulk-intrinsics branch March 11, 2025 08:59
Wolfram70 added a commit that referenced this pull request Mar 12, 2025
This PR fixes an oversight from the previous change (PR #128856) that
introduced the `st.bulk` intrinsic where `llvm_global_ptr_ty` was used
instead of `llvm_ptr_ty` for generic addressing.

PTX Spec Reference:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-st-bulk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants