-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[MLIR][NVGPU] Fix nvgpu_arrive syntax in matmulBuilder.py #113713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLIR][NVGPU] Fix nvgpu_arrive syntax in matmulBuilder.py #113713
Conversation
This patch updates the syntax for nvgpu_arrive Op in matmulBuilder.py. This fixes the compilation error for this test. For the warp-specialized matmul_kernel implementation, removing the WaitGroupSyncOp (after the mma-main-loop) fixes the hang observed. With these two fixes, the test compiles and executes successfully on an sm90a machine. Signed-off-by: Durgadoss R <[email protected]>
@llvm/pr-subscribers-mlir Author: Durgadoss R (durga4github) ChangesThis patch updates the syntax for nvgpu_arrive Op For the warp-specialized matmul_kernel implementation, With these two fixes, the test compiles and Full diff: https://github.com/llvm/llvm-project/pull/113713.diff 1 Files Affected:
diff --git a/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py b/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
index 75f0dc947e0681..5394d4a3272555 100644
--- a/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
+++ b/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
@@ -568,9 +568,7 @@ def generate_matmul_ws(
barId,
predicate=consumerPrimaryThread,
)
- nvgpu.mbarrier_arrive(
- ir.Type.parse("!nvgpu.mbarrier.token"), mbarDONE, barId
- )
+ nvgpu.mbarrier_arrive(mbarDONE, barId)
debug_print(
"[cons] iv={} | mbarDONE[{}] arrive [done]",
iv,
@@ -589,14 +587,9 @@ def generate_matmul_ws(
# Step 6.3.5. Yield
scf.yield_([new_acc, phaseParity])
- # Step 6.3. Wait All WGMMA
- nvvm.WgmmaWaitGroupSyncOp(0)
-
with ir.InsertionPoint(scf.IfOp(consumerPrimaryThread).then_block):
barId = c((K // BLOCK_K) % num_stages)
- nvgpu.mbarrier_arrive(
- ir.Type.parse("!nvgpu.mbarrier.token"), mbarDONE, barId
- )
+ nvgpu.mbarrier_arrive(mbarDONE, barId)
scf.yield_([])
# Step 6.4. Epilogue (registers --> shared memory)
|
@llvm/pr-subscribers-mlir-gpu Author: Durgadoss R (durga4github) ChangesThis patch updates the syntax for nvgpu_arrive Op For the warp-specialized matmul_kernel implementation, With these two fixes, the test compiles and Full diff: https://github.com/llvm/llvm-project/pull/113713.diff 1 Files Affected:
diff --git a/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py b/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
index 75f0dc947e0681..5394d4a3272555 100644
--- a/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
+++ b/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
@@ -568,9 +568,7 @@ def generate_matmul_ws(
barId,
predicate=consumerPrimaryThread,
)
- nvgpu.mbarrier_arrive(
- ir.Type.parse("!nvgpu.mbarrier.token"), mbarDONE, barId
- )
+ nvgpu.mbarrier_arrive(mbarDONE, barId)
debug_print(
"[cons] iv={} | mbarDONE[{}] arrive [done]",
iv,
@@ -589,14 +587,9 @@ def generate_matmul_ws(
# Step 6.3.5. Yield
scf.yield_([new_acc, phaseParity])
- # Step 6.3. Wait All WGMMA
- nvvm.WgmmaWaitGroupSyncOp(0)
-
with ir.InsertionPoint(scf.IfOp(consumerPrimaryThread).then_block):
barId = c((K // BLOCK_K) % num_stages)
- nvgpu.mbarrier_arrive(
- ir.Type.parse("!nvgpu.mbarrier.token"), mbarDONE, barId
- )
+ nvgpu.mbarrier_arrive(mbarDONE, barId)
scf.yield_([])
# Step 6.4. Epilogue (registers --> shared memory)
|
@grypp, Please help with a review |
Can change the prefix of the title to [MLIR] |
Sure, updated. Builds are clean, submitting this. |
This patch updates the syntax for nvgpu_arrive Op in matmulBuilder.py. This fixes the compilation error for this test. For the warp-specialized matmul_kernel implementation, removing the WaitGroupSyncOp (after the mma-main-loop) fixes the hang observed. With these two fixes, the test compiles and executes successfully on an sm90a machine. Signed-off-by: Durgadoss R <[email protected]>
This patch updates the syntax for nvgpu_arrive Op in matmulBuilder.py. This fixes the compilation error for this test. For the warp-specialized matmul_kernel implementation, removing the WaitGroupSyncOp (after the mma-main-loop) fixes the hang observed. With these two fixes, the test compiles and executes successfully on an sm90a machine. Signed-off-by: Durgadoss R <[email protected]>
This patch updates the syntax for nvgpu_arrive Op
in matmulBuilder.py. This fixes the compilation
error for this test.
For the warp-specialized matmul_kernel implementation,
removing the WaitGroupSyncOp (after the mma-main-loop)
fixes the hang observed.
With these two fixes, the test compiles and
executes successfully on an sm90a machine.