-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[AMDGPU] Only emit SCOPE_SYS global_wb #110636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
global_wb with scopes lower than SCOPE_SYS is unnecessary for correctness. I was initially optimistic they would be very cheap no-ops but they can actually be quite expensive so let's avoid them.
LLVMBot hasn't commented on this one for some reason. |
@llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) Changesglobal_wb with scopes lower than SCOPE_SYS is unnecessary for correctness. I was initially optimistic they would be very cheap no-ops but they can actually be quite expensive so let's avoid them. Patch is 687.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110636.diff 38 Files Affected:
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 9e11b13c101d47..bfac4738732631 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -14182,8 +14182,13 @@ For GFX12:
* ``global_inv`` invalidates caches whose scope is strictly smaller than the
instruction's. The invalidation requests cannot be reordered with pending or
upcoming memory operations.
-* ``global_wb`` additionally ensures that previous memory operation done at
- a lower scope level have reached the ``SCOPE:`` of the ``global_wb``.
+* ``global_wb`` is a writeback operation that additionally ensures previous
+ memory operation done at a lower scope level have reached the ``SCOPE:``
+ of the ``global_wb``.
+
+ * ``global_wb`` can be omitted for scopes other than ``SCOPE_SYS`` in
+ gfx120x.
+
* The vector memory operations access a vector L0 cache. There is a single L0
cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
special action is required for coherence between the lanes of a single
@@ -14890,19 +14895,7 @@ the instruction in the code sequence that references the table.
store atomic release - singlethread - global 1. buffer/global/ds/flat_store
- wavefront - local
- generic
- store atomic release - workgroup - global 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ store atomic release - workgroup - global 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -14925,7 +14918,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
must happen after
any preceding
@@ -14945,19 +14942,7 @@ the instruction in the code sequence that references the table.
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- store atomic release - workgroup - local 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode or OpenCL, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ store atomic release - workgroup - local 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -14980,7 +14965,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- Must happen before the
following store.
- Ensures that all
@@ -14992,16 +14981,9 @@ the instruction in the code sequence that references the table.
released.
3. ds_store
- store atomic release - agent - global 1. ``global_wb``
+ store atomic release - agent - global 1. ``global_wb scope:SCOPE_SYS``
- system - generic
- - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at agent or system
- scope before performing the
- store that is being
- released.
+ - If agent scope, omit.
2. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
@@ -15025,7 +15007,12 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ ``global_wb`` if present, or
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
must happen after
any preceding
@@ -15050,20 +15037,8 @@ the instruction in the code sequence that references the table.
atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
- wavefront - local
- generic
- atomicrmw release - workgroup - global 1. ``global_wb scope:SCOPE_SE``
- - generic
- - If CU wavefront execution
- mode, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
+ atomicrmw release - workgroup - global 1. | ``s_wait_bvhcnt 0x0``
+ - generic | ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
@@ -15086,15 +15061,19 @@ the instruction in the code sequence that references the table.
atomic/
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
- must happen after
- ``global_wb``.
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
- must happen after
- any preceding
- local/generic
- load/store/load
- atomic/store
- atomic/atomicrmw.
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
- Must happen before the
following atomic.
- Ensures that all
@@ -15105,23 +15084,11 @@ the instruction in the code sequence that references the table.
atomicrmw that is
being released.
- 3. buffer/global/flat_atomic
+ 2. buffer/global/flat_atomic
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- atomicrmw release - workgroup - local 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode or OpenCL, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ atomicrmw release - workgroup - local 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -15144,7 +15111,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- Must happen before the
following atomic.
- Ensures that all
@@ -15155,17 +15126,10 @@ the instruction in the code sequence that references the table.
store that is being
released.
- 3. ds_atomic
- atomicrmw release - agent - global 1. ``global_wb scope:``
+ 2. ds_atomic
+ atomicrmw release - agent - global 1. ``global_wb scope:SCOPE_SYS``
- system - generic
- - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at agent or system
- scope before performing the
- store that is being
- released.
+ - If agent scope, omit.
2. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
@@ -15188,7 +15152,12 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``
+ ``global_wb`` if present, or
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
must happen after
any preceding
@@ -15212,19 +15181,7 @@ the instruction in the code sequence that references the table.
fence release - singlethread *none* *none*
- wavefront
- fence release - workgroup *none* 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ fence release - workgroup *none* 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -15254,7 +15211,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``
+ ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the internal discussion for this, I think this LGTM
Maybe get approval from one of the tagged reviewers though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/41/builds/2586 Here is the relevant piece of the build log for the reference
|
* commit 'FETCH_HEAD': [X86] getIntImmCostInst - pull out repeated Imm.getBitWidth() calls. NFC. [X86] Add test coverage for llvm#111323 [Driver] Use empty multilib file in another test (llvm#111352) [clang][OpenMP][test] Use x86_64-linux-gnu triple for test referencing avx512f feature (llvm#111337) [doc] Fix Kaleidoscope tutorial chapter 3 code snippet and full listing discrepancies (llvm#111289) [Flang][OpenMP] Improve entry block argument creation and binding (llvm#110267) [x86] combineMul - handle 0/-1 KnownBits cases before MUL_IMM logic (REAPPLIED) [llvm-dis] Fix non-deterministic disassembly across multiple inputs (llvm#110988) [lldb][test] TestDataFormatterLibcxxOptionalSimulator.py: change order of ifdefs [lldb][test] Add libcxx-simulators test for std::optional (llvm#111133) [x86] combineMul - use computeKnownBits directly to find MUL_IMM constant splat. (REAPPLIED) Reland "[lldb][test] TestDataFormatterLibcxxStringSimulator.py: add new padding layout" (llvm#111123) Revert "[x86] combineMul - use computeKnownBits directly to find MUL_IMM constant splat." update_test_checks: fix a simple regression (llvm#111347) [LegalizeVectorTypes] Always widen fabs (llvm#111298) [lsan] Make ReportUnsuspendedThreads return bool also for Fuchsia [mlir][vector] Add more tests for ConvertVectorToLLVM (6/n) (llvm#111121) [bazel] port 9144fed [SystemZ] Remove inlining threshold multiplier. (llvm#106058) [LegalizeVectorTypes] When widening don't check for libcalls if promoted (llvm#111297) [clang][Driver] Improve multilib custom error reporting (llvm#110804) [clang][Driver] Rename "FatalError" key to "Error" in multilib.yaml (llvm#110804) [LLVM][Maintainers] Update release managers (llvm#111164) [Clang][Driver] Add option to provide path for multilib's YAML config file (llvm#109640) [LoopVectorize] Remove redundant code in emitSCEVChecks (llvm#111132) [AMDGPU] Only emit SCOPE_SYS global_wb (llvm#110636) [ELF] Change Ctx::target to unique_ptr (llvm#111260) [ELF] Pass Ctx & to some free functions [RISCV] Only disassemble fcvtmod.w.d if the rounding mode is rtz. (llvm#111308) [Clang] Remove the special-casing for RequiresExprBodyDecl in BuildResolvedCallExpr() after fd87d76 (llvm#111277) [ELF] Pass Ctx & to InputFile [clang-format] Add AlignFunctionDeclarations to AlignConsecutiveDeclarations (llvm#108241) [AMDGPU] Support preloading hidden kernel arguments (llvm#98861) [ELF] Move static nextGroupId isInGroup to LinkerDriver [clangd] Add ArgumentLists config option under Completion (llvm#111322) [ELF] Pass Ctx & to SyntheticSections [ELF] Pass Ctx & to Symbols [ELF] Pass Ctx & to Symbols [ELF] getRelocTargetVA: pass Ctx and Relocation. NFC [clang-tidy] Avoid capturing a local variable in a static lambda in UseRangesCheck (llvm#111282) [VPlan] Use pointer to member 0 as VPInterleaveRecipe's pointer arg. (llvm#106431) [clangd] Simplify ternary expressions with std::optional::value_or (NFC) (llvm#111309) [libc++][format][2/3] Optimizes c-string arguments. (llvm#101805) [RISCV] Combine RVBUnary and RVKUnary into classes that are more similar to ALU(W)_r(r/i). NFC (llvm#111279) [ELF] Pass Ctx & to InputFiles [libc] GPU RPC interface: add return value to `rpc_host_call` (llvm#111288) Signed-off-by: kyvangka1610 <[email protected]>
global_wb with scopes lower than SCOPE_SYS is unnecessary for correctness.
I was initially optimistic they would be very cheap no-ops but they can actually be quite expensive so let's avoid them.