Skip to content

Conversation

@ex-rzr
Copy link
Contributor

@ex-rzr ex-rzr commented Nov 17, 2025

Proposed changes

There are test failures in CI on gfx90a with mainline and staging compilers (clang-22):

[  FAILED  ] 8 tests, listed below:
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdFp16/25, where GetParam() = ((64, -1), batch, 0.123, (10, 123, false), (3, 2, 2, 256, 128, "1"))
...
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdFp16/35, where GetParam() = ((64, -1), batch, 0.5, (34534564645, 7876878876864, true), (4, 3, 1, 100, 768, "2"))

[  FAILED  ] 20 tests, listed below:
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdBf16/25, where GetParam() = ((64, -1), batch, 0.123, (10, 123, false), (3, 2, 2, 256, 128, "1"))
...
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdBf16/47, where GetParam() = ((64, -1), group, 0.5, (34534564645, 7876878876864, true), (4, 3, 1, 100, 768, "2"))

My investigation shows that the failures are caused by this code:

	;;#ASMSTART
	v_cmpx_le_u32 exec, 1, v4
buffer_load_dwordx4 v[6:9], v5, s[48:51], 0 offen offset:0
s_mov_b64 exec s[34:35]
	;;#ASMEND
	buffer_store_dword v6, off, s[80:83], 0 ; 4-byte Folded Spill
	s_nop 0
	buffer_store_dword v7, off, s[80:83], 0 offset:4 ; 4-byte Folded Spill
	buffer_store_dword v8, off, s[80:83], 0 offset:8 ; 4-byte Folded Spill
	buffer_store_dword v9, off, s[80:83], 0 offset:12 ; 4-byte Folded Spill

The inline assembly is buffer_load_if from

asm volatile("buffer_load_dwordx4 %0, %1, %2, 0 offen offset:%3"

The compiler doesn't know that v[6:9] may have no values yet because the memory load is still in progress so it spills these registers right after the inline asm without adding s_waitcnt vmcnt(...).
So there are two kind of incorrect values:

  • the spilled values (buffer_load_dwordx4 is likely not finished when buffer_store_dword starts);
  • v6..v9, if they are used in other operations, will be eventually rewritten when the load finishes.

Usually, without spilling, there is s_waitcnt vmcnt(...) added by buffer_load_fence before results (v[6:9] here) are used. But in this case spilling happens right after loading.

This problem was encountered before, here are PR that addressed it in other places:

This PR tries to WORK AROUND this issue by reducing its probability to happen with less register spilling in the kernels with dropout. See commit messages for more details. The failing kernel (hdim = 64) has no spilling on gfx90a at all with these changes.

The real fix will be to replace inline assembly with builtins (so the compiler knows that it needs to add waits). But this is a separate task due to a need of evaluating performance and making the high level implementation as effective as this inline assembly fragment.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.
Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.
@ex-rzr ex-rzr merged commit d7b3197 into develop Nov 19, 2025
25 checks passed
@ex-rzr ex-rzr deleted the agorenko/fmha-dropout-reduce-reg-spilling branch November 19, 2025 05:40
AviralGoelAMD pushed a commit that referenced this pull request Nov 28, 2025
…nd for CI failures with clang-22) (#3221)

* Use vectorized stores for dropout randvals

With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.

* Calculate dropout randvals for storing and applying only once

Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.
illsilin pushed a commit that referenced this pull request Dec 8, 2025
…nd for CI failures with clang-22) (#3221)

* Use vectorized stores for dropout randvals

With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.

* Calculate dropout randvals for storing and applying only once

Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants