[CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) #3221

ex-rzr · 2025-11-17T13:10:45Z

Proposed changes

There are test failures in CI on gfx90a with mainline and staging compilers (clang-22):

[  FAILED  ] 8 tests, listed below:
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdFp16/25, where GetParam() = ((64, -1), batch, 0.123, (10, 123, false), (3, 2, 2, 256, 128, "1"))
...
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdFp16/35, where GetParam() = ((64, -1), batch, 0.5, (34534564645, 7876878876864, true), (4, 3, 1, 100, 768, "2"))

[  FAILED  ] 20 tests, listed below:
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdBf16/25, where GetParam() = ((64, -1), batch, 0.123, (10, 123, false), (3, 2, 2, 256, 128, "1"))
...
[  FAILED  ] TestCkTileFmhaFwd/Dropout.FmhaFwdBf16/47, where GetParam() = ((64, -1), group, 0.5, (34534564645, 7876878876864, true), (4, 3, 1, 100, 768, "2"))

My investigation shows that the failures are caused by this code:

	;;#ASMSTART
	v_cmpx_le_u32 exec, 1, v4
buffer_load_dwordx4 v[6:9], v5, s[48:51], 0 offen offset:0
s_mov_b64 exec s[34:35]
	;;#ASMEND
	buffer_store_dword v6, off, s[80:83], 0 ; 4-byte Folded Spill
	s_nop 0
	buffer_store_dword v7, off, s[80:83], 0 offset:4 ; 4-byte Folded Spill
	buffer_store_dword v8, off, s[80:83], 0 offset:8 ; 4-byte Folded Spill
	buffer_store_dword v9, off, s[80:83], 0 offset:12 ; 4-byte Folded Spill

The inline assembly is buffer_load_if from

composable_kernel/include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp

Line 148 in 1e77695

asm volatile("buffer_load_dwordx4 %0, %1, %2, 0 offen offset:%3"

The compiler doesn't know that v[6:9] may have no values yet because the memory load is still in progress so it spills these registers right after the inline asm without adding s_waitcnt vmcnt(...).
So there are two kind of incorrect values:

the spilled values (buffer_load_dwordx4 is likely not finished when buffer_store_dword starts);
v6..v9, if they are used in other operations, will be eventually rewritten when the load finishes.

Usually, without spilling, there is s_waitcnt vmcnt(...) added by buffer_load_fence before results (v[6:9] here) are used. But in this case spilling happens right after loading.

This problem was encountered before, here are PR that addressed it in other places:

This PR tries to WORK AROUND this issue by reducing its probability to happen with less register spilling in the kernels with dropout. See commit messages for more details. The failing kernel (hdim = 64) has no spilling on gfx90a at all with these changes.

The real fix will be to replace inline assembly with builtins (so the compiler knows that it needs to add waits). But this is a separate task due to a need of evaluating performance and making the high level implementation as effective as this inline assembly fragment.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of 16 buffer_store_byte. This requires less registers and reduces spilling.

Even though it may add a small overhead when storing is not required, it uses significantly less registers and hence no spilling.

…nd for CI failures with clang-22) (#3221) * Use vectorized stores for dropout randvals With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of 16 buffer_store_byte. This requires less registers and reduces spilling. * Calculate dropout randvals for storing and applying only once Even though it may add a small overhead when storing is not required, it uses significantly less registers and hence no spilling.

ex-rzr added 2 commits November 17, 2025 12:20

Use vectorized stores for dropout randvals

2334792

With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of 16 buffer_store_byte. This requires less registers and reduces spilling.

Calculate dropout randvals for storing and applying only once

ba179e4

Even though it may add a small overhead when storing is not required, it uses significantly less registers and hence no spilling.

ex-rzr marked this pull request as ready for review November 18, 2025 05:49

ex-rzr requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners November 18, 2025 05:49

illsilin approved these changes Nov 18, 2025

View reviewed changes

poyenc approved these changes Nov 19, 2025

View reviewed changes

ex-rzr merged commit d7b3197 into develop Nov 19, 2025
25 checks passed

ex-rzr deleted the agorenko/fmha-dropout-reduce-reg-spilling branch November 19, 2025 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) #3221

[CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) #3221

Uh oh!

ex-rzr commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) #3221

[CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) #3221

Uh oh!

Conversation

ex-rzr commented Nov 17, 2025

Proposed changes

Checklist

Discussion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants