Extend GGML_HIP_ROCWMMA_FATTN to support CDNA warp size 64 #12156
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extends PR #12032 based on comments from @IMbackK to support CDNA, which uses warp size 64.
The first commit changes all relevant uses of the define
WARP_SIZE
to use the current device size. Once I did this properly, the changes to heuristics I mentioned in the other PR were not necessary.I added one assert about the total T_BLOCK_X / T_BLOCK_Y size, based on the AMD docs. I fenced it for AMD, but someone else may know if that same limit (basically don't exceed 4*warp_size total) applies to other architectures.
The second commit removes a fence preventing the
__launch_bounds__
from applying on HIP. This is a significant performance improvement on prompt processing (>15% at larger sizes) but a slight (<5%) penalty to token generation. I don't know enough about the multiple layers of kernel sizing heuristics to dig into this more right now.This passes
test-backend-ops
on a Mi100 (gfx908, CDNA) and a 3090.