Skip to content

CUDA: enable Gemma FA for HIP/Pascal #9581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #9580 .

As of right now the CUDA backend reports that for FlashAttention a head size of 256 is only supported for NVIDIA GPUs that are Volta or newer. However, for AMD GPUs and old NVIDIA GPUs a head size of 256 can also be enabled if the vector kernel is used for large batch sizes. The performance won't be great but it will be faster than CPU. This PR adapts the CUDA code to enable this.

Also I noticed that the tests were only testing batch sizes < 8 which meant that some CUDA kernels were not being invoked at all. I changed the batch sizes to cover a wider range.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels Sep 21, 2024
@JohannesGaessler JohannesGaessler merged commit a5b57b0 into ggml-org:master Sep 22, 2024
53 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Gemma2 9B FlashAttention is offloaded to CPU on AMD (HIP)
2 participants