CUDA: enable Gemma FA for HIP/Pascal #9581

JohannesGaessler · 2024-09-21T18:19:46Z

Fixes #9580 .

As of right now the CUDA backend reports that for FlashAttention a head size of 256 is only supported for NVIDIA GPUs that are Volta or newer. However, for AMD GPUs and old NVIDIA GPUs a head size of 256 can also be enabled if the vector kernel is used for large batch sizes. The performance won't be great but it will be faster than CPU. This PR adapts the CUDA code to enable this.

Also I noticed that the tests were only testing batch sizes < 8 which meant that some CUDA kernels were not being invoked at all. I changed the batch sizes to cover a wider range.

CUDA: enable Gemma FA for HIP/Pascal

0ad9572

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels Sep 21, 2024

slaren approved these changes Sep 21, 2024

View reviewed changes

JohannesGaessler merged commit a5b57b0 into ggml-org:master Sep 22, 2024
53 checks passed

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

CUDA: enable Gemma FA for HIP/Pascal (ggml-org#9581)

868dbf8

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

CUDA: enable Gemma FA for HIP/Pascal (ggml-org#9581)

4367255

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

CUDA: enable Gemma FA for HIP/Pascal (ggml-org#9581)

5ada481

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: enable Gemma FA for HIP/Pascal #9581

CUDA: enable Gemma FA for HIP/Pascal #9581

Uh oh!

JohannesGaessler commented Sep 21, 2024

Uh oh!

Uh oh!

Uh oh!

CUDA: enable Gemma FA for HIP/Pascal #9581

CUDA: enable Gemma FA for HIP/Pascal #9581

Uh oh!

Conversation

JohannesGaessler commented Sep 21, 2024

Uh oh!

Uh oh!

Uh oh!