-
Notifications
You must be signed in to change notification settings - Fork 12k
Bug: Flash Attention performs worse under ROCM #10439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's a known issue and cause by the HIP port of the CUDA FlashAttention kernel for large batch sizes having extremely poor performance (so the kernel optimized for small batch sizes is used instead). With the current code this issue cannot be fixed. |
Are there any plans to rewrite that code to be optimized for ROCM instead of a CUDA port? |
No. There currently isn't a llama.cpp/GGML dev working specifically on AMD performance or even support. I am writing a lot of CUDA code but the extent of effort that I am willing to invest is make sure that the HIP port doesn't break and determining the comparatively best code paths for AMD. |
Why does the performance also regress when FA is enabled when not using batching? That's also a bit weird given the fact it's optimized for small batch sizes. Large batch sizes hasn't really been that important in the past. Hobbyists usually do not use parallelism as they have single-user use cases most of the times. But with the introduction of speculative decoding which will hopefully land in Llama server in the future as well, performance for larger batch sizes will become important even for single user use cases. What will it take for optimizations for AMD to be considered? Will it take someone to join the project to develop and maintain these optimizations? Or might a shift towards more optimization for AMD also be made when (if?) AMD starts to become more popular among the end users? |
The code is optimized for small batch sizes and NVIDIA GPUs, if you use HIP to translate it for AMD you still pay a performance penalty vs. using an external BLAS library optimized for AMD.
My personal goal with llama.cpp is to reduce the cost at which the largest models can be run at reasonable speeds. As of right now I think there simply isn't any AMD hardware that would be worth buying over second-hand NVIDIA hardware (RTX 3090/P40). I have seen some second-hand AMD GPUs such as the Mi 60 at very competitive prices on ebay but unfortunately the supply seems to be extremely limited. If AMD were to release a consumer GPU with a high VRAM capacity at a sufficiently low price I would start optimizing performance for that GPU (even though the AMD dev tools are worse or nonexistent). If a new dev were to join the project with an interest in improving AMD performance I would be happy to assist them. |
Understandable, thanks for the detailed explanation! Hoping to see other devs join the project to optimize AMD performance then :) By the way, do you think (second hand) 7900 xtx could potentially be cost competitive to second hand 3090 and 4090 GPUs? The memory bandwidth is very similar between those GPUs. |
It has become better compared to 1-2 years ago but at least in my region (Germany) there currently don't really seem to be good second-hand offers for RX 7900 XTX cards. |
You may try my forked branch that enables rocWMMA on top of the current CUDA WMMA flash attention implementation, and I'm actively rebasing the latest master branch for my own usage: https://github.com/hjc4869/llama.cpp From my own testing it improves performance by quite a bit on RDNA3 with higher batch size, though still not optimal comparing to equivalent NVIDIA GPUs. Flash attention (master branch)./llama-batched-bench -ngl 999 -m ~/models/qwen2.5-72b-iq4.gguf -fa -npl 1,8,16 -npp 512 -ntg 128 -c 10240 main: n_kv_max = 10240, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32
Flash attention (WMMA patched)make GGML_HIPBLAS=1 GGML_CUDA_FA_ALL_QUANTS=1 AMDGPU_TARGETS=gfx1100,gfx1101 -j64 ./llama-batched-bench -ngl 999 -m ~/models/qwen2.5-72b-iq4.gguf -fa -npl 1,8,16 -npp 512 -ntg 128 -c 10240 main: n_kv_max = 10240, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32
Flash attention off./llama-batched-bench -ngl 999 -m ~/models/qwen2.5-72b-iq4.gguf -npl 1,8,16 -npp 512 -ntg 128 -c 10240 main: n_kv_max = 10240, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32
|
Wow! I am seeing very good results here. Some observations:
A question: I saw that your branch also introduced the option to compile with Master branch:Single Bench Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
build: a5e4759 (4150) Batched bench FA off main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
ggml/src/ggml-cuda/ggml-cuda.cu:70: ROCm error Batched bench FA on main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
Your branchSingle bench Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
build: 8739ed4 (4154) Batched bench FA off main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
ggml/src/ggml-cuda/ggml-cuda.cu:70: ROCm error Batched bench FA on main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
|
GGML_CUDA_FA_ALL_QUANTS is not related to the issue. It's to compile FA kernel for all the KV cache quantization combinations. For example if you want -ctk q4_0 in combination with -ctv q8_0 (different quantization type), or those rarely used ones like q4_1, q5_0, q5_1. |
Thanks for your explanation. Do you have any intention of trying to upstream these FA changes? |
Didn't that comment in that PR mention they would be open to merging improvements, as long as someone would commit to maintaining that code? Not sure if that would be an option for you. Is there some way we can reach out to AMD and convince them to commit to performance improvements and maintaining them for llamacpp? |
That's unfortunate to hear. I really think the 7900xtx could be competitive both on cost and performance. It just needs the support to get there. The situation has already improved massively compared to 1-2 years ago though. So perhaps in the future we'll get there. |
A new 7900xtx is pretty much the same price (or reasonably close) as a used 3090 so doesn't that mean AMD cards are competitive? That used NV market is going to dry up at some point and NV does not care to make affordable cards that have enough vram for llms. |
A new 7900xtx is slightly more expensive, and performing slightly worse than a used 3090. With performance optimizations it would be competitive, but right now the 3090 has the advantage. It's a bit of a chicken and egg problem to be honest. |
@JohannesGaessler do you think you could point us to some parts of the code that could benefit the most from rocm additions? I would be interested in learning more low level gpu programming and would like a test project |
Generally speaking the HIP port of the CUDA code struggles the most with compute-intensive kernels, i.e. the kernels that are used for prompt processing where many tokens can be worked on in parallel. As of right now these are the mul_mat_q (large batch matrix multiplication using quantized data without dequantization to VRAM) and the FlashAttention kernels. There would be a lot of work to do for full ROCm support; I would be happy to talk with you in detail about what could be done depending on how much time and energy you're willing to commit. |
I took a look at https://github.com/hjc4869/llama.cpp and it seems what should be done is to create defines for a new ggml-rocm path. reimplement everything piece by piece in the ggml-cuda folder while using hipified cuda as a fallback until everything is replaced. I know the wavefront size is different on amd, and somehow I will need to benchmark the differences. sound about right? |
@sirus20x6 Let me know if/when you start working on this. Would love to help test. I will create some graphs later to showcase how the 7900xtx currently scales with batchsize. That way, it might be easier to identify on which kernels to start working. |
@sirus20x6 In terms of performance basically the only kernels that really matter are matrix multiplication and FlashAttention (which is basically two matrix multiplications merged with a softmax). My advice to you would be to start with ROCm kernels that can be used instead of those in Medium-term I will add training support to llama.cpp, I think a major use case will be training LoRAs on top of quantized models. The current kernels in If you are willing to write an entire ROCm GGML backend the best person to ask for help would be @slaren since he has the most in-depth knowledge of the corresponding code. |
I was a little confused at fist but I see they must have been renamed mmv.cu/h and mmq.cu/h I will take a look and see if I can make sense of everything :) |
No, I just misremembered the names of the files, sorry. |
I have been superficially looking at this for a while, because I have a AMD ROCm setup of 4x MI100s with Infinity Fabric between them. AMD has provided a kernel for ROCm, and from what little I know of this kernel, it already has an implementation of Flash Attention 2 in it: https://github.com/ROCm/composable_kernel |
Not sure if it is a good idea to use the C++ APIs of this kernel (pre-compiled) for this project? |
@JohannesGaessler I am thinking of jumping onboard (because I will be fully available in the coming weeks, and I have a vested interest in this). However I feel I do have a significant knowledge gap in terms of implementing this, and I was also thinking of importing code from https://github.com/ROCm/composable_kernel, since it is an official example from AMD of going about implementing these kernels. Let me know what you think, if you are able to help with code reviews and such. Thanks. |
As outlined in the coding guidelines:
How many lines of imported code are we talking about? Will you be available long-term for maintenance? |
@JohannesGaessler I plan on importing the bare minimum it to get it working initially, and then rewrite/refactor all of them to more appropriately suit this project and its guidelines. I will be available for the foreseeable future (because I have a machine of 4xMI100 + InfinityFabric, and it is very very unlikely I will get a third machine soon), I too plan on adding documentation as I go about this, because I appreciate being able to easily pickup a project. |
@thamwangjun Since MI100 is CDNA based, will these improvements only target that architecture? Or are improvements for RDNA3 based cards (eg 7900xtx) to be expected as well? |
@thamwangjun if you make a PR with a working implementation and pledge to maintain it "for the foreseeable future" I would likely be willing to merge it. However, if the implementation becomes unmaintained and broken and no one steps up to maintain it my position will be that the code should be removed again, as was done with OpenCL. Of course, if the implementation is simple, self-contained, and well-documented it will be more likely that someone will be willing to maintain it (including me). |
@Mushoz I'm sorry but at this point I will only start work with only CDNA support in mind, because that is what AMD's own flash attention support for now. But I have an old RDNA1 card and RNDA3 integrated graphics in my laptop, so who knows what the future may hold 🤔 What I am committed to right now is getting my MI100 setup running faster. I feel that the HIP port of CUDA is leaving a lot of potential performance benefits behind. |
@JohannesGaessler thanks, I understand it is not reasonable to leave broken features in the project. It is better to remove them if broken/unmaintained. |
I think AMD's own flash attention do support RDNA3 forward pass for now. and, there is fa2 triton implement by third party repo at |
This might be useful to reference while working on a solution: Dao-AILab/flash-attention#1203 This PR that has been merged has FA2 support for CDNA & RDNA. |
I just tested the rocWMMA enabled llama.cpp from https://github.com/hjc4869/llama.cpp with rocm-6.3.0. Since I use the opencl-amd-dev in Arch, it is already included, no additional installation. Here is my command to test (on my 7900 xtx): ggerganov/llama.cpp
hjc4869/llama.cpp
|
I am on a 6650 XT, which doesn't have WMMA. I don't know what speedup is possible, but at the bare minimum, the flash attention implementation should stop making performance worse. |
@thamwangjun Hello, are you still working on this? |
@JohannesGaessler I'd be interested in contributing to any initiative to optimise the kernels for ROCm, though I can only do so in a part-time capacity. It would be great to discuss how we could set something up that others can contribute to as well (or pick up from if I somehow disappear off the face of the planet); as well as whether it would be better to have a fork or a branch in the main repository. Let me know if email works I see a number of other optimisation opportunities, such as using DPP intrinsics for the warp reduction operations instead of the |
I will happily explain to you what would need to be done for better ROCm performance and I would be willing to learn ROCm in order to review the code. Contact via email is generally fine for me, right now my private email
WMMA will in principle work for FlashAttention but not having a defined memory layout really hurts optimization. For NVIDIA GPUs I defined primitives in |
@Headcrabed @adelj88 @JohannesGaessler Sorry for the late reply, I did several tests with lmperf to test flash attention between ROCm's provided flash-attention implementation and llama.cpp with Llama-3.2-1B on a single MI100. The token generation speed is roughly the same, I was hoping to use ROCm's provided flash-attention implementation as a baseline to compare to, but my findings makes it hard to use ROCM's implementation as a baseline. Can someone else help validate my findings and test performance of ROCM's FA compared to llama.cpp? I had the assumption that ROCM's FA would be the fastest since it is from AMD. EDIT1: Add AMD FA assumption |
For a batch size of 1 which is what being used to generate tokens for a single user the HIP port of the llama.cpp CUDA code works well. It is specifically the code for batch sizes >> 1 that is used for prompt processing where the performance of the HIP port is bad. |
For token generation that is definitely not the case, at least not on my 7900xtx. I am getting 27 tokens/sec on Qwen2.5-32b with the ROCM backend with llama-bench. While The vulkan backend is giving me over 35 tokens/sec. I don't have the exact figures since it's been a while since I last checked, but happy to provide those if that is useful information. But given the rather large gap between Vulkan and ROCM, at least in terms of token generation there is a lot of room for improvement. |
Is it possible that Vulkan works better because RX 7900 XTX is geared towards graphics workloads? @Mushoz |
@thamwangjun BTW, I put some MI100 numbers for llama.cpp FA in #12032 after I hacked up that patch enough to make it work on CDNA. It's likely that CDNA support won't land with that patch, but if you want any specific tests I can run them for you on my Mi100 |
This issue is fixed by #12032 for CDNA and RDNA3, while the performance could still be better, its not completely broken now. While not a universal fix i think this thread has outlived its usefulness. |
Very nice, just ran the benchmark against current main with: ggerganov/llama.cpp (Dez. 2024)
hjc4869/llama.cpp (Dez. 2024)
ggerganov/llama.cpp (Mar. 2025)
|
@hjc4869 @JohannesGaessler Hello, I just tested new build option Here my full build option:
Here my command line parameters for llama.cpp server:
Here without
Here with
So it really way faster. Found strange bug or behavior: I just though - maybe better create discussion for AMD optimizations - I mean it will be way easier to discuss bugs and test new code base. |
This is because of KV cache quantization. The FA kernel that is being used cannot use quantized data so it needs to be converted to FP16. And the size of the temporary buffer for the conversion scales with context size.
Let me be frank: the bottleneck is not at all testing code. The bottleneck is writing the code. |
@JohannesGaessler Does it possible to reserve temporary buffer before actual inference - I mean for users maybe would be better get error on start that not enough memory. Or at least how calculate maximum required memory to use proper parameters? |
Yes, it's possible. As I said, the bottleneck is people actually implementing things. |
What happened?
Turning on flash attention degrades the performance when used under ROCM (at least it does with a 7900 xtx). Using batched bench, the degradation is quite minor at a batchsize of 1.
prompt processing: 461 -> 434
token generation: 24.26 -> 23.84
However, when running multiple batches of requests at the same time, the effect is MUCH more pronounced. Especially with batch sizes of 16 the difference is massive:
prompt processing: 678 -> 375
token generation: 169.65 -> 86.87
Flash Attention is needed to be able to use quantization for the KV-cache, but the performance hit is drastic. Can this be fixed?
Name and Version
build: 4123 (2eb76b2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
The text was updated successfully, but these errors were encountered: