-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Flash Attention not working with NVIDIA Quadro P3200 Pascal Architecture GPU #7055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As of right now there are only CUDA FlashAttention kernels that use tensor cores which in turn means you need Turing/Volta or newer. |
@JohannesGaessler Okay, that explains things 😉 So basically the Flash Attention implementation of llama.cpp relies on the NVIDIA GPU to have Tensor Cores. Suggestion If I read https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications correctly we can programmatically determine if a GPU has Tensor Cores. Could the llama.cpp implementation check the Compute Capability of the GPU and ignore the I am getting a Compute Capability of nvidia-smi --query-gpu=compute_cap --format=csv |
I'm writing FlashAttention kernels that don't need tensor cores as we speak. |
@JohannesGaessler I already read on LocalLLaMA. I am smiling in 5xP40. |
@JohannesGaessler Is this the PR for that change? #7061 EDIT: Nevermind, I assume so since the branch name is |
You're going to get correct results but unless you have a P100 the performance is going to be terrible. |
Ah, good to know. 2x1080 Ti's, which are roughly equivalent, perhaps marginally faster(?) than P100s in some metrics. It will be an interesting experiment anyway. |
No, it will literally only perform well on P100s because all other Pascal cards have gimped FP16 performance. |
Ah that's a shame, well that's good to know. |
Yeah, went from:
To:
With FP16 flash-attention on my non-P100 Pascal cards. So not great. I have a 12GB Maxwell card that I don't think has the same NVIDIA FP16 kneecapping, but for now I'll just leave FA off. |
General Pascal support: #7188 . |
I am using the server of llama.cpp b2781 on a Windows machine with enabled
--flash-attn
option.The Compute Capability of the Quadro P3200 GPU in the machine is 6.1. The installed NVIDIA Driver is at 552.22 and CUDA 12.3.
I am getting the following error message:
Log
Flash Attention is working for me on a Turing GPU.
Question: Should llama.cpp support Pascal GPUs?
If not it would be nice to have a feature detection and a speaking error message that indicates this.
The text was updated successfully, but these errors were encountered: