Skip to content

Fix cuda mul mat for pascal cc==610 #6636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

xcnick
Copy link

@xcnick xcnick commented Apr 12, 2024

The following error occurs when executing the test-backend-ops script on the current master branch using a GTX 1080Ti:

MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): GGML_ASSERT: /workspace/llama.cpp/ggml-cuda.cu:1388: src1->type == GGML_TYPE_F32 || (src1->ne[2] == 1 && src1->ne[3] == 1)

This pr fix it.

@HanClinto
Copy link
Collaborator

I'm not very familiar with this section of code, but I wonder if a special exception should be made for the GTX 1080Ti, similar to the way that other devices are specially accounted for, such as the any_pascal_with_slow_fp16 boolean in https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1890 all coming together to see if fp16 support is good on https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1916 ?

@HanClinto
Copy link
Collaborator

Are you compiling with HIPBLAS? I wonder if line 1907 needs to be modified to see if cc==610 also? fp16_performance_good is populated in two potential places, but only one of them checks to see if cc==610

Maybe we change:

https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1907

to:
const bool fp16_performance_good = min_compute_capability >= CC_PASCAL && !any_pascal_with_slow_fp16;

?

@xcnick
Copy link
Author

xcnick commented Apr 12, 2024

Are you compiling with HIPBLAS? I wonder if line 1907 needs to be modified to see if cc==610 also? fp16_performance_good is populated in two potential places, but only one of them checks to see if cc==610

Maybe we change:

https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1907

to: const bool fp16_performance_good = min_compute_capability >= CC_PASCAL && !any_pascal_with_slow_fp16;

?

Unfortunately, I do not have a HIP device and have not tested it. I have tested on GTX 1080 Ti (cc == 610) and V100 (cc==700), and the results are correct.
Furthermore, from the variable name, any_pascal_with_slow_fp16 is only related to the Nvidia Pascal architecture and has nothing to do with HIP. So I think it might not need to be modified at line 1907.

@HanClinto
Copy link
Collaborator

Furthermore, from the variable name, any_pascal_with_slow_fp16 is only related to the Nvidia Pascal architecture and has nothing to do with HIP. So I think it might not need to be modified at line 1907.

Okay. I'm not familiar with these cards, so apologies if my questions seem ignorant.

Unfortunately, I do not have a HIP device and have not tested it. I have tested on GTX 1080 Ti (cc == 610) and V100 (cc==700), and the results are correct.

If cc == 610, then any_pascal_with_slow_fp16 should be true. If any_pascal_with_slow_fp16 is true, then fp16_performance_good should be false. If fp16_performance_good is false, then:

if (!split && fp16_performance_good && src0->type == GGML_TYPE_F16 && !ggml_is_transposed(src0) && !ggml_is_transposed(src1) && src1->ne[2]*src1->ne[3] > 1) 

Should evaluate to false, and removing && fp16_performance_good from here shouldn't have any effect.

All that to say, if cc == 610, then why is this change doing anything -- if I'm reading the code correctly, it shouldn't be making a difference?

@Engininja2
Copy link
Contributor

Removing fp16_performance_good from that line allows it to evaluate to true when the other conditions hold on a card that is considered slow at fp16 operations. Slow is better than not running at all and I think that situation would only come up in regular use if compiled with LLAMA_CUDA_F16 which shouldn't be done with those cards anyways.

Copy link
Collaborator

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As-is, this is going to reduce the benefit of PR #4682. We need to either extend the relevant kernels, or better identify when one of them does not support the input.

@JohannesGaessler
Copy link
Collaborator

This is an issue with the tests, not the CUDA code. The test case in question doesn't actually show up when evaluating a model, so it's only partially implemented. In terms of performance it wouldn't make sense to run a matrix multiplication like that on a GTX 1080 ti either.

Under no circumstances should this PR be merged as-is.

@JohannesGaessler
Copy link
Collaborator

PR with a fix that does not reduce performance: #6667

@JohannesGaessler
Copy link
Collaborator

This should now be fixed on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants