Fix cuda mul mat for pascal cc==610 #6636

xcnick · 2024-04-12T10:31:06Z

The following error occurs when executing the test-backend-ops script on the current master branch using a GTX 1080Ti:

MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): GGML_ASSERT: /workspace/llama.cpp/ggml-cuda.cu:1388: src1->type == GGML_TYPE_F32 || (src1->ne[2] == 1 && src1->ne[3] == 1)

This pr fix it.

HanClinto · 2024-04-12T14:06:52Z

I'm not very familiar with this section of code, but I wonder if a special exception should be made for the GTX 1080Ti, similar to the way that other devices are specially accounted for, such as the any_pascal_with_slow_fp16 boolean in https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1890 all coming together to see if fp16 support is good on https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1916 ?

HanClinto · 2024-04-12T14:10:53Z

Are you compiling with HIPBLAS? I wonder if line 1907 needs to be modified to see if cc==610 also? fp16_performance_good is populated in two potential places, but only one of them checks to see if cc==610

Maybe we change:

https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1907

to:
const bool fp16_performance_good = min_compute_capability >= CC_PASCAL && !any_pascal_with_slow_fp16;

?

xcnick · 2024-04-12T14:55:57Z

Are you compiling with HIPBLAS? I wonder if line 1907 needs to be modified to see if cc==610 also? fp16_performance_good is populated in two potential places, but only one of them checks to see if cc==610

Maybe we change:

https://github.com/ggerganov/llama.cpp/blob/4cc120c7443cf9dab898736f3c3b45dc8f14672b/ggml-cuda.cu#L1907

to: const bool fp16_performance_good = min_compute_capability >= CC_PASCAL && !any_pascal_with_slow_fp16;

?

Unfortunately, I do not have a HIP device and have not tested it. I have tested on GTX 1080 Ti (cc == 610) and V100 (cc==700), and the results are correct.
Furthermore, from the variable name, any_pascal_with_slow_fp16 is only related to the Nvidia Pascal architecture and has nothing to do with HIP. So I think it might not need to be modified at line 1907.

HanClinto · 2024-04-12T15:46:26Z

Furthermore, from the variable name, any_pascal_with_slow_fp16 is only related to the Nvidia Pascal architecture and has nothing to do with HIP. So I think it might not need to be modified at line 1907.

Okay. I'm not familiar with these cards, so apologies if my questions seem ignorant.

Unfortunately, I do not have a HIP device and have not tested it. I have tested on GTX 1080 Ti (cc == 610) and V100 (cc==700), and the results are correct.

If cc == 610, then any_pascal_with_slow_fp16 should be true. If any_pascal_with_slow_fp16 is true, then fp16_performance_good should be false. If fp16_performance_good is false, then:

if (!split && fp16_performance_good && src0->type == GGML_TYPE_F16 && !ggml_is_transposed(src0) && !ggml_is_transposed(src1) && src1->ne[2]*src1->ne[3] > 1)

Should evaluate to false, and removing && fp16_performance_good from here shouldn't have any effect.

All that to say, if cc == 610, then why is this change doing anything -- if I'm reading the code correctly, it shouldn't be making a difference?

Engininja2 · 2024-04-12T18:41:59Z

Removing fp16_performance_good from that line allows it to evaluate to true when the other conditions hold on a card that is considered slow at fp16 operations. Slow is better than not running at all and I think that situation would only come up in regular use if compiled with LLAMA_CUDA_F16 which shouldn't be done with those cards anyways.

cebtenzzre

As-is, this is going to reduce the benefit of PR #4682. We need to either extend the relevant kernels, or better identify when one of them does not support the input.

JohannesGaessler · 2024-04-13T21:15:36Z

This is an issue with the tests, not the CUDA code. The test case in question doesn't actually show up when evaluating a model, so it's only partially implemented. In terms of performance it wouldn't make sense to run a matrix multiplication like that on a GTX 1080 ti either.

Under no circumstances should this PR be merged as-is.

JohannesGaessler · 2024-04-13T21:31:53Z

PR with a fix that does not reduce performance: #6667

JohannesGaessler · 2024-04-13T22:22:19Z

This should now be fixed on master.

Fix cuda mul mat for pascal cc==610

675c0f0

cebtenzzre requested changes Apr 12, 2024

View reviewed changes

cebtenzzre requested review from slaren and JohannesGaessler April 12, 2024 21:32

JohannesGaessler closed this Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix cuda mul mat for pascal cc==610 #6636

Fix cuda mul mat for pascal cc==610 #6636

Uh oh!

xcnick commented Apr 12, 2024

Uh oh!

HanClinto commented Apr 12, 2024

Uh oh!

HanClinto commented Apr 12, 2024

Uh oh!

xcnick commented Apr 12, 2024

Uh oh!

HanClinto commented Apr 12, 2024

Uh oh!

Engininja2 commented Apr 12, 2024

Uh oh!

cebtenzzre left a comment

Uh oh!

JohannesGaessler commented Apr 13, 2024

Uh oh!

JohannesGaessler commented Apr 13, 2024

Uh oh!

JohannesGaessler commented Apr 13, 2024

Uh oh!

Uh oh!

Fix cuda mul mat for pascal cc==610 #6636

Fix cuda mul mat for pascal cc==610 #6636

Uh oh!

Conversation

xcnick commented Apr 12, 2024

Uh oh!

HanClinto commented Apr 12, 2024

Uh oh!

HanClinto commented Apr 12, 2024

Uh oh!

xcnick commented Apr 12, 2024

Uh oh!

HanClinto commented Apr 12, 2024

Uh oh!

Engininja2 commented Apr 12, 2024

Uh oh!

cebtenzzre left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Apr 13, 2024

Uh oh!

JohannesGaessler commented Apr 13, 2024

Uh oh!

JohannesGaessler commented Apr 13, 2024

Uh oh!

Uh oh!