CUDA: fix MMQ for non-contiguous src0, add tests #10021

JohannesGaessler · 2024-10-23T18:42:55Z

The problem is that on master the MMQ code calculates the stride for src0 based on src0->nb[1] but in ggml_cuda_op_mul_mat any non-contiguous matrices are made contiguous first. This case was not covered by the tests, I extended test-backend-ops with new test cases for non-contiguous inputs.

The temporary buffer for src0 also did not get its padding cleared. This did not seem to result in incorrect results but it would in principle be possible so I also fixed that.

Long-term I think we should refactor the code in such a way that we don't need ggml_cuda_op_mul_mat and instead handle batched matrix multiplication in the kernels themselves.

slaren · 2024-10-23T20:12:43Z

tests/test-backend-ops.cpp

+        // If the permutation is not {0, 1, 2, 3}, replace a and b with views that have the same data in a different order.
+        // This test only works correctly if exactly 2 indices != 0 are swapped.
+        if (per[0] != 0 || per[1] != 1 || per[2] != 2 || per[3] != 3) {
+            GGML_ASSERT(per[0] == 0);
+            const size_t rsa = ggml_row_size(a->type, a->ne[0]);
+            const size_t rsb = ggml_row_size(b->type, b->ne[0]);
+            size_t nba[GGML_MAX_DIMS] = {ggml_type_size(a->type), rsa, rsa, rsa};
+            size_t nbb[GGML_MAX_DIMS] = {ggml_type_size(b->type), rsb, rsb, rsb};
+            for (int64_t i = 1; i < GGML_MAX_DIMS; ++i) {
+                for (int64_t j = 1; j < per[i]; ++j) {
+                    nba[i] *= a->ne[per[j]];
+                    nbb[i] *= b->ne[per[j]];
+                }
+            }
+            a = ggml_view_4d(ctx, a, a->ne[0], a->ne[1], a->ne[2], a->ne[3], nba[1], nba[2], nba[3], /*offset =*/ 0);
+            b = ggml_view_4d(ctx, b, b->ne[0], b->ne[1], b->ne[2], b->ne[3], nbb[1], nbb[2], nbb[3], /*offset =*/ 0);
+            GGML_ASSERT(ggml_nbytes(a) == ggml_nbytes(a->src[0]));
+            GGML_ASSERT(ggml_nbytes(b) == ggml_nbytes(b->src[0]));
+            ggml_set_name(a, "a_permuted");
+            ggml_set_name(b, "b_permuted");
+        }


I find this a bit hard to follow. Would this be equivalent to what you are doing here?

// C^T = A * B^T: (k, m) * (k, n) => (m, n) ggml_tensor * a; ggml_tensor * b; // If the permutation is not {0, 1, 2, 3}, replace a and b with views that have the same data in a different order. // This test only works correctly if exactly 2 indices != 0 are swapped. if (per[0] != 0 || per[1] != 1 || per[2] != 2 || per[3] != 3) { // create a tensor with the permuted dimensions, them permute it back to the dimensions given by m,n,k int64_t ne_a[4] = {k, m, bs[0], bs[1]}; int64_t ne_b[4] = {k, n, bs[0]*nr[0], bs[1]*nr[1]}; a = ggml_new_tensor_4d(ctx, type_a, ne_a[per[0]], ne_a[per[1]], ne_a[per[2]], ne_a[per[3]]); b = ggml_new_tensor_4d(ctx, type_b, ne_b[per[0]], ne_b[per[1]], ne_b[per[2]], ne_b[per[3]]); ggml_set_name(a, "a"); ggml_set_name(b, "b"); a = ggml_permute(ctx, a, per[0], per[1], per[2], per[3]); b = ggml_permute(ctx, b, per[0], per[1], per[2], per[3]); ggml_set_name(a, "a_permuted"); ggml_set_name(b, "b_permuted"); } else { a = ggml_new_tensor_4d(ctx, type_a, k, m, bs[0] , bs[1]); b = ggml_new_tensor_4d(ctx, type_b, k, n, bs[0]*nr[0], bs[1]*nr[1]); ggml_set_name(a, "a"); ggml_set_name(b, "b"); } ggml_set_param(ctx, a); ggml_set_param(ctx, b);

Thanks, that's probably a better way to do it.

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

CUDA: fix MMQ for non-contiguous src0, add tests

b87dc5a

JohannesGaessler mentioned this pull request Oct 23, 2024

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Closed

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 23, 2024

slaren reviewed Oct 23, 2024

View reviewed changes

revise test code

4004bb7

JohannesGaessler added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Oct 23, 2024

slaren approved these changes Oct 23, 2024

View reviewed changes

ikawrakow mentioned this pull request Oct 24, 2024

Bug: K cache without FA ikawrakow/ik_llama.cpp#103

Closed

JohannesGaessler merged commit c39665f into ggml-org:master Oct 24, 2024
53 checks passed

JohannesGaessler mentioned this pull request Oct 24, 2024

CUDA: fix insufficient buffer clearing for MMQ #10032

Merged

ikawrakow mentioned this pull request Oct 24, 2024

Fix quantized k-cache without FA ikawrakow/ik_llama.cpp#105

Merged

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

CUDA: fix MMQ for non-contiguous src0, add tests (ggml-org#10021)

a786339

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

CUDA: fix MMQ for non-contiguous src0, add tests (ggml-org#10021)

70f23e5

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

Alcpz mentioned this pull request Nov 19, 2024

sycl : permuted mul_mat through oneMKL #10408

Merged

4 tasks

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Dec 23, 2024

CUDA: fix MMQ for non-contiguous src0, add tests (ggml-org#10021)

4f0aebb

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix MMQ for non-contiguous src0, add tests #10021

CUDA: fix MMQ for non-contiguous src0, add tests #10021

Uh oh!

JohannesGaessler commented Oct 23, 2024

Uh oh!

slaren Oct 23, 2024

Uh oh!

JohannesGaessler Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!

CUDA: fix MMQ for non-contiguous src0, add tests #10021

CUDA: fix MMQ for non-contiguous src0, add tests #10021

Uh oh!

Conversation

JohannesGaessler commented Oct 23, 2024

Uh oh!

slaren Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!