Skip to content

CUDA: fix MMQ for non-contiguous src0, add tests #10021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #10011 .

The problem is that on master the MMQ code calculates the stride for src0 based on src0->nb[1] but in ggml_cuda_op_mul_mat any non-contiguous matrices are made contiguous first. This case was not covered by the tests, I extended test-backend-ops with new test cases for non-contiguous inputs.

The temporary buffer for src0 also did not get its padding cleared. This did not seem to result in incorrect results but it would in principle be possible so I also fixed that.

Long-term I think we should refactor the code in such a way that we don't need ggml_cuda_op_mul_mat and instead handle batched matrix multiplication in the kernels themselves.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 23, 2024
Comment on lines 1686 to 1706
// If the permutation is not {0, 1, 2, 3}, replace a and b with views that have the same data in a different order.
// This test only works correctly if exactly 2 indices != 0 are swapped.
if (per[0] != 0 || per[1] != 1 || per[2] != 2 || per[3] != 3) {
GGML_ASSERT(per[0] == 0);
const size_t rsa = ggml_row_size(a->type, a->ne[0]);
const size_t rsb = ggml_row_size(b->type, b->ne[0]);
size_t nba[GGML_MAX_DIMS] = {ggml_type_size(a->type), rsa, rsa, rsa};
size_t nbb[GGML_MAX_DIMS] = {ggml_type_size(b->type), rsb, rsb, rsb};
for (int64_t i = 1; i < GGML_MAX_DIMS; ++i) {
for (int64_t j = 1; j < per[i]; ++j) {
nba[i] *= a->ne[per[j]];
nbb[i] *= b->ne[per[j]];
}
}
a = ggml_view_4d(ctx, a, a->ne[0], a->ne[1], a->ne[2], a->ne[3], nba[1], nba[2], nba[3], /*offset =*/ 0);
b = ggml_view_4d(ctx, b, b->ne[0], b->ne[1], b->ne[2], b->ne[3], nbb[1], nbb[2], nbb[3], /*offset =*/ 0);
GGML_ASSERT(ggml_nbytes(a) == ggml_nbytes(a->src[0]));
GGML_ASSERT(ggml_nbytes(b) == ggml_nbytes(b->src[0]));
ggml_set_name(a, "a_permuted");
ggml_set_name(b, "b_permuted");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit hard to follow. Would this be equivalent to what you are doing here?

        // C^T = A * B^T: (k, m) * (k, n) => (m, n)
        ggml_tensor * a;
        ggml_tensor * b;

        // If the permutation is not {0, 1, 2, 3}, replace a and b with views that have the same data in a different order.
        // This test only works correctly if exactly 2 indices != 0 are swapped.
        if (per[0] != 0 || per[1] != 1 || per[2] != 2 || per[3] != 3) {
            // create a tensor with the permuted dimensions, them permute it back to the dimensions given by m,n,k
            int64_t ne_a[4] = {k, m, bs[0], bs[1]};
            int64_t ne_b[4] = {k, n, bs[0]*nr[0], bs[1]*nr[1]};
            a = ggml_new_tensor_4d(ctx, type_a, ne_a[per[0]], ne_a[per[1]], ne_a[per[2]], ne_a[per[3]]);
            b = ggml_new_tensor_4d(ctx, type_b, ne_b[per[0]], ne_b[per[1]], ne_b[per[2]], ne_b[per[3]]);
            ggml_set_name(a, "a");
            ggml_set_name(b, "b");
            a = ggml_permute(ctx, a, per[0], per[1], per[2], per[3]);
            b = ggml_permute(ctx, b, per[0], per[1], per[2], per[3]);
            ggml_set_name(a, "a_permuted");
            ggml_set_name(b, "b_permuted");
        } else {
            a = ggml_new_tensor_4d(ctx, type_a, k, m, bs[0]      , bs[1]);
            b = ggml_new_tensor_4d(ctx, type_b, k, n, bs[0]*nr[0], bs[1]*nr[1]);
            ggml_set_name(a, "a");
            ggml_set_name(b, "b");
        }
        ggml_set_param(ctx, a);
        ggml_set_param(ctx, b);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's probably a better way to do it.

@JohannesGaessler JohannesGaessler added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Oct 23, 2024
@JohannesGaessler JohannesGaessler merged commit c39665f into ggml-org:master Oct 24, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Dec 23, 2024
* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: K cache without FA goes Nan on Llama 3.1.
2 participants