-
Notifications
You must be signed in to change notification settings - Fork 11.9k
CUDA: fix MMQ for non-contiguous src0, add tests #10021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
JohannesGaessler
merged 2 commits into
ggml-org:master
from
JohannesGaessler:cuda-fix-permuted-mm
Oct 24, 2024
Merged
CUDA: fix MMQ for non-contiguous src0, add tests #10021
JohannesGaessler
merged 2 commits into
ggml-org:master
from
JohannesGaessler:cuda-fix-permuted-mm
Oct 24, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
slaren
reviewed
Oct 23, 2024
tests/test-backend-ops.cpp
Outdated
Comment on lines
1686
to
1706
// If the permutation is not {0, 1, 2, 3}, replace a and b with views that have the same data in a different order. | ||
// This test only works correctly if exactly 2 indices != 0 are swapped. | ||
if (per[0] != 0 || per[1] != 1 || per[2] != 2 || per[3] != 3) { | ||
GGML_ASSERT(per[0] == 0); | ||
const size_t rsa = ggml_row_size(a->type, a->ne[0]); | ||
const size_t rsb = ggml_row_size(b->type, b->ne[0]); | ||
size_t nba[GGML_MAX_DIMS] = {ggml_type_size(a->type), rsa, rsa, rsa}; | ||
size_t nbb[GGML_MAX_DIMS] = {ggml_type_size(b->type), rsb, rsb, rsb}; | ||
for (int64_t i = 1; i < GGML_MAX_DIMS; ++i) { | ||
for (int64_t j = 1; j < per[i]; ++j) { | ||
nba[i] *= a->ne[per[j]]; | ||
nbb[i] *= b->ne[per[j]]; | ||
} | ||
} | ||
a = ggml_view_4d(ctx, a, a->ne[0], a->ne[1], a->ne[2], a->ne[3], nba[1], nba[2], nba[3], /*offset =*/ 0); | ||
b = ggml_view_4d(ctx, b, b->ne[0], b->ne[1], b->ne[2], b->ne[3], nbb[1], nbb[2], nbb[3], /*offset =*/ 0); | ||
GGML_ASSERT(ggml_nbytes(a) == ggml_nbytes(a->src[0])); | ||
GGML_ASSERT(ggml_nbytes(b) == ggml_nbytes(b->src[0])); | ||
ggml_set_name(a, "a_permuted"); | ||
ggml_set_name(b, "b_permuted"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit hard to follow. Would this be equivalent to what you are doing here?
// C^T = A * B^T: (k, m) * (k, n) => (m, n)
ggml_tensor * a;
ggml_tensor * b;
// If the permutation is not {0, 1, 2, 3}, replace a and b with views that have the same data in a different order.
// This test only works correctly if exactly 2 indices != 0 are swapped.
if (per[0] != 0 || per[1] != 1 || per[2] != 2 || per[3] != 3) {
// create a tensor with the permuted dimensions, them permute it back to the dimensions given by m,n,k
int64_t ne_a[4] = {k, m, bs[0], bs[1]};
int64_t ne_b[4] = {k, n, bs[0]*nr[0], bs[1]*nr[1]};
a = ggml_new_tensor_4d(ctx, type_a, ne_a[per[0]], ne_a[per[1]], ne_a[per[2]], ne_a[per[3]]);
b = ggml_new_tensor_4d(ctx, type_b, ne_b[per[0]], ne_b[per[1]], ne_b[per[2]], ne_b[per[3]]);
ggml_set_name(a, "a");
ggml_set_name(b, "b");
a = ggml_permute(ctx, a, per[0], per[1], per[2], per[3]);
b = ggml_permute(ctx, b, per[0], per[1], per[2], per[3]);
ggml_set_name(a, "a_permuted");
ggml_set_name(b, "b_permuted");
} else {
a = ggml_new_tensor_4d(ctx, type_a, k, m, bs[0] , bs[1]);
b = ggml_new_tensor_4d(ctx, type_b, k, n, bs[0]*nr[0], bs[1]*nr[1]);
ggml_set_name(a, "a");
ggml_set_name(b, "b");
}
ggml_set_param(ctx, a);
ggml_set_param(ctx, b);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's probably a better way to do it.
slaren
approved these changes
Oct 23, 2024
arthw
pushed a commit
to arthw/llama.cpp
that referenced
this pull request
Nov 15, 2024
* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code
arthw
pushed a commit
to arthw/llama.cpp
that referenced
this pull request
Nov 18, 2024
* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code
4 tasks
Nexesenex
pushed a commit
to Nexesenex/croco.cpp
that referenced
this pull request
Dec 23, 2024
* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ggml
changes relating to the ggml tensor library for machine learning
Nvidia GPU
Issues specific to Nvidia GPUs
Review Complexity : Medium
Generally require more time to grok but manageable by beginner to medium expertise level
testing
Everything test related
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #10011 .
The problem is that on master the MMQ code calculates the stride for
src0
based onsrc0->nb[1]
but inggml_cuda_op_mul_mat
any non-contiguous matrices are made contiguous first. This case was not covered by the tests, I extendedtest-backend-ops
with new test cases for non-contiguous inputs.The temporary buffer for
src0
also did not get its padding cleared. This did not seem to result in incorrect results but it would in principle be possible so I also fixed that.Long-term I think we should refactor the code in such a way that we don't need
ggml_cuda_op_mul_mat
and instead handle batched matrix multiplication in the kernels themselves.