Skip to content

CUDA: Fix output size #2480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Currently models with an output size not divisible by 64 do not work with CUDA. The problem is that I added code that rounds down the min/max rows of weight matrices to multiples of 64 to fix an out-of-bounds issue for multi GPU. To make models with an output size of e.g. 32001 work with the new matrix multiplication kernels checks for out-of-bounds memory accesses are necessary. Unfortunately these checks degrade performance by ~5%. So I'm instead making the functions for loading the data templates that only include the check when it's necessary. However, this makes functions like ggml_mul_mat_q4_0_q8_1_cuda kind of unwieldy since you would need to duplicate this code for each quantization format (for now I only added it for q4_0). Currently we have a design where a function like ggml_cuda_mul_mat_op does a switch statement and then calls one of many functions like ggml_mul_mat_q4_0_q8_1_cuda. Should we change the latter to a template to reduce code duplication?

This PR also fixes a check in CMakeLists.txt that was still using LLAMA_CUDA_DMMV_F16.

@JohannesGaessler JohannesGaessler marked this pull request as draft August 1, 2023 14:21
@slaren
Copy link
Member

slaren commented Aug 1, 2023

Adding i = min(i, i_max); unconditionally to the load_tiles functions makes the evaluation 5% slower overall?

@JohannesGaessler
Copy link
Collaborator Author

7b q4_0 with template: 1587 t/s
7b q4_0 without template: 1436 t/s
I was just estimating the difference in my head but it turns out it's actually more like 10% (difference will depend on how fast the rest of the calculations are).

@slaren
Copy link
Member

slaren commented Aug 1, 2023

For me, this results in the same performance:

    int i1 = min(i_offset + GGML_CUDA_MMQ_Y, i_max + GGML_CUDA_MMQ_Y - 1);
#pragma unroll
    for (int i = i_offset; i < i1; i += 8) {
        const block_q4_0 * bxi = bx0 + i*blocks_per_row + kbx;

        x_ql[i * (WARP_SIZE + 1) + k] = get_int_from_uint8(bxi->qs, kqsx);
        x_dm[i * (WARP_SIZE/QI4_0) + i / QI4_0 + kbx].x = bxi->d;
    }

Without the adjustment to i_max it produces wrong results, I am not sure if that is intended or the value of i_max is wrong.

@JohannesGaessler
Copy link
Collaborator Author

That solution does not work; compute-sanitizer detects out-of-bounds memory accesses. The limit for i needs to be i_max to avoid out-of-bounds memory accesses but with the posted code it's i_max + GGML_CUDA_MMQ_Y - 2.

If we want to avoid a template that enables/disables checks we could alternatively pad the weight matrices so that their rows are multiples of GGML_CUDA_MMQ_Y == 64.

@slaren
Copy link
Member

slaren commented Aug 1, 2023

What I am suggesting is moving the min out of the loop. If you can make that work, that would keep the code simple while also not degrading performance when the rows are not divisible by 64.
If you don't think that's possible, then I would suggest keeping the template rather than padding the rows.

@JohannesGaessler JohannesGaessler marked this pull request as ready for review August 2, 2023 08:27
@JohannesGaessler
Copy link
Collaborator Author

An approach with an unconditional check outside the loop was still 4% slower so I went with templates.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I didn't test it.

Copy link
Collaborator

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some quick tests, looks good, works for the same model I had issues with earlier.

@JohannesGaessler JohannesGaessler merged commit 4f6b60c into ggml-org:master Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants