CUDA: Fix output size #2480

JohannesGaessler · 2023-08-01T14:21:02Z

Currently models with an output size not divisible by 64 do not work with CUDA. The problem is that I added code that rounds down the min/max rows of weight matrices to multiples of 64 to fix an out-of-bounds issue for multi GPU. To make models with an output size of e.g. 32001 work with the new matrix multiplication kernels checks for out-of-bounds memory accesses are necessary. Unfortunately these checks degrade performance by ~5%. So I'm instead making the functions for loading the data templates that only include the check when it's necessary. However, this makes functions like ggml_mul_mat_q4_0_q8_1_cuda kind of unwieldy since you would need to duplicate this code for each quantization format (for now I only added it for q4_0). Currently we have a design where a function like ggml_cuda_mul_mat_op does a switch statement and then calls one of many functions like ggml_mul_mat_q4_0_q8_1_cuda. Should we change the latter to a template to reduce code duplication?

This PR also fixes a check in CMakeLists.txt that was still using LLAMA_CUDA_DMMV_F16.

slaren · 2023-08-01T14:43:07Z

Adding i = min(i, i_max); unconditionally to the load_tiles functions makes the evaluation 5% slower overall?

JohannesGaessler · 2023-08-01T14:54:18Z

7b q4_0 with template: 1587 t/s
7b q4_0 without template: 1436 t/s
I was just estimating the difference in my head but it turns out it's actually more like 10% (difference will depend on how fast the rest of the calculations are).

slaren · 2023-08-01T15:17:58Z

For me, this results in the same performance:

    int i1 = min(i_offset + GGML_CUDA_MMQ_Y, i_max + GGML_CUDA_MMQ_Y - 1);
#pragma unroll
    for (int i = i_offset; i < i1; i += 8) {
        const block_q4_0 * bxi = bx0 + i*blocks_per_row + kbx;

        x_ql[i * (WARP_SIZE + 1) + k] = get_int_from_uint8(bxi->qs, kqsx);
        x_dm[i * (WARP_SIZE/QI4_0) + i / QI4_0 + kbx].x = bxi->d;
    }

Without the adjustment to i_max it produces wrong results, I am not sure if that is intended or the value of i_max is wrong.

JohannesGaessler · 2023-08-01T18:09:01Z

That solution does not work; compute-sanitizer detects out-of-bounds memory accesses. The limit for i needs to be i_max to avoid out-of-bounds memory accesses but with the posted code it's i_max + GGML_CUDA_MMQ_Y - 2.

If we want to avoid a template that enables/disables checks we could alternatively pad the weight matrices so that their rows are multiples of GGML_CUDA_MMQ_Y == 64.

slaren · 2023-08-01T19:55:40Z

What I am suggesting is moving the min out of the loop. If you can make that work, that would keep the code simple while also not degrading performance when the rows are not divisible by 64.
If you don't think that's possible, then I would suggest keeping the template rather than padding the rows.

JohannesGaessler · 2023-08-02T08:28:28Z

An approach with an unconditional check outside the loop was still 4% slower so I went with templates.

slaren

Looks good, but I didn't test it.

LostRuins

Did some quick tests, looks good, works for the same model I had issues with earlier.

JohannesGaessler marked this pull request as draft August 1, 2023 14:21

dranger003 mentioned this pull request Aug 2, 2023

Commit "CUDA: Quantized matrix matrix multiplication" causes assert "ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1" on Windows when vocab_size != 32000 #2484

Closed

CUDA: Fix models with output size != 32000

1e64d51

JohannesGaessler force-pushed the cuda-fix-output-size branch from dd0e0d9 to 1e64d51 Compare August 2, 2023 08:27

JohannesGaessler marked this pull request as ready for review August 2, 2023 08:27

slaren approved these changes Aug 2, 2023

View reviewed changes

LostRuins approved these changes Aug 2, 2023

View reviewed changes

JohannesGaessler merged commit 4f6b60c into ggml-org:master Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Fix output size #2480

CUDA: Fix output size #2480

JohannesGaessler commented Aug 1, 2023

slaren commented Aug 1, 2023

JohannesGaessler commented Aug 1, 2023

slaren commented Aug 1, 2023 •

edited

Loading

JohannesGaessler commented Aug 1, 2023

slaren commented Aug 1, 2023

JohannesGaessler commented Aug 2, 2023

slaren left a comment

LostRuins left a comment

CUDA: Fix output size #2480

CUDA: Fix output size #2480

Conversation

JohannesGaessler commented Aug 1, 2023

slaren commented Aug 1, 2023

JohannesGaessler commented Aug 1, 2023

slaren commented Aug 1, 2023 • edited Loading

JohannesGaessler commented Aug 1, 2023

slaren commented Aug 1, 2023

JohannesGaessler commented Aug 2, 2023

slaren left a comment

Choose a reason for hiding this comment

LostRuins left a comment

Choose a reason for hiding this comment

slaren commented Aug 1, 2023 •

edited

Loading