-
Notifications
You must be signed in to change notification settings - Fork 11.8k
CUDA: Fix output size #2480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Fix output size #2480
Conversation
Adding |
7b q4_0 with template: 1587 t/s |
For me, this results in the same performance: int i1 = min(i_offset + GGML_CUDA_MMQ_Y, i_max + GGML_CUDA_MMQ_Y - 1);
#pragma unroll
for (int i = i_offset; i < i1; i += 8) {
const block_q4_0 * bxi = bx0 + i*blocks_per_row + kbx;
x_ql[i * (WARP_SIZE + 1) + k] = get_int_from_uint8(bxi->qs, kqsx);
x_dm[i * (WARP_SIZE/QI4_0) + i / QI4_0 + kbx].x = bxi->d;
} Without the adjustment to |
That solution does not work; If we want to avoid a template that enables/disables checks we could alternatively pad the weight matrices so that their rows are multiples of |
What I am suggesting is moving the |
dd0e0d9
to
1e64d51
Compare
An approach with an unconditional check outside the loop was still 4% slower so I went with templates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I didn't test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some quick tests, looks good, works for the same model I had issues with earlier.
Currently models with an output size not divisible by 64 do not work with CUDA. The problem is that I added code that rounds down the min/max rows of weight matrices to multiples of 64 to fix an out-of-bounds issue for multi GPU. To make models with an output size of e.g. 32001 work with the new matrix multiplication kernels checks for out-of-bounds memory accesses are necessary. Unfortunately these checks degrade performance by ~5%. So I'm instead making the functions for loading the data templates that only include the check when it's necessary. However, this makes functions like
ggml_mul_mat_q4_0_q8_1_cuda
kind of unwieldy since you would need to duplicate this code for each quantization format (for now I only added it for q4_0). Currently we have a design where a function likeggml_cuda_mul_mat_op
does a switch statement and then calls one of many functions likeggml_mul_mat_q4_0_q8_1_cuda
. Should we change the latter to a template to reduce code duplication?This PR also fixes a check in CMakeLists.txt that was still using
LLAMA_CUDA_DMMV_F16
.