Skip to content

ggml : fix I8MM Q4_1 scaling factor conversion #10562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 29, 2024
Merged

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 28, 2024

target #10561

These changes fix illegal instruction crash on M1 Pro which does not do a runtime check for the availability of I8MM. We now check ggml_cpu_has_matmul_int8() and if it is false, we unpack the 2x2 multiplication into 4 dot products.

This fix aside, I am wondering if we should drop the int nrc support in the ggml_vec_dot kernels to keep it simple and proceed to implement proper GEMMs similar to the work in ggml-cpu-aarch64.c?

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2024
@ggerganov ggerganov changed the title ggml : fix row condition for i8mm kernels gml : fix I8MM runtime feature checks in CPU kernels Nov 28, 2024
Comment on lines 2387 to 2391
float32_t _scale[4] = {
GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y0->d),
GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y1->d),
GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y0->d),
GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y1->d)};
Copy link
Member Author

@ggerganov ggerganov Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a bug where the y->d was not converted to F32, resulting in completely wrong numbers when going through this CPU branch.

@@ -1759,66 +1759,76 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * restrict s, size_t bs, const void * r
const block_q8_0 * restrict vy0 = vy;
const block_q8_0 * restrict vy1 = (const block_q8_0 *) ((const uint8_t*)vy + by);

float32x4_t sumv0 = vdupq_n_f32(0.0f);
if (ggml_cpu_has_matmul_int8()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove the ARM runtime feature detection completely, it doesn't work at all and never will. So I would prefer if at least we don't make that task worse by adding more checks like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will change the PR to just include the F16 -> F32 fix in the Q4_1 kernel.

Base automatically changed from gg/cpu-q4_0-i8mm-fix to master November 28, 2024 12:56
@ggerganov ggerganov changed the title gml : fix I8MM runtime feature checks in CPU kernels ggml : fix I8MM Q4_1 scaling factor conversion Nov 28, 2024
@ggerganov ggerganov merged commit f0678c5 into master Nov 29, 2024
57 checks passed
@ggerganov ggerganov deleted the gg/cpu-i8mm-fix-2 branch November 29, 2024 14:25
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants