Skip to content

metal : relax conditions on fast matrix multiplication kernel #3168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 15, 2023

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 14, 2023

close #2850

In #3123, @ikawrakow made an important observation that the KQ tensor does not benefit from the fast matrix multiplication kernel because the K and Q tensors are not contiguous, so we fallback to the slower variant of the kernel.

However, the fast kernel does not really need src0 and src1 to be contiguous. It only requires them to not be transposed - i.e. only the elements in the rows need to be contiguous. Relaxing this restriction leads to improvement in PP speed.
With this change, LLaMA is ~10% faster on M2 Ultra and Falcon is as fast as LLaMA.

While implementing this, I noticed that the metal concurrency optimization while improving PP speed, it actually degrades the TG speed on M2 Ultra. So I've added a flag to use it only if we are computing n_batch > 1. This leads to the TG speedup in the results below.
Ignore this comment - preparing the allocator with concurrency enabled and then not using it produces incorrect results. Updated the tables below.

This PR also fixes a bug in ggml_nbytes() which does not affect llama.cpp.

Please give this branch a thorough testing and let me know if you observe similar speed-up on other Metal chips.
I did some checks on the perplexity for a few quants and everything looks normal, but lets double-check.

model size th test master t/s PR t/s speedup
LLaMA 7B mostly F16 12.55 GiB 4 pp 512 1266.05 ± 1.48 1490.07 ± 0.85 1.177
LLaMA 7B mostly Q8_0 6.67 GiB 4 pp 512 1142.66 ± 0.98 1325.69 ± 0.57 1.160
LLaMA 7B mostly Q4_0 3.56 GiB 4 pp 512 1167.39 ± 1.17 1355.56 ± 0.65 1.161
LLaMA 7B mostly Q4_1 3.95 GiB 4 pp 512 1165.83 ± 0.97 1352.77 ± 0.45 1.160
LLaMA 7B mostly Q6_K 5.15 GiB 4 pp 512 980.69 ± 0.78 1106.65 ± 0.27 1.128
LLaMA 7B mostly Q5_K - Medium 4.45 GiB 4 pp 512 977.92 ± 0.67 1103.13 ± 0.36 1.128
LLaMA 7B mostly Q5_K - Small 4.33 GiB 4 pp 512 977.93 ± 0.68 1101.99 ± 0.36 1.127
LLaMA 7B mostly Q4_K - Medium 3.80 GiB 4 pp 512 1027.95 ± 0.45 1169.12 ± 0.33 1.137
LLaMA 7B mostly Q4_K - Small 3.59 GiB 4 pp 512 1034.99 ± 0.88 1178.93 ± 0.23 1.139
LLaMA 7B mostly Q3_K - Medium 3.07 GiB 4 pp 512 1007.44 ± 0.98 1140.24 ± 3.69 1.132
LLaMA 7B mostly Q3_K - Small 2.75 GiB 4 pp 512 990.65 ± 0.52 1119.17 ± 0.53 1.130
LLaMA 7B mostly F16 12.55 GiB 4 tg 128 40.89 ± 0.05 41.04 ± 0.06 1.004
LLaMA 7B mostly Q8_0 6.67 GiB 4 tg 128 64.73 ± 0.04 64.88 ± 0.07 1.002
LLaMA 7B mostly Q4_0 3.56 GiB 4 tg 128 91.02 ± 0.08 91.44 ± 0.07 1.005
LLaMA 7B mostly Q4_1 3.95 GiB 4 tg 128 86.05 ± 0.08 86.34 ± 0.08 1.003
LLaMA 7B mostly Q6_K 5.15 GiB 4 tg 128 71.51 ± 0.05 71.92 ± 0.10 1.006
LLaMA 7B mostly Q5_K - Medium 4.45 GiB 4 tg 128 72.72 ± 0.10 72.92 ± 0.09 1.003
LLaMA 7B mostly Q5_K - Small 4.33 GiB 4 tg 128 74.23 ± 0.04 74.52 ± 0.05 1.004
LLaMA 7B mostly Q4_K - Medium 3.80 GiB 4 tg 128 83.90 ± 0.04 84.01 ± 0.13 1.001
LLaMA 7B mostly Q4_K - Small 3.59 GiB 4 tg 128 87.31 ± 0.17 87.56 ± 0.16 1.003
LLaMA 7B mostly Q3_K - Medium 3.07 GiB 4 tg 128 83.86 ± 0.10 84.19 ± 0.13 1.004
LLaMA 7B mostly Q3_K - Small 2.75 GiB 4 tg 128 85.54 ± 0.12 85.76 ± 0.12 1.003
model size th test master t/s PR t/s speedup
Falcon 7B mostly F16 13.44 GiB 4 pp 512 886.76 ± 0.94 1378.01 ± 2.19 1.554
Falcon 7B mostly Q8_0 7.14 GiB 4 pp 512 828.19 ± 0.93 1240.71 ± 1.63 1.498
Falcon 7B mostly Q4_0 3.92 GiB 4 pp 512 837.66 ± 1.14 1264.06 ± 1.67 1.509
Falcon 7B mostly Q4_1 4.32 GiB 4 pp 512 837.45 ± 1.11 1262.56 ± 1.82 1.508
Falcon 7B mostly F16 13.44 GiB 4 tg 128 40.26 ± 0.05 40.29 ± 0.03 1.001
Falcon 7B mostly Q8_0 7.14 GiB 4 tg 128 63.19 ± 0.03 63.22 ± 0.01 1.000
Falcon 7B mostly Q4_0 3.92 GiB 4 tg 128 89.31 ± 0.07 89.39 ± 0.02 1.001
Falcon 7B mostly Q4_1 4.32 GiB 4 tg 128 82.54 ± 0.02 82.70 ± 0.05 1.002

build: 71ca2fa (1221)

ggml-metal.m Outdated
@@ -911,7 +925,9 @@ void ggml_metal_graph_compute(
nth1 = 1;
if (ne11 * ne12 < 4) {
[encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32_1row];
} else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {
//} else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if you are sure that it is no longer needed, then remove it altogether. If you are not sure, then this change does not make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do some more tests - for now keeping the original implementation.

With this PR, nrows is always 1 because ne11 is 1, so I think it would either need to be updated or probably it is not needed anymore.

@jhen0409
Copy link
Collaborator

M2 (10c GPU)

model size th test master t/s PR t/s speedup
LLaMA 7B mostly Q8_0 6.67 GiB 4 pp 512 162.20 ± 0.20 185.97 ± 0.02 1.147
LLaMA 7B mostly Q4_0 3.56 GiB 4 pp 512 165.08 ± 0.09 189.65 ± 0.09 1.149
LLaMA 7B mostly Q4_1 3.95 GiB 4 pp 512 164.92 ± 0.14 189.25 ± 0.05 1.148
LLaMA 7B mostly Q8_0 6.67 GiB 4 tg 128 12.23 ± 0.08 12.25 ± 0.04 1.001
LLaMA 7B mostly Q4_0 3.56 GiB 4 tg 128 22.03 ± 0.04 22.04 ± 0.02 1.000
LLaMA 7B mostly Q4_1 3.95 GiB 4 tg 128 19.95 ± 0.02 19.93 ± 0.16 1.000

@ggerganov ggerganov merged commit a51b687 into master Sep 15, 2023
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
…rg#3168)

* metal : relax conditions on fast matrix multiplication kernel

* metal : revert the concurrnecy change because it was wrong

* llama : remove experimental stuff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

falcon : speed-up prompt processing
3 participants