metal : relax conditions on fast matrix multiplication kernel #3168

ggerganov · 2023-09-14T14:12:28Z

In #3123, @ikawrakow made an important observation that the KQ tensor does not benefit from the fast matrix multiplication kernel because the K and Q tensors are not contiguous, so we fallback to the slower variant of the kernel.

However, the fast kernel does not really need src0 and src1 to be contiguous. It only requires them to not be transposed - i.e. only the elements in the rows need to be contiguous. Relaxing this restriction leads to improvement in PP speed.
With this change, LLaMA is ~10% faster on M2 Ultra and Falcon is as fast as LLaMA.

While implementing this, I noticed that the metal concurrency optimization while improving PP speed, it actually degrades the TG speed on M2 Ultra. So I've added a flag to use it only if we are computing n_batch > 1. This leads to the TG speedup in the results below.
Ignore this comment - preparing the allocator with concurrency enabled and then not using it produces incorrect results. Updated the tables below.

This PR also fixes a bug in ggml_nbytes() which does not affect llama.cpp.

Please give this branch a thorough testing and let me know if you observe similar speed-up on other Metal chips.
I did some checks on the perplexity for a few quants and everything looks normal, but lets double-check.

model	size	th	test	master t/s	PR t/s	speedup
LLaMA 7B mostly F16	12.55 GiB	4	pp 512	1266.05 ± 1.48	1490.07 ± 0.85	1.177
LLaMA 7B mostly Q8_0	6.67 GiB	4	pp 512	1142.66 ± 0.98	1325.69 ± 0.57	1.160
LLaMA 7B mostly Q4_0	3.56 GiB	4	pp 512	1167.39 ± 1.17	1355.56 ± 0.65	1.161
LLaMA 7B mostly Q4_1	3.95 GiB	4	pp 512	1165.83 ± 0.97	1352.77 ± 0.45	1.160
LLaMA 7B mostly Q6_K	5.15 GiB	4	pp 512	980.69 ± 0.78	1106.65 ± 0.27	1.128
LLaMA 7B mostly Q5_K - Medium	4.45 GiB	4	pp 512	977.92 ± 0.67	1103.13 ± 0.36	1.128
LLaMA 7B mostly Q5_K - Small	4.33 GiB	4	pp 512	977.93 ± 0.68	1101.99 ± 0.36	1.127
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	4	pp 512	1027.95 ± 0.45	1169.12 ± 0.33	1.137
LLaMA 7B mostly Q4_K - Small	3.59 GiB	4	pp 512	1034.99 ± 0.88	1178.93 ± 0.23	1.139
LLaMA 7B mostly Q3_K - Medium	3.07 GiB	4	pp 512	1007.44 ± 0.98	1140.24 ± 3.69	1.132
LLaMA 7B mostly Q3_K - Small	2.75 GiB	4	pp 512	990.65 ± 0.52	1119.17 ± 0.53	1.130
LLaMA 7B mostly F16	12.55 GiB	4	tg 128	40.89 ± 0.05	41.04 ± 0.06	1.004
LLaMA 7B mostly Q8_0	6.67 GiB	4	tg 128	64.73 ± 0.04	64.88 ± 0.07	1.002
LLaMA 7B mostly Q4_0	3.56 GiB	4	tg 128	91.02 ± 0.08	91.44 ± 0.07	1.005
LLaMA 7B mostly Q4_1	3.95 GiB	4	tg 128	86.05 ± 0.08	86.34 ± 0.08	1.003
LLaMA 7B mostly Q6_K	5.15 GiB	4	tg 128	71.51 ± 0.05	71.92 ± 0.10	1.006
LLaMA 7B mostly Q5_K - Medium	4.45 GiB	4	tg 128	72.72 ± 0.10	72.92 ± 0.09	1.003
LLaMA 7B mostly Q5_K - Small	4.33 GiB	4	tg 128	74.23 ± 0.04	74.52 ± 0.05	1.004
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	4	tg 128	83.90 ± 0.04	84.01 ± 0.13	1.001
LLaMA 7B mostly Q4_K - Small	3.59 GiB	4	tg 128	87.31 ± 0.17	87.56 ± 0.16	1.003
LLaMA 7B mostly Q3_K - Medium	3.07 GiB	4	tg 128	83.86 ± 0.10	84.19 ± 0.13	1.004
LLaMA 7B mostly Q3_K - Small	2.75 GiB	4	tg 128	85.54 ± 0.12	85.76 ± 0.12	1.003

model	size	th	test	master t/s	PR t/s	speedup
Falcon 7B mostly F16	13.44 GiB	4	pp 512	886.76 ± 0.94	1378.01 ± 2.19	1.554
Falcon 7B mostly Q8_0	7.14 GiB	4	pp 512	828.19 ± 0.93	1240.71 ± 1.63	1.498
Falcon 7B mostly Q4_0	3.92 GiB	4	pp 512	837.66 ± 1.14	1264.06 ± 1.67	1.509
Falcon 7B mostly Q4_1	4.32 GiB	4	pp 512	837.45 ± 1.11	1262.56 ± 1.82	1.508
Falcon 7B mostly F16	13.44 GiB	4	tg 128	40.26 ± 0.05	40.29 ± 0.03	1.001
Falcon 7B mostly Q8_0	7.14 GiB	4	tg 128	63.19 ± 0.03	63.22 ± 0.01	1.000
Falcon 7B mostly Q4_0	3.92 GiB	4	tg 128	89.31 ± 0.07	89.39 ± 0.02	1.001
Falcon 7B mostly Q4_1	4.32 GiB	4	tg 128	82.54 ± 0.02	82.70 ± 0.05	1.002

build: 71ca2fa (1221)

ikawrakow · 2023-09-14T16:38:24Z

ggml-metal.m

@@ -911,7 +925,9 @@ void ggml_metal_graph_compute(
                                            nth1 = 1;
                                            if (ne11 * ne12 < 4) {
                                                [encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32_1row];
-                                            } else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {
+                                            //} else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {


So, if you are sure that it is no longer needed, then remove it altogether. If you are not sure, then this change does not make sense.

Will do some more tests - for now keeping the original implementation.

With this PR, nrows is always 1 because ne11 is 1, so I think it would either need to be updated or probably it is not needed anymore.

llama.cpp

jhen0409 · 2023-09-15T02:10:53Z

M2 (10c GPU)

model	size	th	test	master t/s	PR t/s	speedup
LLaMA 7B mostly Q8_0	6.67 GiB	4	pp 512	162.20 ± 0.20	185.97 ± 0.02	1.147
LLaMA 7B mostly Q4_0	3.56 GiB	4	pp 512	165.08 ± 0.09	189.65 ± 0.09	1.149
LLaMA 7B mostly Q4_1	3.95 GiB	4	pp 512	164.92 ± 0.14	189.25 ± 0.05	1.148
LLaMA 7B mostly Q8_0	6.67 GiB	4	tg 128	12.23 ± 0.08	12.25 ± 0.04	1.001
LLaMA 7B mostly Q4_0	3.56 GiB	4	tg 128	22.03 ± 0.04	22.04 ± 0.02	1.000
LLaMA 7B mostly Q4_1	3.95 GiB	4	tg 128	19.95 ± 0.02	19.93 ± 0.16	1.000

…rg#3168) * metal : relax conditions on fast matrix multiplication kernel * metal : revert the concurrnecy change because it was wrong * llama : remove experimental stuff

metal : relax conditions on fast matrix multiplication kernel

336afbc

ggerganov requested review from ikawrakow, lshzh-ww and jhen0409 September 14, 2023 14:12

metal : revert the concurrnecy change because it was wrong

e343b8b

ikawrakow approved these changes Sep 14, 2023

View reviewed changes

ikawrakow mentioned this pull request Sep 14, 2023

metal: more PP speedup for Falcon #3123

Closed

llama : remove experimental stuff

e7e7b11

jhen0409 approved these changes Sep 15, 2023

View reviewed changes

ggerganov merged commit a51b687 into master Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : relax conditions on fast matrix multiplication kernel #3168

metal : relax conditions on fast matrix multiplication kernel #3168

Uh oh!

ggerganov commented Sep 14, 2023 •

edited

Loading

Uh oh!

ikawrakow Sep 14, 2023

Uh oh!

ggerganov Sep 14, 2023

Uh oh!

Uh oh!

jhen0409 commented Sep 15, 2023

Uh oh!

Uh oh!

metal : relax conditions on fast matrix multiplication kernel #3168

metal : relax conditions on fast matrix multiplication kernel #3168

Uh oh!

Conversation

ggerganov commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow Sep 14, 2023

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 14, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jhen0409 commented Sep 15, 2023

Uh oh!

Uh oh!

ggerganov commented Sep 14, 2023 •

edited

Loading