Another speed gain for Q4_0 and Q4_1 on Metal #2375

ikawrakow · 2023-07-24T15:50:16Z

In #2279 there was some debate on whether it is better to have each thread in a Metal SIMD group process whole Q4_0 or Q4_1 blocks, or have them process half blocks.

On my M2 Max with 30-core GPU having threads process half a block is definitely faster, see table below. I'm curious to see if this change also improves (or worsens) performance on M1 Pro/Max.

TG-128 in ms/token on 30-core M2 Max. The command used was

./main -m $model -p "I believe the meaning of life is" --ignore-eos -n 128 -s 1234 --no-mmap -t 8 -ngl 1

Quantization	t/s Master	t/s PR	speedup
Q4_0 - 7B	18.6	17.6	5.7%
Q4_1 - 7B	20.0	18.9	5.8%
Q4_0 - 13B	32.5	30.1	8.0%
Q4_1 - 13B	33.7	32.2	4.7%
Q4_0 - 33B	76.5	67.2	11.4%
Q4_1 - 33B	82.3	75.3	9.3%
Q4_0 - 65B	146.8	133.4	10.0%
Q4_1 - 65B	152.8	145.4	5.1%

lshzh-ww · 2023-07-24T16:27:22Z

Test on M1 Max 32c

Quantization	t/s Master	t/s PR	speedup
Q4_0 - 7B	19.18	18.07	6.1%
Q4_0 - 33B	73.9	70.9	4.2%

The performance improvement is solid. I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance if it weren't for you.

ikawrakow · 2023-07-24T16:39:32Z

I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance

@lshzh-ww That doesn't make sense. Your contribution in #2188 and #2248 were absolutely essential for improving Metal performance. Before you came along with #2188 we were basically stuck at 21.5 ms/t on M2 Max for the 7B model. This kind of collaborative progress is the whole point of open source!

lshzh-ww · 2023-07-24T18:52:26Z

ggml-metal.metal

+//Note: This is a template, but strictly speaking it only applies to
+//      quantizations where the block size is 32. It also does not
+//      giard against the number of rows not being divisible by
+//      N_DST, so this is another explicit assumption of the implementation.


I think this template also works on matrices with number of rows not being divisible by N_DST. In the very last of template we have if (tiisg == 0 && first_row + row < ne01) which guarantees that we don't write into other tensor's region when there are less than N_DST rows left.

I checked by using a WizardLM model, which has a row number of 32001. This template generated exactly the same results as the llama.cpp before #2188 was merged.

Well, yes, there is a guard to not write outside of the result bounds. But there is no guard to not read outside the bounds of the quantized tensor. We don't get a segfault because we only step out slightly and the address being accessed is within the process address space, but there is no guarantee that this will always be the case.

I see. Thank you for answering my question.

ggerganov

M1 Pro

Quantization	ms/t Master	ms/t PR
Q4_0 - 7B	30.2	28.7
Q4_1 - 7B	32.9	31.2
Q4_0 - 13B	53.5	50.7
Q4_1 - 13B	58.1	55.6

ggerganov · 2023-07-25T07:16:21Z

@ikawrakow and @lshzh-ww

I assume your tables show ms/t instead of t/s, correct?

ikawrakow · 2023-07-25T07:23:22Z

@ikawrakow and @lshzh-ww

I assume your tables show ms/t instead of t/s, correct?

Yes.

Another speed gain for Q4_0 and Q4_1 on Metal

b759afa

ikawrakow requested review from ggerganov and lshzh-ww July 24, 2023 15:50

Have N_DST, etc., be template parameters

7f98561

lshzh-ww approved these changes Jul 24, 2023

View reviewed changes

ggerganov approved these changes Jul 25, 2023

View reviewed changes

ikawrakow merged commit 9a08eaf into master Jul 25, 2023

ikawrakow deleted the ik/metal_q4_0_1_new branch July 25, 2023 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Another speed gain for Q4_0 and Q4_1 on Metal #2375

ikawrakow commented Jul 24, 2023

lshzh-ww commented Jul 24, 2023

ikawrakow commented Jul 24, 2023

lshzh-ww Jul 24, 2023

ikawrakow Jul 25, 2023

lshzh-ww Jul 25, 2023

ggerganov left a comment

ggerganov commented Jul 25, 2023

ikawrakow commented Jul 25, 2023

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Conversation

ikawrakow commented Jul 24, 2023

lshzh-ww commented Jul 24, 2023

ikawrakow commented Jul 24, 2023

lshzh-ww Jul 24, 2023

Choose a reason for hiding this comment

ikawrakow Jul 25, 2023

Choose a reason for hiding this comment

lshzh-ww Jul 25, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented Jul 25, 2023

ikawrakow commented Jul 25, 2023