-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Another speed gain for Q4_0 and Q4_1 on Metal #2375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test on M1 Max 32c
The performance improvement is solid. I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance if it weren't for you. |
@lshzh-ww That doesn't make sense. Your contribution in #2188 and #2248 were absolutely essential for improving Metal performance. Before you came along with #2188 we were basically stuck at 21.5 ms/t on M2 Max for the 7B model. This kind of collaborative progress is the whole point of open source! |
//Note: This is a template, but strictly speaking it only applies to | ||
// quantizations where the block size is 32. It also does not | ||
// giard against the number of rows not being divisible by | ||
// N_DST, so this is another explicit assumption of the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this template also works on matrices with number of rows not being divisible by N_DST. In the very last of template we have if (tiisg == 0 && first_row + row < ne01)
which guarantees that we don't write into other tensor's region when there are less than N_DST rows left.
I checked by using a WizardLM model, which has a row number of 32001. This template generated exactly the same results as the llama.cpp before #2188 was merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes, there is a guard to not write outside of the result bounds. But there is no guard to not read outside the bounds of the quantized tensor. We don't get a segfault because we only step out slightly and the address being accessed is within the process address space, but there is no guarantee that this will always be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for answering my question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M1 Pro
Quantization | ms/t Master | ms/t PR |
---|---|---|
Q4_0 - 7B | 30.2 | 28.7 |
Q4_1 - 7B | 32.9 | 31.2 |
Q4_0 - 13B | 53.5 | 50.7 |
Q4_1 - 13B | 58.1 | 55.6 |
@ikawrakow and @lshzh-ww I assume your tables show |
Yes. |
In #2279 there was some debate on whether it is better to have each thread in a Metal SIMD group process whole
Q4_0
orQ4_1
blocks, or have them process half blocks.On my M2 Max with 30-core GPU having threads process half a block is definitely faster, see table below. I'm curious to see if this change also improves (or worsens) performance on M1 Pro/Max.
TG-128 in ms/token on 30-core M2 Max. The command used was