metal : relax conditions on fast matrix multiplication kernel #3168
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
close #2850
In #3123, @ikawrakow made an important observation that the
KQ
tensor does not benefit from the fast matrix multiplication kernel because theK
andQ
tensors are not contiguous, so we fallback to the slower variant of the kernel.However, the fast kernel does not really need
src0
andsrc1
to be contiguous. It only requires them to not be transposed - i.e. only the elements in the rows need to be contiguous. Relaxing this restriction leads to improvement in PP speed.With this change, LLaMA is ~10% faster on M2 Ultra and Falcon is as fast as LLaMA.
While implementing this, I noticed that the metal concurrency optimization while improving PP speed, it actually degrades the TG speed on M2 Ultra. So I've added a flag to use it only if we are computingn_batch > 1
. This leads to the TG speedup in the results below.Ignore this comment - preparing the allocator with concurrency enabled and then not using it produces incorrect results. Updated the tables below.
This PR also fixes a bug in
ggml_nbytes()
which does not affectllama.cpp
.Please give this branch a thorough testing and let me know if you observe similar speed-up on other Metal chips.
I did some checks on the perplexity for a few quants and everything looks normal, but lets double-check.
build: 71ca2fa (1221)