Faster Q5_K and Q6_K on Metal #2294

ikawrakow · 2023-07-20T14:10:19Z

Along the same lines as #2290. Here the speedup is not quite as large as for Q4_K, but still significant:

Model	Master	This PR	Speedup
Q5_K_S 7B	26.2	22.8	14.9%
Q5_K_S 13B	46.1	39.3	17.4%
Q5_K_S 33B	115.5	97.0	19.1%
Q5_K_S 65B	214.6	181.1	18.5%
Q6_K 7B	25.6	24.6	4.1%
Q6_K 13B	46.1	44.3	4.1%
Q6_K 33B	116.6	111.0	5.0%

Table shows token generation time in ms/t on M2 Max with 30-core GPU. The system has 64 GB RAM and the 65B Q6_K model does not run successfully.

Kawrakow added 3 commits July 20, 2023 16:00

Faster Q6_K on Metal

fa9d54e

Faster Q5_K on Metal

463f420

Another Q5_K speedup

5f2e4bd

ikawrakow requested a review from ggerganov July 20, 2023 14:10

ggerganov approved these changes Jul 20, 2023

View reviewed changes

ikawrakow merged commit e782c9e into master Jul 20, 2023

ikawrakow mentioned this pull request Jul 20, 2023

Faster Q2_K on Metal #2297

Merged

j-f1 deleted the ik/metal_faster_q6k branch July 21, 2023 12:43

Provide feedback