Skip to content

Faster Q4_K on Metal #2290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 20, 2023
Merged

Faster Q4_K on Metal #2290

merged 1 commit into from
Jul 20, 2023

Conversation

ikawrakow
Copy link
Contributor

This PR improves token generation speed for Q4_K on Metal by a significant amount using ideas from PRs #2248, #2212 and #2188. The table gives token generation time in ms/t on M2 Max with 30-core GPU:

Model Master This PR Speedup
7B 23.7 19.5 21.5%
13B 41.4 31.9 29.8%
33B 99.4 73.6 35.0%
65B 194.1 141.0 37.7%

@ikawrakow ikawrakow requested a review from ggerganov July 20, 2023 11:36
@ggerganov
Copy link
Member

I can provide results from M1 Pro a bit later

@ikawrakow ikawrakow merged commit 785829d into master Jul 20, 2023
@ikawrakow ikawrakow deleted the ik/metal_faster_q4k branch July 20, 2023 12:18
@ggerganov
Copy link
Member

ggerganov commented Jul 20, 2023

M1 Pro

Model Master This PR
7B 48.4 35.4
13B 49.4 91.4 63.2

@ikawrakow
Copy link
Contributor Author

@ggerganov Any chance the 13B points are reversed? Else it would mean that on current master token prediction is about the same for 7B and 13B on your M1 Pro.

@ggerganov
Copy link
Member

Sorry about that, somehow I messed up the 13B Master number

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants