Skip to content

vulkan: double buffer scale caches #12188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 10, 2025

Conversation

netrunnereve
Copy link
Collaborator

This double buffers the mat vec scale caches from #11081 so that the threads don't need to wait at a barrier before the cache is filled.

Performance wise this is pretty much within 1% of master, with it being a bit slower on my RX470 and a bit faster on my W8100. This might have a bigger improvement on faster GPUs that actually are caught up by the barrier (on my computer I can actually remove the barrier without double buffering and still have the tests pass).

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 4, 2025
@jeffbolznv
Copy link
Collaborator

When the workgroup size equals the subgroup size, the barrier becomes free or nearly free. I haven't had a chance to test this yet, but I don't expect gains on NVIDIA GPUs.

@jeffbolznv
Copy link
Collaborator

I did a quick test with Qwen2.5-7B-Instruct-1M-Q2_K.gguf, maybe a very small gain (<1%) but hard to tell whether it's just noise.

@netrunnereve
Copy link
Collaborator Author

Yeah there's probably not much of a difference here for mat vec. I'm trying to get this set up for mat mul as well, though with limited shared memory there's a tradeoff between a single large buffer and a smaller double buffer.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 8, 2025

I see no difference on Nvidia or AMD, but I do see a decent performance increase in q2_k and q3_k on Intel A770. Not on q6_k, but Intel has never liked q6_k for whatever reason and doesn't perform well with it either way.

If this doesn't have negative effect on other hardware, I think it's good.

@netrunnereve
Copy link
Collaborator Author

I'm trying to get this set up for mat mul as well, though with limited shared memory there's a tradeoff between a single large buffer and a smaller double buffer.

The mat mul experiment ended up being a failure as the extra shared memory slowed things down way more than the barrier did. Since the shared memory is, well, shared among all the subgroups running on the core this reduces the number of subgroups it can run at a time.

If this doesn't have negative effect on other hardware, I think it's good.

Considering how this doesn't affect Nvidia or AMD and makes Intel faster I think we're fine.

@netrunnereve netrunnereve merged commit 2c9f833 into ggml-org:master Mar 10, 2025
43 checks passed
@netrunnereve netrunnereve deleted the double_buffering branch March 10, 2025 19:28
@LostRuins
Copy link
Collaborator

How much perf improvement did you see on A770

jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants