vulkan: double buffer scale caches #12188

netrunnereve · 2025-03-04T23:05:36Z

This double buffers the mat vec scale caches from #11081 so that the threads don't need to wait at a barrier before the cache is filled.

Performance wise this is pretty much within 1% of master, with it being a bit slower on my RX470 and a bit faster on my W8100. This might have a bigger improvement on faster GPUs that actually are caught up by the barrier (on my computer I can actually remove the barrier without double buffering and still have the tests pass).

jeffbolznv · 2025-03-05T05:37:43Z

When the workgroup size equals the subgroup size, the barrier becomes free or nearly free. I haven't had a chance to test this yet, but I don't expect gains on NVIDIA GPUs.

jeffbolznv · 2025-03-05T20:41:32Z

I did a quick test with Qwen2.5-7B-Instruct-1M-Q2_K.gguf, maybe a very small gain (<1%) but hard to tell whether it's just noise.

netrunnereve · 2025-03-05T22:40:45Z

Yeah there's probably not much of a difference here for mat vec. I'm trying to get this set up for mat mul as well, though with limited shared memory there's a tradeoff between a single large buffer and a smaller double buffer.

0cc4m · 2025-03-08T09:42:42Z

I see no difference on Nvidia or AMD, but I do see a decent performance increase in q2_k and q3_k on Intel A770. Not on q6_k, but Intel has never liked q6_k for whatever reason and doesn't perform well with it either way.

If this doesn't have negative effect on other hardware, I think it's good.

netrunnereve · 2025-03-10T19:24:50Z

I'm trying to get this set up for mat mul as well, though with limited shared memory there's a tradeoff between a single large buffer and a smaller double buffer.

The mat mul experiment ended up being a failure as the extra shared memory slowed things down way more than the barrier did. Since the shared memory is, well, shared among all the subgroups running on the core this reduces the number of subgroups it can run at a time.

If this doesn't have negative effect on other hardware, I think it's good.

Considering how this doesn't affect Nvidia or AMD and makes Intel faster I think we're fine.

LostRuins · 2025-03-12T08:17:05Z

How much perf improvement did you see on A770

mat vec double buffer

a33c8bc

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 4, 2025

0cc4m approved these changes Mar 8, 2025

View reviewed changes

netrunnereve merged commit 2c9f833 into ggml-org:master Mar 10, 2025
43 checks passed

netrunnereve deleted the double_buffering branch March 10, 2025 19:28

jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025

mat vec double buffer (ggml-org#12188)

db843d5

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025

mat vec double buffer (ggml-org#12188)

1989578

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: double buffer scale caches #12188

vulkan: double buffer scale caches #12188

Uh oh!

netrunnereve commented Mar 4, 2025

Uh oh!

jeffbolznv commented Mar 5, 2025

Uh oh!

jeffbolznv commented Mar 5, 2025

Uh oh!

netrunnereve commented Mar 5, 2025

Uh oh!

0cc4m commented Mar 8, 2025

Uh oh!

netrunnereve commented Mar 10, 2025

Uh oh!

Uh oh!

LostRuins commented Mar 12, 2025

Uh oh!

Uh oh!

vulkan: double buffer scale caches #12188

vulkan: double buffer scale caches #12188

Uh oh!

Conversation

netrunnereve commented Mar 4, 2025

Uh oh!

jeffbolznv commented Mar 5, 2025

Uh oh!

jeffbolznv commented Mar 5, 2025

Uh oh!

netrunnereve commented Mar 5, 2025

Uh oh!

0cc4m commented Mar 8, 2025

Uh oh!

netrunnereve commented Mar 10, 2025

Uh oh!

Uh oh!

LostRuins commented Mar 12, 2025

Uh oh!

Uh oh!