-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Research: Performance differences between Metal (macOS) and Vulkan (Linux) #10982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, welcome. So this is using the Honeykrisp driver, right? What's the state of the shader compiler there? Do you expect it to be generating reasonably optimal code?
The backends are separate and there's no guarantee that things are implemented the same way between Vulkan and Metal. For the benchmarks you're looking at the Vulkan shaders involved are pretty well-tuned so I don't expect that to be the issue at the source level.
Please try running
Most time will be spent in mul_mat_vec_q4_k.comp (for token generation) and mul_mm.comp (for prompt processing). I'm surprised token generation is 3x slower in Vulkan, I suspect an issue with the shader compiler code generation. Prompt processing in the metal backend uses simdgroup matrices, you'd need to support cooperative matrix in Vulkan to get get access to those, and I don't think you'll be able to get competitive performance without it, pp512 is maybe 3x faster with coopmat than without on other platforms (depends on model, GPU, etc.).
There's a pipeline barrier between almost every dispatch, see ggml_vk_sync_buffers. Some dispatches are quite small. IMO first step would be to compare MUL_MAT Q4_K performance with n==1 (an existing test in test-backend-ops) between Metal and Vulkan. This is the mul_mat_vec_q4_k.comp shader. |
@alyssarosenzweig can probably chime in with more specifics. The compiler itself is shared with the GL driver and has several years of development at this point (and uses lots of shared Mesa infrastructure), so it's not particularly new even though Honeykrisp is.
I think I tried a llama.cpp build from a few months ago and it was noticeably slower, which makes me think there's probably still low-hanging fruit on this side? (unless major optimizations happened recently and no more are expected). If you think it might be insightful, I can try to bisect the performance improvement to see if it was something interesting / unexpected.
Ah, that's what I was looking for, thanks! I'll do some comparisons with Metal. A priori, one of the copy types reports Having single-shader tests like this is very helpful, since we can outright dump the shader assembly (and pipeline config) from macOS and Linux and compare (and even manually bisect differences). Much nicer than testing games... ^^;;
That explains a big factor then, we don't have coopmat wired up yet. I'll look into what it would take to add that. |
A lot has happened in the last few months. The Vulkan path is generally within about 10% of the CUDA path for token generation at least on my system (RTX 4070 using drivers from https://developer.nvidia.com/vulkan-driver). There are some knobs like in #10846 that might help things a bit on Apple hardware.
Depending on the test, 277/400 may be quite good. And the bandwidth-limited shaders aren't the majority of the time in these models. |
Was it known to be significantly slower a few months ago on that hardware? What I'm wondering is whether it's possible some smaller change had an outsized perf impact on our platform, and whether bisecting it could lead us somewhere. The previous version I was using was b3873 (from October 3), and that one gives these numbers:
So unless something major happened to the Vulkan backend that would explain at 2-3x performance improvement in the last 2-3 months, maybe it's worth bisecting that and seeing how that happened?
It's a copy test, so 277 read + 277 write > 500/400 right? Which I was guessing means the working set is small enough that a significant chunk fits in the cache hierarchy, which is why it's faster than DRAM bandwidth. |
Quite a bit of shader and backend optimization happened over the last months on the Vulkan backend. But it's most optimized on Nvidia and AMD hardware. No performance tuning has happened for Apple hardware, and UMA buffer handling can probably be improved. But it's good to hear that performance is increasing significantly. |
Might be worth comparing kompute vulkan backend on Asahi also |
I tried Kompute and it was significantly worse... |
Recently I did some tinkering to help the vulkan backend in llama.cpp to function properly on macos (with MoltenVK). As far as I can tell, the recent vulkan and metal backends seem to run similarly as fast on macos with apple silicon (people report the vulkan backend to be much faster on older intel-based macs with amd gpu though). Considering MoltenVK mostly perform straightforward shader translation from spirv to MSL with SPIRV-Cross without a lot of IR-level transformation/optimization, I would say the vulkan backend is at least as good as the metal backend, if not better. I would guess a few directions: 1) backend code generation or instruction scheduling, maybe the apple driver is emitting better code even with comparably-optimized IR; 2) memory management; or 3) power management (I remember the days when nouveau cannot get full performance potential of nvidia gpus because it cannot use the frequency scaling as with the proprietary driver, not sure if asahi has similar issues). |
PM is firmware-managed so it should not be There's the known factor of missing coopmat, and I guess other than that, I should start looking at shader dumps... |
I don't think moltenvk supports coopmat as of now. Interestingly, the metal backend indeed uses simdgroup matrix (which is kind of similar to coopmat i guess). I guess the shader dump is worth looking at. Maybe you can even try moltenvk on macos (you'll need my experimental fix for correct result). In that case you can get more comparable shader programs that are based on the same set of spirv. |
Just had some spare time to run the bench mode of Just for reference, for this test |
With n<=8 that test doesn't use coopmat,, it just uses the mul_mat_vec_q4_k.comp shader. There's a knob in ggml-vulkan.cpp that affects this shader that you could easily experiment with:
|
Hello here, I've been playing with the topic of Vulkan vs Metal, but from a slightly different angle: I'm trying to measure (and optimize) the performance of Vulkan/Linux, when running in a When I compare the llama.cpp inference performance for MacOS Vulkan ( However, the performance of
The first plot shows the tests giving GFLOPS results, the second one the I'm curious how you'll interpret these results, in particular the saw tooth results of Vulkan. Is it difference choices of algorithms that make the results better/worst (like in optimized for one case and inefficient for the next one?) This part of the GFLOPS plot seems to be representative of the inference test Vulkan slowdown. And the Happy to explain any part of the test harness that may not be clear, or run further experiments if that can help understanding how things are happening |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I'm one of the developers for the Asahi Linux GPU drivers, which provide accelerated Vulkan and OpenGL support on Apple Silicon platforms. I'm interested in improving the performance of llama.cpp on our drivers with the Vulkan backend.
As things stand today, macOS is significantly faster on a quick test with
llama-bench
, with default settings (tested on an M2 Max 64GB):Linux:
macOS:
(I also tested a larger 70B model which failed to load due to failing to allocate memory on Linux, but that's obviously a separate issue that's easy to debug. Probably just a hardcoded alloc size limit in the driver we can raise, since we recently refactored a bunch of stuff to handle >4G buffers properly.)
Of course, we'd like to improve the driver where possible to make things faster. However, since I know nothing about how LLMs are implemented under the hood, or the state of the llama.cpp Metal and Vulkan backends I would like to ask for help figuring out the perf issues, and analyzing whether llama.cpp itself could also be part of the root cause.
Would you be able to help us out? I'm curious about these things:
The text was updated successfully, but these errors were encountered: