Skip to content

change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU #12632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 29, 2025

Conversation

Djip007
Copy link
Contributor

@Djip007 Djip007 commented Mar 28, 2025

This allow to use GPU host when possible over CPU repack.
This have the same effect to resolve this issues (#12459) without completely disable CPU extra buffer.

some bench:
on AMD Ryzen 9 5950X 16-Core Processor + AMD Radeon RX 6900 XT
with Mistral-Nemo-Instruct-2407-Q4_0.gguf

For reference with pure CPU+repack:

model size params backend threads test t/s
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp1 7.20 ± 0.11
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp1 7.25 ± 0.00
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp2 11.61 ± 0.01
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp4 27.39 ± 0.02
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp8 37.06 ± 0.04
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp16 62.30 ± 0.03
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp32 65.11 ± 0.02
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp48 66.10 ± 0.04
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp64 66.58 ± 0.01
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp128 67.17 ± 0.05
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp192 66.67 ± 0.06
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp256 66.42 ± 0.04
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp384 65.44 ± 0.05
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp512 61.43 ± 0.78
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 pp768 61.88 ± 0.74
llama 13B Q4_0 7.06 GiB 12.25 B CPU 16 tg16 7.26 ± 0.00

for reference with full GPU offload using vulkan backend:

model size params backend ngl test t/s
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp1 57.50 ± 0.45
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp1 57.71 ± 1.04
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp2 108.35 ± 2.20
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp4 176.74 ± 2.18
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp8 205.79 ± 2.20
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp16 195.33 ± 1.22
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp32 411.28 ± 1.76
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp48 377.55 ± 0.73
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp64 542.27 ± 2.74
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp128 632.02 ± 3.87
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp192 668.72 ± 1.04
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp256 690.73 ± 3.59
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp384 718.63 ± 2.20
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp512 720.74 ± 0.51
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 pp768 704.29 ± 0.30
llama 13B Q4_0 7.06 GiB 12.25 B Vulkan 99 tg16 57.90 ± 0.08

Now with partial GPU offloading + CPU repack:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none

model backend ngl test before t/s #12459 t/s this patch t/s
llama 13B Q4_0 Vulkan 20 pp1 11.15 ± 0.13 11.21 ± 0.08 11.20 ± 0.08
llama 13B Q4_0 Vulkan 20 pp2 18.50 ± 0.13 20.89 ± 0.89 21.44 ± 0.23
llama 13B Q4_0 Vulkan 20 pp4 40.44 ± 0.40 40.70 ± 0.42 40.62 ± 0.28
llama 13B Q4_0 Vulkan 20 pp8 54.91 ± 0.34 59.45 ± 2.17 60.13 ± 0.41
llama 13B Q4_0 Vulkan 20 pp16 83.87 ± 0.22 68.18 ± 0.28 68.10 ± 0.15
llama 13B Q4_0 Vulkan 20 pp32 95.53 ± 4.98 79.60 ± 0.17 79.79 ± 0.24
llama 13B Q4_0 Vulkan 20 pp48 103.74 ± 1.45 102.88 ± 0.11 103.11 ± 0.36
llama 13B Q4_0 Vulkan 20 pp64 112.64 ± 0.22 140.70 ± 0.26 141.55 ± 0.13
llama 13B Q4_0 Vulkan 20 pp128 121.09 ± 1.46 228.04 ± 0.25 228.93 ± 0.61
llama 13B Q4_0 Vulkan 20 pp192 122.81 ± 2.19 288.27 ± 0.62 289.21 ± 0.33
llama 13B Q4_0 Vulkan 20 pp256 123.81 ± 0.07 332.01 ± 0.38 332.97 ± 0.34
llama 13B Q4_0 Vulkan 20 pp384 120.27 ± 1.39 394.76 ± 0.39 395.88 ± 0.17
llama 13B Q4_0 Vulkan 20 pp512 112.97 ± 2.42 426.87 ± 1.37 423.95 ± 0.57
llama 13B Q4_0 Vulkan 20 pp768 116.93 ± 1.11 386.33 ± 0.53 391.37 ± 9.33
llama 13B Q4_0 Vulkan 20 tg16 11.04 ± 0.19 11.23 ± 0.01 11.18 ± 0.03

So for me it look good.

this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (ggml-org#12498) without
completely disable CPU extra buffer.
@ggerganov ggerganov requested a review from slaren March 29, 2025 09:26
@slaren slaren merged commit 0bb2919 into ggml-org:master Mar 29, 2025
48 checks passed
@jklincn
Copy link
Contributor

jklincn commented Apr 10, 2025

Maybe we should remove this incorrect comment.

// add extra buffer types, only if no GPU device is present

And this problem has been solved by another method instead of the one mentioned in the issue.

// ref: https://github.com/ggml-org/llama.cpp/issues/12481#issuecomment-2743136094

@Djip007
Copy link
Contributor Author

Djip007 commented May 23, 2025

Yes look I missed to correct the comments... my bad.
Do you want to make a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants