change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU #12632

Djip007 · 2025-03-28T20:32:43Z

This allow to use GPU host when possible over CPU repack.
This have the same effect to resolve this issues (#12459) without completely disable CPU extra buffer.

some bench:
on AMD Ryzen 9 5950X 16-Core Processor + AMD Radeon RX 6900 XT
with Mistral-Nemo-Instruct-2407-Q4_0.gguf

For reference with pure CPU+repack:

model	size	params	backend	threads	test	t/s
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp1	7.20 ± 0.11
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp1	7.25 ± 0.00
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp2	11.61 ± 0.01
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp4	27.39 ± 0.02
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp8	37.06 ± 0.04
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp16	62.30 ± 0.03
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp32	65.11 ± 0.02
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp48	66.10 ± 0.04
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp64	66.58 ± 0.01
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp128	67.17 ± 0.05
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp192	66.67 ± 0.06
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp256	66.42 ± 0.04
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp384	65.44 ± 0.05
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp512	61.43 ± 0.78
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	pp768	61.88 ± 0.74
llama 13B Q4_0	7.06 GiB	12.25 B	CPU	16	tg16	7.26 ± 0.00

for reference with full GPU offload using vulkan backend:

model	size	params	backend	ngl	test	t/s
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp1	57.50 ± 0.45
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp1	57.71 ± 1.04
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp2	108.35 ± 2.20
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp4	176.74 ± 2.18
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp8	205.79 ± 2.20
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp16	195.33 ± 1.22
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp32	411.28 ± 1.76
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp48	377.55 ± 0.73
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp64	542.27 ± 2.74
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp128	632.02 ± 3.87
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp192	668.72 ± 1.04
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp256	690.73 ± 3.59
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp384	718.63 ± 2.20
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp512	720.74 ± 0.51
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	pp768	704.29 ± 0.30
llama 13B Q4_0	7.06 GiB	12.25 B	Vulkan	99	tg16	57.90 ± 0.08

model	backend	ngl	test	before t/s	#12459 t/s	this patch t/s
llama 13B Q4_0	Vulkan	20	pp1	11.15 ± 0.13	11.21 ± 0.08	11.20 ± 0.08
llama 13B Q4_0	Vulkan	20	pp2	18.50 ± 0.13	20.89 ± 0.89	21.44 ± 0.23
llama 13B Q4_0	Vulkan	20	pp4	40.44 ± 0.40	40.70 ± 0.42	40.62 ± 0.28
llama 13B Q4_0	Vulkan	20	pp8	54.91 ± 0.34	59.45 ± 2.17	60.13 ± 0.41
llama 13B Q4_0	Vulkan	20	pp16	83.87 ± 0.22	68.18 ± 0.28	68.10 ± 0.15
llama 13B Q4_0	Vulkan	20	pp32	95.53 ± 4.98	79.60 ± 0.17	79.79 ± 0.24
llama 13B Q4_0	Vulkan	20	pp48	103.74 ± 1.45	102.88 ± 0.11	103.11 ± 0.36
llama 13B Q4_0	Vulkan	20	pp64	112.64 ± 0.22	140.70 ± 0.26	141.55 ± 0.13
llama 13B Q4_0	Vulkan	20	pp128	121.09 ± 1.46	228.04 ± 0.25	228.93 ± 0.61
llama 13B Q4_0	Vulkan	20	pp192	122.81 ± 2.19	288.27 ± 0.62	289.21 ± 0.33
llama 13B Q4_0	Vulkan	20	pp256	123.81 ± 0.07	332.01 ± 0.38	332.97 ± 0.34
llama 13B Q4_0	Vulkan	20	pp384	120.27 ± 1.39	394.76 ± 0.39	395.88 ± 0.17
llama 13B Q4_0	Vulkan	20	pp512	112.97 ± 2.42	426.87 ± 1.37	423.95 ± 0.57
llama 13B Q4_0	Vulkan	20	pp768	116.93 ± 1.11	386.33 ± 0.53	391.37 ± 9.33
llama 13B Q4_0	Vulkan	20	tg16	11.04 ± 0.19	11.23 ± 0.01	11.18 ± 0.03

So for me it look good.

this allow to use GPU host when possible over CPU repack. this have the same effect to resolve this issues (ggml-org#12498) without completely disable CPU extra buffer.

jklincn · 2025-04-10T06:36:20Z

Maybe we should remove this incorrect comment.

// add extra buffer types, only if no GPU device is present

And this problem has been solved by another method instead of the one mentioned in the issue.

// ref: https://github.com/ggml-org/llama.cpp/issues/12481#issuecomment-2743136094

Djip007 · 2025-05-23T18:44:24Z

Yes look I missed to correct the comments... my bad.
Do you want to make a PR?

change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU

2bf259f

this allow to use GPU host when possible over CPU repack. this have the same effect to resolve this issues (ggml-org#12498) without completely disable CPU extra buffer.

Djip007 mentioned this pull request Mar 28, 2025

model : do not repack if a GPU device is present #12498

Merged

ggerganov approved these changes Mar 29, 2025

View reviewed changes

ggerganov requested a review from slaren March 29, 2025 09:26

slaren approved these changes Mar 29, 2025

View reviewed changes

slaren merged commit 0bb2919 into ggml-org:master Mar 29, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU #12632

change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU #12632

Uh oh!

Djip007 commented Mar 28, 2025

Uh oh!

Uh oh!

jklincn commented Apr 10, 2025

Uh oh!

Djip007 commented May 23, 2025

Uh oh!

Uh oh!

change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU #12632

change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU #12632

Uh oh!

Conversation

Djip007 commented Mar 28, 2025

Uh oh!

Uh oh!

jklincn commented Apr 10, 2025

Uh oh!

Djip007 commented May 23, 2025

Uh oh!

Uh oh!