Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

m-arbaro · 2025-01-30T12:08:27Z

Name and Version

b3990/b3989

Operating systems

Linux

GGML backends

CUDA

Hardware

NVidia RTX 3090 + 3x Tesla P40, full offload

Models

Meta-Llama-3.3-70B-Instruct-Q6_K.gguf, reproduced on multiple models.

Problem description & steps to reproduce

After updating to b3990 inference speed on the long context start falling down much faster.
Inference on affected build:

llama_perf_sampler_print: sampling time = 3849.30 ms / 19008 runs ( 0.20 ms per token, 4938.04 tokens per second)
llama_perf_context_print: load time = 10529.10 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 185789.79 ms / 454 runs ( 409.23 ms per token, 2.44 tokens per second)
llama_perf_context_print: total time = 192330.87 ms / 455 tokens

Inference on unaffected build :

llama_perf_context_print: load time = 10386.38 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 72821.88 ms / 454 runs ( 160.40 ms per token, 6.23 tokens per second)
llama_perf_context_print: total time = 79340.13 ms / 455 tokens

Inference with --split-mode layer is the same on the both commits:
llama_perf_sampler_print: sampling time = 3848.68 ms / 19008 runs ( 0.20 ms per token, 4938.84 tokens per second)
llama_perf_context_print: load time = 10364.01 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 228173.97 ms / 454 runs ( 502.59 ms per token, 1.99 tokens per second)
llama_perf_context_print: total time = 234689.68 ms / 455 tokens
llama_perf_sampler_print: sampling time = 3792.58 ms / 19008 runs ( 0.20 ms per token, 5011.89 tokens per second)

First Bad Commit

b3990

Relevant log output

llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111  -sm row --prompt-cache models/memory/pc19000.tmp

llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111  -sm layer --prompt-cache models/memory/pc19000.tmp

slaren · 2025-01-30T12:44:48Z

After b3990, the context is distributed along all the GPUs, while previously it was stored entirely in the main GPU. So if you have a much faster main GPU than the rest, a drop in performance is expected. We could maybe add the possibility of configuring where to store the KV cache in a similar way to #11397.

city96 · 2025-02-01T02:09:38Z

I did notice a slight regression as well when that PR was merged. I have a similar setup (Tesla V100s PCIe + 2x Tesla P40, with the P40s power limited most of the time).

Forcing the KV cache to be on device 0 does seem to help on longer context tasks. Prompt processing seems slightly slower. All cards are on CPU connected PCIe 3.0 x16 so this may change depending on interconnect speed/latency.

Code modified at this line by replacing i with 0 (i.e. just forcing everything to use the same device as the first layer).

Benchmark done with this PR applied as a patch to test long context: #11126

./build/bin/llama-bench -fa 1 -m /mnt/models/command-r-v01-q8_0.bin -ts 6/5/5 -sm row -r 3 -p 0 -n 0 -gp 0,32 -gp 1024,32 -gp 4096,32 -gp 8192,32

branch	size	params	backend	ngl	sm	fa	ts	test	t/s
main	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	pp512	78.41 ± 0.02
main	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	tg32@pp0	15.56 ± 0.03
main	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	tg32@pp1024	15.07 ± 0.02
main	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	tg32@pp4096	12.71 ± 0.05
main	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	tg32@pp8192	10.50 ± 0.06
kv_dev0	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	pp512	75.72 ± 0.04
kv_dev0	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	tg32@pp0	15.32 ± 0.01
kv_dev0	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	tg32@pp1024	15.32 ± 0.04
kv_dev0	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5	tg32@pp4096	14.24 ± 0.04
kv_dev0	34.62 GiB	34.98 B	CUDA	99	row	1	6/5/5.	tg32@pp8192	12.99 ± 0.02

github-actions · 2025-03-18T01:07:41Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

m-arbaro added the bug-unconfirmed label Jan 30, 2025

m-arbaro changed the title ~~Eval bug: --slpit mode performance on NVidia multy-gpu config is extremely low on the long contexts after b3990~~ Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 Jan 30, 2025

github-actions bot added the stale label Mar 4, 2025

github-actions bot closed this as completed Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

m-arbaro commented Jan 30, 2025 •

edited

Loading

slaren commented Jan 30, 2025

city96 commented Feb 1, 2025

github-actions bot commented Mar 18, 2025

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

Comments

m-arbaro commented Jan 30, 2025 • edited Loading

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

slaren commented Jan 30, 2025

city96 commented Feb 1, 2025

github-actions bot commented Mar 18, 2025

m-arbaro commented Jan 30, 2025 •

edited

Loading