Skip to content

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
m-arbaro opened this issue Jan 30, 2025 · 3 comments

Comments

@m-arbaro
Copy link

m-arbaro commented Jan 30, 2025

Name and Version

b3990/b3989

Operating systems

Linux

GGML backends

CUDA

Hardware

NVidia RTX 3090 + 3x Tesla P40, full offload

Models

Meta-Llama-3.3-70B-Instruct-Q6_K.gguf, reproduced on multiple models.

Problem description & steps to reproduce

After updating to b3990 inference speed on the long context start falling down much faster.
Inference on affected build:

llama_perf_sampler_print: sampling time = 3849.30 ms / 19008 runs ( 0.20 ms per token, 4938.04 tokens per second)
llama_perf_context_print: load time = 10529.10 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 185789.79 ms / 454 runs ( 409.23 ms per token, 2.44 tokens per second)
llama_perf_context_print: total time = 192330.87 ms / 455 tokens

Inference on unaffected build :

llama_perf_context_print: load time = 10386.38 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 72821.88 ms / 454 runs ( 160.40 ms per token, 6.23 tokens per second)
llama_perf_context_print: total time = 79340.13 ms / 455 tokens

Inference with --split-mode layer is the same on the both commits:
llama_perf_sampler_print: sampling time = 3848.68 ms / 19008 runs ( 0.20 ms per token, 4938.84 tokens per second)
llama_perf_context_print: load time = 10364.01 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 228173.97 ms / 454 runs ( 502.59 ms per token, 1.99 tokens per second)
llama_perf_context_print: total time = 234689.68 ms / 455 tokens
llama_perf_sampler_print: sampling time = 3792.58 ms / 19008 runs ( 0.20 ms per token, 5011.89 tokens per second)

First Bad Commit

b3990

Relevant log output

llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111  -sm row --prompt-cache models/memory/pc19000.tmp

llama-cli --n-gpu-layers 99 -ts '30,40,40,40' --main_gpu 0 --temp 1.5 --file prompt_random_long --ctx-size 19000 --model models/memory/llama3.3-q6_k.gguf --seed 11111111111  -sm layer --prompt-cache models/memory/pc19000.tmp
@m-arbaro m-arbaro changed the title Eval bug: --slpit mode performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 Jan 30, 2025
@slaren
Copy link
Member

slaren commented Jan 30, 2025

After b3990, the context is distributed along all the GPUs, while previously it was stored entirely in the main GPU. So if you have a much faster main GPU than the rest, a drop in performance is expected. We could maybe add the possibility of configuring where to store the KV cache in a similar way to #11397.

@city96
Copy link
Contributor

city96 commented Feb 1, 2025

I did notice a slight regression as well when that PR was merged. I have a similar setup (Tesla V100s PCIe + 2x Tesla P40, with the P40s power limited most of the time).

Forcing the KV cache to be on device 0 does seem to help on longer context tasks. Prompt processing seems slightly slower. All cards are on CPU connected PCIe 3.0 x16 so this may change depending on interconnect speed/latency.

Code modified at this line by replacing i with 0 (i.e. just forcing everything to use the same device as the first layer).

Benchmark done with this PR applied as a patch to test long context: #11126

./build/bin/llama-bench -fa 1 -m /mnt/models/command-r-v01-q8_0.bin -ts 6/5/5 -sm row -r 3 -p 0 -n 0 -gp 0,32 -gp 1024,32 -gp 4096,32 -gp 8192,32

branch size params backend ngl sm fa ts test t/s
main 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 pp512 78.41 ± 0.02
main 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 tg32@pp0 15.56 ± 0.03
main 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 tg32@pp1024 15.07 ± 0.02
main 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 tg32@pp4096 12.71 ± 0.05
main 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 tg32@pp8192 10.50 ± 0.06
kv_dev0 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 pp512 75.72 ± 0.04
kv_dev0 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 tg32@pp0 15.32 ± 0.01
kv_dev0 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 tg32@pp1024 15.32 ± 0.04
kv_dev0 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5 tg32@pp4096 14.24 ± 0.04
kv_dev0 34.62 GiB 34.98 B CUDA 99 row 1 6/5/5. tg32@pp8192 12.99 ± 0.02

@github-actions github-actions bot added the stale label Mar 4, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants