You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Meta-Llama-3.3-70B-Instruct-Q6_K.gguf, reproduced on multiple models.
Problem description & steps to reproduce
After updating to b3990 inference speed on the long context start falling down much faster.
Inference on affected build:
llama_perf_sampler_print: sampling time = 3849.30 ms / 19008 runs ( 0.20 ms per token, 4938.04 tokens per second)
llama_perf_context_print: load time = 10529.10 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 185789.79 ms / 454 runs ( 409.23 ms per token, 2.44 tokens per second)
llama_perf_context_print: total time = 192330.87 ms / 455 tokens
Inference on unaffected build :
llama_perf_context_print: load time = 10386.38 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 72821.88 ms / 454 runs ( 160.40 ms per token, 6.23 tokens per second)
llama_perf_context_print: total time = 79340.13 ms / 455 tokens
Inference with --split-mode layer is the same on the both commits:
llama_perf_sampler_print: sampling time = 3848.68 ms / 19008 runs ( 0.20 ms per token, 4938.84 tokens per second)
llama_perf_context_print: load time = 10364.01 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 228173.97 ms / 454 runs ( 502.59 ms per token, 1.99 tokens per second)
llama_perf_context_print: total time = 234689.68 ms / 455 tokens
llama_perf_sampler_print: sampling time = 3792.58 ms / 19008 runs ( 0.20 ms per token, 5011.89 tokens per second)
m-arbaro
changed the title
Eval bug: --slpit mode performance on NVidia multy-gpu config is extremely low on the long contexts after b3990
Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990
Jan 30, 2025
After b3990, the context is distributed along all the GPUs, while previously it was stored entirely in the main GPU. So if you have a much faster main GPU than the rest, a drop in performance is expected. We could maybe add the possibility of configuring where to store the KV cache in a similar way to #11397.
I did notice a slight regression as well when that PR was merged. I have a similar setup (Tesla V100s PCIe + 2x Tesla P40, with the P40s power limited most of the time).
Forcing the KV cache to be on device 0 does seem to help on longer context tasks. Prompt processing seems slightly slower. All cards are on CPU connected PCIe 3.0 x16 so this may change depending on interconnect speed/latency.
Code modified at this line by replacing i with 0 (i.e. just forcing everything to use the same device as the first layer).
Benchmark done with this PR applied as a patch to test long context: #11126
Name and Version
b3990/b3989
Operating systems
Linux
GGML backends
CUDA
Hardware
NVidia RTX 3090 + 3x Tesla P40, full offload
Models
Meta-Llama-3.3-70B-Instruct-Q6_K.gguf, reproduced on multiple models.
Problem description & steps to reproduce
After updating to b3990 inference speed on the long context start falling down much faster.
Inference on affected build:
llama_perf_sampler_print: sampling time = 3849.30 ms / 19008 runs ( 0.20 ms per token, 4938.04 tokens per second)
llama_perf_context_print: load time = 10529.10 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 185789.79 ms / 454 runs ( 409.23 ms per token, 2.44 tokens per second)
llama_perf_context_print: total time = 192330.87 ms / 455 tokens
Inference on unaffected build :
llama_perf_context_print: load time = 10386.38 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 72821.88 ms / 454 runs ( 160.40 ms per token, 6.23 tokens per second)
llama_perf_context_print: total time = 79340.13 ms / 455 tokens
Inference with --split-mode layer is the same on the both commits:
llama_perf_sampler_print: sampling time = 3848.68 ms / 19008 runs ( 0.20 ms per token, 4938.84 tokens per second)
llama_perf_context_print: load time = 10364.01 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 228173.97 ms / 454 runs ( 502.59 ms per token, 1.99 tokens per second)
llama_perf_context_print: total time = 234689.68 ms / 455 tokens
llama_perf_sampler_print: sampling time = 3792.58 ms / 19008 runs ( 0.20 ms per token, 5011.89 tokens per second)
First Bad Commit
b3990
Relevant log output
The text was updated successfully, but these errors were encountered: