You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the number of threads used for prompt processing and inference is defined by n_threads unless CPU-based BLAS is used. In that case it is locked to 1 for processing only since OpenBLAS and friends are already multithreaded to begin with.
Meanwhile, GPU BLAS implementations spawn n_threads threads for prompt processing. Here's a CLBlast run with the default 4 threads on my 4 core/8 thread CPU:
llama_print_timings: load time = 391.66 ms
llama_print_timings: sample time = 78.83 ms / 100 runs ( 0.79 ms per token, 1268.50 tokens per second)
llama_print_timings: prompt eval time = 11181.22 ms / 401 tokens ( 27.88 ms per token, 35.86 tokens per second)
llama_print_timings: eval time = 21230.25 ms / 99 runs ( 214.45 ms per token, 4.66 tokens per second)
llama_print_timings: total time = 32514.96 ms
I get around 28 ms/token during prompt processing. Now let's try running with a different thread count by modifying line 1819:
n_threads = N >= 32 && (ggml_cpu_has_blas() || ggml_cpu_has_gpublas()) ? <NUMBER OF THREADS> : n_threads;
Thread Count
ms/token (averaged over multiple runs)
1
44
2
29
3
28
4
28
8
30
On the prompt processing side I'm able to get the same results with only 2 or 3 threads, which saves power and puts less load on the CPU. Meanwhile I get optimal inference speed with 4 threads (I use the CPU for inference as it's for some reason faster than the GPU), so there's a disrepancy there.
Is anyone else seeing this as well? I'm thinking of adding an additional command line option to control the prompt processing thread count.
The text was updated successfully, but these errors were encountered:
netrunnereve
changed the title
[Experiment] Controlling the number of threads for CLBlast/cuBLAS prompt processing
Controlling the number of threads for CLBlast/cuBLAS prompt processing
Aug 3, 2023
Currently the number of threads used for prompt processing and inference is defined by
n_threads
unless CPU-based BLAS is used. In that case it is locked to 1 for processing only since OpenBLAS and friends are already multithreaded to begin with.https://github.com/ggerganov/llama.cpp/blob/8183159cf3def112f6d1fe94815fce70e1bffa12/llama.cpp#L1817-L1819
Meanwhile, GPU BLAS implementations spawn
n_threads
threads for prompt processing. Here's a CLBlast run with the default 4 threads on my 4 core/8 thread CPU:I get around 28 ms/token during prompt processing. Now let's try running with a different thread count by modifying line 1819:
On the prompt processing side I'm able to get the same results with only 2 or 3 threads, which saves power and puts less load on the CPU. Meanwhile I get optimal inference speed with 4 threads (I use the CPU for inference as it's for some reason faster than the GPU), so there's a disrepancy there.
Is anyone else seeing this as well? I'm thinking of adding an additional command line option to control the prompt processing thread count.
The text was updated successfully, but these errors were encountered: