Skip to content

Controlling the number of threads for CLBlast/cuBLAS prompt processing #2498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
netrunnereve opened this issue Aug 3, 2023 · 1 comment
Closed

Comments

@netrunnereve
Copy link
Collaborator

netrunnereve commented Aug 3, 2023

Currently the number of threads used for prompt processing and inference is defined by n_threads unless CPU-based BLAS is used. In that case it is locked to 1 for processing only since OpenBLAS and friends are already multithreaded to begin with.

https://github.com/ggerganov/llama.cpp/blob/8183159cf3def112f6d1fe94815fce70e1bffa12/llama.cpp#L1817-L1819

Meanwhile, GPU BLAS implementations spawn n_threads threads for prompt processing. Here's a CLBlast run with the default 4 threads on my 4 core/8 thread CPU:

llama_print_timings:        load time =   391.66 ms
llama_print_timings:      sample time =    78.83 ms /   100 runs   (    0.79 ms per token,  1268.50 tokens per second)
llama_print_timings: prompt eval time = 11181.22 ms /   401 tokens (   27.88 ms per token,    35.86 tokens per second)
llama_print_timings:        eval time = 21230.25 ms /    99 runs   (  214.45 ms per token,     4.66 tokens per second)
llama_print_timings:       total time = 32514.96 ms

I get around 28 ms/token during prompt processing. Now let's try running with a different thread count by modifying line 1819:

n_threads = N >= 32 && (ggml_cpu_has_blas() || ggml_cpu_has_gpublas()) ? <NUMBER OF THREADS> : n_threads;
Thread Count ms/token (averaged over multiple runs)
1 44
2 29
3 28
4 28
8 30

On the prompt processing side I'm able to get the same results with only 2 or 3 threads, which saves power and puts less load on the CPU. Meanwhile I get optimal inference speed with 4 threads (I use the CPU for inference as it's for some reason faster than the GPU), so there's a disrepancy there.

Is anyone else seeing this as well? I'm thinking of adding an additional command line option to control the prompt processing thread count.

@netrunnereve netrunnereve changed the title [Experiment] Controlling the number of threads for CLBlast/cuBLAS prompt processing Controlling the number of threads for CLBlast/cuBLAS prompt processing Aug 3, 2023
@netrunnereve
Copy link
Collaborator Author

Feature added in #3301.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant