Controlling the number of threads for CLBlast/cuBLAS prompt processing

Currently the number of threads used for prompt processing and inference is defined by `n_threads` unless CPU-based BLAS is used. In that case it is locked to 1 for processing only since OpenBLAS and friends are already multithreaded to begin with.

https://github.com/ggerganov/llama.cpp/blob/8183159cf3def112f6d1fe94815fce70e1bffa12/llama.cpp#L1817-L1819

Meanwhile, GPU BLAS implementations spawn `n_threads` threads for prompt processing. Here's a CLBlast run with the default 4 threads on my 4 core/8 thread CPU:
```
llama_print_timings:        load time =   391.66 ms
llama_print_timings:      sample time =    78.83 ms /   100 runs   (    0.79 ms per token,  1268.50 tokens per second)
llama_print_timings: prompt eval time = 11181.22 ms /   401 tokens (   27.88 ms per token,    35.86 tokens per second)
llama_print_timings:        eval time = 21230.25 ms /    99 runs   (  214.45 ms per token,     4.66 tokens per second)
llama_print_timings:       total time = 32514.96 ms
```
I get around 28 ms/token during prompt processing. Now let's try running with a different thread count by modifying line 1819:
```
n_threads = N >= 32 && (ggml_cpu_has_blas() || ggml_cpu_has_gpublas()) ? <NUMBER OF THREADS> : n_threads;
```

| Thread Count | ms/token (averaged over multiple runs) |
| --- | --- |
| 1 | 44 |
| 2 | 29 |
| 3 | 28 |
| 4 | 28 |
| 8 | 30 |

On the prompt processing side I'm able to get the same results with only 2 or 3 threads, which saves power and puts less load on the CPU. Meanwhile I get optimal inference speed with 4 threads (I use the CPU for inference as it's for some reason faster than the GPU), so there's a disrepancy there.

Is anyone else seeing this as well? I'm thinking of adding an additional command line option to control the prompt processing thread count.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controlling the number of threads for CLBlast/cuBLAS prompt processing #2498

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Controlling the number of threads for CLBlast/cuBLAS prompt processing #2498

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions