llama-cli on Hexagon-NPU introducing a lot of extra time

I find something weird about Hexagon-NPU backend performance.
Firstly I run `test-backend-ops` on Android 8gen3, the result is normal:
```
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                  744 runs - 22486.85 us/run - 134.48 MFLOP/run -   5.98 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                  744 runs - 13501.82 us/run - 134.48 MFLOP/run -   9.96 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 852 runs -  9510.09 us/run - 117.44 MFLOP/run -  12.35 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 852 runs -  8131.15 us/run - 117.44 MFLOP/run -  14.44 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs -  4886.67 us/run - 117.44 MFLOP/run -  24.03 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs - 19410.53 us/run - 117.44 MFLOP/run -   6.05 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs - 408029.35 us/run - 117.44 MFLOP/run - 287.82 MFLOPS
```
But when I run `llama-cli`, the tg performance is very bad:
```
llama_perf_sampler_print:    sampling time =       0.87 ms /    45 runs   (    0.02 ms per token, 51605.50 tokens per second)
llama_perf_context_print:        load time =    5733.53 ms
llama_perf_context_print: prompt eval time =    3360.59 ms /    42 tokens (   80.01 ms per token,    12.50 tokens per second)
llama_perf_context_print:        eval time =    2368.24 ms /     2 runs   ( 1184.12 ms per token,     0.84 tokens per second)
llama_perf_context_print:       total time =    8105.73 ms /    44 tokens
```
Even if ops on NPU is not very fast, it shouldnot be so bad.
So I print time-cost about subgraph(NPU+CPU) compute of llama-cli by `ggml_time_us`, something very weird:
```
[CPU]total: 102.017 ms  // CPU ggml_backend_graph_compute_async time-cost
[hexagon-npu]total: 367.552 ms  // NPU ggml_backend_graph_compute_async time-cost
graph_compute cost 504.811 ms  // graph_compute total-time-cost, a little gap between 504.811 and (102.017+367.552), ignore temporarily
```
The total is `504.811 ms`, much lower than `1184.12 ms` per token.
Then I print other stages of llama-cli:
```
graph_init cost 0.005 ms
graph_build cost 0.648 ms
ggml_backend_sched_alloc_graph cost 695.747 ms
graph_compute cost 504.811 ms
```
`ggml_backend_sched_alloc_graph` cost almost 700 ms per token!
Then inside `ggml_backend_sched_alloc_graph`, the following are some functions that take up a lot of time:
```
ggml_backend_sched_split_graph cost 144.407 ms
clear_tensors cost 269.318 ms  // in ggml_backend_buffer_reset
ggml_gallocr_init_tensor[nodes without view_src] total cost 206.843 ms
ggml_gallocr_init_tensor[nodes with view_src] total cost 69.264 ms
```
Seems that `ggml_backend_sched_split_graph` and `ggml_gallocr_init_tensor` involving RPC memory allocation, but is the time-cost normal? It's too long ... And why `clear tensors` cost 269 ms? It looks simple.
```
void host_buffer::clear_tensors() {
    _tensors.clear();
    LOG_DEBUG("clear host_buffer(%p) tensors\n", (void *) _data);
}
```

BTW my device: Android 8gen3
test model: qwen2.5-1.5b-instruct-q4_0.gguf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-cli on Hexagon-NPU introducing a lot of extra time #51

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

llama-cli on Hexagon-NPU introducing a lot of extra time #51

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions