Skip to content

llama-cli on Hexagon-NPU introducing a lot of extra time #51

@finneyyan

Description

@finneyyan

I find something weird about Hexagon-NPU backend performance.
Firstly I run test-backend-ops on Android 8gen3, the result is normal:

  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                  744 runs - 22486.85 us/run - 134.48 MFLOP/run -   5.98 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                  744 runs - 13501.82 us/run - 134.48 MFLOP/run -   9.96 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 852 runs -  9510.09 us/run - 117.44 MFLOP/run -  12.35 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 852 runs -  8131.15 us/run - 117.44 MFLOP/run -  14.44 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs -  4886.67 us/run - 117.44 MFLOP/run -  24.03 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs - 19410.53 us/run - 117.44 MFLOP/run -   6.05 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs - 408029.35 us/run - 117.44 MFLOP/run - 287.82 MFLOPS

But when I run llama-cli, the tg performance is very bad:

llama_perf_sampler_print:    sampling time =       0.87 ms /    45 runs   (    0.02 ms per token, 51605.50 tokens per second)
llama_perf_context_print:        load time =    5733.53 ms
llama_perf_context_print: prompt eval time =    3360.59 ms /    42 tokens (   80.01 ms per token,    12.50 tokens per second)
llama_perf_context_print:        eval time =    2368.24 ms /     2 runs   ( 1184.12 ms per token,     0.84 tokens per second)
llama_perf_context_print:       total time =    8105.73 ms /    44 tokens

Even if ops on NPU is not very fast, it shouldnot be so bad.
So I print time-cost about subgraph(NPU+CPU) compute of llama-cli by ggml_time_us, something very weird:

[CPU]total: 102.017 ms  // CPU ggml_backend_graph_compute_async time-cost
[hexagon-npu]total: 367.552 ms  // NPU ggml_backend_graph_compute_async time-cost
graph_compute cost 504.811 ms  // graph_compute total-time-cost, a little gap between 504.811 and (102.017+367.552), ignore temporarily

The total is 504.811 ms, much lower than 1184.12 ms per token.
Then I print other stages of llama-cli:

graph_init cost 0.005 ms
graph_build cost 0.648 ms
ggml_backend_sched_alloc_graph cost 695.747 ms
graph_compute cost 504.811 ms

ggml_backend_sched_alloc_graph cost almost 700 ms per token!
Then inside ggml_backend_sched_alloc_graph, the following are some functions that take up a lot of time:

ggml_backend_sched_split_graph cost 144.407 ms
clear_tensors cost 269.318 ms  // in ggml_backend_buffer_reset
ggml_gallocr_init_tensor[nodes without view_src] total cost 206.843 ms
ggml_gallocr_init_tensor[nodes with view_src] total cost 69.264 ms

Seems that ggml_backend_sched_split_graph and ggml_gallocr_init_tensor involving RPC memory allocation, but is the time-cost normal? It's too long ... And why clear tensors cost 269 ms? It looks simple.

void host_buffer::clear_tensors() {
    _tensors.clear();
    LOG_DEBUG("clear host_buffer(%p) tensors\n", (void *) _data);
}

BTW my device: Android 8gen3
test model: qwen2.5-1.5b-instruct-q4_0.gguf

Metadata

Metadata

Assignees

Labels

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions