-
Notifications
You must be signed in to change notification settings - Fork 4
Description
I find something weird about Hexagon-NPU backend performance.
Firstly I run test-backend-ops
on Android 8gen3, the result is normal:
MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0): 744 runs - 22486.85 us/run - 134.48 MFLOP/run - 5.98 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1): 744 runs - 13501.82 us/run - 134.48 MFLOP/run - 9.96 GFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 9510.09 us/run - 117.44 MFLOP/run - 12.35 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 8131.15 us/run - 117.44 MFLOP/run - 14.44 GFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 4886.67 us/run - 117.44 MFLOP/run - 24.03 GFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 19410.53 us/run - 117.44 MFLOP/run - 6.05 GFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 408029.35 us/run - 117.44 MFLOP/run - 287.82 MFLOPS
But when I run llama-cli
, the tg performance is very bad:
llama_perf_sampler_print: sampling time = 0.87 ms / 45 runs ( 0.02 ms per token, 51605.50 tokens per second)
llama_perf_context_print: load time = 5733.53 ms
llama_perf_context_print: prompt eval time = 3360.59 ms / 42 tokens ( 80.01 ms per token, 12.50 tokens per second)
llama_perf_context_print: eval time = 2368.24 ms / 2 runs ( 1184.12 ms per token, 0.84 tokens per second)
llama_perf_context_print: total time = 8105.73 ms / 44 tokens
Even if ops on NPU is not very fast, it shouldnot be so bad.
So I print time-cost about subgraph(NPU+CPU) compute of llama-cli by ggml_time_us
, something very weird:
[CPU]total: 102.017 ms // CPU ggml_backend_graph_compute_async time-cost
[hexagon-npu]total: 367.552 ms // NPU ggml_backend_graph_compute_async time-cost
graph_compute cost 504.811 ms // graph_compute total-time-cost, a little gap between 504.811 and (102.017+367.552), ignore temporarily
The total is 504.811 ms
, much lower than 1184.12 ms
per token.
Then I print other stages of llama-cli:
graph_init cost 0.005 ms
graph_build cost 0.648 ms
ggml_backend_sched_alloc_graph cost 695.747 ms
graph_compute cost 504.811 ms
ggml_backend_sched_alloc_graph
cost almost 700 ms per token!
Then inside ggml_backend_sched_alloc_graph
, the following are some functions that take up a lot of time:
ggml_backend_sched_split_graph cost 144.407 ms
clear_tensors cost 269.318 ms // in ggml_backend_buffer_reset
ggml_gallocr_init_tensor[nodes without view_src] total cost 206.843 ms
ggml_gallocr_init_tensor[nodes with view_src] total cost 69.264 ms
Seems that ggml_backend_sched_split_graph
and ggml_gallocr_init_tensor
involving RPC memory allocation, but is the time-cost normal? It's too long ... And why clear tensors
cost 269 ms? It looks simple.
void host_buffer::clear_tensors() {
_tensors.clear();
LOG_DEBUG("clear host_buffer(%p) tensors\n", (void *) _data);
}
BTW my device: Android 8gen3
test model: qwen2.5-1.5b-instruct-q4_0.gguf
Metadata
Metadata
Assignees
Labels
Projects
Status