ggml: offload the entire cgraph to a specified backend #12342

zhouwg · 2025-03-12T04:13:04Z

this PR provide a concise approach to offload the entire ggml's cgraph to a specified backend and no side-effect to the all existing backends.

this PR has verified in my forked llama.cpp project and it works fine as expected.

3-12 12:29:31.568 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4722]: qnn device 2(QNN-NPU)
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4723]: cgraph->n_nodes 846
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op inp_embd (GET_ROWS)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op norm-0 (RMS_NORM)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op attn_norm-0 (MUL)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (MUL_MAT)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (ADD)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (reshaped) (RESHAPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (ROPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (MUL_MAT)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (ADD)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (reshaped) (RESHAPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (ROPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Vcur-0 (MUL_MAT)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Vcur-0 (ADD)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op k_cache_view-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op k_cache_view-0 (copy of Kcur-0) (CPY)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Vcur-0 (transposed) (TRANSPOSE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op v_cache_view-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op v_cache_view-0 (copy of Vcur-0 (transposed)) (CPY)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op v-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op k-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4758]: the second inference approach "mapping

this feature would be very helpful for a WIP PR: mapping entire ggml cgraph to the QNN graph although it seems it's a bad news for my formal third PR:#12326, it doesn't matter 🤗 and I'd like to see success of similar PR from others in this great tech community although that implementation hide so much tech details and much complicated encapsulation.

I personally hope this PR can be helpful for that WIP PR because I already have no positive attention to Qualcomm's ggml-qnn backend since 03/12/2025(03/29 might-be a better date because it seems I was back to github and this great tech community since 01/29/2024, that's enough).

this feature will/might brings some unexpected help for Intel's sycl or Huawei's cann backend which similar to the second tech approach in a WIP Qualcomm's ggml-qnn backend, many advanced or state-of-the-art AI technologies can be imported to this great project.

relative tech details can be found at: #12326 (comment)

@slaren, could you help to review this PR? it will very helpful for a WIP PR(mapping entire ggml cgraph to a Qualcomm's QNN NPU backend then the specified backend can do some special hardware-dependent optimizations). the function name or function position might be not inappropriate, I'll adjust it accordingly as your review comments.

0cc4m · 2025-03-12T09:23:48Z

Vulkan already offloads the entire (sub)graph and I think CUDA does something similar with the CUDA graphs feature. There are no code changes to the backend system required for that, you just trigger the graph execution on the first node that is a part of the graph, and wait for it to finish on the last node.

Some performance optimizations happened that split it up into multiple graphs to allow earlier submissions, but otherwise it works as I described.

ngxson · 2025-03-12T09:58:08Z

offload the entire ggml's cgraph to a specified backend

IIRC this is actually a deprecated feature in ggml. When working with #12322 I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend. So probably we don't need a new API for this? (I don't have a strong opinion on this, just FYI)

zhouwg · 2025-03-12T10:17:09Z

Vulkan already offloads the entire (sub)graph and I think CUDA does something similar with the CUDA graphs feature. There are no code changes to the backend system required for that, you just trigger the graph execution on the first node that is a part of the graph, and wait for it to finish on the last node.

Some performance optimizations happened that split it up into multiple graphs to allow earlier submissions, but otherwise it works as I described.

1.I don't know the details in vulkan backend. I already see there is a debug statement in ggml_backend_vk_graph_compute

static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
    VK_LOG_DEBUG("ggml_backend_vk_graph_compute(" << cgraph->n_nodes << " nodes)");

2.a WIP Qualcomm QNN backend need this feature because of Qualcomm's dedicated AI tech(they need to convert the entire ggml cgraph to a single opcfg QNN graph and then optimize the single opcfg QNN graph on QNN-CPU / QNN-NPU backend accordingly, the details can be found at:#12326 (comment). we can clearly see that there only 2 graph nodes or 1 graph node in static enum ggml_status ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph). so the current so-called second technical approach "mapping the entire cgraph to QNN graph" will be fallback to the first technical approach which similar to Intel sycl backend or Huawei cann backend. we call this is the first tech approach in ggml-qnn backend and it's performance is really bad because of their hardware accelerators are significantly different from Intel sycl or Huawei cann.
3. this feature has carefully verified in my forked llama.cpp project with 2*4 scenarios:

./scripts/build-run-android.sh run_llamacli 0 (QNN_CPU backend)
./scripts/build-run-android.sh run_llamacli 1 (QNN_GPU backend)
./scripts/build-run-android.sh run_llamacli 2 (QNN_NPU backend)
./scripts/build-run-android.sh run_llamacli 3 (default backend)

./scripts/build-run-android.sh run_llamacli 0 (QNN_CPU backend)
./scripts/build-run-android.sh run_llamacli 1 (QNN_GPU backend)
./scripts/build-run-android.sh run_llamacli 2 (QNN_NPU backend)
./scripts/build-run-android.sh run_llamacli 3 (default backend)

all these testcase works fine as expected.
4. I think this feature is also suitable for Intel sycl or Huawei cann backend because they are both use a similar inference procedure currently: accelerate op one bye one, this is a general approach in the existing backends or so-called first tech approach in Qualcomm's ggml-qnn backend.

0cc4m · 2025-03-12T10:26:45Z

1.I don't know the details in vulkan backend.
2.a WIP Qualcomm QNN backend need this feature, the details can be found at:#12326 (comment). we can clearly see that there only 2 graph nodes or 1 graph node in static enum ggml_status ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)

I know you don't know the details of the Vulkan backend, that's why I'm telling you about it. You get the cgraph in the graph_compute function and you can handle it however you like. Nothing is forcing you to handle nodes one by one.

If the cgraph you receive contains only few nodes, that's cause your supports_op function returned false for some of the nodes in the middle of the graph, forcing the scheduler to split the cgraph into smaller chunks to handle those parts on CPU. Once your backend supports all ops, you will get a complete graph.

zhouwg · 2025-03-12T10:39:57Z

offload the entire ggml's cgraph to a specified backend

IIRC this is actually a deprecated feature in ggml. When working with #12322 I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend. So probably we don't need a new API for this? (I don't have a strong opinion on this, just FYI)

yes, it seems the existing ggml backend subsystem offload the entire cgraph to a specified backend completely. unfortunately, this is not the fact in ggml-qnn backend, pls refer to:#12326 (comment). I personally think this also not the fact in Intel sycl backend or Huawei cann backend. we can add a simple debug statement in the corresponding function:

the rootcause is that the original author introduced a standout and necessary feature "backend scheduler" in ggml backend subsystem, in other words, your opinion "I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend" is absolutely correct.

so my patch in this PR is very simple:

find the inference procedure in llama.cpp
find the corresponding function in ggml/src/ggml-backend.cpp
add a hook in that function to offload the real entire ggml cgraph to a specified backend(such as ggml-qnn backend) directly
avoid side-effects to all existing backends and existing logic(especially the "backend scheduler") in existing ggml backend subsystem

zhouwg · 2025-03-12T11:04:12Z

1.I don't know the details in vulkan backend.
2.a WIP Qualcomm QNN backend need this feature, the details can be found at:#12326 (comment). we can clearly see that there only 2 graph nodes or 1 graph node in static enum ggml_status ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)

I know you don't know the details of the Vulkan backend, that's why I'm telling you about it. You get the cgraph in the graph_compute function and you can handle it however you like. Nothing is forcing you to handle nodes one by one.

If the cgraph you receive contains only few nodes, that's cause your supports_op function returned false for some of the nodes in the middle of the graph, forcing the scheduler to split the cgraph into smaller chunks to handle those parts on CPU. Once your backend supports all ops, you will get a complete graph.

thanks for you kind reminder and I understand what you said.
unfortunately, it seems that this is not the fact in ggml-qnn backend:

let ggml_qnn_can_handle_op return true forcefully
running the llm inference on qnn npu backend with my patch in this PR
running the llm inference on qnn npu backend without my patch in this PR

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model, this is my personal understanding and correction from Qualcomm's expert is greatly appreicated) and then converting / mapping it to a single opcfg QNN graph and then optimize the QNN graph accordingly. I or we called this is a second tech approach of NPU inference on Qualcomm's mobile or desktop SoC. I or we call the general approach in Intel sycl or Huawei cann is a first tech approach. the NPU performance of ggml-qnn through the first tech approach is really bad(much slower then the default cpu backend), this is significantly different from Intel sycl or Huawei cann. I guess the reason is that Qualcomm's AI accelerator is not a general/common hardware accelerator or there some tricks in Qualcomm's QNN SDK(they have a world-class Hexgon NPU and the QNN SDK(Qualcomm has provided various AI sw stacks) couldn't utilize it maximally if programmers don't know how to use the C API in QNN SDK correctly).

in the all, we can call help from author of Intel sycl or Huawei cann or call help from the original author of ggml backend subsystem.

0cc4m · 2025-03-12T11:33:41Z

thanks for you kind reminder and I understand what you said. unfortunately, it seems that this is not the fact in ggml-qnn backend:

That just means there is a different problem with your backend. Usually the scheduler will give you a complete subgraph if you support all ops, for example on Vulkan I get:

cgraph->n_nodes = 709 nodes

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

zhouwg · 2025-03-12T11:46:47Z

thanks for you kind reminder and I understand what you said. unfortunately, it seems that this is not the fact in ggml-qnn backend:

That just means there is a different problem with your backend. Usually the scheduler will give you a complete subgraph if you support all ops, for example on Vulkan I get:

Qualcomm's NPU backend need a real complete graph not sub-graph.

cgraph->n_nodes = 709 nodes
as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

yes, you are correct and I don't want this patch but this PR is strongly required for a WIP ggml-qnn backend otherwise it's not a practical approach because it will fallback to the general approach or so-called first tech approach.

you don't know tech details in ggml-qnn backend and I also don't know tech details in vulkan backend and why you can get a complete graph in vulkan backend, can we call help from the original author of ggml backend subsystem?

0cc4m · 2025-03-12T12:20:51Z

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

What is the difference between a complete graph and a partial one? They are both graphs that QNN should be able to execute. I understand there is internal optimization, but there shouldn't be a technical difference between executing a full graph and a partial one.

In the Vulkan case they are handled in completely the same way, the difference is only in performance since Vulkan has to stop and restart execution if the graph is split up, which comes with an overhead.

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

you don't know tech details in ggml-qnn backend and I also don't know tech details in vulkan backend and why you can get a complete graph in vulkan backend, can we call help from the original author of ggml backend subsystem?

Yeah, maybe @slaren has an idea why you didn't get a full subgraph.

zhouwg · 2025-03-12T12:31:33Z

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

What is the difference between a complete graph and a partial one? They are both graphs that QNN should be able to execute. I understand there is internal optimization, but there shouldn't be a technical difference between executing a full graph and a partial one.

I agree with your opinion "there shouldn't be a technical difference between executing a full graph and a partial one", that's the approach in Intel sycl or Huawei cann, we can clearly see that according to tracking codes.

pls refer to:#12326 (comment)
you will understand what I mentioned after you completely understand that tech doc.

In the Vulkan case they are handled in completely the same way, the difference is only in performance since Vulkan has to stop and restart execution if the graph is split up, which comes with an overhead.

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

I haven't read the codes of vulkan backend carefully so I have no opinion with what you mentioned. but we can clearly see that the general approach or so-called first tech approach in Intel sycl or Huawei cann(I has spent some time to study Intel sycl and Huawei cann carefully): handle op acceleration one by one and this is just what you mentioned "there shouldn't be a technical difference between executing a full graph and a partial one". we can see the NPU performance of this approach in ggml-qnn is really bad and Qualcomm's official approach is the second tech approach: converting / mapping a complete LLM model to a single opcfg QNN graph and then optimize the QNN graph and finally execute the QNN graph on NPU accordingly, unfortunately, they provides many binary dedicated tool to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single opcfg QNN graph) in the second tech approach of ggml-qnn backend.

you don't know tech details in ggml-qnn backend and I also don't know tech details in vulkan backend and why you can get a complete graph in vulkan backend, can we call help from the original author of ggml backend subsystem?

Yeah, maybe @slaren has an idea why you didn't get a full subgraph.

yes, I strongly agree with you.

slaren · 2025-03-12T12:41:42Z

As @0cc4m said, if the backend supports all operations it will receive a single graph. You can verify this by changing the supports_op function to always return true.

zhouwg · 2025-03-12T12:52:29Z

As @0cc4m said, if the backend supports all operations it will receive a single graph. You can verify this by changing the supports_op function to always return true.

I already did this verification and the result as following:

let ggml_qnn_can_handle_op return true forcefully
running the llm inference on qnn npu backend with my patch in this PR
running the llm inference on qnn npu backend without my patch in this PR

slaren · 2025-03-12T12:55:02Z

You also need to use -ngl 99 to offload all layers to the backend. Since that graph is starting from layer 21, I suspect that you are not doing that.

zhouwg · 2025-03-12T13:42:51Z

thanks, you are absolutely correct and I already close this PR accordingly.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 12, 2025

ggml: offload the entire cgraph to a specified backend

4e20355

zhouwg force-pushed the offload_cgraph_to_backend branch from 042d7b5 to 4e20355 Compare March 12, 2025 04:18

ggml: offload the entire cgraph to a specified backend --- make CI happy

5d0bd1a

zhouwg closed this Mar 12, 2025

zhouwg mentioned this pull request Mar 12, 2025

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: offload the entire cgraph to a specified backend #12342

ggml: offload the entire cgraph to a specified backend #12342

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

ngxson commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

slaren commented Mar 12, 2025 •

edited

Loading

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

slaren commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

ggml: offload the entire cgraph to a specified backend #12342

ggml: offload the entire cgraph to a specified backend #12342

Uh oh!

Conversation

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

ngxson commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading

slaren commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading