Skip to content

ggml: offload the entire cgraph to a specified backend #12342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Mar 12, 2025

this PR provide a concise approach to offload the entire ggml's cgraph to a specified backend and no side-effect to the all existing backends.

this PR has verified in my forked llama.cpp project and it works fine as expected.

3-12 12:29:31.568 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4722]: qnn device 2(QNN-NPU)
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4723]: cgraph->n_nodes 846
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op inp_embd (GET_ROWS)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op norm-0 (RMS_NORM)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op attn_norm-0 (MUL)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (MUL_MAT)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (ADD)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (reshaped) (RESHAPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Qcur-0 (ROPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (MUL_MAT)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (ADD)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (reshaped) (RESHAPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Kcur-0 (ROPE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Vcur-0 (MUL_MAT)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Vcur-0 (ADD)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op k_cache_view-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op k_cache_view-0 (copy of Kcur-0) (CPY)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op Vcur-0 (transposed) (TRANSPOSE)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op v_cache_view-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op v_cache_view-0 (copy of Vcur-0 (transposed)) (CPY)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op v-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4728]: ggmlqnn_graph_compute: op k-0 (VIEW)
03-12 12:29:31.569 13867 13867 I ggml-qnn: 
03-12 12:29:31.569 13867 13867 I ggml-qnn: [ggmlqnn_graph_compute, 4758]: the second inference approach "mapping 


this feature would be very helpful for a WIP PR: mapping entire ggml cgraph to the QNN graph although it seems it's a bad news for my formal third PR:#12326, it doesn't matter 🤗 and I'd like to see success of similar PR from others in this great tech community although that implementation hide so much tech details and much complicated encapsulation.

I personally hope this PR can be helpful for that WIP PR because I already have no positive attention to Qualcomm's ggml-qnn backend since 03/12/2025(03/29 might-be a better date because it seems I was back to github and this great tech community since 01/29/2024, that's enough).

this feature will/might brings some unexpected help for Intel's sycl or Huawei's cann backend which similar to the second tech approach in a WIP Qualcomm's ggml-qnn backend, many advanced or state-of-the-art AI technologies can be imported to this great project.

relative tech details can be found at: #12326 (comment)

@slaren, could you help to review this PR? it will very helpful for a WIP PR(mapping entire ggml cgraph to a Qualcomm's QNN NPU backend then the specified backend can do some special hardware-dependent optimizations). the function name or function position might be not inappropriate, I'll adjust it accordingly as your review comments.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 12, 2025
@zhouwg zhouwg force-pushed the offload_cgraph_to_backend branch from 042d7b5 to 4e20355 Compare March 12, 2025 04:18
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Kompute https://github.com/KomputeProject/kompute/ labels Mar 12, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 12, 2025

Vulkan already offloads the entire (sub)graph and I think CUDA does something similar with the CUDA graphs feature. There are no code changes to the backend system required for that, you just trigger the graph execution on the first node that is a part of the graph, and wait for it to finish on the last node.

Some performance optimizations happened that split it up into multiple graphs to allow earlier submissions, but otherwise it works as I described.

@ngxson
Copy link
Collaborator

ngxson commented Mar 12, 2025

offload the entire ggml's cgraph to a specified backend

IIRC this is actually a deprecated feature in ggml. When working with #12322 I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend. So probably we don't need a new API for this? (I don't have a strong opinion on this, just FYI)

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

Vulkan already offloads the entire (sub)graph and I think CUDA does something similar with the CUDA graphs feature. There are no code changes to the backend system required for that, you just trigger the graph execution on the first node that is a part of the graph, and wait for it to finish on the last node.

Some performance optimizations happened that split it up into multiple graphs to allow earlier submissions, but otherwise it works as I described.

1.I don't know the details in vulkan backend. I already see there is a debug statement in ggml_backend_vk_graph_compute

static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
    VK_LOG_DEBUG("ggml_backend_vk_graph_compute(" << cgraph->n_nodes << " nodes)");

2.a WIP Qualcomm QNN backend need this feature because of Qualcomm's dedicated AI tech(they need to convert the entire ggml cgraph to a single opcfg QNN graph and then optimize the single opcfg QNN graph on QNN-CPU / QNN-NPU backend accordingly, the details can be found at:#12326 (comment). we can clearly see that there only 2 graph nodes or 1 graph node in static enum ggml_status ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph). so the current so-called second technical approach "mapping the entire cgraph to QNN graph" will be fallback to the first technical approach which similar to Intel sycl backend or Huawei cann backend. we call this is the first tech approach in ggml-qnn backend and it's performance is really bad because of their hardware accelerators are significantly different from Intel sycl or Huawei cann.
3. this feature has carefully verified in my forked llama.cpp project with 2*4 scenarios:
Screenshot from 2025-03-12 18-16-10
./scripts/build-run-android.sh run_llamacli 0 (QNN_CPU backend)
./scripts/build-run-android.sh run_llamacli 1 (QNN_GPU backend)
./scripts/build-run-android.sh run_llamacli 2 (QNN_NPU backend)
./scripts/build-run-android.sh run_llamacli 3 (default backend)

Screenshot from 2025-03-12 18-17-36
./scripts/build-run-android.sh run_llamacli 0 (QNN_CPU backend)
./scripts/build-run-android.sh run_llamacli 1 (QNN_GPU backend)
./scripts/build-run-android.sh run_llamacli 2 (QNN_NPU backend)
./scripts/build-run-android.sh run_llamacli 3 (default backend)

all these testcase works fine as expected.
4. I think this feature is also suitable for Intel sycl or Huawei cann backend because they are both use a similar inference procedure currently: accelerate op one bye one, this is a general approach in the existing backends or so-called first tech approach in Qualcomm's ggml-qnn backend.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 12, 2025

1.I don't know the details in vulkan backend.
2.a WIP Qualcomm QNN backend need this feature, the details can be found at:#12326 (comment). we can clearly see that there only 2 graph nodes or 1 graph node in static enum ggml_status ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)

I know you don't know the details of the Vulkan backend, that's why I'm telling you about it. You get the cgraph in the graph_compute function and you can handle it however you like. Nothing is forcing you to handle nodes one by one.

If the cgraph you receive contains only few nodes, that's cause your supports_op function returned false for some of the nodes in the middle of the graph, forcing the scheduler to split the cgraph into smaller chunks to handle those parts on CPU. Once your backend supports all ops, you will get a complete graph.

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

offload the entire ggml's cgraph to a specified backend

IIRC this is actually a deprecated feature in ggml. When working with #12322 I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend. So probably we don't need a new API for this? (I don't have a strong opinion on this, just FYI)

yes, it seems the existing ggml backend subsystem offload the entire cgraph to a specified backend completely. unfortunately, this is not the fact in ggml-qnn backend, pls refer to:#12326 (comment). I personally think this also not the fact in Intel sycl backend or Huawei cann backend. we can add a simple debug statement in the corresponding function:
Screenshot from 2025-03-12 18-41-52
Screenshot from 2025-03-12 18-41-30

the rootcause is that the original author introduced a standout and necessary feature "backend scheduler" in ggml backend subsystem, in other words, your opinion "I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend" is absolutely correct.

so my patch in this PR is very simple:

  • find the inference procedure in llama.cpp
  • find the corresponding function in ggml/src/ggml-backend.cpp
  • add a hook in that function to offload the real entire ggml cgraph to a specified backend(such as ggml-qnn backend) directly
  • avoid side-effects to all existing backends and existing logic(especially the "backend scheduler") in existing ggml backend subsystem

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

1.I don't know the details in vulkan backend.
2.a WIP Qualcomm QNN backend need this feature, the details can be found at:#12326 (comment). we can clearly see that there only 2 graph nodes or 1 graph node in static enum ggml_status ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)

I know you don't know the details of the Vulkan backend, that's why I'm telling you about it. You get the cgraph in the graph_compute function and you can handle it however you like. Nothing is forcing you to handle nodes one by one.

If the cgraph you receive contains only few nodes, that's cause your supports_op function returned false for some of the nodes in the middle of the graph, forcing the scheduler to split the cgraph into smaller chunks to handle those parts on CPU. Once your backend supports all ops, you will get a complete graph.

thanks for you kind reminder and I understand what you said.
unfortunately, it seems that this is not the fact in ggml-qnn backend:

  • let ggml_qnn_can_handle_op return true forcefully
    Screenshot from 2025-03-12 19-09-11
  • running the llm inference on qnn npu backend with my patch in this PR
    Screenshot from 2025-03-12 19-08-34
  • running the llm inference on qnn npu backend without my patch in this PR
    Screenshot from 2025-03-12 19-12-15

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model, this is my personal understanding and correction from Qualcomm's expert is greatly appreicated) and then converting / mapping it to a single opcfg QNN graph and then optimize the QNN graph accordingly. I or we called this is a second tech approach of NPU inference on Qualcomm's mobile or desktop SoC. I or we call the general approach in Intel sycl or Huawei cann is a first tech approach. the NPU performance of ggml-qnn through the first tech approach is really bad(much slower then the default cpu backend), this is significantly different from Intel sycl or Huawei cann. I guess the reason is that Qualcomm's AI accelerator is not a general/common hardware accelerator or there some tricks in Qualcomm's QNN SDK(they have a world-class Hexgon NPU and the QNN SDK(Qualcomm has provided various AI sw stacks) couldn't utilize it maximally if programmers don't know how to use the C API in QNN SDK correctly).

in the all, we can call help from author of Intel sycl or Huawei cann or call help from the original author of ggml backend subsystem.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 12, 2025

thanks for you kind reminder and I understand what you said. unfortunately, it seems that this is not the fact in ggml-qnn backend:

That just means there is a different problem with your backend. Usually the scheduler will give you a complete subgraph if you support all ops, for example on Vulkan I get:

cgraph->n_nodes = 709 nodes

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

thanks for you kind reminder and I understand what you said. unfortunately, it seems that this is not the fact in ggml-qnn backend:

That just means there is a different problem with your backend. Usually the scheduler will give you a complete subgraph if you support all ops, for example on Vulkan I get:

Qualcomm's NPU backend need a real complete graph not sub-graph.

cgraph->n_nodes = 709 nodes

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

yes, you are correct and I don't want this patch but this PR is strongly required for a WIP ggml-qnn backend otherwise it's not a practical approach because it will fallback to the general approach or so-called first tech approach.

you don't know tech details in ggml-qnn backend and I also don't know tech details in vulkan backend and why you can get a complete graph in vulkan backend, can we call help from the original author of ggml backend subsystem?

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 12, 2025

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

What is the difference between a complete graph and a partial one? They are both graphs that QNN should be able to execute. I understand there is internal optimization, but there shouldn't be a technical difference between executing a full graph and a partial one.

In the Vulkan case they are handled in completely the same way, the difference is only in performance since Vulkan has to stop and restart execution if the graph is split up, which comes with an overhead.

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

you don't know tech details in ggml-qnn backend and I also don't know tech details in vulkan backend and why you can get a complete graph in vulkan backend, can we call help from the original author of ggml backend subsystem?

Yeah, maybe @slaren has an idea why you didn't get a full subgraph.

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model

What is the difference between a complete graph and a partial one? They are both graphs that QNN should be able to execute. I understand there is internal optimization, but there shouldn't be a technical difference between executing a full graph and a partial one.

I agree with your opinion "there shouldn't be a technical difference between executing a full graph and a partial one", that's the approach in Intel sycl or Huawei cann, we can clearly see that according to tracking codes.

pls refer to:#12326 (comment)
you will understand what I mentioned after you completely understand that tech doc.

In the Vulkan case they are handled in completely the same way, the difference is only in performance since Vulkan has to stop and restart execution if the graph is split up, which comes with an overhead.

That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution.

I haven't read the codes of vulkan backend carefully so I have no opinion with what you mentioned. but we can clearly see that the general approach or so-called first tech approach in Intel sycl or Huawei cann(I has spent some time to study Intel sycl and Huawei cann carefully): handle op acceleration one by one and this is just what you mentioned "there shouldn't be a technical difference between executing a full graph and a partial one". we can see the NPU performance of this approach in ggml-qnn is really bad and Qualcomm's official approach is the second tech approach: converting / mapping a complete LLM model to a single opcfg QNN graph and then optimize the QNN graph and finally execute the QNN graph on NPU accordingly, unfortunately, they provides many binary dedicated tool to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single opcfg QNN graph) in the second tech approach of ggml-qnn backend.

you don't know tech details in ggml-qnn backend and I also don't know tech details in vulkan backend and why you can get a complete graph in vulkan backend, can we call help from the original author of ggml backend subsystem?

Yeah, maybe @slaren has an idea why you didn't get a full subgraph.

yes, I strongly agree with you.

@slaren
Copy link
Member

slaren commented Mar 12, 2025

As @0cc4m said, if the backend supports all operations it will receive a single graph. You can verify this by changing the supports_op function to always return true.

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

As @0cc4m said, if the backend supports all operations it will receive a single graph. You can verify this by changing the supports_op function to always return true.

I already did this verification and the result as following:

  • let ggml_qnn_can_handle_op return true forcefully
    Screenshot from 2025-03-12 19-09-11
  • running the llm inference on qnn npu backend with my patch in this PR
    Screenshot from 2025-03-12 19-08-34
  • running the llm inference on qnn npu backend without my patch in this PR
    Screenshot from 2025-03-12 19-12-15

@slaren
Copy link
Member

slaren commented Mar 12, 2025

You also need to use -ngl 99 to offload all layers to the backend. Since that graph is starting from layer 21, I suspect that you are not doing that.

@zhouwg zhouwg closed this Mar 12, 2025
@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

thanks, you are absolutely correct and I already close this PR accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants