-
Notifications
You must be signed in to change notification settings - Fork 12k
ggml: offload the entire cgraph to a specified backend #12342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
042d7b5
to
4e20355
Compare
Vulkan already offloads the entire (sub)graph and I think CUDA does something similar with the CUDA graphs feature. There are no code changes to the backend system required for that, you just trigger the graph execution on the first node that is a part of the graph, and wait for it to finish on the last node. Some performance optimizations happened that split it up into multiple graphs to allow earlier submissions, but otherwise it works as I described. |
IIRC this is actually a deprecated feature in ggml. When working with #12322 I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend. So probably we don't need a new API for this? (I don't have a strong opinion on this, just FYI) |
1.I don't know the details in vulkan backend. I already see there is a debug statement in ggml_backend_vk_graph_compute
2.a WIP Qualcomm QNN backend need this feature because of Qualcomm's dedicated AI tech(they need to convert the entire ggml cgraph to a single opcfg QNN graph and then optimize the single opcfg QNN graph on QNN-CPU / QNN-NPU backend accordingly, the details can be found at:#12326 (comment). we can clearly see that there only 2 graph nodes or 1 graph node in static enum ggml_status ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph). so the current so-called second technical approach "mapping the entire cgraph to QNN graph" will be fallback to the first technical approach which similar to Intel sycl backend or Huawei cann backend. we call this is the first tech approach in ggml-qnn backend and it's performance is really bad because of their hardware accelerators are significantly different from Intel sycl or Huawei cann.
all these testcase works fine as expected. |
I know you don't know the details of the Vulkan backend, that's why I'm telling you about it. You get the cgraph in the If the cgraph you receive contains only few nodes, that's cause your |
yes, it seems the existing ggml backend subsystem offload the entire cgraph to a specified backend completely. unfortunately, this is not the fact in ggml-qnn backend, pls refer to:#12326 (comment). I personally think this also not the fact in Intel sycl backend or Huawei cann backend. we can add a simple debug statement in the corresponding function: the rootcause is that the original author introduced a standout and necessary feature "backend scheduler" in ggml backend subsystem, in other words, your opinion "I realized that without using backend sched, the cgraph will be offloaded completely to a specific backend" is absolutely correct. so my patch in this PR is very simple:
|
thanks for you kind reminder and I understand what you said.
as I explained before: Qualcomm's NPU AI accelerator need a complete or entire ggml cgraph(in other words, a complete graph of the original LLM model, this is my personal understanding and correction from Qualcomm's expert is greatly appreicated) and then converting / mapping it to a single opcfg QNN graph and then optimize the QNN graph accordingly. I or we called this is a second tech approach of NPU inference on Qualcomm's mobile or desktop SoC. I or we call the general approach in Intel sycl or Huawei cann is a first tech approach. the NPU performance of ggml-qnn through the first tech approach is really bad(much slower then the default cpu backend), this is significantly different from Intel sycl or Huawei cann. I guess the reason is that Qualcomm's AI accelerator is not a general/common hardware accelerator or there some tricks in Qualcomm's QNN SDK(they have a world-class Hexgon NPU and the QNN SDK(Qualcomm has provided various AI sw stacks) couldn't utilize it maximally if programmers don't know how to use the C API in QNN SDK correctly). in the all, we can call help from author of Intel sycl or Huawei cann or call help from the original author of ggml backend subsystem. |
That just means there is a different problem with your backend. Usually the scheduler will give you a complete subgraph if you support all ops, for example on Vulkan I get:
That is also what Vulkan needs and is already doing without any changes to the backend system. You can already do that. I'm not familiar enough with your backend to know why it's currently not working, but overriding the scheduler is not the right solution. |
Qualcomm's NPU backend need a real complete graph not sub-graph.
yes, you are correct and I don't want this patch but this PR is strongly required for a WIP ggml-qnn backend otherwise it's not a practical approach because it will fallback to the general approach or so-called first tech approach. you don't know tech details in ggml-qnn backend and I also don't know tech details in vulkan backend and why you can get a complete graph in vulkan backend, can we call help from the original author of ggml backend subsystem? |
What is the difference between a complete graph and a partial one? They are both graphs that QNN should be able to execute. I understand there is internal optimization, but there shouldn't be a technical difference between executing a full graph and a partial one. In the Vulkan case they are handled in completely the same way, the difference is only in performance since Vulkan has to stop and restart execution if the graph is split up, which comes with an overhead.
Yeah, maybe @slaren has an idea why you didn't get a full subgraph. |
I agree with your opinion "there shouldn't be a technical difference between executing a full graph and a partial one", that's the approach in Intel sycl or Huawei cann, we can clearly see that according to tracking codes. pls refer to:#12326 (comment)
I haven't read the codes of vulkan backend carefully so I have no opinion with what you mentioned. but we can clearly see that the general approach or so-called first tech approach in Intel sycl or Huawei cann(I has spent some time to study Intel sycl and Huawei cann carefully): handle op acceleration one by one and this is just what you mentioned "there shouldn't be a technical difference between executing a full graph and a partial one". we can see the NPU performance of this approach in ggml-qnn is really bad and Qualcomm's official approach is the second tech approach: converting / mapping a complete LLM model to a single opcfg QNN graph and then optimize the QNN graph and finally execute the QNN graph on NPU accordingly, unfortunately, they provides many binary dedicated tool to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single opcfg QNN graph) in the second tech approach of ggml-qnn backend.
yes, I strongly agree with you. |
As @0cc4m said, if the backend supports all operations it will receive a single graph. You can verify this by changing the |
I already did this verification and the result as following: |
You also need to use |
thanks, you are absolutely correct and I already close this PR accordingly. |
this PR provide a concise approach to offload the entire ggml's cgraph to a specified backend and no side-effect to the all existing backends.
this PR has verified in my forked llama.cpp project and it works fine as expected.
this feature would be very helpful for a WIP PR: mapping entire ggml cgraph to the QNN graph although it seems it's a bad news for my formal third PR:#12326, it doesn't matter 🤗 and I'd like to see success of similar PR from others in this great tech community although that implementation hide so much tech details and much complicated encapsulation.
I personally hope this PR can be helpful for that WIP PR because I already have no positive attention to Qualcomm's ggml-qnn backend since 03/12/2025(03/29 might-be a better date because it seems I was back to github and this great tech community since 01/29/2024, that's enough).
this feature will/might brings some unexpected help for Intel's sycl or Huawei's cann backend which similar to the second tech approach in a WIP Qualcomm's ggml-qnn backend, many advanced or state-of-the-art AI technologies can be imported to this great project.
relative tech details can be found at: #12326 (comment)
@slaren, could you help to review this PR? it will very helpful for a WIP PR(mapping entire ggml cgraph to a Qualcomm's QNN NPU backend then the specified backend can do some special hardware-dependent optimizations). the function name or function position might be not inappropriate, I'll adjust it accordingly as your review comments.