ggml-backend: refine ggml backend subsystem for mixed inference between CPU&GPU / CPU/NPU easily for some special ggml backends #7679
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
This PR is intent to refine ggml backend subsystem to enable mixed inference between CPU & GPU / CPU & NPU for some special ggml backends(ggml_backend_xxx_buffer_type_is_host return true
) more easily.
There already is "Backend Scheduler" feature in ggml backend subsystem but the "Backend Scheduler" is too complex and not a straight way and some backend APIs is not make sense:
For example, ggml_backend_supports_op is only called/used in https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L406,
For example, ggml_backend_offload_op is not reasonable.
In the all, a special backend(ggml_backend_xxx_buffer_type_is_host return true) doesn't need to implement all GGML OPs and much of them can fallback to the default GGML backend(this is a long-term problem in ggml backend subsystem):
The entire framework of existing ggml backend subystem is really excellent, but part of subsystem seems too strict to a special backend;
GPU/NPU computing might be slower then CPU computing in some special scenarios if we considering data copy/data preparation between CPU/GPU or CPU/NPU and memory size or KV cache size.
Pros
This PR less then one hundred LoC based on the existing ggml backend subsystem and NO side-effect to existing codes.
This PR works very fine/well with whisper.cpp and llama.cpp using QNN backend as expected and all the testcases passed on local dev side.
The GGML QNN backend and many other GGML backends will/might be benefit from this PR greatly.
It's very simple and straightforward and easy to understand.
Cons
A static function in ggml.c is changed to a global function and referenced in this PR. this is not make sense but the cost might be acceptable. A workaround to fix this problem is merge the entire ggml-backend.c to ggml.c and ggml-backend.h to ggml.h accordingly.