ggml-backend: refine ggml backend subsystem for mixed inference between CPU&GPU / CPU/NPU easily for some special ggml backends #7679

zhouwg · 2024-06-01T01:10:25Z

Purpose

This PR is intent to refine ggml backend subsystem to enable mixed inference between CPU & GPU / CPU & NPU for some special ggml backends(ggml_backend_xxx_buffer_type_is_host return true
) more easily.

There already is "Backend Scheduler" feature in ggml backend subsystem but the "Backend Scheduler" is too complex and not a straight way and some backend APIs is not make sense:

For example, ggml_backend_supports_op is only called/used in https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L406,

For example, ggml_backend_offload_op is not reasonable.

In the all, a special backend(ggml_backend_xxx_buffer_type_is_host return true) doesn't need to implement all GGML OPs and much of them can fallback to the default GGML backend(this is a long-term problem in ggml backend subsystem):

The entire framework of existing ggml backend subystem is really excellent, but part of subsystem seems too strict to a special backend;
GPU/NPU computing might be slower then CPU computing in some special scenarios if we considering data copy/data preparation between CPU/GPU or CPU/NPU and memory size or KV cache size.

Pros

This PR less then one hundred LoC based on the existing ggml backend subsystem and NO side-effect to existing codes.

This PR works very fine/well with whisper.cpp and llama.cpp using QNN backend as expected and all the testcases passed on local dev side.

The GGML QNN backend and many other GGML backends will/might be benefit from this PR greatly.

It's very simple and straightforward and easy to understand.

Cons

A static function in ggml.c is changed to a global function and referenced in this PR. this is not make sense but the cost might be acceptable. A workaround to fix this problem is merge the entire ggml-backend.c to ggml.c and ggml-backend.h to ggml.h accordingly.

…ference more easily for a specified GGML backend

…en CPU&GPU / CPU/NPU easily for some special ggml backend

slaren · 2024-06-01T01:15:10Z

Please don't spam.

zhouwg added 4 commits May 30, 2024 22:01

ggml-backend: refine backend subsystem for CPU&GPU / CPU&NPU mixed in…

5b36de7

…ference more easily for a specified GGML backend

Merge branch 'ggerganov:master' into refine-ggml-backend-subsystem

42cbf56

Merge branch 'ggerganov:master' into refine-ggml-backend-subsystem

72197ae

ggml-backend: refine ggml backend subsystem for mixed inference betwe…

2f5d2b5

…en CPU&GPU / CPU/NPU easily for some special ggml backend

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 1, 2024

slaren closed this Jun 1, 2024

ggml-org locked as spam and limited conversation to collaborators Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-backend: refine ggml backend subsystem for mixed inference between CPU&GPU / CPU/NPU easily for some special ggml backends #7679

ggml-backend: refine ggml backend subsystem for mixed inference between CPU&GPU / CPU/NPU easily for some special ggml backends #7679

Uh oh!

zhouwg commented Jun 1, 2024 •

edited

Loading

Uh oh!

slaren commented Jun 1, 2024

Uh oh!

Uh oh!

ggml-backend: refine ggml backend subsystem for mixed inference between CPU&GPU / CPU/NPU easily for some special ggml backends #7679

ggml-backend: refine ggml backend subsystem for mixed inference between CPU&GPU / CPU/NPU easily for some special ggml backends #7679

Uh oh!

Conversation

zhouwg commented Jun 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Pros

Cons

Uh oh!

slaren commented Jun 1, 2024

Uh oh!

Uh oh!

zhouwg commented Jun 1, 2024 •

edited

Loading