[SYCL] fix set main gpu error, support single/mul gpu mode #6022

NeoZhangJianyu · 2024-03-12T15:34:18Z

fix error to set main gpu as non-zero.
add new APIs: ggml_backend_sycl_set_single_device_mode(), ggml_backend_sycl_set_mul_device_mode() to handle single/multiple cards by split-mode.
In CI, enable ggml_backend_sycl_set_mul_device_mode() as default.
If split-mode==layer (default), use all GPUs with top max compute unit.
Else, use the main-gpu set by user as only device. It supports level_zero:gpu, opencl:gpu.
refactor the shown device list.

change the order in shown device list.
sort the devices by type (level_zero, opencl:gpu, opencl:cpu, opencl:acc) and max compute unit.
add type item to follow the output style of tool sycl-ls.

found 6 SYCL devices:
|  |                  |                                             |compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|[level_zero:gpu:1]|                    Intel(R) UHD Graphics 770|       1.3|         32|     512|     32|    53651849216|
| 2|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 3|    [opencl:gpu:1]|                    Intel(R) UHD Graphics 770|       3.0|         32|     512|     32|    53651849216|
| 4|    [opencl:cpu:0]|         13th Gen Intel(R) Core(TM) i7-13700K|       3.0|         24|    8192|     64|    67064815616|
| 5|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         24|67108864|     64|    67064815616|

update examples/sycl/run-llama2.sh
support to switch in single/multiple cards by set device id as parameter.

slaren · 2024-03-12T21:26:37Z

The same concerns that I mentioned in the first review are still present here. My recommendation is to remove ggml_backend_sycl_set_single_device_mode, ggml_backend_sycl_get_device_index, ggml_backend_sycl_set_mul_device_mode entirely and figure a way to solve this problem that doesn't require adding more backend-specific functions.

NeoZhangJianyu · 2024-03-13T02:20:20Z

The same concerns that I mentioned in the first review are still present here. My recommendation is to remove ggml_backend_sycl_set_single_device_mode, ggml_backend_sycl_get_device_index, ggml_backend_sycl_set_mul_device_mode entirely and figure a way to solve this problem that doesn't require adding more backend-specific functions.

Invite @0cc4m, @airMeng, @luoyu-intel @abhilash1910 @AidanBeltonS

SYCL backend could support different device types, like iGPU/dGPU, CPU, FPGA (which is like Vulkan).
It needs to support them and special cases.
In current common framework, we only could modify SYCL code part. That limits to provide better solution.
This PR provides a new method to support new cases with less impact to common part.
If llama.cpp could provide better solution in architecture, SYCL could use the common solution.

In current framework, following cases are supported:

1 dGPU or 1 iGPU
2+dGPU

But SYCL need to support the more cases:
3. 1 iGPU + 1 dGPU
4. 1 iGPU + 2+ dGPU
User could choose one of iGPU/dGPU in case 3, 4, or choose 2+ dGPU in case 4.

This PR avoid mixing iGPU and dGPU together, that will reduce the performance obviously. But current framework allows it.
In multiple cards case, SYCL will choose all GPUs with same power, instead of mix all GPUs.
In single card case, SYCL could select iGPU or dGPU.

To cover case 1, 2, I think SYCL could follow current framework.
But for case 3,4, there is no existed solution.
That's why we try new solution/GGML APIs for them.

In the future, we want SYCL to support CPU (case 5) to cover CI/unit-test requirement, so that make sure the SYCL backend don't be broken in new PR.
5. CPU

More cases in the future:
6. iGPU + CPU + FPGA
7. iGPU + CPU + dGPU
mix different devices or not mix.

I guess other backends have similar requirement to support more device types.
We hope llama.cpp provide a common solution, so that we could follow it to reduce the cost.

Thank you!

slaren · 2024-03-13T02:37:12Z

How could we improve the llama.cpp framework to support these cases better?

Something I would like to do in the future is remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Then the user could specify the devices that they want to use by name. For example, the user could specify to use devices cpu, sycl_igpu0 and sycl_dgpu0 to select CPU, iGPU and dGPU.

NeoZhangJianyu · 2024-03-13T03:04:24Z

Yes, I think it's good solution.

When will you plan to implement it?

Could SYCL backend wait for this new solution? or fix the issue of set main gpu by this PR and update when common solution is ready?

abhilash1910

LGTM! I think it is good to merge this PR since this fixes breaking issue , we can later modify based on generic ggml backend initialization.

slaren · 2024-03-13T11:20:51Z

Could SYCL backend wait for this new solution? or fix the issue of set main gpu by this PR and update when common solution is ready?

It's going to take a while, it's not a priority at the moment. It is ok to use a temporary solution for now.

AidanBeltonS

One necessary fix, and two small NITs otherwise it looks fine.

ggml-sycl.cpp

AidanBeltonS · 2024-03-13T11:26:22Z

ggml-sycl.cpp

+        static int convert_backend_index(std::string & backend) {
+            if (backend == "ext_oneapi_level_zero:gpu") return 0;
+            if (backend == "opencl:gpu") return 1;
+            if (backend == "opencl:cpu") return 2;
+            if (backend == "opencl:acc") return 3;
+            printf("convert_backend_index: can't handle backend=%s\n", backend.c_str());
+            GGML_ASSERT(false);
+        }


Current approach fails on NVidia and AMD targets

Suggested change

static int convert_backend_index(std::string & backend) {

if (backend == "ext_oneapi_level_zero:gpu") return 0;

if (backend == "opencl:gpu") return 1;

if (backend == "opencl:cpu") return 2;

if (backend == "opencl:acc") return 3;

printf("convert_backend_index: can't handle backend=%s\n", backend.c_str());

GGML_ASSERT(false);

}

static int convert_backend_index(std::string & backend) {

if (backend == "ext_oneapi_level_zero:gpu") return 0;

if (backend == "opencl:gpu") return 1;

if (backend == "opencl:cpu") return 2;

if (backend == "opencl:acc") return 3;

if (backend == "ext_oneapi_cuda:gpu") return 4;

if (backend == "ext_oneapi_hip:gpu") return 5;

printf("convert_backend_index: can't handle backend=%s\n", backend.c_str());

GGML_ASSERT(false);

}

Yes this is because of the gpu branches for amd/nv were not handled yet.

I want to focus to support Intel GPU only in this PR.
I can't make sure the above code work well with other vendor GPUs, because I have no hardware environment to test other vendor GPUs.

I suggest you create new PR based on this PR to support other vendor GPUs.

Here, the index impacts the order in device list. lower is for higher priority.
I suggest change to 2,3, move CPU and ACC to 4, 5 in new PR.

Co-authored-by: AidanBeltonS <[email protected]>

NeoZhangJianyu · 2024-03-14T02:07:51Z

Could SYCL backend wait for this new solution? or fix the issue of set main gpu by this PR and update when common solution is ready?

It's going to take a while, it's not a priority at the moment. It is ok to use a temporary solution for now.

Got it! Thank you!

NeoZhangJianyu · 2024-03-14T12:12:21Z

@ggerganov Could you review this PR?

llama.cpp

ggml-sycl.h

Co-authored-by: Georgi Gerganov <[email protected]>

NeoZhangJianyu · 2024-03-15T03:56:16Z

@ggerganov Some CI cases have be passed. But now they are always fail.
The PR won't impact the fault cases in fact.

I rebased and check CI again!

Thank you!

* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}

* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…l-org#6039)

* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library

* Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader

…rg#6037) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

…org#6047) - increase time out for server - do not fail fast

Co-authored-by: Jian Liao <[email protected]>

* additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert

There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.

NeoZhangJianyu · 2024-03-15T06:37:57Z

Close it to rebase and submit again

zhouwg · 2024-04-11T08:16:13Z

How could we improve the llama.cpp framework to support these cases better?

Something I would like to do in the future is remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Then the user could specify the devices that they want to use by name. For example, the user could specify to use devices cpu, sycl_igpu0 and sycl_dgpu0 to select CPU, iGPU and dGPU.

I agree with your idea:remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Qualcomm's QNN SDK is a good reference.

NeoZhangJianyu · 2024-04-13T01:22:52Z

Yes. This refactor is ongoing. but need more time.

NeoZhangJianyu added 2 commits March 12, 2024 23:30

fix set main gpu error, support single/mul gpu mode

9d4a130

update format

573a5dc

NeoZhangJianyu mentioned this pull request Mar 12, 2024

[SYCL] fix error to set main gpu as non-zero #6006

Closed

NeoZhangJianyu requested review from ggerganov and slaren March 12, 2024 15:43

abhilash1910 approved these changes Mar 13, 2024

View reviewed changes

slaren approved these changes Mar 13, 2024

View reviewed changes

AidanBeltonS reviewed Mar 13, 2024

View reviewed changes

NeoZhangJianyu and others added 2 commits March 14, 2024 09:56

fix grammar issue

a469431

Co-authored-by: AidanBeltonS <[email protected]>

fix grammar issue

3da43c4

Co-authored-by: AidanBeltonS <[email protected]>

ggerganov approved these changes Mar 14, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

ggml-sycl.h Show resolved Hide resolved

NeoZhangJianyu and others added 4 commits March 14, 2024 20:44

fix format

3b672ca

Co-authored-by: Georgi Gerganov <[email protected]>

fix format

e8d77ab

Co-authored-by: Georgi Gerganov <[email protected]>

mark comment for improvement in the future

dd519ea

Co-authored-by: Georgi Gerganov <[email protected]>

fix format

31277a1

slaren and others added 7 commits March 15, 2024 14:10

ci : remove tidy-review (ggml-org#6021)

d6625ce

Server: Use multi-task for embeddings endpoint (ggml-org#6001)

42810dd

* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}

Update get version (ggml-org#6025)

463ab3e

test-backend-ops : skip CPU backend by default (ggml-org#6028)

2784b84

grammar : handle missing "root" node (ggml-org#6004)

119accb

readme : update API changes and hot topics

dbde7d3

wanix1988 and others added 16 commits March 15, 2024 14:10

readme : update details about running llama in Termux on Android (ggm…

8e86d56

…l-org#6039)

embedding : print cosine similarity (ggml-org#899)

6bde412

metal : build metallib + fix embed path (ggml-org#6015)

5f34a21

* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library

embedding : print all resulting embeddings (ggml-org#899)

42ff703

ggml : designate enum vals for integer types (ggml-org#6050)

58fd227

llama : optimize defrag moves + fix fragmentation calculation (ggml-o…

d2651dd

…rg#6037) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama : fix typo

f36e0a5

server: disable debug release type sanitizer, simplify trigger (ggml-…

9d16ae7

…org#6047) - increase time out for server - do not fail fast

readme : improve readme for Llava-1.6 example (ggml-org#6044)

51945cf

Co-authored-by: Jian Liao <[email protected]>

gguf-py : fix dtype check (ggml-org#6045)

42b03c4

embedding : add EOS token if not present (ggml-org#899)

d5dc5d2

gguf-py : bump version to 0.8.0 (ggml-org#6060)

2c29275

gguf : fix resource leaks (ggml-org#6061)

0c3c10b

There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.

llama : fix integer overflow during quantization (ggml-org#6063)

baadd37

NeoZhangJianyu closed this Mar 15, 2024

NeoZhangJianyu deleted the fix_set_main_gpu branch March 15, 2024 06:40

NeoZhangJianyu mentioned this pull request Mar 15, 2024

[SYCL] fix set main gpu error, support single/mul gpu mode #6073

Merged

[SYCL] fix set main gpu error, support single/mul gpu mode #6022

[SYCL] fix set main gpu error, support single/mul gpu mode #6022

Conversation

NeoZhangJianyu commented Mar 12, 2024

Uh oh!

slaren commented Mar 12, 2024

Uh oh!

NeoZhangJianyu commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Mar 13, 2024

Uh oh!

NeoZhangJianyu commented Mar 13, 2024

Uh oh!

abhilash1910 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren commented Mar 13, 2024

Uh oh!

AidanBeltonS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AidanBeltonS Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

abhilash1910 Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

NeoZhangJianyu Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

NeoZhangJianyu commented Mar 14, 2024

Uh oh!

NeoZhangJianyu commented Mar 14, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NeoZhangJianyu commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu commented Mar 15, 2024

Uh oh!

zhouwg commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu commented Apr 13, 2024

Uh oh!

Uh oh!

NeoZhangJianyu commented Mar 13, 2024 •

edited

Loading

abhilash1910 left a comment •

edited

Loading

NeoZhangJianyu commented Mar 15, 2024 •

edited

Loading

zhouwg commented Apr 11, 2024 •

edited

Loading