Skip to content

[SYCL] fix set main gpu error, support single/mul gpu mode #6022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

NeoZhangJianyu
Copy link
Collaborator

  1. fix error to set main gpu as non-zero.

  2. add new APIs: ggml_backend_sycl_set_single_device_mode(), ggml_backend_sycl_set_mul_device_mode() to handle single/multiple cards by split-mode.
    In CI, enable ggml_backend_sycl_set_mul_device_mode() as default.
    If split-mode==layer (default), use all GPUs with top max compute unit.
    Else, use the main-gpu set by user as only device. It supports level_zero:gpu, opencl:gpu.

  3. refactor the shown device list.

  • change the order in shown device list.
    sort the devices by type (level_zero, opencl:gpu, opencl:cpu, opencl:acc) and max compute unit.
  • add type item to follow the output style of tool sycl-ls.
found 6 SYCL devices:
|  |                  |                                             |compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|[level_zero:gpu:1]|                    Intel(R) UHD Graphics 770|       1.3|         32|     512|     32|    53651849216|
| 2|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 3|    [opencl:gpu:1]|                    Intel(R) UHD Graphics 770|       3.0|         32|     512|     32|    53651849216|
| 4|    [opencl:cpu:0]|         13th Gen Intel(R) Core(TM) i7-13700K|       3.0|         24|    8192|     64|    67064815616|
| 5|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         24|67108864|     64|    67064815616|
  1. update examples/sycl/run-llama2.sh
    support to switch in single/multiple cards by set device id as parameter.

@slaren
Copy link
Member

slaren commented Mar 12, 2024

The same concerns that I mentioned in the first review are still present here. My recommendation is to remove ggml_backend_sycl_set_single_device_mode, ggml_backend_sycl_get_device_index, ggml_backend_sycl_set_mul_device_mode entirely and figure a way to solve this problem that doesn't require adding more backend-specific functions.

@NeoZhangJianyu
Copy link
Collaborator Author

NeoZhangJianyu commented Mar 13, 2024

The same concerns that I mentioned in the first review are still present here. My recommendation is to remove ggml_backend_sycl_set_single_device_mode, ggml_backend_sycl_get_device_index, ggml_backend_sycl_set_mul_device_mode entirely and figure a way to solve this problem that doesn't require adding more backend-specific functions.

Invite @0cc4m, @airMeng, @luoyu-intel @abhilash1910 @AidanBeltonS

SYCL backend could support different device types, like iGPU/dGPU, CPU, FPGA (which is like Vulkan).
It needs to support them and special cases.
In current common framework, we only could modify SYCL code part. That limits to provide better solution.
This PR provides a new method to support new cases with less impact to common part.
If llama.cpp could provide better solution in architecture, SYCL could use the common solution.

In current framework, following cases are supported:

  1. 1 dGPU or 1 iGPU
  2. 2+dGPU

But SYCL need to support the more cases:
3. 1 iGPU + 1 dGPU
4. 1 iGPU + 2+ dGPU
User could choose one of iGPU/dGPU in case 3, 4, or choose 2+ dGPU in case 4.

This PR avoid mixing iGPU and dGPU together, that will reduce the performance obviously. But current framework allows it.
In multiple cards case, SYCL will choose all GPUs with same power, instead of mix all GPUs.
In single card case, SYCL could select iGPU or dGPU.

To cover case 1, 2, I think SYCL could follow current framework.
But for case 3,4, there is no existed solution.
That's why we try new solution/GGML APIs for them.

In the future, we want SYCL to support CPU (case 5) to cover CI/unit-test requirement, so that make sure the SYCL backend don't be broken in new PR.
5. CPU

More cases in the future:
6. iGPU + CPU + FPGA
7. iGPU + CPU + dGPU
mix different devices or not mix.

I guess other backends have similar requirement to support more device types.
We hope llama.cpp provide a common solution, so that we could follow it to reduce the cost.

Thank you!

@slaren
Copy link
Member

slaren commented Mar 13, 2024

How could we improve the llama.cpp framework to support these cases better?

Something I would like to do in the future is remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Then the user could specify the devices that they want to use by name. For example, the user could specify to use devices cpu, sycl_igpu0 and sycl_dgpu0 to select CPU, iGPU and dGPU.

@NeoZhangJianyu
Copy link
Collaborator Author

Yes, I think it's good solution.

When will you plan to implement it?

Could SYCL backend wait for this new solution? or fix the issue of set main gpu by this PR and update when common solution is ready?

Copy link
Collaborator

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think it is good to merge this PR since this fixes breaking issue , we can later modify based on generic ggml backend initialization.

@slaren
Copy link
Member

slaren commented Mar 13, 2024

Could SYCL backend wait for this new solution? or fix the issue of set main gpu by this PR and update when common solution is ready?

It's going to take a while, it's not a priority at the moment. It is ok to use a temporary solution for now.

Copy link
Contributor

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One necessary fix, and two small NITs otherwise it looks fine.

Comment on lines +972 to +979
static int convert_backend_index(std::string & backend) {
if (backend == "ext_oneapi_level_zero:gpu") return 0;
if (backend == "opencl:gpu") return 1;
if (backend == "opencl:cpu") return 2;
if (backend == "opencl:acc") return 3;
printf("convert_backend_index: can't handle backend=%s\n", backend.c_str());
GGML_ASSERT(false);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current approach fails on NVidia and AMD targets

Suggested change
static int convert_backend_index(std::string & backend) {
if (backend == "ext_oneapi_level_zero:gpu") return 0;
if (backend == "opencl:gpu") return 1;
if (backend == "opencl:cpu") return 2;
if (backend == "opencl:acc") return 3;
printf("convert_backend_index: can't handle backend=%s\n", backend.c_str());
GGML_ASSERT(false);
}
static int convert_backend_index(std::string & backend) {
if (backend == "ext_oneapi_level_zero:gpu") return 0;
if (backend == "opencl:gpu") return 1;
if (backend == "opencl:cpu") return 2;
if (backend == "opencl:acc") return 3;
if (backend == "ext_oneapi_cuda:gpu") return 4;
if (backend == "ext_oneapi_hip:gpu") return 5;
printf("convert_backend_index: can't handle backend=%s\n", backend.c_str());
GGML_ASSERT(false);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is because of the gpu branches for amd/nv were not handled yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to focus to support Intel GPU only in this PR.
I can't make sure the above code work well with other vendor GPUs, because I have no hardware environment to test other vendor GPUs.

I suggest you create new PR based on this PR to support other vendor GPUs.

Here, the index impacts the order in device list. lower is for higher priority.
I suggest change to 2,3, move CPU and ACC to 4, 5 in new PR.

NeoZhangJianyu and others added 2 commits March 14, 2024 09:56
Co-authored-by: AidanBeltonS <[email protected]>
Co-authored-by: AidanBeltonS <[email protected]>
@NeoZhangJianyu
Copy link
Collaborator Author

Could SYCL backend wait for this new solution? or fix the issue of set main gpu by this PR and update when common solution is ready?

It's going to take a while, it's not a priority at the moment. It is ok to use a temporary solution for now.

Got it! Thank you!

@NeoZhangJianyu
Copy link
Collaborator Author

@ggerganov Could you review this PR?

NeoZhangJianyu and others added 4 commits March 14, 2024 20:44
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
@NeoZhangJianyu
Copy link
Collaborator Author

NeoZhangJianyu commented Mar 15, 2024

@ggerganov Some CI cases have be passed. But now they are always fail.
The PR won't impact the fault cases in fact.

I rebased and check CI again!

Thank you!

slaren and others added 7 commits March 15, 2024 14:10
* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
wanix1988 and others added 16 commits March 15, 2024 14:10
* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library
* Refactor dtype handling to be extensible

This code is equivalent as before, but now it is prepared to easily add
more NumPy dtypes.

* Add support for I8, I16 and I32

These types are allowed in the GGUF specification.

* Add support for I8, I16 and I32 to gguf_writer

* Add support for I8, I16, I32 to gguf_reader
…rg#6037)

* attempt to reduce the impact of a worst-case scenario

* fragmentation calculation fix

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* additional methods to read model and ctx parameters

* vocab size as a part of a model metadata

* models without vocabulary, convert.py part

* models without vocabulary, llama.cpp part

* PR clean up

* converter scrypt fixes

* llama_vocab_type update (renamed the new key)

* pr review fixes

* revert function renaming

* one more NoVocab assert
There several places where a gguf context is allocated. A call to gguf_free
is missing in some error paths. Also on linux, llama-bench was missing a
fclose.
@NeoZhangJianyu
Copy link
Collaborator Author

Close it to rebase and submit again

@zhouwg
Copy link
Contributor

zhouwg commented Apr 11, 2024

How could we improve the llama.cpp framework to support these cases better?

Something I would like to do in the future is remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Then the user could specify the devices that they want to use by name. For example, the user could specify to use devices cpu, sycl_igpu0 and sycl_dgpu0 to select CPU, iGPU and dGPU.

I agree with your idea:remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Qualcomm's QNN SDK is a good reference.

@NeoZhangJianyu
Copy link
Collaborator Author

Yes. This refactor is ongoing. but need more time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.