-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[SYCL] fix set main gpu error, support single/mul gpu mode #6022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The same concerns that I mentioned in the first review are still present here. My recommendation is to remove |
Invite @0cc4m, @airMeng, @luoyu-intel @abhilash1910 @AidanBeltonS SYCL backend could support different device types, like iGPU/dGPU, CPU, FPGA (which is like Vulkan). In current framework, following cases are supported:
But SYCL need to support the more cases: This PR avoid mixing iGPU and dGPU together, that will reduce the performance obviously. But current framework allows it. To cover case 1, 2, I think SYCL could follow current framework. In the future, we want SYCL to support CPU (case 5) to cover CI/unit-test requirement, so that make sure the SYCL backend don't be broken in new PR. More cases in the future: I guess other backends have similar requirement to support more device types. Thank you! |
How could we improve the llama.cpp framework to support these cases better? Something I would like to do in the future is remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Then the user could specify the devices that they want to use by name. For example, the user could specify to use devices |
Yes, I think it's good solution. When will you plan to implement it? Could SYCL backend wait for this new solution? or fix the issue of set main gpu by this PR and update when common solution is ready? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think it is good to merge this PR since this fixes breaking issue , we can later modify based on generic ggml backend initialization.
It's going to take a while, it's not a priority at the moment. It is ok to use a temporary solution for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One necessary fix, and two small NITs otherwise it looks fine.
static int convert_backend_index(std::string & backend) { | ||
if (backend == "ext_oneapi_level_zero:gpu") return 0; | ||
if (backend == "opencl:gpu") return 1; | ||
if (backend == "opencl:cpu") return 2; | ||
if (backend == "opencl:acc") return 3; | ||
printf("convert_backend_index: can't handle backend=%s\n", backend.c_str()); | ||
GGML_ASSERT(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current approach fails on NVidia and AMD targets
static int convert_backend_index(std::string & backend) { | |
if (backend == "ext_oneapi_level_zero:gpu") return 0; | |
if (backend == "opencl:gpu") return 1; | |
if (backend == "opencl:cpu") return 2; | |
if (backend == "opencl:acc") return 3; | |
printf("convert_backend_index: can't handle backend=%s\n", backend.c_str()); | |
GGML_ASSERT(false); | |
} | |
static int convert_backend_index(std::string & backend) { | |
if (backend == "ext_oneapi_level_zero:gpu") return 0; | |
if (backend == "opencl:gpu") return 1; | |
if (backend == "opencl:cpu") return 2; | |
if (backend == "opencl:acc") return 3; | |
if (backend == "ext_oneapi_cuda:gpu") return 4; | |
if (backend == "ext_oneapi_hip:gpu") return 5; | |
printf("convert_backend_index: can't handle backend=%s\n", backend.c_str()); | |
GGML_ASSERT(false); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is because of the gpu branches for amd/nv were not handled yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to focus to support Intel GPU only in this PR.
I can't make sure the above code work well with other vendor GPUs, because I have no hardware environment to test other vendor GPUs.
I suggest you create new PR based on this PR to support other vendor GPUs.
Here, the index impacts the order in device list. lower is for higher priority.
I suggest change to 2,3, move CPU and ACC to 4, 5 in new PR.
Co-authored-by: AidanBeltonS <[email protected]>
Co-authored-by: AidanBeltonS <[email protected]>
Got it! Thank you! |
@ggerganov Could you review this PR? |
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
@ggerganov Some CI cases have be passed. But now they are always fail. I rebased and check CI again! Thank you! |
* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library
* Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader
…rg#6037) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>
…org#6047) - increase time out for server - do not fail fast
Co-authored-by: Jian Liao <[email protected]>
* additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert
There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.
Close it to rebase and submit again |
I agree with your idea:remove most of the backend-specific initialization code in llama.cpp and use the ggml-backend registry in a generic way. Qualcomm's QNN SDK is a good reference. |
Yes. This refactor is ongoing. but need more time. |
fix error to set main gpu as non-zero.
add new APIs: ggml_backend_sycl_set_single_device_mode(), ggml_backend_sycl_set_mul_device_mode() to handle single/multiple cards by split-mode.
In CI, enable ggml_backend_sycl_set_mul_device_mode() as default.
If split-mode==layer (default), use all GPUs with top max compute unit.
Else, use the main-gpu set by user as only device. It supports level_zero:gpu, opencl:gpu.
refactor the shown device list.
sort the devices by type (level_zero, opencl:gpu, opencl:cpu, opencl:acc) and max compute unit.
support to switch in single/multiple cards by set device id as parameter.