Skip to content

Commit 6167700

Browse files
NeoZhangJianyutybalex
authored andcommitted
support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (ggml-org#6521)
1 parent 635ef35 commit 6167700

File tree

2 files changed

+895
-217
lines changed

2 files changed

+895
-217
lines changed

README-sycl.md

Lines changed: 19 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
- [Background](#background)
44
- [News](#news)
55
- [OS](#os)
6-
- [Supported Devices](#supported-devices)
6+
- [Hardware](#hardware)
77
- [Docker](#docker)
88
- [Linux](#linux)
99
- [Windows](#windows)
@@ -24,19 +24,20 @@
2424
- **Nvidia & AMD Plugins**: These are plugins extending oneAPI's DPCPP support to SYCL on Nvidia and AMD GPU targets.
2525

2626
### Llama.cpp + SYCL
27-
This SYCL "backend" follows the same design found in other llama.cpp BLAS-based paths such as *OpenBLAS, cuBLAS, CLBlast etc..*. The oneAPI's [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) open-source migration tool (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) was used for this purpose.
2827

29-
The llama.cpp SYCL backend supports:
30-
- Intel GPUs.
31-
- Nvidia GPUs.
28+
The llama.cpp SYCL backend is designed to support **Intel GPU** firstly. Based on the cross-platform feature of SYCL, it could support other vendor GPUs: Nvidia GPU (*AMD GPU coming*).
3229

33-
*Upcoming support: AMD GPUs*.
30+
When targeting **Intel CPU**, it is recommended to use llama.cpp for [Intel oneMKL](README.md#intel-onemkl) backend.
3431

35-
When targetting **Intel CPUs**, it is recommended to use llama.cpp for [x86_64](README.md#intel-onemkl) approach.
32+
It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS, cuBLAS, CLBlast etc..*. In beginning work, the oneAPI's [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) open-source migration tool (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) was used for this purpose.
3633

3734
## News
3835

36+
- 2024.4
37+
- Support data types: GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M.
38+
3939
- 2024.3
40+
- Release binary files of Windows.
4041
- A blog is published: **Run LLM on all Intel GPUs Using llama.cpp**: [intel.com](https://www.intel.com/content/www/us/en/developer/articles/technical/run-llm-on-all-gpus-using-llama-cpp-artical.html) or [medium.com](https://medium.com/@jianyu_neo/run-llm-on-all-intel-gpus-using-llama-cpp-fd2e2dcbd9bd).
4142
- New base line is ready: [tag b2437](https://github.com/ggerganov/llama.cpp/tree/b2437).
4243
- Support multiple cards: **--split-mode**: [none|layer]; not support [row], it's on developing.
@@ -59,16 +60,11 @@ When targetting **Intel CPUs**, it is recommended to use llama.cpp for [x86_64]
5960
|Windows|Support|Windows 11|
6061

6162

62-
## Supported devices
63-
64-
### Intel GPUs
63+
## Hardware
6564

66-
The oneAPI Math Kernel Library, which the oneAPI base-toolkit includes, supports intel GPUs. In order to make it "visible", simply run the following:
67-
```sh
68-
source /opt/intel/oneapi/setvars.sh
69-
```
65+
### Intel GPU
7066

71-
- **Tested devices**
67+
**Verified devices**
7268

7369
|Intel GPU| Status | Verified Model|
7470
|-|-|-|
@@ -80,16 +76,18 @@ source /opt/intel/oneapi/setvars.sh
8076

8177
*Notes:*
8278

83-
- Device memory can be a limitation when running a large model on an intel GPU. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/main`.
79+
- **Memory**
80+
- The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/main`.
8481

85-
- Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPUs and 4.0GB for discrete GPUs.
82+
- Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPU and 4.0GB for discrete GPU.
8683

87-
- If the iGPU has less than 80 EUs *(Execution Unit)*, the inference speed will likely be too slow for practical use.
84+
- **Execution Unit (EU)**
85+
- If the iGPU has less than 80 EUs, the inference speed will likely be too slow for practical use.
8886

89-
### Nvidia GPUs
90-
The BLAS acceleration on Nvidia GPUs through oneAPI can be obtained using the Nvidia plugins for oneAPI and the cuBLAS backend of the upstream oneMKL library. Details and instructions on how to setup the runtime and library can be found in [this section](#i-setup-environment)
87+
### Nvidia GPU
88+
The BLAS acceleration on Nvidia GPU through oneAPI can be obtained using the Nvidia plugins for oneAPI and the cuBLAS backend of the upstream oneMKL library. Details and instructions on how to setup the runtime and library can be found in [this section](#i-setup-environment)
9189

92-
- **Tested devices**
90+
**Verified devices**
9391

9492
|Nvidia GPU| Status | Verified Model|
9593
|-|-|-|

0 commit comments

Comments
 (0)