You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- By default, NNPA is enabled when available. To disable it (not recommended):
45
+
- By default, NNPA is disabled by default. To enable it:
46
46
47
47
```bash
48
48
cmake -S . -B build \
49
49
-DCMAKE_BUILD_TYPE=Release \
50
50
-DGGML_BLAS=ON \
51
51
-DGGML_BLAS_VENDOR=OpenBLAS \
52
-
-DGGML_NNPA=OFF
52
+
-DGGML_NNPA=ON
53
53
54
54
cmake --build build --config Release -j $(nproc)
55
55
```
@@ -86,7 +86,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
86
86
87
87
You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
88
88
89
-
These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
89
+
These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
90
90
91
91
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
92
92
@@ -95,11 +95,13 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
95
95
The model you are trying to convert must be in`safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
96
96
97
97
Ensure that you have installed the required packages in advance
98
+
98
99
```bash
99
100
pip3 install -r requirements.txt
100
101
```
101
102
102
103
Convert the `safetensors` model to `GGUF`
104
+
103
105
```bash
104
106
python3 convert_hf_to_gguf.py \
105
107
--outfile model-name-be.f16.gguf \
@@ -147,7 +149,7 @@ Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by
147
149
148
150
### 2. NNPA Vector Intrinsics Acceleration
149
151
150
-
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
152
+
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
151
153
152
154
### 3. zDNN Accelerator
153
155
@@ -206,10 +208,15 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
219
+
213
220
## Getting Help on IBM Z & LinuxONE
214
221
215
222
1. **Bugs, Feature Requests**
@@ -266,4 +273,4 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
266
273
- 🚫 - acceleration unavailable, will still run using scalar implementation
267
274
- ❓ - acceleration unknown, please contribute if you can test it yourself
0 commit comments