You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
options.push_back({ "server", " --embedding(s)", "restrict to only support embedding use case; use only with dedicated embedding models (default: %s)", params.embedding ? "enabled" : "disabled" });
1633
1638
options.push_back({ "server", " --api-key KEY", "API key to use for authentication (default: none)" });
1634
1639
options.push_back({ "server", " --api-key-file FNAME", "path to file containing API keys (default: none)" });
1635
1640
options.push_back({ "server", " --ssl-key-file FNAME", "path to file a PEM-encoded SSL private key" });
Copy file name to clipboardExpand all lines: docs/build.md
+18-1
Original file line number
Diff line number
Diff line change
@@ -178,7 +178,11 @@ For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](ht
178
178
cmake --build build --config Release
179
179
```
180
180
181
-
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
181
+
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used.
182
+
183
+
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.
184
+
185
+
The following compilation options are also available to tweak performance:
@@ -192,6 +196,19 @@ The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/c
192
196
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
193
197
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
194
198
199
+
### MUSA
200
+
201
+
- Using `make`:
202
+
```bash
203
+
make GGML_MUSA=1
204
+
```
205
+
- Using `CMake`:
206
+
207
+
```bash
208
+
cmake -B build -DGGML_MUSA=ON
209
+
cmake --build build --config Release
210
+
```
211
+
195
212
### hipBLAS
196
213
197
214
This provides BLAS acceleration on HIP-supported AMD GPUs.
0 commit comments