You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-3Lines changed: 24 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -51,11 +51,10 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
51
51
The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook
52
52
53
53
- Plain C/C++ implementation without dependencies
54
-
- Apple silicon first-class citizen - optimized via ARM NEONand Accelerate framework
54
+
- Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
55
55
- AVX, AVX2 and AVX512 support for x86 architectures
56
56
- Mixed F16 / F32 precision
57
57
- 4-bit, 5-bit and 8-bit integer quantization support
58
-
- Runs on the CPU
59
58
- Supports OpenBLAS/Apple BLAS/ARM Performance Lib/ATLAS/BLIS/Intel MKL/NVHPC/ACML/SCSL/SGIMATH and [more](https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors) in BLAS
60
59
- cuBLAS and CLBlast support
61
60
@@ -236,6 +235,28 @@ In order to build llama.cpp you have three different options.
236
235
zig build -Drelease-fast
237
236
```
238
237
238
+
### Metal Build
239
+
240
+
Using Metal allows the computation to be executed on the GPU for Apple devices:
241
+
242
+
- Using `make`:
243
+
244
+
```bash
245
+
LLAMA_METAL=1 make
246
+
```
247
+
248
+
- Using `CMake`:
249
+
250
+
```bash
251
+
mkdir build-metal
252
+
cd build-metal
253
+
cmake -DLLAMA_METAL=ON ..
254
+
cmake --build . --config Release
255
+
```
256
+
257
+
When built with Metal support, you can enable GPU inference with the `--gpu-layers|-ngl` command-line argument.
258
+
Any value larger than 0 will offload the computation to the GPU.
259
+
239
260
### BLAS Build
240
261
241
262
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
@@ -367,7 +388,7 @@ Building the program with BLAS support may lead to some performance improvements
367
388
368
389
Running:
369
390
370
-
The CLBlast build supports `--gpu-layers|-ngl` like the CUDA version does.
391
+
The CLBlast build supports `--gpu-layers|-ngl` like the CUDA version does.
371
392
372
393
To selectthe correct platform (driver) and device (GPU), you can use the environment variables `GGML_OPENCL_PLATFORM` and `GGML_OPENCL_DEVICE`.
373
394
The selection can be a number (starting from 0) or a text string to search:
0 commit comments