Skip to content

Commit e33002d

Browse files
committed
readme : add Metal instructions
1 parent db3db9e commit e33002d

File tree

1 file changed

+24
-3
lines changed

1 file changed

+24
-3
lines changed

README.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,10 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
5151
The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook
5252

5353
- Plain C/C++ implementation without dependencies
54-
- Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
54+
- Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
5555
- AVX, AVX2 and AVX512 support for x86 architectures
5656
- Mixed F16 / F32 precision
5757
- 4-bit, 5-bit and 8-bit integer quantization support
58-
- Runs on the CPU
5958
- Supports OpenBLAS/Apple BLAS/ARM Performance Lib/ATLAS/BLIS/Intel MKL/NVHPC/ACML/SCSL/SGIMATH and [more](https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors) in BLAS
6059
- cuBLAS and CLBlast support
6160

@@ -236,6 +235,28 @@ In order to build llama.cpp you have three different options.
236235
zig build -Drelease-fast
237236
```
238237

238+
### Metal Build
239+
240+
Using Metal allows the computation to be executed on the GPU for Apple devices:
241+
242+
- Using `make`:
243+
244+
```bash
245+
LLAMA_METAL=1 make
246+
```
247+
248+
- Using `CMake`:
249+
250+
```bash
251+
mkdir build-metal
252+
cd build-metal
253+
cmake -DLLAMA_METAL=ON ..
254+
cmake --build . --config Release
255+
```
256+
257+
When built with Metal support, you can enable GPU inference with the `--gpu-layers|-ngl` command-line argument.
258+
Any value larger than 0 will offload the computation to the GPU.
259+
239260
### BLAS Build
240261

241262
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
@@ -367,7 +388,7 @@ Building the program with BLAS support may lead to some performance improvements
367388
368389
Running:
369390
370-
The CLBlast build supports `--gpu-layers|-ngl` like the CUDA version does.
391+
The CLBlast build supports `--gpu-layers|-ngl` like the CUDA version does.
371392
372393
To select the correct platform (driver) and device (GPU), you can use the environment variables `GGML_OPENCL_PLATFORM` and `GGML_OPENCL_DEVICE`.
373394
The selection can be a number (starting from 0) or a text string to search:

0 commit comments

Comments
 (0)