You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-9Lines changed: 1 addition & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
17
17
The main goal is to run the model using 4-bit quantization on a MacBook
18
18
19
19
- Plain C/C++ implementation without dependencies
20
-
- Apple silicon first-class citizen - optimized via ARM NEON
20
+
- Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
21
21
- AVX2 support for x86 architectures
22
22
- Mixed F16 / F32 precision
23
23
- 4-bit quantization support
@@ -323,14 +323,6 @@ or with light image:
323
323
docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
324
324
```
325
325
326
-
## Limitations
327
-
328
-
- Probably the token sampling can be improved
329
-
- The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder,
330
-
there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simply don't
331
-
know how to utilize it properly. But in any case, you can even disable it with `LLAMA_NO_ACCELERATE=1 make` and the
332
-
performance will be the same, since no BLAS calls are invoked by the current implementation
0 commit comments