File tree 1 file changed +6
-1
lines changed 1 file changed +6
-1
lines changed Original file line number Diff line number Diff line change @@ -22,6 +22,11 @@ The main goal is to run the model using 4-bit quantization on a MacBook.
22
22
- Runs on the CPU
23
23
24
24
This was hacked in an evening - I have no idea if it works correctly.
25
+ Please do not make conclusions about the models based on the results from this implementation.
26
+ For all I know, it can be completely wrong. This project is for educational purposes and is not going to be maintained properly.
27
+ New features will probably be added mostly through community contributions, if any.
28
+
29
+ ---
25
30
26
31
Here is a typical run using LLaMA-7B:
27
32
@@ -183,7 +188,7 @@ When running the larger models, make sure you have enough disk space to store al
183
188
- x86 quantization support [ not yet ready] ( https://github.com/ggerganov/ggml/pull/27 ) . Basically, you want to run this
184
189
on Apple Silicon. For now, on Linux and Windows you can use the F16 ` ggml-model-f16.bin ` model, but it will be much
185
190
slower.
186
- - The Accelerate framework is actually currently unused since I found that for tensors shapes typical for the Decoder,
191
+ - The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder,
187
192
there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't
188
193
know how to utilize it properly. But in any case, you can even disable it with ` LLAMA_NO_ACCELERATE=1 make ` and the
189
194
performance will be the same, since no BLAS calls are invoked by the current implementation
You can’t perform that action at this time.
0 commit comments