|
240 | 240 |
|
241 | 241 | `shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS
|
242 | 242 |
|
| 243 | +### Perplexity (Measuring model quality) |
| 244 | + |
| 245 | +You can pass `--perplexity` as a command line option to measure perplexity over the given prompt. For more background, |
| 246 | +see https://huggingface.co/docs/transformers/perplexity. However, in general, lower perplexity is better for LLMs. |
| 247 | + |
| 248 | +#### Measurements |
| 249 | + |
| 250 | +https://github.com/ggerganov/llama.cpp/pull/270 is the unofficial tracking page for now. llama.cpp is measuring very well |
| 251 | +compared to the baseline implementations. Quantization has a small negative impact to quality, but, as you can see, running |
| 252 | +13B at q4_0 beats the 7B f16 model by a significant amount. |
| 253 | + |
| 254 | +All measurements are done against wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context). |
| 255 | +Note that the changing the context length will have a significant impact on perplexity (longer context = better perplexity). |
| 256 | +``` |
| 257 | +Perplexity - model options |
| 258 | +5.5985 - 13B, q4_0 |
| 259 | +5.9565 - 7B, f16 |
| 260 | +6.3001 - 7B, q4_1 |
| 261 | +6.5949 - 7B, q4_0 |
| 262 | +6.5995 - 7B, q4_0, --memory_f16 |
| 263 | +``` |
| 264 | + |
| 265 | +#### How to run |
| 266 | + |
| 267 | +1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research |
| 268 | +2. Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` |
| 269 | +3. Output: |
| 270 | +``` |
| 271 | +Calculating perplexity over 655 chunks |
| 272 | +24.43 seconds per pass - ETA 4.45 hours |
| 273 | +[1]4.5970,[2]5.1807,[3]6.0382,... |
| 274 | +``` |
| 275 | +And after 4.45 hours, you will have the final perplexity. |
| 276 | + |
243 | 277 | ### Android
|
244 | 278 |
|
245 | 279 | You can easily run `llama.cpp` on Android device with [termux](https://play.google.com/store/apps/details?id=com.termux).
|
@@ -290,7 +324,6 @@ docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models
|
290 | 324 |
|
291 | 325 | ## Limitations
|
292 | 326 |
|
293 |
| -- We don't know yet how much the quantization affects the quality of the generated text |
294 | 327 | - Probably the token sampling can be improved
|
295 | 328 | - The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder,
|
296 | 329 | there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simply don't
|
|
0 commit comments