Skip to content

Commit 40ea807

Browse files
authored
Add details on perplexity to README.md (#395)
1 parent d5850c5 commit 40ea807

File tree

1 file changed

+34
-1
lines changed

1 file changed

+34
-1
lines changed

README.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,40 @@ or
240240

241241
`shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS
242242

243+
### Perplexity (Measuring model quality)
244+
245+
You can pass `--perplexity` as a command line option to measure perplexity over the given prompt. For more background,
246+
see https://huggingface.co/docs/transformers/perplexity. However, in general, lower perplexity is better for LLMs.
247+
248+
#### Measurements
249+
250+
https://github.com/ggerganov/llama.cpp/pull/270 is the unofficial tracking page for now. llama.cpp is measuring very well
251+
compared to the baseline implementations. Quantization has a small negative impact to quality, but, as you can see, running
252+
13B at q4_0 beats the 7B f16 model by a significant amount.
253+
254+
All measurements are done against wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context).
255+
Note that the changing the context length will have a significant impact on perplexity (longer context = better perplexity).
256+
```
257+
Perplexity - model options
258+
5.5985 - 13B, q4_0
259+
5.9565 - 7B, f16
260+
6.3001 - 7B, q4_1
261+
6.5949 - 7B, q4_0
262+
6.5995 - 7B, q4_0, --memory_f16
263+
```
264+
265+
#### How to run
266+
267+
1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
268+
2. Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
269+
3. Output:
270+
```
271+
Calculating perplexity over 655 chunks
272+
24.43 seconds per pass - ETA 4.45 hours
273+
[1]4.5970,[2]5.1807,[3]6.0382,...
274+
```
275+
And after 4.45 hours, you will have the final perplexity.
276+
243277
### Android
244278

245279
You can easily run `llama.cpp` on Android device with [termux](https://play.google.com/store/apps/details?id=com.termux).
@@ -290,7 +324,6 @@ docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models
290324

291325
## Limitations
292326

293-
- We don't know yet how much the quantization affects the quality of the generated text
294327
- Probably the token sampling can be improved
295328
- The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder,
296329
there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simply don't

0 commit comments

Comments
 (0)