Skip to content

Commit 113228b

Browse files
perplexity: add BF16 vs. FP16 results
1 parent 83330d8 commit 113228b

File tree

1 file changed

+59
-1
lines changed

1 file changed

+59
-1
lines changed

examples/perplexity/README.md

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ Also note that finetunes typically result in a higher perplexity value even thou
77

88
Within llama.cpp the perplexity of base models is used primarily to judge the quality loss from e.g. quantized models vs. FP16.
99
The convention among contributors is to use the Wikitext-2 test set for testing unless noted otherwise (can be obtained with `scripts/get-wikitext-2.sh`).
10+
When numbers are listed all command line arguments and compilation options are left at their defaults unless noted otherwise.
11+
llama.cpp numbers are **not** directly comparable to those of other projects because the exact values depend strongly on the implementation details.
1012

1113
By default only the mean perplexity value and the corresponding uncertainty is calculated.
1214
The uncertainty is determined empirically by assuming a Gaussian distribution of the "correct" logits per and then applying error propagation.
@@ -32,7 +34,13 @@ In addition to the KL divergence the following statistics are calculated with `-
3234

3335
## LLaMA 3 8b Scoreboard
3436

35-
Results are sorted by Kullback-Leibler divergence relative to FP16.
37+
| Revision | f364eb6f |
38+
|:---------|:-------------------|
39+
| Backend | CUDA |
40+
| CPU | AMD Epyc 7742 |
41+
| GPU | 1x NVIDIA RTX 4090 |
42+
43+
Results were generated using the CUDA backend and are sorted by Kullback-Leibler divergence relative to FP16.
3644
The "WT" importance matrices were created using varying numbers of Wikitext tokens and can be found [here](https://huggingface.co/JohannesGaessler/llama.cpp_importance_matrices/blob/main/imatrix-llama_3-8b-f16-2.7m_tokens.dat).
3745

3846
| Quantization | imatrix | Model size [GiB] | PPL | ΔPPL | KLD | Mean Δp | RMS Δp |
@@ -89,6 +97,12 @@ K-quants score better on mean Δp than the legacy quants than e.g. KL divergence
8997

9098
## LLaMA 2 vs. LLaMA 3 Quantization comparison
9199

100+
| Revision | f364eb6f |
101+
|:---------|:-------------------|
102+
| Backend | CUDA |
103+
| CPU | AMD Epyc 7742 |
104+
| GPU | 1x NVIDIA RTX 4090 |
105+
92106
| Metric | L2 7b q2_K | L3 8b q2_K | L2 7b q4_K_M | L3 8b q4_K_M | L2 7b q6_K | L3 8b q6_K | L2 7b q8_0 | L3 8b q8_0 |
93107
|-----------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
94108
| Mean PPL | 5.794552 ± 0.032298 | 9.751568 ± 0.063312 | 5.877078 ± 0.032781 | 6.407115 ± 0.039119 | 5.808494 ± 0.032425 | 6.253382 ± 0.038078 | 5.798542 ± 0.032366 | 6.234284 ± 0.037878 |
@@ -107,6 +121,50 @@ K-quants score better on mean Δp than the legacy quants than e.g. KL divergence
107121
| RMS Δp | 9.762 ± 0.053 % | 21.421 ± 0.079 % | 3.252 ± 0.024 % | 5.519 ± 0.050 % | 1.339 ± 0.010 % | 2.295 ± 0.019 % | 0.618 ± 0.011 % | 1.198 ± 0.007 % |
108122
| Same top p | 85.584 ± 0.086 % | 71.138 ± 0.119 % | 94.665 ± 0.055 % | 91.901 ± 0.072 % | 97.520 ± 0.038 % | 96.031 ± 0.051 % | 98.846 ± 0.026 % | 97.674 ± 0.040 % |
109123

124+
## LLaMA 3 BF16 vs. FP16 comparison
125+
126+
| Revision | 83330d8c |
127+
|:---------|:--------------|
128+
| Backend | CPU |
129+
| CPU | AMD Epyc 7742 |
130+
| GPU | N/A |
131+
132+
Results were calculated with LLaMA 3 8b BF16 as `--kl-divergence-base` and LLaMA 3 8b FP16 as the `--model` for comparison.
133+
134+
| Metric | Value |
135+
|--------------------------------|--------------------------|
136+
| Mean PPL(Q) | 6.227711 ± 0.037833 |
137+
| Mean PPL(base) | 6.225194 ± 0.037771 |
138+
| Cor(ln(PPL(Q)), ln(PPL(base))) | 99.990% |
139+
| Mean ln(PPL(Q)/PPL(base)) | 0.000404 ± 0.000086 |
140+
| Mean PPL(Q)/PPL(base) | 1.000404 ± 0.000086 |
141+
| Mean PPL(Q)-PPL(base) | 0.002517 ± 0.000536 |
142+
| Mean KLD | 0.00002515 ± 0.00000020 |
143+
| Maximum KLD | 0.012206 |
144+
| 99.9% KLD | 0.000799 |
145+
| 99.0% KLD | 0.000222 |
146+
| 99.0% KLD | 0.000222 |
147+
| Median KLD | 0.000013 |
148+
| 10.0% KLD | -0.000002 |
149+
| 5.0% KLD | -0.000008 |
150+
| 1.0% KLD | -0.000023 |
151+
| Minimum KLD | -0.000059 |
152+
| Mean Δp | -0.0000745 ± 0.0003952 % |
153+
| Maximum Δp | 4.186% |
154+
| 99.9% Δp | 1.049% |
155+
| 99.0% Δp | 0.439% |
156+
| 95.0% Δp | 0.207% |
157+
| 90.0% Δp | 0.125% |
158+
| 75.0% Δp | 0.029% |
159+
| Median Δp | 0.000% |
160+
| 25.0% Δp | -0.030% |
161+
| 10.0% Δp | -0.126% |
162+
| 5.0% Δp | -0.207% |
163+
| 1.0% Δp | -0.434% |
164+
| 0.1% Δp | -1.016% |
165+
| Minimum Δp | -4.672% |
166+
| RMS Δp | 0.150 ± 0.001 % |
167+
| Same top p | 99.739 ± 0.013 % |
110168

111169
## Old Numbers
112170

0 commit comments

Comments
 (0)