Skip to content

Commit 2d5db48

Browse files
authored
ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics
1 parent 6986c78 commit 2d5db48

File tree

6 files changed

+109
-102
lines changed

6 files changed

+109
-102
lines changed

README.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
99

1010
**Hot topics:**
1111

12+
- Quantization formats `Q4` and `Q8` have changed again (19 May) - [(info)](https://github.com/ggerganov/llama.cpp/pull/1508)
1213
- Quantization formats `Q4` and `Q5` have changed - requantize any old models [(info)](https://github.com/ggerganov/llama.cpp/pull/1405)
1314
- [Roadmap May 2023](https://github.com/ggerganov/llama.cpp/discussions/1220)
1415

@@ -334,16 +335,16 @@ Several quantization methods are supported. They differ in the resulting model d
334335
335336
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
336337
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
337-
| 7B | perplexity | 5.9066 | 6.1565 | 6.0910 | 5.9862 | 5.9481 | 5.9069 |
338-
| 7B | file size | 13.0G | 4.0G | 4.8G | 4.4G | 4.8G | 7.1G |
339-
| 7B | ms/tok @ 4th | 128 | 50 | 54 | 75 | 83 | 75 |
340-
| 7B | ms/tok @ 8th | 123 | 44 | 52 | 53 | 58 | 72 |
341-
| 7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0 |
342-
| 13B | perplexity | 5.2543 | 5.3860 | 5.3607 | 5.2856 | 5.2706 | 5.2548 |
343-
| 13B | file size | 25.0G | 7.6G | 9.1G | 8.4G | 9.1G | 14G |
344-
| 13B | ms/tok @ 4th | 239 | 93 | 101 | 150 | 164 | 141 |
345-
| 13B | ms/tok @ 8th | 240 | 81 | 96 | 96 | 104 | 136 |
346-
| 13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0 |
338+
| 7B | perplexity | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
339+
| 7B | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |
340+
| 7B | ms/tok @ 4th | 127 | 55 | 54 | 76 | 83 | 72 |
341+
| 7B | ms/tok @ 8th | 122 | 43 | 45 | 52 | 56 | 67 |
342+
| 7B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
343+
| 13B | perplexity | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
344+
| 13B | file size | 25.0G | 6.8G | 7.6G | 8.3G | 9.1G | 13G |
345+
| 13B | ms/tok @ 4th | - | 103 | 105 | 148 | 160 | 131 |
346+
| 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 |
347+
| 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
347348
348349
### Perplexity (measuring model quality)
349350

ggml-cuda.cu

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -42,19 +42,19 @@ typedef void (*dequantize_mul_mat_vec_cuda_t)(const void * vx, const float * y,
4242
#define QK4_0 32
4343
#define QR4_0 2
4444
typedef struct {
45-
float d; // delta
45+
half d; // delta
4646
uint8_t qs[QK4_0 / 2]; // nibbles / quants
4747
} block_q4_0;
48-
static_assert(sizeof(block_q4_0) == sizeof(float) + QK4_0 / 2, "wrong q4_0 block size/padding");
48+
static_assert(sizeof(block_q4_0) == sizeof(ggml_fp16_t) + QK4_0 / 2, "wrong q4_0 block size/padding");
4949

5050
#define QK4_1 32
5151
#define QR4_1 2
5252
typedef struct {
53-
float d; // delta
54-
float m; // min
53+
half d; // delta
54+
half m; // min
5555
uint8_t qs[QK4_1 / 2]; // nibbles / quants
5656
} block_q4_1;
57-
static_assert(sizeof(block_q4_1) == sizeof(float) * 2 + QK4_1 / 2, "wrong q4_1 block size/padding");
57+
static_assert(sizeof(block_q4_1) == sizeof(ggml_fp16_t) * 2 + QK4_1 / 2, "wrong q4_1 block size/padding");
5858

5959
#define QK5_0 32
6060
#define QR5_0 2
@@ -78,10 +78,10 @@ static_assert(sizeof(block_q5_1) == 2 * sizeof(ggml_fp16_t) + sizeof(uint32_t) +
7878
#define QK8_0 32
7979
#define QR8_0 1
8080
typedef struct {
81-
float d; // delta
81+
half d; // delta
8282
int8_t qs[QK8_0]; // quants
8383
} block_q8_0;
84-
static_assert(sizeof(block_q8_0) == sizeof(float) + QK8_0, "wrong q8_0 block size/padding");
84+
static_assert(sizeof(block_q8_0) == sizeof(ggml_fp16_t) + QK8_0, "wrong q8_0 block size/padding");
8585

8686
#define CUDA_DEQUANTIZE_BLOCK_SIZE 256
8787
#define CUDA_DMMV_BLOCK_SIZE 32 // dmmv = dequantize_mul_mat_vec

0 commit comments

Comments
 (0)