Skip to content

Commit 9008328

Browse files
compiladengxsonCISC
authored
imatrix : use GGUF to store importance matrices (#9400)
* imatrix : allow processing multiple chunks per batch * perplexity : simplify filling the batch * imatrix : fix segfault when using a single chunk per batch * imatrix : use GGUF to store imatrix data * imatrix : fix conversion problems * imatrix : use FMA and sort tensor names * py : add requirements for legacy imatrix convert script * perplexity : revert changes * py : include imatrix converter requirements in toplevel requirements * imatrix : avoid using designated initializers in C++ * imatrix : remove unused n_entries * imatrix : allow loading mis-ordered tensors Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <[email protected]> * quantize : use unused imatrix chunk_size with LLAMA_TRACE * common : use GGUF for imatrix output by default * imatrix : two-way conversion between old format and GGUF * convert : remove imatrix to gguf python script * imatrix : use the function name in more error messages * imatrix : don't use FMA explicitly This should make comparisons between the formats easier because this matches the behavior of the previous version. * imatrix : avoid returning from void function save_imatrix * imatrix : support 3d tensors with MUL_MAT * quantize : fix dataset name loading from gguf imatrix * common : move string_remove_suffix from quantize and imatrix Co-authored-by: Sigbjørn Skjæret <[email protected]> * imatrix : add warning when legacy format is written * imatrix : warn when writing partial data, to help guess dataset coverage Also make the legacy format store partial data by using neutral values for missing data. This matches what is done at read-time for the new format, and so should get the same quality in case the old format is still used. * imatrix : avoid loading model to convert or combine imatrix * imatrix : avoid using imatrix.dat in README --------- Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
1 parent d4b91ea commit 9008328

File tree

6 files changed

+669
-158
lines changed

6 files changed

+669
-158
lines changed

common/common.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -448,6 +448,15 @@ void string_replace_all(std::string & s, const std::string & search, const std::
448448
bool string_ends_with(const std::string_view & str, const std::string_view & suffix) {
449449
return str.size() >= suffix.size() && str.compare(str.size()-suffix.size(), suffix.size(), suffix) == 0;
450450
}
451+
452+
bool string_remove_suffix(std::string & str, const std::string_view & suffix) {
453+
bool has_suffix = string_ends_with(str, suffix);
454+
if (has_suffix) {
455+
str = str.substr(0, str.size() - suffix.size());
456+
}
457+
return has_suffix;
458+
}
459+
451460
size_t string_find_partial_stop(const std::string_view & str, const std::string_view & stop) {
452461
if (!str.empty() && !stop.empty()) {
453462
const char text_last_char = str.back();

common/common.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -534,6 +534,7 @@ static bool string_starts_with(const std::string & str,
534534

535535
// While we wait for C++20's std::string::ends_with...
536536
bool string_ends_with(const std::string_view & str, const std::string_view & suffix);
537+
bool string_remove_suffix(std::string & str, const std::string_view & suffix);
537538
size_t string_find_partial_stop(const std::string_view & str, const std::string_view & stop);
538539

539540
bool string_parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);

gguf-py/gguf/constants.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,11 @@ class Adapter:
233233
TYPE = "adapter.type"
234234
LORA_ALPHA = "adapter.lora.alpha"
235235

236+
class IMatrix:
237+
CHUNK_COUNT = "imatrix.chunk_count"
238+
CHUNK_SIZE = "imatrix.chunk_size"
239+
DATASETS = "imatrix.datasets"
240+
236241
class Clip:
237242
PROJECTOR_TYPE = "clip.projector_type"
238243
HAS_VISION_ENCODER = "clip.has_vision_encoder"
@@ -282,6 +287,7 @@ class Projector:
282287
class GGUFType:
283288
MODEL = "model"
284289
ADAPTER = "adapter"
290+
IMATRIX = "imatrix"
285291
MMPROJ = "mmproj" # dummy, unused for now
286292

287293

tools/imatrix/README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,15 @@ More information is available here: https://github.com/ggml-org/llama.cpp/pull/4
77

88
```
99
./llama-imatrix \
10-
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \
10+
-m model.gguf -f some-text.txt [-o imatrix.gguf] [--process-output] \
1111
[--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \
12-
[--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...]
12+
[--in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf ...] \
13+
[--parse-special]
1314
```
1415

1516
Here `-m` with a model name and `-f` with a file containing training data (such as e.g. `wiki.train.raw`) are mandatory.
1617
The parameters in square brackets are optional and have the following meaning:
17-
* `-o` (or `--output-file`) specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used.
18+
* `-o` (or `--output-file`) specifies the name of the file where the computed data will be stored. If missing `imatrix.gguf` is used.
1819
* `--verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
1920
* `--output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
2021
* `--save-frequency` specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)
@@ -25,9 +26,9 @@ For faster computation, make sure to use GPU offloading via the `-ngl` argument
2526
## Example
2627

2728
```bash
28-
# generate importance matrix (imatrix.dat)
29+
# generate importance matrix (imatrix.gguf)
2930
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99
3031

3132
# use the imatrix to perform a Q4_K_M quantization
32-
./llama-quantize --imatrix imatrix.dat ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
33+
./llama-quantize --imatrix imatrix.gguf ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
3334
```

0 commit comments

Comments
 (0)