Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 26, 2023

ref:

This PR paves the way for integrating more models into llama.cpp. It changes the file format in which we convert the models by extending it with key-value pairs meta information. This meta data is flexible and allows to add specific information about the model being converted.

This is a breaking change, meaning that all existing ggml models will no longer be compatible after merging.
You should obtain the original model data (e.g. Meta PTH or Hugging Face, etc) and use the convert.py script to generate the new F16 or F32 .gguf models. From there, you can use all tools as usual: quantize, main, perplexity, etc.

Read more about the GGUF file format here: ggml-org/ggml#302

The PR also refactors some portions of llama.cpp and mainly extends ggml with a gguf API for loading and writing .gguf files. It incorporates LLaMA tokenizer fixes and adds (temporary?) BPE tokenizer support.

Huge thanks to all the contributors for writing the spec and helping with the implementation ❤️

Merge ETA: 21 Aug

Usage

# build the GGUF read/write example
make gguf

# write a dummy GGUF model to test.gguf
./gguf test.gguf w

# read the dummy GGUF model
./gguf test.gguf r

# LLaMA 1 PTH
python3 convert.py ../llama1/7B/ --outfile models/7B/ggml-model-f16.gguf

# LLaMA 2 PTH
python3 convert.py ../llama2/llama/llama-2-7b --outfile models/7B-v2/ggml-model-f16.gguf

# LLaMA 2 HF
python3 convert.py ~/development/huggingface/Llama-2-7b-hf/ --outfile models/7B-v2/ggml-model-f16.gguf

# vocab-only
python3 convert.py ../llama1/7B/ --outfile models/ggml-vocab-llama.gguf --vocab-only
python3 convert.py ~/development/huggingface/Llama-2-7b-hf/ --outfile models/ggml-vocab-llama.gguf --vocab-only

Convert GGML to GGUF (not guaranteed to work)


Implementation plan:

  • implement GGUF import in ggml
  • add sample code to write / read dummy GGUF files
  • refactor convert.py + export GGUF
    • remove quantized data support - only F16 and F32 types
    • remove GPTQ support
    • output GGUF for LLaMAv2
      • Hugging Face (convert-llama-h5-to-gguf.py)
      • Vanilla PTH (convert.py)
  • refactor llama.cpp model loading code to simplify things and drop obsolete stuff
  • utilize the new gguf API in llama.cpp to load new GGUF F16 models
  • extend the gguf API to be able to output quantized GGUF models
    • as a first step, the user code can write the models
    • later, when we see what API we need, we can move the implementation as part of ggml
  • refactor llama.cpp to add support for alternative model inference graphs left for future PR
  • demonstrate integration with MPT or Falcon - i.e. be able to seamlessly load and inference any of the 2 models with main left for future PR
  • bring the Merge tokenizer fixes #2549 tokenizer fixes so we have a single breaking change instead of two

Contributions are welcome - just make the target branch this one.
Collaborators can push directly in the branch.

TODO

@ggerganov ggerganov mentioned this pull request Jul 26, 2023
@monatis
Copy link
Collaborator

monatis commented Jul 26, 2023

Great, better to have a single PR where every contributor can target.

I'll continue to work on this in the afternoon.

@ggerganov
Copy link
Member Author

ggerganov commented Jul 26, 2023

I'm working on the C implementation for reading. A python script for creating sample GGUF model files similar to the one you've started can be done in parallel. Also, directly modifying convert.py to output GGUF can also be done without conflict for now

@ggerganov ggerganov added high priority Very important issue refactoring Refactoring breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels Jul 26, 2023
@ggerganov
Copy link
Member Author

Implemented an initial API for reading GGUF files. The new gguf example demonstrates writing simple model files. It also demonstrates different ways to read the data and load it into memory.

There were some minor deviations from the specification, but nothing is final yet. Will summarize in the end and update the spec respectively.

Tomorrow will continue with update of convert.py to write a GGUF file and try to use the new API to load the model in llama.cpp

@ggerganov ggerganov force-pushed the gguf branch 2 times, most recently from bae72cb to ef482eb Compare July 26, 2023 19:59
@Green-Sky
Copy link
Collaborator

tip: port constants.py to C . Vulkan headers have done that, to drastically reduce the number of spelling mistakes in strings.

@cztomsik
Copy link
Contributor

Sorry if this is not the right place to ask but why is whitespace getting escaped at all? The previous version was literally returning string slices right from the vocab and now we need to replace occurrences of (Lower One Eighth Block) in a newly allocated string.

Or it can write to a buffer, but still, every token which is generated, needs to be searched for substring, at every position, and then replaced.

Obviously it's not going to be a bottleneck but it seems wasteful and very odd. Can anybody explain what was the reasoning behind this decision?

@ggerganov
Copy link
Member Author

@cztomsik Would be helpful to provide a specific example and what is the expectation.

@cebtenzzre
Copy link
Collaborator

why is whitespace getting escaped at all?

I don't have a solid understanding of sentencepiece, but I do know that the recent tokenizer changes started at #167 and made their way through a few PRs before landing here. These changes were made for more accurate tokenization compared to the official sentencepiece implementation.

@cztomsik
Copy link
Contributor

cztomsik commented Sep 13, 2023

@cztomsik Would be helpful to provide a specific example and what is the expectation.

Basically, I'm asking why can't we have a function which takes token and return a const string. Something like this:

image

(This actually works, and I'm doing the replace in JS, but I was curious why is it necessary to do the escaping in the first place, and why we can't have unescaped whitespace in the vocab)

One obvious fix would be to do the unescape when the vocab is loaded, and have two pointers for each entry (escaped + unescaped - if it's different), then you could avoid these replaceAll calls:

image
image
image
image

And not only it would be faster but the API would be easier to understand and to work with (now you need buffer, even if you're directly writing to a STDOUT for example.

BTW: I am not C/C++ dev so I might be missing something.

EDIT: My question is probably more related to this addition #2810

@goerch
Copy link
Contributor

goerch commented Sep 14, 2023

Obviously it's not going to be a bottleneck but it seems wasteful and very odd. Can anybody explain what was the reasoning behind this decision?

See #2310 and #3096. for example: we want to make the tokenizer compatible to LLaMa's sentencepiece tokenizer. I fear we are still missing Unicode normalization somehow. I'm sure you could optimize interfaces and implementations.

@cztomsik
Copy link
Contributor

Ok, I see. But from my limited understanding, the only tricky part is encoding step, right? That's where different white-space combinations can yield different token ids.

When you get ids from the LLM and turn them into a string/byte chunks, every piece is fixed and known in advance, it can be invalid utf-8 but that should be so far the only problem, and therefore, token_to_s could be just look-up. At least that's what I assume from this code in the google sentencepiece library

Or, I mean, there's one another thing, a bit related to grammars, because if you are inside multi-byte you should not sample certain tokens at all, but that is related to sampling and not to id -> bytes conversion IMHO

BTW: I would be happy to help once you're done with your current changes and tests are passing.

@ggerganov
Copy link
Member Author

@cztomsik

I just checked and the Python tokenizer works like this:

print(tokenizer.id_to_piece(15043))
_Hello

So I see your point - we should update llama_token_to_piece to match that behavior.
But we need to be careful to not break some existing logic.

Not sure when I'll get to this though - if you open a PR, I can help with review

@goerch
Copy link
Contributor

goerch commented Sep 14, 2023

So I see your point - we should update llama_token_to_piece to match that behavior.
But we need to be careful to not break some existing logic.

llama_token_to_piece is used in some of the examples instead of the newer llama_detokenize... AFAIU. Those should be adapted to use llama_detokenize... then? Do we have an model independent llama_detokenize yet?

@ggerganov
Copy link
Member Author

Those should be adapted to use llama_detokenize... then?

Probably yes, but need a careful look

Do we have an model independent llama_detokenize yet?

No - it's a TODO:

https://github.com/ggerganov/llama.cpp/blob/e394084166baac09e8ee9a08a4686f907f7e5291/common/common.h#L148-L164

@cztomsik
Copy link
Contributor

So I see your point - we should update llama_token_to_piece to match that behavior. But we need to be careful to not break some existing logic.

No, the only point was that there's a lot of strings getting allocated and destroyed if you're using the llama_token_to_piece from the C API.

In the end, I did it like it is shown below, it's only using one buffer, and it's still doing replacements for each generated token, but at least it's not creating and throwing away intermediate strings all the time.

I think I could even avoid that buffer entirely but I think it will be useful for rprompt (which I'm currently doing in the client code) so I'm fine with this and I don't need any changes, everything was possible with currently existing APIs.

image

So far my impl is only missing the stripping of the leading space, but I have that in my client code already since the pre-GGUF version.

@goerch
Copy link
Contributor

goerch commented Sep 14, 2023

@cztomsik : I think @ggerganov and I are talking about slowly converging to the LlaMa API for the tokenizer:

    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
        assert type(s) is str
        t = self.sp_model.encode(s)
        if bos:
            t = [self.bos_id] + t
        if eos:
            t = t + [self.eos_id]
        return t

    def decode(self, t: List[int]) -> str:
        return self.sp_model.decode(t)

@ggerganov
Copy link
Member Author

No, the only point was that there's a lot of strings getting allocated and destroyed if you're using the llama_token_to_piece from the C API.

Yup, got it. We have many temporary strings because of the unescape stuff. If we change the API to return the "vanilla" piece (similar to what the OG tokenizer does), then we won't have to unescape and we won't have the allocations.

Alternatively, we can keep the current API and do what you suggested:

One obvious fix would be to do the unescape when the vocab is loaded, and have two pointers for each entry (escaped + unescaped - if it's different), then you could avoid these replaceAll calls:

Will think about which way is better

@goerch
Copy link
Contributor

goerch commented Sep 14, 2023

Just one remark: I believe the sentencepiece tokenizer doesn't respect invariants you'd expect (at least I did;).

#2310 (comment)
#2810 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. high priority Very important issue refactoring Refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.