-
Notifications
You must be signed in to change notification settings - Fork 13.9k
GGUF #2398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Great, better to have a single PR where every contributor can target. I'll continue to work on this in the afternoon. |
|
I'm working on the C implementation for reading. A python script for creating sample GGUF model files similar to the one you've started can be done in parallel. Also, directly modifying |
ff3ded1 to
8b8ad7d
Compare
|
Implemented an initial API for reading GGUF files. The new There were some minor deviations from the specification, but nothing is final yet. Will summarize in the end and update the spec respectively. Tomorrow will continue with update of |
bae72cb to
ef482eb
Compare
|
tip: port |
|
Sorry if this is not the right place to ask but why is whitespace getting escaped at all? The previous version was literally returning string slices right from the vocab and now we need to replace occurrences of Or it can write to a buffer, but still, every token which is generated, needs to be searched for substring, at every position, and then replaced. Obviously it's not going to be a bottleneck but it seems wasteful and very odd. Can anybody explain what was the reasoning behind this decision? |
|
@cztomsik Would be helpful to provide a specific example and what is the expectation. |
I don't have a solid understanding of sentencepiece, but I do know that the recent tokenizer changes started at #167 and made their way through a few PRs before landing here. These changes were made for more accurate tokenization compared to the official sentencepiece implementation. |
Basically, I'm asking why can't we have a function which takes token and return a const string. Something like this: (This actually works, and I'm doing the replace in JS, but I was curious why is it necessary to do the escaping in the first place, and why we can't have unescaped whitespace in the vocab) One obvious fix would be to do the unescape when the vocab is loaded, and have two pointers for each entry (escaped + unescaped - if it's different), then you could avoid these And not only it would be faster but the API would be easier to understand and to work with (now you need buffer, even if you're directly writing to a STDOUT for example. BTW: I am not C/C++ dev so I might be missing something. EDIT: My question is probably more related to this addition #2810 |
See #2310 and #3096. for example: we want to make the tokenizer compatible to LLaMa's |
|
Ok, I see. But from my limited understanding, the only tricky part is encoding step, right? That's where different white-space combinations can yield different token ids. When you get ids from the LLM and turn them into a string/byte chunks, every piece is fixed and known in advance, it can be invalid utf-8 but that should be so far the only problem, and therefore, Or, I mean, there's one another thing, a bit related to grammars, because if you are inside multi-byte you should not sample certain tokens at all, but that is related to sampling and not to id -> bytes conversion IMHO BTW: I would be happy to help once you're done with your current changes and tests are passing. |
|
I just checked and the Python tokenizer works like this: print(tokenizer.id_to_piece(15043))
_HelloSo I see your point - we should update Not sure when I'll get to this though - if you open a PR, I can help with review |
|
Probably yes, but need a careful look
No - it's a TODO: |
No, the only point was that there's a lot of strings getting allocated and destroyed if you're using the In the end, I did it like it is shown below, it's only using one buffer, and it's still doing replacements for each generated token, but at least it's not creating and throwing away intermediate strings all the time. I think I could even avoid that buffer entirely but I think it will be useful for rprompt (which I'm currently doing in the client code) so I'm fine with this and I don't need any changes, everything was possible with currently existing APIs. So far my impl is only missing the stripping of the leading space, but I have that in my client code already since the pre-GGUF version. |
|
@cztomsik : I think @ggerganov and I are talking about slowly converging to the |
Yup, got it. We have many temporary strings because of the unescape stuff. If we change the API to return the "vanilla" piece (similar to what the OG tokenizer does), then we won't have to unescape and we won't have the allocations. Alternatively, we can keep the current API and do what you suggested:
Will think about which way is better |
|
Just one remark: I believe the |






ref:
This PR paves the way for integrating more models into
llama.cpp. It changes the file format in which we convert the models by extending it with key-value pairs meta information. This meta data is flexible and allows to add specific information about the model being converted.This is a breaking change, meaning that all existing
ggmlmodels will no longer be compatible after merging.You should obtain the original model data (e.g. Meta PTH or Hugging Face, etc) and use the
convert.pyscript to generate the new F16 or F32.ggufmodels. From there, you can use all tools as usual:quantize,main,perplexity, etc.Read more about the GGUF file format here: ggml-org/ggml#302
The PR also refactors some portions of
llama.cppand mainly extendsggmlwith aggufAPI for loading and writing.gguffiles. It incorporates LLaMA tokenizer fixes and adds (temporary?) BPE tokenizer support.Huge thanks to all the contributors for writing the spec and helping with the implementation ❤️
Merge ETA: 21 Aug
Usage
Convert GGML to GGUF (not guaranteed to work)
Implementation plan:
ggmlconvert.py+ export GGUFconvert-llama-h5-to-gguf.py)convert.py)llama.cppmodel loading code to simplify things and drop obsolete stuffllama-utils.hby merging it inllama.cppggufAPI inllama.cppto load new GGUF F16 modelsggufAPI to be able to output quantized GGUF modelsggmlrefactorleft for future PRllama.cppto add support for alternative model inference graphsdemonstrate integration with MPT or Falcon - i.e. be able to seamlessly load and inference any of the 2 models withleft for future PRmainContributions are welcome - just make the target branch this one.
Collaborators can push directly in the branch.
TODO
make it work with F16 1D tensors(too big change for this PR, will fix later)ggufggufforexamples/train-text-from-scratchmodel writingggufforexamples/convert-llama2c-to-ggmlmodel writing#define LLAMA_API_CPPftypeprintftype)