Skip to content

Converting GGML->GGUF: ValueError: Only GGJTv3 supported #2990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jboero opened this issue Sep 3, 2023 · 25 comments · Fixed by #3023
Closed

Converting GGML->GGUF: ValueError: Only GGJTv3 supported #2990

jboero opened this issue Sep 3, 2023 · 25 comments · Fixed by #3023
Assignees

Comments

@jboero
Copy link
Contributor

jboero commented Sep 3, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ X] I carefully followed the README.md.
  • [X ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

My GGML converted models should be easy to convert to GGUF.
I know the conversion tools aren't guaranteed but I'd like to file this one in case anybody else has a workaround or more version flexible option. I would love to see any version of GGML/GGJT supported if possible. Instead my GGML files converted earlier are apparently not supported for conversion to GGUF.

Is there any tool to show the standard version details of a model file? Happy to contribute one if there isn't.

Current Behavior

python3 ./convert-llama-ggmlv3-to-gguf.py -i llama-2-70b/ggml-model-f32.bin -o test.gguf
=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===

* Scanning GGML input file
Traceback (most recent call last):
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 353, in <module>
    main()
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 335, in main
    offset = model.load(data, 0)
             ^^^^^^^^^^^^^^^^^^^
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 125, in load
    offset += self.validate_header(data, offset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 121, in validate_header
    raise ValueError('Only GGJTv3 supported')
ValueError: Only GGJTv3 supported

Environment and Context

Working with models

  • Physical (or virtual) hardware you are using, e.g. for Linux:
    Physical Fedora 38, probably irrelevant give the Python.

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  56
  On-line CPU(s) list:   0-55
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           2
    Stepping:            1
    CPU(s) scaling MHz:  40%
    CPU max MHz:         3200.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            3990.92
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts a
                         cpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_per
                         fmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes
                         64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_
                         2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowp
                         refetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb st
                         ibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bm
                         i2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc
                          cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   896 KiB (28 instances)
  L1i:                   896 KiB (28 instances)
  L2:                    7 MiB (28 instances)
  L3:                    70 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-13,28-41
  NUMA node1 CPU(s):     14-27,42-55
  • Operating System, e.g. for Linux:

$ uname -a
Linux z840 6.4.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 23 17:46:49 UTC 2023 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.11.4
$ make --version
$ g++ --version

Failure Information (for bugs)

python3 ./convert-llama-ggmlv3-to-gguf.py -i llama-2-70b/ggml-model-f32.bin -o test.gguf
=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===

* Scanning GGML input file
Traceback (most recent call last):
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 353, in <module>
    main()
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 335, in main
    offset = model.load(data, 0)
             ^^^^^^^^^^^^^^^^^^^
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 125, in load
    offset += self.validate_header(data, offset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[PATH]/llama.cpp/convert-llama-ggmlv3-to-gguf.py", line 121, in validate_header
    raise ValueError('Only GGJTv3 supported')
ValueError: Only GGJTv3 supported

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. step 1 convert any of the PTH models to GGML (using previous unversioned commits of convert)
  2. step 2 convert the GGML to GGUF with the command given above.
@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 3, 2023

My GGML converted models should be easy to convert to GGUF.

There are versions of GGML that had really strange, difficult to support stuff like multi-part files, including individual tensors split across (or duplicated) across the files, etc. So supporting all versions of the previous GGML formats definitely isn't easy or simple.

It also looks like you're converting without the model metadata, and converting the vocabulary also isn't perfect. Even if you mess around with this, you're going to get a model that's lower quality than one that was directly converted to GGUF.

Is it really impractical for you to just download the GGUF version? edit: Also, do you know what version the file actually is?

@KerfuffleV2 KerfuffleV2 self-assigned this Sep 3, 2023
@jboero
Copy link
Contributor Author

jboero commented Sep 3, 2023 via email

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 3, 2023

Is it really impractical for you to just download the GGUF version?

For those of us who use 30b/70B models yes. It very much is impractical to download 40GB over and over again. To download unquantized models it's also impractical because they are hundreds of GB. If you are downloading 10 - 20 models over time this is virtually impossible due to data caps and internet speeds.

@KerfuffleV2
Copy link
Collaborator

@jboero

Any documentation for the header info that I might be able to write an info

Not really, kind of have to just figure it out by looking at the loading code for different versions in the projects that have supported GGML. You didn't answer my question about what version you have. If you can load it with an older llama.cpp version I think it will say what it is when it gets loaded.

Or if you can show me the first 10 or so bytes in a hexdump. For example on Linux something like:

hexdump < /path/to/model.bin | head -2

@Ph0rk0z

Is it really impractical for you to just download the GGUF version?

For those of us who use 30b/70B models yes.

Just to be clear, I wasn't saying that in a snarky way. People tend to just make issues when they run into a problem, even if there's a relatively easy workaround. Since converting the GGML models isn't ideal in the first place, I was just checking to make sure there wasn't an easier way to deal with this.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 3, 2023

I don't wanna hijack but I have a similar problem now. I converted a bunch of GGML to GGUF and they worked fine. Now I directly downloaded 2 quants in GGUF and am getting huge repetition problems at long context. I also have the GPTQ version and these issues aren't present. It isn't the model.

So now, I'm stuck downloading a GGML of the same and converting it to see if that will work. But this means that I have downloaded over 120GB of the same model, not counting the GPTQ. All due to format.

At the end, I also don't know if the other models are GGUFv1 or GGUFv2 from either a direct quant or conversion scripts. GGUFv1 goes away in a month. Will they convert after? Will some other strange bug like this occur? You can see how this is frustrating, right?

@jboero
Copy link
Contributor Author

jboero commented Sep 3, 2023 via email

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 3, 2023

@Ph0rk0z

I converted a bunch of GGML to GGUF and they worked fine.

Did you convert using metadata? (The --model-metadata-option for convert-llama-ggmlv3-to-gguf.py)?

If not, that might be something to test: see if doing the GGML conversion with metadata leads to the repetition problem you mentioned.

GGUFv1 goes away in a month. Will they convert after?

I got your back with #2931. You can now use the quantize tool in copy mode to repackage your GGUFv1 files to GGUFv2.


@jboero

Anyway that's probably for a separate issue but I'll do my best to write up a quick version/info app as there doesn't seem to be any.

The file will start with:

  1. bytes lmgg (ggml reversed). This is the old/original GGML version.
  2. bytes fmgg followed by a little endian u32 for version. That's GGMF.
  3. bytes tjgg follow by a little endian u32 for version. That's GGJT.

That's why I was asking you for a hexdump of the beginning of the file. (By the way, I'm the person who wrote the convert-llama-ggmlv3-to-gguf.py tool.)

I think the main difference in GGMF from GGML was that the version was added and vocabulary items added score (a f32 with each vocab entry). GGJT added padding/alignment. I don't remember what's different between the different GGJT versions. Just to be clear, this is just talking about differences in the structure of the file not really the content. I'm pretty sure there were some changes to quantization formats in there, and something like that is pretty much impossible to convert.

You can possibly try changing:

if bytes(data[offset:offset + 4]) != b'tjgg' or struct.unpack('<I', data[offset + 4:offset + 8])[0] != 3:

to

if bytes(data[offset:offset + 4]) != b'tjgg':

(should be around line 120 in the script)

This will just completely ignore the version (still only supports GGJT) and blindly attempt to proceed. It may or may not actually work.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 3, 2023

I didn't use metadata. I only used the script. I'll also try the repeating models in GGUFv2 because why not. I won't have a GGML copy of the model to attempt to convert until tomorrow.

@KerfuffleV2
Copy link
Collaborator

I'll also try the repeating models in GGUFv2 because why not.

You shouldn't notice any difference. The only thing GGUFv2 did (as far as I know) is change some types to 64bit to allow expressing larger values. Existing files wouldn't have any values over 32bit so there wouldn't be a visible difference.

I didn't use metadata. I only used the script.

Alright. Well, if you notice that converting with metadata leads to the repetition issue but converting using only the GGML file doesn't that would mean it's something with the new vocab stuff. GGUF is supposed to be better in that regard though and converting GGML to GGUF without the metadata is imperfect. So it's likely to be worse than the original GGML file in terms of quality.

@danielbrdz
Copy link

Does anyone know how to solve the problem:
ValueError: Only GGJTv3 supported

I had already successfully converted GGML to GGUF last week, but I updated the llama.cpp and now I get this error.

@KerfuffleV2
Copy link
Collaborator

I had already successfully converted GGML to GGUF last week

What? The GGML to GGUF conversion script has only ever supported GGJTv3. Maybe you successfully converted a GGJTv3 file and then tried to convert a GGML file of a different version (non GGJTv3). As for possibly ways to deal with that, please read through the other posts in this issue.

I can't help you if I don't know the version of the GGML file you're trying to convert.

@jboero
Copy link
Contributor Author

jboero commented Sep 4, 2023 via email

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 4, 2023

Got the K5M GGML. Running it as GGML it still does some repetition. I also converted it to GGUF (v1/v2 makes no difference). The extreme repetition where it generates the same thing over and over happens about 1000 tokens later in the converted model. It can now be broken via mirostat with high TAU. You'll still get your bits of the previous messages but at least now the plot moves forward. Every message may start with the robot smiles slyly but the rest of the contents will be different.

Something is definitely wrong with how this was quantized. The GPTQ model had none of these problems. I will try to run it with pure exllama so that there are less samplers for a final check. I'm not sure what else to do or why it would be such a big difference between the two formats or why llama.cpp performs so poorly in this department. So far none of my other models have suffered from this and GGUF/GGML have been smarter than their GPTQ equivalents.

Model is: https://huggingface.co/nRuaif/fiction.live-Kimiko-V2-70B and I'm using the bloke's quants.

@KerfuffleV2
Copy link
Collaborator

Getting a bit off topic here, but...

I'm not sure what else to do or why it would be such a big difference between the two formats or why llama.cpp performs so poorly in this department.

It may just be random. Any quantization is going to cause some kind of degradation, some particular models may just get hit in a particularly critical way. Nothing really has changed recently with the quantization that I'm aware of. The only thing I can think of is that k-quants actually quantizes a couple tensors with higher quality than it used to for 70B LLaMA2 models specifically. This is because of the multi-attention stuff those tensors are smaller than they were in LLaMAv1 so more bits can be spent on them.

It can now be broken via mirostat with high TAU. You'll still get your bits of the previous messages but at least now the plot moves forward.

Going even further off topic, you should try out my seqrep sampler in #2593, it's specifically designed to try to help with that kind of stuff. Try parameters like:

--top-p 2.0
--tfs 0.95 
--typical 0.25 
--repeat-last-n 0 
--top-k 140  
--no-penalize-nl
--temp 1.1

and for seqrep

-seqrep min_length=3:tolerance=1:tolerance_match_credit=.25:tolerance_half_step_cost=.25:flag_divide_by_penalty:presence_penalty=1.2:length_penalty=1.1:flag_tolerance_no_consecutive:last_n=-1:mid_word_scale=0

airoboros-l2-70b-2.1 (I have q4_k_m) also seems very good, and better than the -creative version Airoboros. Especially if you use the BEGININPUT/ENDINPUT, BEGININSTRUCTION/ENDINSTRUCTION stuff to set up the context and instructions. I think it's currently my favorite model.

Note: I haven't tested this stuff out for use models in chat or roleplay type modes. I think any type of repetition penalties are going to struggle there because there's going to be a lot of repetition in stuff like Character: etc.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 4, 2023

seqrep sampler in #2593,

Dang, I want to try that but I'm using textgen with python bindings as a backend and then doing the chats through silly tavern.
So I'd have to merge the PR and add the relevant bits to the bindings then textgen and then ST, oof. Maybe there is a way to point silly at llama.cpp server directly somehow. I know there is kobold.cpp but it's not up to date with main most of the time. Without a similar setup it will be hard to check that it's fixed.

I don't have any of these problems with platypus, qcamel and a couple of other 70b and I use the exact same settings and prompts. All quants; QK4M, QK5S, Q6. I mean not even a hint of it. I can load the GPTQ model up in the same broken chat and it immediately ends the repetition, even with plain exllama with the same low amount of samplers. Loading another GGUF also immediately ends it, even if it went on for many messages.

I have all airoboros, mostly in lora form that I apply to GPTQ models. I'd love to have done the same thing for GGUF but offloaded and quantized models won't take lora.

My main worry is that I'll d/l another 70b as GGUF instead of GGML for better quants and then be hit with this issue. Plus it's sad that I won't be able to use this one in it's smarter GGUF form.

@danielbrdz
Copy link

I had already successfully converted GGML to GGUF last week

What? The GGML to GGUF conversion script has only ever supported GGJTv3. Maybe you successfully converted a GGJTv3 file and then tried to convert a GGML file of a different version (non GGJTv3). As for possibly ways to deal with that, please read through the other posts in this issue.

I can't help you if I don't know the version of the GGML file you're trying to convert.

It is GGMLv3
I convert my HF models to GGML using make-ggml.py from llama.cpp and the format it tells me is GGMLv3.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 4, 2023

@danielbrdz

It is GGMLv3

I don't think there is such a thing, that's why I asked for a hex dump from the beginning of the file: #2990 (comment)

edit: I guess I can't criticize saying "GGML v3" too much since I called the script that after all. Unfortunately it's not really accurate enough to know exactly what the format is. I could have called it "GGJT v3" but generally users wouldn't know internal details like the current GGML format is actually GGJT.

Digging through the older code, it looks like these are the options: https://github.com/ggerganov/llama.cpp/blob/dadbed99e65252d79f81101a392d0d6497b86caa/llama.cpp#L506-L512

Based on that, it's probably possibly to convert any of those versions all the way back to plain GGML as long as they're unquantized. i.e. just f16 or f32 format. (But I wouldn't even try to convert a file that has the really weird stuff like being in multiple parts.)

For quantized files, it looks like it's only possible to go back to GGJT v2 and in that case only if it's not q8 or q4 (I assume this means q8_0, q4_0, q4_1). There was a full quantization format change at GGJTv2 so converting any quantized files earlier than that wouldn't be possible.


@Ph0rk0z

Dang, I want to try that but I'm using textgen with python bindings as a backend and then doing the chats through silly tavern.

Ah, I can't really help you there and I also suspect it wouldn't work as well for a chat format compared to something like "Write a story with blah, blah, blah criteria".

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 5, 2023

@danielbrdz @jboero Please try converting with #3023 - that version should convert even very old GGML format files when it's possible. In cases where conversion isn't possible, it should give you a better error message.

I actually don't have any GGML files older than GGJTv3 laying around so I'd appreciate any testing with older files.

Please note that in cases where the quantization format changed it's just not possible to convert the file. So if your GGML isn't f16 or f32 format and it's older than GGJTv2 it just can't be converted. If it's GGJTv2 and Q8 or Q4 quantized then it also can't be converted since the format for those quantizations changed in GGJTv3.

Even for those files that can't be converted, it would be helpful if people can test and report back. You should get a reasonable error message when the file can't be converted, like:

ValueError: Q4 and Q8 quantizations changed in GGJTv3. Sorry, your GGJTv2 file of type MOSTLY_Q8_0 is not eligible for conversion.

It should also report the file format when loading to enable better reporting of problems:

* Scanning GGML input file
* File format: GGJTv3 with ftype MOSTLY_Q8_0

@jboero
Copy link
Contributor Author

jboero commented Sep 5, 2023 via email

@KerfuffleV2
Copy link
Collaborator

Just a note to people reading this, I renamed the script to convert-llama-ggml-to-gguf.py in #3023 since ggmlv3 doesn't really make sense. If you test with the pull (or after it's merged) be sure to look for it with the new name. I'm waiting to get some feedback before I merge it, so if supporting more than GGJTv3 is relevant for you then please test!

@danielbrdz
Copy link

I have solved the error: ValueError: Only GGJTv3 supported

As I told you before, last week I made the conversion from GGML to GGUF without problems.
But the GGML files I had converted from HF to GGML with a previous version called.cpp, specifically the August 20 version which has no GGUF files.
Using the GGML I made with that version and using the recent version of llama.cpp I was able to do the successful GGML to GGUF conversion with no problems.
So using the August 20 version of llama.cpp I converted my HF model to GGML and then used the newer version of llama.cpp to convert it from GGML to GGUF without any problems.

This way I could solve the error, I have the theory that the error is due to when converting a HF model to GGML in the latest version of llama.cpp as it has residues of GGUF it could be mixed and that is the reason for the incompatibility that marks the error commented in the forum.

@jboero
Copy link
Contributor Author

jboero commented Sep 6, 2023 via email

@KerfuffleV2
Copy link
Collaborator

@danielbrdz

So using the August 20 version of llama.cpp I converted my HF model to GGML and then used the newer version of llama.cpp to convert it from GGML to GGUF without any problems.

I'm confused and not sure I understand correctly. If you have the HF model, why are you converting to GGML and then converting the GGML to GGUF instead of just converting from HF to GGUF directly?

I have the theory that the error is due to when converting a HF model to GGML in the latest version of llama.cpp

There isn't a way to convert LLaMA models from HF to GGML anymore in the latest llama.cpp, as far as I know. The current convert.py converts directly to GGUF.

I added the conversion script (now renamed to convert-llama-ggml-to-gguf.py) just to ease the transition for people that had existing GGML files. If you plan to use the version of llama.cpp after the switch to GGUF I strongly suggest not creating any new GGML files.

@danielbrdz
Copy link

@danielbrdz

So using the August 20 version of llama.cpp I converted my HF model to GGML and then used the newer version of llama.cpp to convert it from GGML to GGUF without any problems.

I'm confused and not sure I understand correctly. If you have the HF model, why are you converting to GGML and then converting the GGML to GGUF instead of just converting from HF to GGUF directly?

Answering your first question, I have converted from HF to GGUF directly on several occasions, the problem for me is that to do that conversion directly you can only do it for f16, f32 and I think also q8_0. Those formats are too heavy for my computer and I need them in Q4_K_M or Q5_K_M to run it locally on my PC. I don't think I can do it directly from HF to GGUF, but I have to go through GGML to have it in the format I want.

I have the theory that the error is due to when converting a HF model to GGML in the latest version of llama.cpp

There isn't a way to convert LLaMA models from HF to GGML anymore in the latest llama.cpp, as far as I know. The current convert.py converts directly to GGUF.

I added the conversion script (now renamed to convert-llama-ggml-to-gguf.py) just to ease the transition for people that had existing GGML files. If you plan to use the version of llama.cpp after the switch to GGUF I strongly suggest not creating any new GGML files.

Regarding this, you can convert HF to GGML in the last update of llama.cpp, you can do it by going to the "example" folder and use the script called "make-ggml.py" and you can convert it to GGML, I used it and you still can, just for some reason you can't convert it to GGUF anymore.

@KerfuffleV2
Copy link
Collaborator

the problem for me is that to do that conversion directly you can only do it for f16, f32 and I think also q8_0.

The intended workflow is to use convert.py to convert to a GGUF format and then use the quantize tool to quantize to formats like Q4_K_M. I recently added the q8_0 support for convert.py but it's mainly just to make it easier to store models for later quantization. Keep in mind there's a small quality loss quantize from q8_0 to something else compared to from f16 or f32 - the advantage is the file is half is big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants