Skip to content

DBRX GGUF conversion no longer working #7074

@jukofyork

Description

@jukofyork
Loading model: dbrx-instruct
gguf: This GGUF file is for Little Endian only
Set model parameters
gguf: file type = 1
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
chktok: [198, 4815, 15073, 66597, 8004, 1602, 2355, 79772, 11187, 9468, 248, 222, 320, 8416, 8, 27623, 114, 378, 235, 9468, 234, 104, 31643, 320, 36773, 100166, 98634, 8, 26602, 227, 11410, 99, 247, 9468, 99, 247, 220, 18, 220, 1644, 220, 8765, 220, 8765, 18, 220, 8765, 1644, 220, 8765, 8765, 220, 8765, 8765, 18, 220, 8765, 8765, 1644, 220, 18, 13, 18, 220, 18, 497, 18, 220, 18, 1131, 18, 220, 21549, 222, 98629, 241, 45358, 233, 21549, 237, 45358, 224, 21549, 244, 21549, 115, 21549, 253, 45358, 223, 21549, 253, 21549, 95, 98629, 227, 76460, 223, 949, 37046, 33565, 111, 19000, 23182, 49792, 19967, 9263, 18136, 16, 36827, 21909, 56560, 54337, 19175, 14476, 1482, 13373, 64571, 34694, 3114, 15752, 17721, 80112, 3436, 4708, 4708, 14196, 14196, 74694, 3089, 3089, 29249, 17523, 3001, 27708, 7801, 358, 3077, 1027, 364, 83, 820, 568, 596, 1070, 11, 364, 793, 499, 2771, 30, 364, 44, 539, 2771, 358, 3358, 1304, 433, 11, 364, 35, 499, 1093, 1063, 15600, 30, 1226, 6, 43712, 264, 64966, 43]
chkhsh: a8594e3edff7c29c003940395316294b2c623e09894deebbc65f33f1515df79e


**************************************************************************************
** WARNING: The BPE pre-tokenizer was not recognized!
**          There are 2 possible reasons for this:
**          - the model has not been added to convert-hf-to-gguf-update.py yet
**          - the pre-tokenization config has changed upstream
**          Check your model files and convert-hf-to-gguf-update.py and update them accordingly.
** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
**
** chkhsh:  a8594e3edff7c29c003940395316294b2c623e09894deebbc65f33f1515df79e
**************************************************************************************


Traceback (most recent call last):
  File "/home/juk/LLMs/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/home/juk/LLMs/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/home/juk/LLMs/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
    self._set_vocab_gpt2()
  File "/home/juk/LLMs/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/juk/LLMs/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/juk/LLMs/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()
main: build = 2776 (c4ec9c0d)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

I edited in:

        { "name": "dbrx",           "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/databricks/dbrx-base", },

and:

        if chkhsh == "a8594e3edff7c29c003940395316294b2c623e09894deebbc65f33f1515df79e":
            # ref: https://huggingface.co/databricks/dbrx-base
            res = "dbrx"

but looking at the PR for command-r it looks like lots more changes need to be made to make this to work:

https://github.com/ggerganov/llama.cpp/pull/7033/files

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions