llama : add support for EXAONE tied word embeddings #12451

ngxson · 2025-03-18T14:56:31Z

Tested and confirm to work with https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B

CISC · 2025-03-18T15:18:12Z

I was adding a weight-copy to convert_hf_to_gguf.py, as this is how it has been handled in other models, but I guess it makes more sense to handle this here instead...

Would it make sense to remove the copying for other models?

ngxson · 2025-03-18T15:32:52Z

Would it make sense to remove the copying for other models?

Which model is copying the weight? AFAIK if preferable not to copy the weight if the model uses tied word embd, otherwise it defeat the whole point of reducing memory usage 😂

CISC · 2025-03-18T16:06:51Z

Which model is copying the weight? AFAIK if preferable not to copy the weight if the model uses tied word embd, otherwise it defeat the whole point of reducing memory usage 😂

For sure, but the following converted models have this:

Bloom

llama.cpp/convert_hf_to_gguf.py

Lines 1102 to 1107 in 7dfad38

    
           if name == "word_embeddings.weight": 
        
               assert self.tensor_names is not None 
        
               # TODO: tie them at runtime, don't duplicate in the model file 
        
               if all(s not in self.tensor_names for s in ("lm_head.weight", "output.weight")): 
        
                   tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

GPT2

llama.cpp/convert_hf_to_gguf.py

Lines 2407 to 2409 in 7dfad38

    
           # note: GPT2 output is tied to (same as) wte in original model 
        
           if new_name == self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD): 
        
               tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

CodeShell

llama.cpp/convert_hf_to_gguf.py

Lines 2747 to 2752 in 7dfad38

    
           if new_name == self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD): 
        
               assert self.tensor_names is not None 
        
               if all(s not in self.tensor_names for s in ("lm_head.weight", "output.weight")): 
        
                   # copy tok_embd.weight to output.weight 
        
                   tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

Mamba (this one is interesting, I guess it's handled in model loading)

llama.cpp/convert_hf_to_gguf.py

Lines 3798 to 3804 in 7dfad38

    
           # assuming token_embd.weight is seen before output.weight 
        
           if self._tok_embd is not None and new_name == output_name: 
        
               if torch.equal(self._tok_embd, data_torch): 
        
                   logger.debug(f"{output_name} is equivalent to {tok_embd_name}, omitting") 
        
                   return [] 
        
           elif new_name == tok_embd_name: 
        
               self._tok_embd = data_torch

ngxson · 2025-03-18T16:14:17Z

For bloom, gpt-2 and codeshell, yes seems like we can remove it and update llama-model.cpp to reuse the token_embd tensor. These are old models, so probably at that time there was no support for tied word embeddings in llama.cpp. Just make sure to test them after you implement the change.

~~For mamba, I have no idea tbh, it's better to ask @compilade~~

ngxson · 2025-03-18T16:16:29Z

Edit: the code for mamba does not duplicate the tensor, it only handles the case where these 2 tensors are the same ; if that the case, it does not write the same tensor twice, which is what we expect.

CISC · 2025-03-18T16:17:45Z

Edit: the code for mamba does not duplicate the tensor, it only handles the case where these 2 tensors are the same ; if that the case, it does not write the same tensor twice, which is what we expect.

Yep, but that means it's actually handled already for Mamba?

Edit: Indeed it is

llama.cpp/src/llama-model.cpp

Lines 2628 to 2632 in 7dfad38

    
           output = create_tensor(tn(LLM_TENSOR_OUTPUT, "weight"), {n_embd, n_vocab}, TENSOR_NOT_REQUIRED); 
        
           // if output is NULL, init from the input tok embed, duplicated to allow offloading 
        
           if (output == NULL) { 
        
               output = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, TENSOR_DUPLICATED); 
        
           }

David-AU-github · 2025-03-21T03:28:15Z

Please note tested:
-> download source from EXone's repo
-> Convert..HF -> f16 gguf.
-> Quantized GGUFS -> DO NOT work, ERROR : error loading model: missing tensor 'output.weight'

I can however create GGUFS using EXONE's f16.gguf (not created with Llamacpp?) without issue:

https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B-GGUF

CISC · 2025-03-21T09:20:17Z

@David-AU-github Make sure you are using llama-quantize from 99aa304 (b4915) or later.

llama : add support for EXAONE tied word embeddings

7bf6172

ngxson requested a review from ggerganov March 18, 2025 14:56

CISC approved these changes Mar 18, 2025

View reviewed changes

ngxson merged commit 99aa304 into ggml-org:master Mar 18, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : add support for EXAONE tied word embeddings #12451

llama : add support for EXAONE tied word embeddings #12451

Uh oh!

ngxson commented Mar 18, 2025 •

edited

Loading

Uh oh!

CISC commented Mar 18, 2025

Uh oh!

ngxson commented Mar 18, 2025

Uh oh!

CISC commented Mar 18, 2025 •

edited

Loading

Uh oh!

ngxson commented Mar 18, 2025 •

edited

Loading

Uh oh!

ngxson commented Mar 18, 2025

Uh oh!

CISC commented Mar 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

David-AU-github commented Mar 21, 2025

Uh oh!

CISC commented Mar 21, 2025

Uh oh!

Uh oh!

llama : add support for EXAONE tied word embeddings #12451

llama : add support for EXAONE tied word embeddings #12451

Uh oh!

Conversation

ngxson commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Mar 18, 2025

Uh oh!

ngxson commented Mar 18, 2025

Uh oh!

CISC commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 18, 2025

Uh oh!

CISC commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

David-AU-github commented Mar 21, 2025

Uh oh!

CISC commented Mar 21, 2025

Uh oh!

Uh oh!

ngxson commented Mar 18, 2025 •

edited

Loading

CISC commented Mar 18, 2025 •

edited

Loading

ngxson commented Mar 18, 2025 •

edited

Loading

CISC commented Mar 18, 2025 •

edited

Loading