Skip to content

llama : add support for EXAONE tied word embeddings #12451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 18, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 18, 2025

Fix #12448

Tested and confirm to work with https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B

@ngxson ngxson requested a review from ggerganov March 18, 2025 14:56
@CISC
Copy link
Collaborator

CISC commented Mar 18, 2025

I was adding a weight-copy to convert_hf_to_gguf.py, as this is how it has been handled in other models, but I guess it makes more sense to handle this here instead...

Would it make sense to remove the copying for other models?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 18, 2025

Would it make sense to remove the copying for other models?

Which model is copying the weight? AFAIK if preferable not to copy the weight if the model uses tied word embd, otherwise it defeat the whole point of reducing memory usage 😂

@CISC
Copy link
Collaborator

CISC commented Mar 18, 2025

Which model is copying the weight? AFAIK if preferable not to copy the weight if the model uses tied word embd, otherwise it defeat the whole point of reducing memory usage 😂

For sure, but the following converted models have this:

Bloom

if name == "word_embeddings.weight":
assert self.tensor_names is not None
# TODO: tie them at runtime, don't duplicate in the model file
if all(s not in self.tensor_names for s in ("lm_head.weight", "output.weight")):
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

GPT2

# note: GPT2 output is tied to (same as) wte in original model
if new_name == self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD):
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

CodeShell

if new_name == self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD):
assert self.tensor_names is not None
if all(s not in self.tensor_names for s in ("lm_head.weight", "output.weight")):
# copy tok_embd.weight to output.weight
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

Mamba (this one is interesting, I guess it's handled in model loading)

# assuming token_embd.weight is seen before output.weight
if self._tok_embd is not None and new_name == output_name:
if torch.equal(self._tok_embd, data_torch):
logger.debug(f"{output_name} is equivalent to {tok_embd_name}, omitting")
return []
elif new_name == tok_embd_name:
self._tok_embd = data_torch

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 18, 2025

For bloom, gpt-2 and codeshell, yes seems like we can remove it and update llama-model.cpp to reuse the token_embd tensor. These are old models, so probably at that time there was no support for tied word embeddings in llama.cpp. Just make sure to test them after you implement the change.

For mamba, I have no idea tbh, it's better to ask @compilade

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 18, 2025

Edit: the code for mamba does not duplicate the tensor, it only handles the case where these 2 tensors are the same ; if that the case, it does not write the same tensor twice, which is what we expect.

@CISC
Copy link
Collaborator

CISC commented Mar 18, 2025

Edit: the code for mamba does not duplicate the tensor, it only handles the case where these 2 tensors are the same ; if that the case, it does not write the same tensor twice, which is what we expect.

Yep, but that means it's actually handled already for Mamba?

Edit: Indeed it is

llama.cpp/src/llama-model.cpp

Lines 2628 to 2632 in 7dfad38

output = create_tensor(tn(LLM_TENSOR_OUTPUT, "weight"), {n_embd, n_vocab}, TENSOR_NOT_REQUIRED);
// if output is NULL, init from the input tok embed, duplicated to allow offloading
if (output == NULL) {
output = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, TENSOR_DUPLICATED);
}

@ngxson ngxson merged commit 99aa304 into ggml-org:master Mar 18, 2025
47 checks passed
@David-AU-github
Copy link

Please note tested:
-> download source from EXone's repo
-> Convert..HF -> f16 gguf.
-> Quantized GGUFS -> DO NOT work, ERROR : error loading model: missing tensor 'output.weight'

I can however create GGUFS using EXONE's f16.gguf (not created with Llamacpp?) without issue:

https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B-GGUF

@CISC
Copy link
Collaborator

CISC commented Mar 21, 2025

@David-AU-github Make sure you are using llama-quantize from 99aa304 (b4915) or later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EXAONE Deep 2 unsupported?
3 participants