Skip to content

Feature Request: Support YaRN RoPE Scaling on Qwen2MoeModel/Qwen3MoeModel models on convert_hf_to_gguf.py #13322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
rjmalagon opened this issue May 5, 2025 · 10 comments · Fixed by #13331
Closed
4 tasks done
Labels
enhancement New feature or request

Comments

@rjmalagon
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Setting YaRN RoPe scaling from config.json works on Qwen2Model/Qwen3Model, but is missing on Qwen3MoeModel model gguf conversion.

Motivation

Qwen/Qwen3-235B-A22B and Qwen/Qwen3-30B-A3B on HF support YaRN RoPe scaling

Possible Implementation

Not python expert...
On Qwen2MoeModel class, in the set_gguf_parameters add the YaRN Rope Scaling detection and writing.

self._try_set_pooling_type()
if self.hparams.get("rope_scaling") is not None and "factor" in self.hparams["rope_scaling"]:
    if self.hparams["rope_scaling"].get("type") == "yarn":
        self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN)
        self.gguf_writer.add_rope_scaling_factor(self.hparams["rope_scaling"]["factor"])
        self.gguf_writer.add_rope_scaling_orig_ctx_len(self.hparams["rope_scaling"]["original_max_position_embeddings"])
@rjmalagon rjmalagon added the enhancement New feature or request label May 5, 2025
@rjmalagon
Copy link
Author

In an ugly copy and paste code on convert_hf_to_gguf.py, It seems to work

    qwen3moe.rope.freq_base                          1e+06               
    qwen3moe.rope.scaling.factor                     4                   
    qwen3moe.rope.scaling.original_context_length    32768               
    qwen3moe.rope.scaling.type                       yarn         

@steampunque
Copy link

In an ugly copy and paste code on convert_hf_to_gguf.py, It seems to work

    qwen3moe.rope.freq_base                          1e+06               
    qwen3moe.rope.scaling.factor                     4                   
    qwen3moe.rope.scaling.original_context_length    32768               
    qwen3moe.rope.scaling.type                       yarn         

I don't rely on this stuff being inside gguf in my model loader. You can set the parameters at load time so you know they will be right :

--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults
the model
(env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N RoPE context scaling factor, expands co
(env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N RoPE base frequency, used by NTK-aware
model)
(env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N RoPE frequency scaling factor, expands
(env: LLAMA_ARG_ROPE_FREQ_SCALE)
--rope-yarn-log-mul N RoPE yarn log mul
(env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N YaRN: original context size of model (d
context size)
(env: LLAMA_ARG_YARN_ORIG_CTX)

As far as my understanding with yarn goes you need to set the scaling factor to the length of KV you have specified divided by the original context length anyway. Thus if you fire up the model with less than 32768 KV turn yarn off with --rope-scale none. If you fire up the model with KV > 32768 then turn on yarn, set freq base and original context length as specified by model, and set --rope-scale as KV / 32768 (fractional value) at model load time.

@rjmalagon
Copy link
Author

You are right. But we don't have that feature on Ollama.

@steampunque
Copy link

You are right. But we don't have that feature on Ollama.

It will be running degraded if you turn on yarn and run any context <32k. Most users will not be running even 32k since GPU VRAM is not big enough so leaving rope/yarn off is probably the best default config for the gguf if you can't configure it on model load.

@rjmalagon
Copy link
Author

I know well. Some users, like me, use cheap AMD APUs (Radeon >660M) with plenty of common RAM (>90GB, GTT on Linux), and work beautifully in long context (+64k) in small models (<14B) and MOE (like Qwen3-30B-A3B, on BF16 precision).

We don't need fast answers, we can wait for accurate answers.

@CISC
Copy link
Collaborator

CISC commented May 6, 2025

Since you have to manually add it to config.json anyway it should probably be added to convert_hf_to_gguf.py to simplify things for those making GGUFs, I'll make a PR.

@CISC CISC linked a pull request May 6, 2025 that will close this issue
@ngxson
Copy link
Collaborator

ngxson commented May 6, 2025

@CISC
Copy link
Collaborator

CISC commented May 6, 2025

I don't see the mention config

That's because it's not, they've consistently disabled it by default for a while now, it's mentioned in the README.md.

@rjmalagon
Copy link
Author

rjmalagon commented May 7, 2025

The README was no clear enough and probably there is a misspelled suggested parameter.

I realized this some days ago.

In config.json, I changed the "rope_type": "yarn", part, to "type": "yarn",

Example that works with the converter

  "rope_scaling": {
    "type": "yarn",
    "factor": 2.0,
    "original_max_position_embeddings": 32768
  }

@CISC
Copy link
Collaborator

CISC commented May 7, 2025

In config.json, I changed the "rope_type": "yarn", part, to "type": "yarn",

Ah, just looked into it and found out that transformers has renamed this parameter at some point, so we need to support both, I'll fix.

Thanks for reporting. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants