[TP plans] Fix some incorrects TP plans #42448

Cyrilvallez · 2025-11-27T09:31:56Z

What does this PR do?

As per the title. See here for the source

TODO CYRIL: models modified only apply norm on head_dim!! So should be fine without replication. But then nano_chat needs to remove useless replication, and let's check other models with norms to be sure

cc @vasqu

HuggingFaceDocBuilderDev · 2025-11-27T09:45:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-11-27T11:06:18Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, doge, nanochat

vasqu · 2025-11-27T11:06:26Z

src/transformers/models/doge/modular_doge.py

-        "layers.*.input_layernorm.weight": "sequence_parallel",
-        "layers.*.input_residual": "sequence_parallel",
-        "layers.*.post_attention_layernorm.weight": "sequence_parallel",
-        "layers.*.post_attention_residual": "sequence_parallel",
-        "norm.weight": "sequence_parallel",


SP is super slow so removed it while I was at it

vasqu · 2025-11-27T11:10:25Z

Only a bit weird one is Cohere, it has this use_qk_norm attribute but it defaults to false and the base models also doesn't use it. If it were to use it, we would need to rep, otherwise not. See

transformers/src/transformers/models/cohere/modular_cohere.py

Lines 139 to 146 in 352a2e0

    
           if self.use_qk_norm: 
        
               # When sharding the model using Tensor Parallelism, need to be careful to use n_local_heads 
        
               self.q_norm = CohereLayerNorm( 
        
                   hidden_size=(config.num_attention_heads, self.head_dim), eps=config.layer_norm_eps 
        
               ) 
        
               self.k_norm = CohereLayerNorm( 
        
                   hidden_size=(config.num_key_value_heads, self.head_dim), eps=config.layer_norm_eps 
        
               )

Not sure if we just rep either way but left it for now

Cyrilvallez · 2025-12-01T17:49:07Z

Thanks a lot for double checking @vasqu! For cohere, if the main checkpoints do not use the norm, let's not replicate as it's a huge performance drawback!
Cannot approve my own PR, so just gonna merge it!

Cyrilvallez added 2 commits November 27, 2025 10:30

gemma3

1c7590a

qwen3 and modulars

5023f0e

fix tp plans

6874739

vasqu reviewed Nov 27, 2025

View reviewed changes

Cyrilvallez merged commit 6e3f2f8 into main Dec 1, 2025
18 checks passed

Cyrilvallez deleted the fix-tp-plans branch December 1, 2025 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TP plans] Fix some incorrects TP plans #42448

[TP plans] Fix some incorrects TP plans #42448

Cyrilvallez commented Nov 27, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

vasqu Nov 27, 2025

Uh oh!

vasqu commented Nov 27, 2025

Uh oh!

Cyrilvallez commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[TP plans] Fix some incorrects TP plans #42448

[TP plans] Fix some incorrects TP plans #42448

Conversation

Cyrilvallez commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

vasqu Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu commented Nov 27, 2025

Uh oh!

Cyrilvallez commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Nov 27, 2025 •

edited

Loading