3outeille/transformers backend (Dense model only) #2048

3outeille · 2025-11-17T09:45:41Z

Context

This PR enables:

Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP (and the combinations between them). The following models were tested:
- meta-llama/Llama-3.2-1B
- microsoft/phi-2
- Qwen/Qwen2.5-7B
- mistralai/Mistral-7B-v0.1
- ByteDance-Seed/Seed-Coder-8B-Instruct
- Qwen/Qwen3-4B-Instruct-2507
- arcee-ai/AFM-4.5B
- ibm-granite/granite-3b-code-base-2k
- baidu/ERNIE-4.5-0.3B-Base-PT
- kyutai/helium-1-preview-2b
- allenai/OLMo-7B-hf
- mistralai/Ministral-8B-Instruct-2410
Patching HF models weights initialisation. Without this, the the loss and grad_norm starts very high

Usage

Requirements transformers==4.57.1
Config: torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml

...
[model]
- name = "llama3"
+ name = "transformers_backend"
flavor = "debugmodel"
hf_assets_path = "./tests/assets/tokenizer"

+[hf_transformers]
+model = "Qwen/Qwen3-4B-Instruct-2507"
...

Train: LOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enable

Testing methodology

Following the converging.md guidelines, I am comparing the baseline FSDP=2 vs FSDP=2 & <other //-ism>
More precisely, the test_hf_integration.pyis going to do:

    results/
        |_ meta-llama
            |_ Llama-3.2-1B
                |_ debugmodel/
                    |_ seed_checkpoint/
                        |_ config.toml
                        |_ seed.slurm
                        |_ step-0/
                           |_ ....
                    |_ fsdp2_tp1_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                    |_ fsdp2_tp2_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp1_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log`
                |_ full/
                ...

Here is the grid search to test the HF modelling

#!/usr/bin/bash
model_names=(
     "meta-llama/Llama-3.2-1B"
     "microsoft/phi-2" 
     "Qwen/Qwen2.5-7B"
     "mistralai/Mistral-7B-v0.1"
     "ByteDance-Seed/Seed-Coder-8B-Instruct"
     "Qwen/Qwen3-4B-Instruct-2507" 
     "arcee-ai/AFM-4.5B" 
     "ibm-granite/granite-3b-code-base-2k" 
     "baidu/ERNIE-4.5-0.3B-Base-PT" 
     "kyutai/helium-1-preview-2b" 
     "allenai/OLMo-7B-hf"
     "mistralai/Ministral-8B-Instruct-2410" 
)

for model_name in "${model_names[@]}"; do
    rm -rf slurm_results/${model_name}

    python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high
    while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do
        echo "Waiting for seed checkpoint from ${model_name} to complete ..."
        sleep 1
    done
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high
    echo "================"
done

Further tasks

Moe (handle in PR Add transformer backend (MoE) clean huggingface/torchtitan#3)
- Missing build_optimizers_with_moe_load_balancing support for MoE
- Missing TP/PP/EP supports for MoE
When using HF modeling, the test FSDP=2 vs FSDP=2 + PP=2, the loss and grad_norm not bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in Fix pp convergence to be bitwise huggingface/torchtitan#4)
Add convergence tests to CI by doing tiny model + gloo backend (once PP is bitwise matching)
the HF modeling has lower MFU than Torchtitan MFU
NOTE: import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128 to avoid recomputation for graph when using torch.compile and activation checkpointing

…roper mapping

… gradnorm and less tps with HF model

…on backend)

meta-cla · 2025-11-17T09:45:50Z

Hi @3outeille!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-11-17T09:47:30Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

wwwjn

Thanks for the great work again, let some comments

torchtitan/distributed/pipeline_parallel.py

wwwjn · 2025-11-17T18:53:59Z

torchtitan/distributed/pipeline_parallel.py

-                setattr(model, module_name, None)
+                # Replace with Identity or None based on configuration
+                replacement = (
+                    nn.Identity() if use_identity_for_missing_modules else None


Could you quicly remind me why we need to use Identity() here?

I think it's because HF define their models without things like if toke_embeddings is None.

I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?

cc @fegin if you know this definitively.

The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?

Seems like PP ranks are restored perfectly because we have perfect match with Qwen but not with Llama for example (cf the screenshot at huggingface#4)

torchtitan/experiments/transformers_backend/configs/qwen3.toml

wwwjn · 2025-11-17T19:50:19Z

torchtitan/experiments/transformers_backend/infra/parallelize.py

+    )
+
+
+def apply_fsdp(


By reading this function, the function is the same as the apply_fsdp function in llama4/parallelize (I know we will keep MoE capability for next PR), can we reuse the apply_fsdp function from llama4 and avoid keeping multiple copies?

Oh I see the difference - The only difference is moe_block = transformer_block.mlp line 337, in transformers models, the MoE module is named mlp, instead of moe. Can we use the same getter/setter way to rename it in model.py, so we can reuse the apply_fsdp function from llama4.

I don't have strong opinion on this, but I'm a little bit concerned if we have several copies, they will become diverged easily in the future

Valid concern. i'll reuse fsdp from llama3 for now as this PR handles only dense. It will make more sense to handle the getter/setter in the MoE PR

torchtitan/experiments/transformers_backend/infra/pipeline.py

torchtitan/experiments/transformers_backend/__init__.py

torchtitan/experiments/transformers_backend/model/args.py

torchtitan/experiments/transformers_backend/tests/integration_tests.py

tianyu-l

Please address final comments.

.github/workflows/integration_test_8gpu_transformers_backend.yaml

torchtitan/distributed/utils.py

torchtitan/experiments/README.md

torchtitan/experiments/transformers_backend/tests/integration_tests.py

torchtitan/experiments/transformers_backend/model/model.py

torchtitan/protocols/train_spec.py

torchtitan/train.py

tianyu-l · 2025-11-18T07:48:13Z

torchtitan/distributed/pipeline_parallel.py

-                setattr(model, module_name, None)
+                # Replace with Identity or None based on configuration
+                replacement = (
+                    nn.Identity() if use_identity_for_missing_modules else None


I think it's because HF define their models without things like if toke_embeddings is None.

I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?

cc @fegin if you know this definitively.

tianyu-l · 2025-11-18T07:50:12Z

torchtitan/distributed/pipeline_parallel.py

It sounds the changes are caused by specific ways transformers define models. Then let's fork the changed functions into experiments/transformers_backend/. I apologize for the back & forth.

but isnt the compromise good enough ? Copy pasting means not noticing changes in Pipeline parallel later on

For rotary_emb, torchtitan doesn't own the model definition, so has no visibility about this module and no guarantee on the correctness. That's why I think it's better for transformers_backend folder to own this function.

Regarding use_identity_for_missing_modules, I'm not convinced that it would work with DCP. If it's a transformers specific decision, we should also limit the scope of change to transformers_backend folder.

Thanks!

tianyu-l · 2025-11-19T09:52:36Z

torchtitan/distributed/pipeline_parallel.py

For rotary_emb, torchtitan doesn't own the model definition, so has no visibility about this module and no guarantee on the correctness. That's why I think it's better for transformers_backend folder to own this function.

Regarding use_identity_for_missing_modules, I'm not convinced that it would work with DCP. If it's a transformers specific decision, we should also limit the scope of change to transformers_backend folder.

Thanks!

tianyu-l · 2025-11-19T09:53:52Z

torchtitan/distributed/utils.py

        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
+        # Otherwise, Huggignface modeling register buffer for ROPE (inv_freq) and this will be by default be initialized to Nan
+        torch.utils.deterministic.fill_uninitialized_memory = False


If you think this is hf specific and can be put in model.py, let's do it.

tianyu-l · 2025-11-19T09:55:41Z

torchtitan/experiments/transformers_backend/tests/integration_tests.py

+        OverrideDefinitions(
+            [
+                [
+                    "--model.name meta-llama/Llama-3.2-1B",


CI seems failing because of this -- should change to transformers_backend and specify --hf_transformers.model?

3outeille · 2025-11-19T10:52:30Z

I'm not convinced that it would work with DCP. If it's a transformers specific decision, we should also limit the scope of change to transformers_backend folder.

Regarding identity not working with DCP, one easy way to know for sure will be to manually do if toke_embeddings is None in the Transformers modeling and see if the issue persists. Will add track it here huggingface#4

This reverts commit 09f0c94.

tianyu-l

It seems CI is not running on this change, please see inline comments.
I also left some other remaining minor comments.

tianyu-l · 2025-11-20T06:00:24Z

torchtitan/experiments/README.md

 | [moe_symm_mem_kernels](./moe_symm_mem_kernels/) | TBA | [@kwen2501](https://github.com/kwen2501) |
 | [gpt_oss](./gpt_oss/) | TBA | [@jianiw](https://github.com/jianiw) |
 | [compiler_toolkit](./compiler_toolkit/) | [![Compiler Toolkit 8 GPU Integration Tests](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu_compiler_toolkit.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu_compiler_toolkit.yaml?query=branch%3Amain) | [@SherlockNoMad](https://github.com/SherlockNoMad) [@yiming0416](https://github.com/yiming0416) |
+| [transformers_backend](./transformers_backend/) | ![Transformers Backend 8 GPU Integration Tests](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu_transformers_backend.yaml/badge.svg?branch=main) | [@3outeille](https://github.com/3outeille) |


This is not properly linked to the actual tests. Please refer to how others are done.

tianyu-l · 2025-11-20T06:04:20Z

torchtitan/experiments/transformers_backend/tests/integration_tests.py

+    same root config file.
+    """
+    integration_tests_flavors = [
+        OverrideDefinitions(


This is missing --model.name transformers_backend, so actually the CI is running llama3 in the ./tests/integration_tests/base_config.toml file
https://github.com/pytorch/torchtitan/actions/runs/19500238760/job/55877797776?pr=2048#step:16:392

tianyu-l · 2025-11-20T06:06:13Z

torchtitan/experiments/transformers_backend/configs/qwen3.toml

+
+[parallelism]
+data_parallel_replicate_degree = 1
+data_parallel_shard_degree = 2


Let's restore this to -1, and others to 1, so it's consistent with other debug tomls.

tianyu-l · 2025-11-20T06:06:47Z

torchtitan/experiments/transformers_backend/configs/qwen3.toml

+mixed_precision_param = "float32" # force float32 for comparison
+mixed_precision_reduce = "float32"


can we remove these two fields, so default mixed_precision_param is bf16?

tianyu-l · 2025-11-20T06:08:04Z

torchtitan/experiments/transformers_backend/configs/qwen3.toml

+
+[model]
+name = "transformers_backend"
+flavor = "debugmodel"


I think it doesn't hurt to create two toml, one has debugmodel with c4_test dataset, the other has full and uses c4 dataset.

tianyu-l · 2025-11-20T06:09:22Z

torchtitan/experiments/transformers_backend/infra/pipeline_parallel.py

Let's still name this to pipeline.py (just convention, no real reasons).

3outeille added 30 commits August 28, 2025 08:05

create transformer_backend folder with debug run

7488385

add hf config

39a3b34

can now register train spec for hf model

ea7c594

can now switch with different flavors using HF Llama modeling

5f0adf5

it is now working up to apply_ac

7c3795c

now working up to init_weights

3fb2bf8

fix mapping when convert_to_hf_config + add breaking test to ensure p…

25daeca

…roper mapping

define own apply_ac for transformer backend instead of reusing llama3

3e67f2c

HF model without any parallelism now train (but grad_norm is high)

8c5c0ae

a bit cleaner way to get passed args

4ae9560

now same number of params + same attention backend but noticed higher…

9be95f9

… gradnorm and less tps with HF model

fix seed and deterministic

bf91447

fix torch deterministic for HF modeling that was producing Nans

4c2fc0b

HF model now numerically stable compared to TT (given a fixed attenti…

9bffa38

…on backend)

handling the is_hf_initialized flag in patch

40d84cc

refactor HF transformer model args

bd3f332

wrapper model class to avoid transformers to be explicit in train.py

249be92

add better testing script with reference log for later sanity check

e2d4ada

no need to fill passed args

4b498a9

can now handle multiple HF modeling

eb403d5

handle pref logits accessing inside HF model wrapper

a0d67a7

isolate HF patch for llama in another file

ea05552

find hacky way to pass HF model.name through CLI

adefa2c

more granularity of logging when doing parameter breakdown

a235863

add __repr__ to HFTransformerModelArgs for better debugging logs

fc43dc8

HF deepseek v3 is now training

23ae378

refactor to make it clear which args comes from which parts

2573be4

fix refactor and simplify things

46ae0a3

hacky way to switch flavors for now

b33d575

hf deepseek train while matching same param counts as tt deepseek

007f005

3outeille requested review from fegin, wconstab and wwwjn as code owners November 17, 2025 09:45

3outeille changed the title ~~3outeille/transformers backend~~ 3outeille/transformers backend (Dense model only) Nov 17, 2025

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 17, 2025

3outeille mentioned this pull request Nov 17, 2025

[Experimental Feature] Huggingface model training #919

Open

wwwjn reviewed Nov 17, 2025

View reviewed changes

tianyu-l requested changes Nov 18, 2025

View reviewed changes

3outeille added 12 commits November 18, 2025 10:11

change qwen3 config name

2d2b612

reuse fsdp from llama3. Moe will be handle in another PR

a2ea2ef

clean logging

47fb2ea

move TitanDenseModelArgs to args

20308d3

clean

019f2cc

fix integration tests

fc93b4f

rename integration test file

f9e8e11

update README

83b0437

revert accidental changes linting

fb978dd

typo in naming

71ff098

refactor

663a415

revert the way we select HF modeling in config

3dbe6fa

tianyu-l reviewed Nov 19, 2025

View reviewed changes

3outeille added 2 commits November 19, 2025 11:25

Revert "reuse pipeline from torchtitan"

9be95da

This reverts commit 09f0c94.

pass deterministic.fill_uninitialized_memory to HF model

c0c273c

3outeille force-pushed the 3outeille/transformers_backend branch from bcf5355 to c0c273c Compare November 19, 2025 11:25

3outeille added 2 commits November 19, 2025 11:27

fix linting

4c50a00

fix integration tests

5b8d38c

tianyu-l reviewed Nov 20, 2025

View reviewed changes

		mixed_precision_param = "float32" # force float32 for comparison
		mixed_precision_reduce = "float32"

		)


		def apply_fsdp(

3outeille/transformers backend (Dense model only) #2048

Are you sure you want to change the base?

3outeille/transformers backend (Dense model only) #2048

Conversation

3outeille commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Usage

Testing methodology

Further tasks

Uh oh!

meta-cla bot commented Nov 17, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Nov 17, 2025

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille commented Nov 19, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

3outeille commented Nov 17, 2025 •

edited

Loading

3outeille Nov 18, 2025 •

edited

Loading

3outeille Nov 18, 2025 •

edited

Loading