-
Notifications
You must be signed in to change notification settings - Fork 607
3outeille/transformers backend (Dense model only) #2048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
3outeille/transformers backend (Dense model only) #2048
Conversation
… gradnorm and less tps with HF model
|
Hi @3outeille! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
wwwjn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work again, let some comments
| setattr(model, module_name, None) | ||
| # Replace with Identity or None based on configuration | ||
| replacement = ( | ||
| nn.Identity() if use_identity_for_missing_modules else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you quicly remind me why we need to use Identity() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because HF define their models without things like if toke_embeddings is None.
I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
cc @fegin if you know this definitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
Seems like PP ranks are restored perfectly because we have perfect match with Qwen but not with Llama for example (cf the screenshot at huggingface#4)
| ) | ||
|
|
||
|
|
||
| def apply_fsdp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading this function, the function is the same as the apply_fsdp function in llama4/parallelize (I know we will keep MoE capability for next PR), can we reuse the apply_fsdp function from llama4 and avoid keeping multiple copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see the difference - The only difference is moe_block = transformer_block.mlp line 337, in transformers models, the MoE module is named mlp, instead of moe. Can we use the same getter/setter way to rename it in model.py, so we can reuse the apply_fsdp function from llama4.
I don't have strong opinion on this, but I'm a little bit concerned if we have several copies, they will become diverged easily in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid concern. i'll reuse fsdp from llama3 for now as this PR handles only dense. It will make more sense to handle the getter/setter in the MoE PR
torchtitan/experiments/transformers_backend/tests/integration_tests.py
Outdated
Show resolved
Hide resolved
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address final comments.
torchtitan/experiments/transformers_backend/tests/integration_tests.py
Outdated
Show resolved
Hide resolved
| setattr(model, module_name, None) | ||
| # Replace with Identity or None based on configuration | ||
| replacement = ( | ||
| nn.Identity() if use_identity_for_missing_modules else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because HF define their models without things like if toke_embeddings is None.
I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
cc @fegin if you know this definitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds the changes are caused by specific ways transformers define models. Then let's fork the changed functions into experiments/transformers_backend/. I apologize for the back & forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but isnt the compromise good enough ? Copy pasting means not noticing changes in Pipeline parallel later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rotary_emb, torchtitan doesn't own the model definition, so has no visibility about this module and no guarantee on the correctness. That's why I think it's better for transformers_backend folder to own this function.
Regarding use_identity_for_missing_modules, I'm not convinced that it would work with DCP. If it's a transformers specific decision, we should also limit the scope of change to transformers_backend folder.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rotary_emb, torchtitan doesn't own the model definition, so has no visibility about this module and no guarantee on the correctness. That's why I think it's better for transformers_backend folder to own this function.
Regarding use_identity_for_missing_modules, I'm not convinced that it would work with DCP. If it's a transformers specific decision, we should also limit the scope of change to transformers_backend folder.
Thanks!
torchtitan/distributed/utils.py
Outdated
| torch.backends.cudnn.deterministic = True | ||
| torch.backends.cudnn.benchmark = False | ||
| # Otherwise, Huggignface modeling register buffer for ROPE (inv_freq) and this will be by default be initialized to Nan | ||
| torch.utils.deterministic.fill_uninitialized_memory = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think this is hf specific and can be put in model.py, let's do it.
| OverrideDefinitions( | ||
| [ | ||
| [ | ||
| "--model.name meta-llama/Llama-3.2-1B", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI seems failing because of this -- should change to transformers_backend and specify --hf_transformers.model?
Regarding |
bcf5355 to
c0c273c
Compare
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems CI is not running on this change, please see inline comments.
I also left some other remaining minor comments.
| | [moe_symm_mem_kernels](./moe_symm_mem_kernels/) | TBA | [@kwen2501](https://github.com/kwen2501) | | ||
| | [gpt_oss](./gpt_oss/) | TBA | [@jianiw](https://github.com/jianiw) | | ||
| | [compiler_toolkit](./compiler_toolkit/) | [](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu_compiler_toolkit.yaml?query=branch%3Amain) | [@SherlockNoMad](https://github.com/SherlockNoMad) [@yiming0416](https://github.com/yiming0416) | | ||
| | [transformers_backend](./transformers_backend/) |  | [@3outeille](https://github.com/3outeille) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not properly linked to the actual tests. Please refer to how others are done.
| same root config file. | ||
| """ | ||
| integration_tests_flavors = [ | ||
| OverrideDefinitions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing --model.name transformers_backend, so actually the CI is running llama3 in the ./tests/integration_tests/base_config.toml file
https://github.com/pytorch/torchtitan/actions/runs/19500238760/job/55877797776?pr=2048#step:16:392
|
|
||
| [parallelism] | ||
| data_parallel_replicate_degree = 1 | ||
| data_parallel_shard_degree = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's restore this to -1, and others to 1, so it's consistent with other debug tomls.
| mixed_precision_param = "float32" # force float32 for comparison | ||
| mixed_precision_reduce = "float32" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove these two fields, so default mixed_precision_param is bf16?
|
|
||
| [model] | ||
| name = "transformers_backend" | ||
| flavor = "debugmodel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't hurt to create two toml, one has debugmodel with c4_test dataset, the other has full and uses c4 dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's still name this to pipeline.py (just convention, no real reasons).
Context
Reference PR: huggingface#1
This PR enables:
meta-llama/Llama-3.2-1Bmicrosoft/phi-2Qwen/Qwen2.5-7Bmistralai/Mistral-7B-v0.1ByteDance-Seed/Seed-Coder-8B-InstructQwen/Qwen3-4B-Instruct-2507arcee-ai/AFM-4.5Bibm-granite/granite-3b-code-base-2kbaidu/ERNIE-4.5-0.3B-Base-PTkyutai/helium-1-preview-2ballenai/OLMo-7B-hfmistralai/Ministral-8B-Instruct-2410lossandgrad_normstarts very highUsage
transformers==4.57.1torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.tomlLOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enableTesting methodology
FSDP=2vsFSDP=2 & <other //-ism>test_hf_integration.pyis going to do:results/ |_ meta-llama |_ Llama-3.2-1B |_ debugmodel/ |_ seed_checkpoint/ |_ config.toml |_ seed.slurm |_ step-0/ |_ .... |_ fsdp2_tp1_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ fsdp2_tp2_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp1_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log` |_ full/ ...Further tasks
build_optimizers_with_moe_load_balancingsupport for MoEFSDP=2 vs FSDP=2 + PP=2, thelossandgrad_normnot bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in Fix pp convergence to be bitwise huggingface/torchtitan#4)import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128to avoid recomputation for graph when usingtorch.compileandactivation checkpointing