Adding integration test for FSDP Memory Tracking and Estimation #426

sanketpurandare · 2024-06-25T20:59:03Z

Stack from ghstack (oldest at bottom):

Adds an integration test for FSDPMemTracker which will help keep estimation.py in sync with train.py.

python test_runner.py test_outputs --test fsdp2_mem_tracker

Integration test CI output (Unzoom for better viewing):

[ghstack-poisoned]

ghstack-source-id: 9d25d1f Pull Request resolved: #426

…ation" Adds an integration test for `FSDPMemTracker` which will help keep `estimation.py` in sync with `train.py`. `python test_runner.py test_outputs --test fsdp2_mem_tracker` cc: gnadathur [ghstack-poisoned]

ghstack-source-id: cc224db Pull Request resolved: #426

gnadathur

LGTM!

ghstack-source-id: cc224db Pull Request resolved: #426

* Set `record_shapes=True` for profiler ghstack-source-id: 6f1ed49 Pull Request resolved: pytorch#419 * Improved `repeat_kv` eager perf ghstack-source-id: 39e4849 Pull Request resolved: pytorch#418 * Adding FSDP Memory Tracking and Estimation ghstack-source-id: c8ed20f Pull Request resolved: pytorch#425 * Adding integration test for FSDP Memory Tracking and Estimation ghstack-source-id: cc224db Pull Request resolved: pytorch#426 * by default disable heavy memory profiling ghstack-source-id: cad7b3c Pull Request resolved: pytorch#430 * Add the option to turn on async-TP ghstack-source-id: 0a03379 Pull Request resolved: pytorch#429 * Modifying memory estimation options and minor changes ghstack-source-id: 5f09824 Pull Request resolved: pytorch#435 * add comment pointing to Sequence Parallel optimization example ghstack-source-id: 6fa0dcd Pull Request resolved: pytorch#438 * switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags: * Removed `_experimental_support_context_fn_in_torch_utils_checkpoint` ghstack-source-id: 50b2d0c Pull Request resolved: pytorch#444 * Reordered TP parallel plan to follow execution order ghstack-source-id: b492495 Pull Request resolved: pytorch#445 * Made some stylistic changes to `apply_dp` ghstack-source-id: fb78e9e Pull Request resolved: pytorch#446 * Refactored activation checkpointing ghstack-source-id: 785c7e4 Pull Request resolved: pytorch#447 * compiled RMSNorm ghstack-source-id: c4efb81 Pull Request resolved: pytorch#442 * Renamed parallel styles for transformer block weights ghstack-source-id: 5fb0bf3 Pull Request resolved: pytorch#448 * Added type annotations and more stylistic changes ghstack-source-id: 1bd5b9d Pull Request resolved: pytorch#449 --------- Co-authored-by: Andrew Gu <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Yifu Wang <[email protected]> Co-authored-by: Vasiliy Kuznetsov <[email protected]>

@awgu

* Set `record_shapes=True` for profiler ghstack-source-id: 6f1ed49 Pull Request resolved: pytorch#419 * Improved `repeat_kv` eager perf ghstack-source-id: 39e4849 Pull Request resolved: pytorch#418 * Adding FSDP Memory Tracking and Estimation ghstack-source-id: c8ed20f Pull Request resolved: pytorch#425 * Adding integration test for FSDP Memory Tracking and Estimation ghstack-source-id: cc224db Pull Request resolved: pytorch#426 * by default disable heavy memory profiling ghstack-source-id: cad7b3c Pull Request resolved: pytorch#430 * Add the option to turn on async-TP ghstack-source-id: 0a03379 Pull Request resolved: pytorch#429 * Modifying memory estimation options and minor changes ghstack-source-id: 5f09824 Pull Request resolved: pytorch#435 * add comment pointing to Sequence Parallel optimization example ghstack-source-id: 6fa0dcd Pull Request resolved: pytorch#438 * switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags: * Removed `_experimental_support_context_fn_in_torch_utils_checkpoint` ghstack-source-id: 50b2d0c Pull Request resolved: pytorch#444 * Reordered TP parallel plan to follow execution order ghstack-source-id: b492495 Pull Request resolved: pytorch#445 * Made some stylistic changes to `apply_dp` ghstack-source-id: fb78e9e Pull Request resolved: pytorch#446 * Refactored activation checkpointing ghstack-source-id: 785c7e4 Pull Request resolved: pytorch#447 * compiled RMSNorm ghstack-source-id: c4efb81 Pull Request resolved: pytorch#442 * Renamed parallel styles for transformer block weights ghstack-source-id: 5fb0bf3 Pull Request resolved: pytorch#448 * Added type annotations and more stylistic changes ghstack-source-id: 1bd5b9d Pull Request resolved: pytorch#449 * [Cleanup] Remove libuv from run_llama_train.sh libuv is now enabled by default. we can proably do without the educational blurb there, and don't need the env either since the default has landed. ghstack-source-id: 68c8d2a Pull Request resolved: pytorch#453 * [Cleanup] Organize run_llama_train.sh options Just a little code motion but it looks cleaner to me this way ghstack-source-id: 055fbd5 Pull Request resolved: pytorch#454 * [Cleanup] Split run_llama_train.sh and run_memory_estimation.sh Make each script simpler to read ghstack-source-id: ba3aa65 Pull Request resolved: pytorch#455 * [Cleanup] Remove unused TRAINER_DIR This argument seems to be left over from older times- it is not used anywhere in the codebase. ghstack-source-id: abbcf82 Pull Request resolved: pytorch#456 * Add educational code pointers to top level README ghstack-source-id: 522aa2f Pull Request resolved: pytorch#457 * enable FSDP2 + fp8 all-gather and fix TP fp8 all-gather (pytorch#413) we have landed fp8 all-gather optimizations in float8_experimental pytorch-labs/float8_experimental#266 this PR proposes torchtitan changes. also include fp8 in CI ``` from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp # inside the training loop model(input).sum().backward() optim.step() precompute_float8_dynamic_scale_for_fsdp(model) ``` FSDP2 fp8 all-gather are added to CI ``` CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp ``` TP fp8 all-gather are locally tested. will add them to CI after uploading a new tokenizer with vacab size 2560 (divisible by 16) ``` CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2 ``` precompute scales after optimizer.step <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM" src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817"> FSDP2 pre-all-gather do not have any small all-reduces <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM" src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08"> TODO * upload tokenizer with vacab size 2560 to enable CI on TP fp8 all-gather * torch.compile complains about fp8 * add delayed scaling and brainstorm about best config option to express fp8 * compare perf between delayed scaling and dynamic scaling https://github.com/pytorch-labs/float8_experimental/pull/312/files * import float8_experimental only when fp8 is enabled and install it in CI (pytorch#464) make sure to only import float8_experimental when fp8 is enabled for 4 gpu CI, make sure we can import float8_experimental correctly in CI `python -m pip install git+https://github.com/pytorch-labs/float8_experimental.git` * skip fp8 CI on non-H100 GPUs (pytorch#465) skip fp8 tests on non-H100 GPUs by checking `torch.cuda.get_device_capability() >= (9, 0)` this makes 4 GPU CI healthy again * clean up float8 configs in torchtitan (pytorch#466) Summary: 1. standardizes on `float8` instead of `fp8` for config names 2. removes usage of non-public objects such as `Float8Linear` Test Plan: ``` with-proxy NGPU=1 CUDA_VISIBLE_DEVICES=7 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.compile --training.enable_float8_linear ``` Reviewers: Subscribers: Tasks: Tags: * Add support of DDP and experimental CompiledAutograd Summary: Address the comments in pytorch#319 and resubmit the PR to fit the current code base. Test Plan: ``` CONFIG_FILE=./train_configs/debug_model.toml ./run_llama_train.sh --comm.train_timeout_seconds=3600 --training.tensor_parallel_degree=1 --training.data_parallel_degree=8 --experimental.data_parallel_type=ddp --training.steps=1000 --metrics.log_freq=10 --profiling.profile_freq=1000 ``` ghstack-source-id: 81dc85d Pull Request resolved: pytorch#432 * add torch.compile + FSDP2 float8 all-gather in CI (pytorch#468) fixed my bug in float8_experimental. now we can torch.compile transfromer blocks with FSDP float8 all-gather pytorch-labs/float8_experimental#321 local test: `CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.compile` profiler traces: I can see compiled region in cpu thread and float8 malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM" src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129"> * [float8] keep model.output as `nn.Linear` (high precision, not fp8) (pytorch#469) **keep model.output as nn.Linear**: it's a common practice to NOT apply fp8 on final output layer * specify `skip_fqn_list` in swapping * when applying TP to model.output, use plain `ColwiseParallel` instead of `Float8ColwiseParallel` credit to @awgu, we do not need tokentizer vacab size to be divisible by 16 pytorch#461 1D TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4` 1D TP + float8 all-gather, compile mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 --training.compile` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2 --training.compile` 1D TP + float8 all-gather trace: see float8 and all-gather in the trace <img width="1611" alt="Screenshot 2024-07-19 at 1 16 59 PM" src="https://github.com/user-attachments/assets/9a95dfd9-40e0-4133-b2bb-e22ddf5b8472"> 2D + float8 all-gather trace: see float8 and FSDP collectives and TP collectives <img width="1038" alt="Screenshot 2024-07-19 at 1 29 59 PM" src="https://github.com/user-attachments/assets/6a34bcaa-bcae-402b-9994-cc892554fec7"> * remove CI for FSDP2 + fp8 all-gather (pytorch#470) per discussion from pytorch#469 (comment) we are planning BC breaking changes in float8_experimental. remove CI for FSDP2 + fp8 all-gather for now. When public APIs are finalized, we can discuss bringing it back * dynamically update torch.compile cache config to ensure async tp support, enhance async tp UX (pytorch#471) This PR adds some enhancements for supporting async tp: 1 - if async tp is active, auto updates the torch.dynamo cache limit to 10K. If this is not updated, async tp will not be activated on larger models as it will quietly stop compilation due to 'cache limit reached' with no info for the user. This config update is logged. 2 - if async tp is enabled, verifies that torch.compile is set to true for this job config. If not, it warns and then activates torch.compile to ensure user gets working async tp. (see WARNING in below screenshot) <img width="1345" alt="Screenshot 2024-07-20 at 4 33 04 PM" src="https://github.com/user-attachments/assets/26e5a48e-4bb8-4f33-b1b5-8939c1517c1d"> 3 - Updates the 'Applied Tensor Parallel' to the model to be 'Applied Async Tensor Parallel' when async tp is active to make it clear in the logs which TP is active. (see above screenshot) * Fix 8gpu PP failure due to 2D DCP disablement DCP recently added safeties to avoid using it for 2D/3D since strided sharding (a feature needed for safe 2D/3D resharding) is not ready yet. PP uses DCP to load a seed checkpoint. Disabling the safety mechanism is enough to make 3D/PP still work (for the case where we train from the beginning or do not re-shard. (Resharding refers to saving a checkpoint from one world size/parallelism config and loading/resuming under a different one). ghstack-source-id: c069d21 Pull Request resolved: pytorch#460 * update float8 integration after UX changes (pytorch#484) Summary: float8_experimental landed various BC-breaking UX changes last week. This PR updates torchtitan to work with the version of float8_experimental after pytorch-labs/float8_experimental#332 and pytorch-labs/float8_experimental#337 Test Plan: ``` with-proxy CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * Re-enable FSDP2 Mem Tracker integration tests ghstack-source-id: 8344603 Pull Request resolved: pytorch#485 * Used `partial` instead of global vars for LR scheduling ghstack-source-id: 12c4418 Pull Request resolved: pytorch#487 * [EZ] Add logs for some basic training params so that we can verify in… (pytorch#491) As title, while testing on 405B model, I found that we need to somehow need the logs for some training params. So added some here. Tested locally and the logging is shown as in the screenshot: <img width="900" alt="image" src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786"> * make float8 scaling type configurable (pytorch#489) Summary: Adds config options to configure float8 scaling type for input, weight, grad_output. Performance is not ideal yet, but that's because we have not optimized it. Test Plan: ``` // repeat for input, weight, grad_out with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.float8_scaling_type_weight delayed --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [PP] add flexible interleaved 1f1b schedule pytorch#490 (pytorch#493) This was approved in pytorch#490, but merged into the wrong branch, merging this into main * move float8 callsites to torchao.float8 (pytorch#492) Summary: The `float8_experimental` repository moved to `torchao.float8` in pytorch/ao#551 This PR updates `torchtitan` to use float8 from the new location. Test Plan: ``` with-proxy CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [BE][1/n] simplify train.py ghstack-source-id: 3879e76 Pull Request resolved: pytorch#494 * [BE][2/n] use proper method signatures in parallelize_llama ghstack-source-id: 17a1ee9 Pull Request resolved: pytorch#495 * [BE][3/n] wrap fp8 logic using Float8Handler ghstack-source-id: e94c7f6 Pull Request resolved: pytorch#496 * Bring LLaMa 3.1 405B to TorchTitan family (pytorch#481) With the official launch of LLaMa 3.1 model, we want to add the config to TorchTitan. Of course, there are more work to be done, but we want to go an incremental way. So more PRs will be needed. For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The perf number is wps: 109 mfu: 29%. Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4). <img width="1037" alt="image" src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e"> Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4). ![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0) * [TP] Infer local n_heads instead of ad-hoc model changes ghstack-source-id: 587e3d6 Pull Request resolved: pytorch#498 * some compile-related updates ghstack-source-id: 63af802 Pull Request resolved: pytorch#443 * [EZ][405B] Use scientific notation for 405B model lr (pytorch#504) As title, use `8e-5` rather than `0.8e-4`. * [BE][4/n] split pipeline_llama into a separate file ghstack-source-id: 5ebb4ad Pull Request resolved: pytorch#499 * [fix] float8 should be applied on all model_parts ghstack-source-id: 52ed683 Pull Request resolved: pytorch#500 * Add warning to compile rmsnorm (pytorch#505) as titled, add warning to compile rmsnorm as it's not fully ready yet, i.e. this issue pytorch#497 We can remove this warning once we fix the issue * add float8 to README (pytorch#509) add float8 link in README so we can redirect people from dev-discuss post to torchtitan repo README looks like this after rendering <img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM" src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4"> float8.md looks like this <img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM" src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4"> I tried the command locally and traces are looking good <img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM" src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb"> * address TODOs as 2D recompiles is fixed ghstack-source-id: 2927f0a Pull Request resolved: pytorch#508 * [BE][5/n] simply pp vs. non-pp set up ghstack-source-id: 003bfbf Pull Request resolved: pytorch#510 * [BE][6/n] replace large c4_mini datasets by c4_test with the first 2K entries ghstack-source-id: 319f496 Pull Request resolved: pytorch#512 * Create composability.md (pytorch#511) Explain the rationale and challenges behind certain changes we made to llama model to support 3D parallelism. --------- Co-authored-by: tianyu-l <[email protected]> * depend on torchdata 0.8.0 instead of nightly ghstack-source-id: 1965d31 Pull Request resolved: pytorch#513 --------- Co-authored-by: Andrew Gu <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Yifu Wang <[email protected]> Co-authored-by: Vasiliy Kuznetsov <[email protected]> Co-authored-by: Will Constable <[email protected]> Co-authored-by: Wei (Will) Feng <[email protected]> Co-authored-by: Chien-Chin Huang <[email protected]> Co-authored-by: Less Wright <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Hugo <[email protected]> Co-authored-by: Howard Huang <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Wanchao <[email protected]> Co-authored-by: Will Constable <[email protected]>

@awgu

* Set `record_shapes=True` for profiler ghstack-source-id: 6f1ed49 Pull Request resolved: pytorch#419 * Improved `repeat_kv` eager perf ghstack-source-id: 39e4849 Pull Request resolved: pytorch#418 * Adding FSDP Memory Tracking and Estimation ghstack-source-id: c8ed20f Pull Request resolved: pytorch#425 * Adding integration test for FSDP Memory Tracking and Estimation ghstack-source-id: cc224db Pull Request resolved: pytorch#426 * by default disable heavy memory profiling ghstack-source-id: cad7b3c Pull Request resolved: pytorch#430 * Add the option to turn on async-TP ghstack-source-id: 0a03379 Pull Request resolved: pytorch#429 * Modifying memory estimation options and minor changes ghstack-source-id: 5f09824 Pull Request resolved: pytorch#435 * add comment pointing to Sequence Parallel optimization example ghstack-source-id: 6fa0dcd Pull Request resolved: pytorch#438 * switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags: * Removed `_experimental_support_context_fn_in_torch_utils_checkpoint` ghstack-source-id: 50b2d0c Pull Request resolved: pytorch#444 * Reordered TP parallel plan to follow execution order ghstack-source-id: b492495 Pull Request resolved: pytorch#445 * Made some stylistic changes to `apply_dp` ghstack-source-id: fb78e9e Pull Request resolved: pytorch#446 * Refactored activation checkpointing ghstack-source-id: 785c7e4 Pull Request resolved: pytorch#447 * compiled RMSNorm ghstack-source-id: c4efb81 Pull Request resolved: pytorch#442 * Renamed parallel styles for transformer block weights ghstack-source-id: 5fb0bf3 Pull Request resolved: pytorch#448 * Added type annotations and more stylistic changes ghstack-source-id: 1bd5b9d Pull Request resolved: pytorch#449 * [Cleanup] Remove libuv from run_llama_train.sh libuv is now enabled by default. we can proably do without the educational blurb there, and don't need the env either since the default has landed. ghstack-source-id: 68c8d2a Pull Request resolved: pytorch#453 * [Cleanup] Organize run_llama_train.sh options Just a little code motion but it looks cleaner to me this way ghstack-source-id: 055fbd5 Pull Request resolved: pytorch#454 * [Cleanup] Split run_llama_train.sh and run_memory_estimation.sh Make each script simpler to read ghstack-source-id: ba3aa65 Pull Request resolved: pytorch#455 * [Cleanup] Remove unused TRAINER_DIR This argument seems to be left over from older times- it is not used anywhere in the codebase. ghstack-source-id: abbcf82 Pull Request resolved: pytorch#456 * Add educational code pointers to top level README ghstack-source-id: 522aa2f Pull Request resolved: pytorch#457 * enable FSDP2 + fp8 all-gather and fix TP fp8 all-gather (pytorch#413) we have landed fp8 all-gather optimizations in float8_experimental pytorch-labs/float8_experimental#266 this PR proposes torchtitan changes. also include fp8 in CI ``` from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp # inside the training loop model(input).sum().backward() optim.step() precompute_float8_dynamic_scale_for_fsdp(model) ``` FSDP2 fp8 all-gather are added to CI ``` CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp ``` TP fp8 all-gather are locally tested. will add them to CI after uploading a new tokenizer with vacab size 2560 (divisible by 16) ``` CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2 ``` precompute scales after optimizer.step <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM" src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817"> FSDP2 pre-all-gather do not have any small all-reduces <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM" src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08"> TODO * upload tokenizer with vacab size 2560 to enable CI on TP fp8 all-gather * torch.compile complains about fp8 * add delayed scaling and brainstorm about best config option to express fp8 * compare perf between delayed scaling and dynamic scaling https://github.com/pytorch-labs/float8_experimental/pull/312/files * import float8_experimental only when fp8 is enabled and install it in CI (pytorch#464) make sure to only import float8_experimental when fp8 is enabled for 4 gpu CI, make sure we can import float8_experimental correctly in CI `python -m pip install git+https://github.com/pytorch-labs/float8_experimental.git` * skip fp8 CI on non-H100 GPUs (pytorch#465) skip fp8 tests on non-H100 GPUs by checking `torch.cuda.get_device_capability() >= (9, 0)` this makes 4 GPU CI healthy again * clean up float8 configs in torchtitan (pytorch#466) Summary: 1. standardizes on `float8` instead of `fp8` for config names 2. removes usage of non-public objects such as `Float8Linear` Test Plan: ``` with-proxy NGPU=1 CUDA_VISIBLE_DEVICES=7 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.compile --training.enable_float8_linear ``` Reviewers: Subscribers: Tasks: Tags: * Add support of DDP and experimental CompiledAutograd Summary: Address the comments in pytorch#319 and resubmit the PR to fit the current code base. Test Plan: ``` CONFIG_FILE=./train_configs/debug_model.toml ./run_llama_train.sh --comm.train_timeout_seconds=3600 --training.tensor_parallel_degree=1 --training.data_parallel_degree=8 --experimental.data_parallel_type=ddp --training.steps=1000 --metrics.log_freq=10 --profiling.profile_freq=1000 ``` ghstack-source-id: 81dc85d Pull Request resolved: pytorch#432 * add torch.compile + FSDP2 float8 all-gather in CI (pytorch#468) fixed my bug in float8_experimental. now we can torch.compile transfromer blocks with FSDP float8 all-gather pytorch-labs/float8_experimental#321 local test: `CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.compile` profiler traces: I can see compiled region in cpu thread and float8 malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM" src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129"> * [float8] keep model.output as `nn.Linear` (high precision, not fp8) (pytorch#469) **keep model.output as nn.Linear**: it's a common practice to NOT apply fp8 on final output layer * specify `skip_fqn_list` in swapping * when applying TP to model.output, use plain `ColwiseParallel` instead of `Float8ColwiseParallel` credit to @awgu, we do not need tokentizer vacab size to be divisible by 16 pytorch#461 1D TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4` 1D TP + float8 all-gather, compile mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 --training.compile` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2 --training.compile` 1D TP + float8 all-gather trace: see float8 and all-gather in the trace <img width="1611" alt="Screenshot 2024-07-19 at 1 16 59 PM" src="https://github.com/user-attachments/assets/9a95dfd9-40e0-4133-b2bb-e22ddf5b8472"> 2D + float8 all-gather trace: see float8 and FSDP collectives and TP collectives <img width="1038" alt="Screenshot 2024-07-19 at 1 29 59 PM" src="https://github.com/user-attachments/assets/6a34bcaa-bcae-402b-9994-cc892554fec7"> * remove CI for FSDP2 + fp8 all-gather (pytorch#470) per discussion from pytorch#469 (comment) we are planning BC breaking changes in float8_experimental. remove CI for FSDP2 + fp8 all-gather for now. When public APIs are finalized, we can discuss bringing it back * dynamically update torch.compile cache config to ensure async tp support, enhance async tp UX (pytorch#471) This PR adds some enhancements for supporting async tp: 1 - if async tp is active, auto updates the torch.dynamo cache limit to 10K. If this is not updated, async tp will not be activated on larger models as it will quietly stop compilation due to 'cache limit reached' with no info for the user. This config update is logged. 2 - if async tp is enabled, verifies that torch.compile is set to true for this job config. If not, it warns and then activates torch.compile to ensure user gets working async tp. (see WARNING in below screenshot) <img width="1345" alt="Screenshot 2024-07-20 at 4 33 04 PM" src="https://github.com/user-attachments/assets/26e5a48e-4bb8-4f33-b1b5-8939c1517c1d"> 3 - Updates the 'Applied Tensor Parallel' to the model to be 'Applied Async Tensor Parallel' when async tp is active to make it clear in the logs which TP is active. (see above screenshot) * Fix 8gpu PP failure due to 2D DCP disablement DCP recently added safeties to avoid using it for 2D/3D since strided sharding (a feature needed for safe 2D/3D resharding) is not ready yet. PP uses DCP to load a seed checkpoint. Disabling the safety mechanism is enough to make 3D/PP still work (for the case where we train from the beginning or do not re-shard. (Resharding refers to saving a checkpoint from one world size/parallelism config and loading/resuming under a different one). ghstack-source-id: c069d21 Pull Request resolved: pytorch#460 * update float8 integration after UX changes (pytorch#484) Summary: float8_experimental landed various BC-breaking UX changes last week. This PR updates torchtitan to work with the version of float8_experimental after pytorch-labs/float8_experimental#332 and pytorch-labs/float8_experimental#337 Test Plan: ``` with-proxy CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * Re-enable FSDP2 Mem Tracker integration tests ghstack-source-id: 8344603 Pull Request resolved: pytorch#485 * Used `partial` instead of global vars for LR scheduling ghstack-source-id: 12c4418 Pull Request resolved: pytorch#487 * [EZ] Add logs for some basic training params so that we can verify in… (pytorch#491) As title, while testing on 405B model, I found that we need to somehow need the logs for some training params. So added some here. Tested locally and the logging is shown as in the screenshot: <img width="900" alt="image" src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786"> * make float8 scaling type configurable (pytorch#489) Summary: Adds config options to configure float8 scaling type for input, weight, grad_output. Performance is not ideal yet, but that's because we have not optimized it. Test Plan: ``` // repeat for input, weight, grad_out with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.float8_scaling_type_weight delayed --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [PP] add flexible interleaved 1f1b schedule pytorch#490 (pytorch#493) This was approved in pytorch#490, but merged into the wrong branch, merging this into main * move float8 callsites to torchao.float8 (pytorch#492) Summary: The `float8_experimental` repository moved to `torchao.float8` in pytorch/ao#551 This PR updates `torchtitan` to use float8 from the new location. Test Plan: ``` with-proxy CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [BE][1/n] simplify train.py ghstack-source-id: 3879e76 Pull Request resolved: pytorch#494 * [BE][2/n] use proper method signatures in parallelize_llama ghstack-source-id: 17a1ee9 Pull Request resolved: pytorch#495 * [BE][3/n] wrap fp8 logic using Float8Handler ghstack-source-id: e94c7f6 Pull Request resolved: pytorch#496 * Bring LLaMa 3.1 405B to TorchTitan family (pytorch#481) With the official launch of LLaMa 3.1 model, we want to add the config to TorchTitan. Of course, there are more work to be done, but we want to go an incremental way. So more PRs will be needed. For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The perf number is wps: 109 mfu: 29%. Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4). <img width="1037" alt="image" src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e"> Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4). ![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0) * [TP] Infer local n_heads instead of ad-hoc model changes ghstack-source-id: 587e3d6 Pull Request resolved: pytorch#498 * some compile-related updates ghstack-source-id: 63af802 Pull Request resolved: pytorch#443 * [EZ][405B] Use scientific notation for 405B model lr (pytorch#504) As title, use `8e-5` rather than `0.8e-4`. * [BE][4/n] split pipeline_llama into a separate file ghstack-source-id: 5ebb4ad Pull Request resolved: pytorch#499 * [fix] float8 should be applied on all model_parts ghstack-source-id: 52ed683 Pull Request resolved: pytorch#500 * Add warning to compile rmsnorm (pytorch#505) as titled, add warning to compile rmsnorm as it's not fully ready yet, i.e. this issue pytorch#497 We can remove this warning once we fix the issue * add float8 to README (pytorch#509) add float8 link in README so we can redirect people from dev-discuss post to torchtitan repo README looks like this after rendering <img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM" src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4"> float8.md looks like this <img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM" src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4"> I tried the command locally and traces are looking good <img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM" src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb"> * address TODOs as 2D recompiles is fixed ghstack-source-id: 2927f0a Pull Request resolved: pytorch#508 * [BE][5/n] simply pp vs. non-pp set up ghstack-source-id: 003bfbf Pull Request resolved: pytorch#510 * [BE][6/n] replace large c4_mini datasets by c4_test with the first 2K entries ghstack-source-id: 319f496 Pull Request resolved: pytorch#512 * Create composability.md (pytorch#511) Explain the rationale and challenges behind certain changes we made to llama model to support 3D parallelism. --------- Co-authored-by: tianyu-l <[email protected]> * depend on torchdata 0.8.0 instead of nightly ghstack-source-id: 1965d31 Pull Request resolved: pytorch#513 * add support for torchbench --------- Co-authored-by: Andrew Gu <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Yifu Wang <[email protected]> Co-authored-by: Vasiliy Kuznetsov <[email protected]> Co-authored-by: Will Constable <[email protected]> Co-authored-by: Wei (Will) Feng <[email protected]> Co-authored-by: Chien-Chin Huang <[email protected]> Co-authored-by: Less Wright <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Hugo <[email protected]> Co-authored-by: Howard Huang <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Wanchao <[email protected]> Co-authored-by: Will Constable <[email protected]>

* Set `record_shapes=True` for profiler ghstack-source-id: 6f1ed49 Pull Request resolved: pytorch#419 * Improved `repeat_kv` eager perf ghstack-source-id: 39e4849 Pull Request resolved: pytorch#418 * Adding FSDP Memory Tracking and Estimation ghstack-source-id: c8ed20f Pull Request resolved: pytorch#425 * Adding integration test for FSDP Memory Tracking and Estimation ghstack-source-id: cc224db Pull Request resolved: pytorch#426 * by default disable heavy memory profiling ghstack-source-id: cad7b3c Pull Request resolved: pytorch#430 * Add the option to turn on async-TP ghstack-source-id: 0a03379 Pull Request resolved: pytorch#429 * Modifying memory estimation options and minor changes ghstack-source-id: 5f09824 Pull Request resolved: pytorch#435 * add comment pointing to Sequence Parallel optimization example ghstack-source-id: 6fa0dcd Pull Request resolved: pytorch#438 * switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags: * Removed `_experimental_support_context_fn_in_torch_utils_checkpoint` ghstack-source-id: 50b2d0c Pull Request resolved: pytorch#444 * Reordered TP parallel plan to follow execution order ghstack-source-id: b492495 Pull Request resolved: pytorch#445 * Made some stylistic changes to `apply_dp` ghstack-source-id: fb78e9e Pull Request resolved: pytorch#446 * Refactored activation checkpointing ghstack-source-id: 785c7e4 Pull Request resolved: pytorch#447 * compiled RMSNorm ghstack-source-id: c4efb81 Pull Request resolved: pytorch#442 * Renamed parallel styles for transformer block weights ghstack-source-id: 5fb0bf3 Pull Request resolved: pytorch#448 * Added type annotations and more stylistic changes ghstack-source-id: 1bd5b9d Pull Request resolved: pytorch#449 --------- Co-authored-by: Andrew Gu <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Yifu Wang <[email protected]> Co-authored-by: Vasiliy Kuznetsov <[email protected]>

@awgu

* Set `record_shapes=True` for profiler ghstack-source-id: 6f1ed49 Pull Request resolved: pytorch#419 * Improved `repeat_kv` eager perf ghstack-source-id: 39e4849 Pull Request resolved: pytorch#418 * Adding FSDP Memory Tracking and Estimation ghstack-source-id: c8ed20f Pull Request resolved: pytorch#425 * Adding integration test for FSDP Memory Tracking and Estimation ghstack-source-id: cc224db Pull Request resolved: pytorch#426 * by default disable heavy memory profiling ghstack-source-id: cad7b3c Pull Request resolved: pytorch#430 * Add the option to turn on async-TP ghstack-source-id: 0a03379 Pull Request resolved: pytorch#429 * Modifying memory estimation options and minor changes ghstack-source-id: 5f09824 Pull Request resolved: pytorch#435 * add comment pointing to Sequence Parallel optimization example ghstack-source-id: 6fa0dcd Pull Request resolved: pytorch#438 * switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags: * Removed `_experimental_support_context_fn_in_torch_utils_checkpoint` ghstack-source-id: 50b2d0c Pull Request resolved: pytorch#444 * Reordered TP parallel plan to follow execution order ghstack-source-id: b492495 Pull Request resolved: pytorch#445 * Made some stylistic changes to `apply_dp` ghstack-source-id: fb78e9e Pull Request resolved: pytorch#446 * Refactored activation checkpointing ghstack-source-id: 785c7e4 Pull Request resolved: pytorch#447 * compiled RMSNorm ghstack-source-id: c4efb81 Pull Request resolved: pytorch#442 * Renamed parallel styles for transformer block weights ghstack-source-id: 5fb0bf3 Pull Request resolved: pytorch#448 * Added type annotations and more stylistic changes ghstack-source-id: 1bd5b9d Pull Request resolved: pytorch#449 * [Cleanup] Remove libuv from run_llama_train.sh libuv is now enabled by default. we can proably do without the educational blurb there, and don't need the env either since the default has landed. ghstack-source-id: 68c8d2a Pull Request resolved: pytorch#453 * [Cleanup] Organize run_llama_train.sh options Just a little code motion but it looks cleaner to me this way ghstack-source-id: 055fbd5 Pull Request resolved: pytorch#454 * [Cleanup] Split run_llama_train.sh and run_memory_estimation.sh Make each script simpler to read ghstack-source-id: ba3aa65 Pull Request resolved: pytorch#455 * [Cleanup] Remove unused TRAINER_DIR This argument seems to be left over from older times- it is not used anywhere in the codebase. ghstack-source-id: abbcf82 Pull Request resolved: pytorch#456 * Add educational code pointers to top level README ghstack-source-id: 522aa2f Pull Request resolved: pytorch#457 * enable FSDP2 + fp8 all-gather and fix TP fp8 all-gather (pytorch#413) we have landed fp8 all-gather optimizations in float8_experimental pytorch-labs/float8_experimental#266 this PR proposes torchtitan changes. also include fp8 in CI ``` from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp # inside the training loop model(input).sum().backward() optim.step() precompute_float8_dynamic_scale_for_fsdp(model) ``` FSDP2 fp8 all-gather are added to CI ``` CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp ``` TP fp8 all-gather are locally tested. will add them to CI after uploading a new tokenizer with vacab size 2560 (divisible by 16) ``` CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2 ``` precompute scales after optimizer.step <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM" src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817"> FSDP2 pre-all-gather do not have any small all-reduces <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM" src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08"> TODO * upload tokenizer with vacab size 2560 to enable CI on TP fp8 all-gather * torch.compile complains about fp8 * add delayed scaling and brainstorm about best config option to express fp8 * compare perf between delayed scaling and dynamic scaling https://github.com/pytorch-labs/float8_experimental/pull/312/files * import float8_experimental only when fp8 is enabled and install it in CI (pytorch#464) make sure to only import float8_experimental when fp8 is enabled for 4 gpu CI, make sure we can import float8_experimental correctly in CI `python -m pip install git+https://github.com/pytorch-labs/float8_experimental.git` * skip fp8 CI on non-H100 GPUs (pytorch#465) skip fp8 tests on non-H100 GPUs by checking `torch.cuda.get_device_capability() >= (9, 0)` this makes 4 GPU CI healthy again * clean up float8 configs in torchtitan (pytorch#466) Summary: 1. standardizes on `float8` instead of `fp8` for config names 2. removes usage of non-public objects such as `Float8Linear` Test Plan: ``` with-proxy NGPU=1 CUDA_VISIBLE_DEVICES=7 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.compile --training.enable_float8_linear ``` Reviewers: Subscribers: Tasks: Tags: * Add support of DDP and experimental CompiledAutograd Summary: Address the comments in pytorch#319 and resubmit the PR to fit the current code base. Test Plan: ``` CONFIG_FILE=./train_configs/debug_model.toml ./run_llama_train.sh --comm.train_timeout_seconds=3600 --training.tensor_parallel_degree=1 --training.data_parallel_degree=8 --experimental.data_parallel_type=ddp --training.steps=1000 --metrics.log_freq=10 --profiling.profile_freq=1000 ``` ghstack-source-id: 81dc85d Pull Request resolved: pytorch#432 * add torch.compile + FSDP2 float8 all-gather in CI (pytorch#468) fixed my bug in float8_experimental. now we can torch.compile transfromer blocks with FSDP float8 all-gather pytorch-labs/float8_experimental#321 local test: `CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.compile` profiler traces: I can see compiled region in cpu thread and float8 malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM" src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129"> * [float8] keep model.output as `nn.Linear` (high precision, not fp8) (pytorch#469) **keep model.output as nn.Linear**: it's a common practice to NOT apply fp8 on final output layer * specify `skip_fqn_list` in swapping * when applying TP to model.output, use plain `ColwiseParallel` instead of `Float8ColwiseParallel` credit to @awgu, we do not need tokentizer vacab size to be divisible by 16 pytorch#461 1D TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4` 1D TP + float8 all-gather, compile mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 --training.compile` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2 --training.compile` 1D TP + float8 all-gather trace: see float8 and all-gather in the trace <img width="1611" alt="Screenshot 2024-07-19 at 1 16 59 PM" src="https://github.com/user-attachments/assets/9a95dfd9-40e0-4133-b2bb-e22ddf5b8472"> 2D + float8 all-gather trace: see float8 and FSDP collectives and TP collectives <img width="1038" alt="Screenshot 2024-07-19 at 1 29 59 PM" src="https://github.com/user-attachments/assets/6a34bcaa-bcae-402b-9994-cc892554fec7"> * remove CI for FSDP2 + fp8 all-gather (pytorch#470) per discussion from pytorch#469 (comment) we are planning BC breaking changes in float8_experimental. remove CI for FSDP2 + fp8 all-gather for now. When public APIs are finalized, we can discuss bringing it back * dynamically update torch.compile cache config to ensure async tp support, enhance async tp UX (pytorch#471) This PR adds some enhancements for supporting async tp: 1 - if async tp is active, auto updates the torch.dynamo cache limit to 10K. If this is not updated, async tp will not be activated on larger models as it will quietly stop compilation due to 'cache limit reached' with no info for the user. This config update is logged. 2 - if async tp is enabled, verifies that torch.compile is set to true for this job config. If not, it warns and then activates torch.compile to ensure user gets working async tp. (see WARNING in below screenshot) <img width="1345" alt="Screenshot 2024-07-20 at 4 33 04 PM" src="https://github.com/user-attachments/assets/26e5a48e-4bb8-4f33-b1b5-8939c1517c1d"> 3 - Updates the 'Applied Tensor Parallel' to the model to be 'Applied Async Tensor Parallel' when async tp is active to make it clear in the logs which TP is active. (see above screenshot) * Fix 8gpu PP failure due to 2D DCP disablement DCP recently added safeties to avoid using it for 2D/3D since strided sharding (a feature needed for safe 2D/3D resharding) is not ready yet. PP uses DCP to load a seed checkpoint. Disabling the safety mechanism is enough to make 3D/PP still work (for the case where we train from the beginning or do not re-shard. (Resharding refers to saving a checkpoint from one world size/parallelism config and loading/resuming under a different one). ghstack-source-id: c069d21 Pull Request resolved: pytorch#460 * update float8 integration after UX changes (pytorch#484) Summary: float8_experimental landed various BC-breaking UX changes last week. This PR updates torchtitan to work with the version of float8_experimental after pytorch-labs/float8_experimental#332 and pytorch-labs/float8_experimental#337 Test Plan: ``` with-proxy CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * Re-enable FSDP2 Mem Tracker integration tests ghstack-source-id: 8344603 Pull Request resolved: pytorch#485 * Used `partial` instead of global vars for LR scheduling ghstack-source-id: 12c4418 Pull Request resolved: pytorch#487 * [EZ] Add logs for some basic training params so that we can verify in… (pytorch#491) As title, while testing on 405B model, I found that we need to somehow need the logs for some training params. So added some here. Tested locally and the logging is shown as in the screenshot: <img width="900" alt="image" src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786"> * make float8 scaling type configurable (pytorch#489) Summary: Adds config options to configure float8 scaling type for input, weight, grad_output. Performance is not ideal yet, but that's because we have not optimized it. Test Plan: ``` // repeat for input, weight, grad_out with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.float8_scaling_type_weight delayed --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [PP] add flexible interleaved 1f1b schedule pytorch#490 (pytorch#493) This was approved in pytorch#490, but merged into the wrong branch, merging this into main * move float8 callsites to torchao.float8 (pytorch#492) Summary: The `float8_experimental` repository moved to `torchao.float8` in pytorch/ao#551 This PR updates `torchtitan` to use float8 from the new location. Test Plan: ``` with-proxy CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [BE][1/n] simplify train.py ghstack-source-id: 3879e76 Pull Request resolved: pytorch#494 * [BE][2/n] use proper method signatures in parallelize_llama ghstack-source-id: 17a1ee9 Pull Request resolved: pytorch#495 * [BE][3/n] wrap fp8 logic using Float8Handler ghstack-source-id: e94c7f6 Pull Request resolved: pytorch#496 * Bring LLaMa 3.1 405B to TorchTitan family (pytorch#481) With the official launch of LLaMa 3.1 model, we want to add the config to TorchTitan. Of course, there are more work to be done, but we want to go an incremental way. So more PRs will be needed. For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The perf number is wps: 109 mfu: 29%. Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4). <img width="1037" alt="image" src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e"> Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4). ![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0) * [TP] Infer local n_heads instead of ad-hoc model changes ghstack-source-id: 587e3d6 Pull Request resolved: pytorch#498 * some compile-related updates ghstack-source-id: 63af802 Pull Request resolved: pytorch#443 * [EZ][405B] Use scientific notation for 405B model lr (pytorch#504) As title, use `8e-5` rather than `0.8e-4`. * [BE][4/n] split pipeline_llama into a separate file ghstack-source-id: 5ebb4ad Pull Request resolved: pytorch#499 * [fix] float8 should be applied on all model_parts ghstack-source-id: 52ed683 Pull Request resolved: pytorch#500 * Add warning to compile rmsnorm (pytorch#505) as titled, add warning to compile rmsnorm as it's not fully ready yet, i.e. this issue pytorch#497 We can remove this warning once we fix the issue * add float8 to README (pytorch#509) add float8 link in README so we can redirect people from dev-discuss post to torchtitan repo README looks like this after rendering <img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM" src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4"> float8.md looks like this <img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM" src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4"> I tried the command locally and traces are looking good <img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM" src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb"> * address TODOs as 2D recompiles is fixed ghstack-source-id: 2927f0a Pull Request resolved: pytorch#508 * [BE][5/n] simply pp vs. non-pp set up ghstack-source-id: 003bfbf Pull Request resolved: pytorch#510 * [BE][6/n] replace large c4_mini datasets by c4_test with the first 2K entries ghstack-source-id: 319f496 Pull Request resolved: pytorch#512 * Create composability.md (pytorch#511) Explain the rationale and challenges behind certain changes we made to llama model to support 3D parallelism. --------- Co-authored-by: tianyu-l <[email protected]> * depend on torchdata 0.8.0 instead of nightly ghstack-source-id: 1965d31 Pull Request resolved: pytorch#513 --------- Co-authored-by: Andrew Gu <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Yifu Wang <[email protected]> Co-authored-by: Vasiliy Kuznetsov <[email protected]> Co-authored-by: Will Constable <[email protected]> Co-authored-by: Wei (Will) Feng <[email protected]> Co-authored-by: Chien-Chin Huang <[email protected]> Co-authored-by: Less Wright <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Hugo <[email protected]> Co-authored-by: Howard Huang <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Wanchao <[email protected]> Co-authored-by: Will Constable <[email protected]>

@awgu

* Set `record_shapes=True` for profiler ghstack-source-id: 6f1ed49 Pull Request resolved: pytorch#419 * Improved `repeat_kv` eager perf ghstack-source-id: 39e4849 Pull Request resolved: pytorch#418 * Adding FSDP Memory Tracking and Estimation ghstack-source-id: c8ed20f Pull Request resolved: pytorch#425 * Adding integration test for FSDP Memory Tracking and Estimation ghstack-source-id: cc224db Pull Request resolved: pytorch#426 * by default disable heavy memory profiling ghstack-source-id: cad7b3c Pull Request resolved: pytorch#430 * Add the option to turn on async-TP ghstack-source-id: 0a03379 Pull Request resolved: pytorch#429 * Modifying memory estimation options and minor changes ghstack-source-id: 5f09824 Pull Request resolved: pytorch#435 * add comment pointing to Sequence Parallel optimization example ghstack-source-id: 6fa0dcd Pull Request resolved: pytorch#438 * switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags: * Removed `_experimental_support_context_fn_in_torch_utils_checkpoint` ghstack-source-id: 50b2d0c Pull Request resolved: pytorch#444 * Reordered TP parallel plan to follow execution order ghstack-source-id: b492495 Pull Request resolved: pytorch#445 * Made some stylistic changes to `apply_dp` ghstack-source-id: fb78e9e Pull Request resolved: pytorch#446 * Refactored activation checkpointing ghstack-source-id: 785c7e4 Pull Request resolved: pytorch#447 * compiled RMSNorm ghstack-source-id: c4efb81 Pull Request resolved: pytorch#442 * Renamed parallel styles for transformer block weights ghstack-source-id: 5fb0bf3 Pull Request resolved: pytorch#448 * Added type annotations and more stylistic changes ghstack-source-id: 1bd5b9d Pull Request resolved: pytorch#449 * [Cleanup] Remove libuv from run_llama_train.sh libuv is now enabled by default. we can proably do without the educational blurb there, and don't need the env either since the default has landed. ghstack-source-id: 68c8d2a Pull Request resolved: pytorch#453 * [Cleanup] Organize run_llama_train.sh options Just a little code motion but it looks cleaner to me this way ghstack-source-id: 055fbd5 Pull Request resolved: pytorch#454 * [Cleanup] Split run_llama_train.sh and run_memory_estimation.sh Make each script simpler to read ghstack-source-id: ba3aa65 Pull Request resolved: pytorch#455 * [Cleanup] Remove unused TRAINER_DIR This argument seems to be left over from older times- it is not used anywhere in the codebase. ghstack-source-id: abbcf82 Pull Request resolved: pytorch#456 * Add educational code pointers to top level README ghstack-source-id: 522aa2f Pull Request resolved: pytorch#457 * enable FSDP2 + fp8 all-gather and fix TP fp8 all-gather (pytorch#413) we have landed fp8 all-gather optimizations in float8_experimental pytorch-labs/float8_experimental#266 this PR proposes torchtitan changes. also include fp8 in CI ``` from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp # inside the training loop model(input).sum().backward() optim.step() precompute_float8_dynamic_scale_for_fsdp(model) ``` FSDP2 fp8 all-gather are added to CI ``` CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp ``` TP fp8 all-gather are locally tested. will add them to CI after uploading a new tokenizer with vacab size 2560 (divisible by 16) ``` CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2 ``` precompute scales after optimizer.step <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM" src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817"> FSDP2 pre-all-gather do not have any small all-reduces <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM" src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08"> TODO * upload tokenizer with vacab size 2560 to enable CI on TP fp8 all-gather * torch.compile complains about fp8 * add delayed scaling and brainstorm about best config option to express fp8 * compare perf between delayed scaling and dynamic scaling https://github.com/pytorch-labs/float8_experimental/pull/312/files * import float8_experimental only when fp8 is enabled and install it in CI (pytorch#464) make sure to only import float8_experimental when fp8 is enabled for 4 gpu CI, make sure we can import float8_experimental correctly in CI `python -m pip install git+https://github.com/pytorch-labs/float8_experimental.git` * skip fp8 CI on non-H100 GPUs (pytorch#465) skip fp8 tests on non-H100 GPUs by checking `torch.cuda.get_device_capability() >= (9, 0)` this makes 4 GPU CI healthy again * clean up float8 configs in torchtitan (pytorch#466) Summary: 1. standardizes on `float8` instead of `fp8` for config names 2. removes usage of non-public objects such as `Float8Linear` Test Plan: ``` with-proxy NGPU=1 CUDA_VISIBLE_DEVICES=7 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.compile --training.enable_float8_linear ``` Reviewers: Subscribers: Tasks: Tags: * Add support of DDP and experimental CompiledAutograd Summary: Address the comments in pytorch#319 and resubmit the PR to fit the current code base. Test Plan: ``` CONFIG_FILE=./train_configs/debug_model.toml ./run_llama_train.sh --comm.train_timeout_seconds=3600 --training.tensor_parallel_degree=1 --training.data_parallel_degree=8 --experimental.data_parallel_type=ddp --training.steps=1000 --metrics.log_freq=10 --profiling.profile_freq=1000 ``` ghstack-source-id: 81dc85d Pull Request resolved: pytorch#432 * add torch.compile + FSDP2 float8 all-gather in CI (pytorch#468) fixed my bug in float8_experimental. now we can torch.compile transfromer blocks with FSDP float8 all-gather pytorch-labs/float8_experimental#321 local test: `CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.compile` profiler traces: I can see compiled region in cpu thread and float8 malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM" src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129"> * [float8] keep model.output as `nn.Linear` (high precision, not fp8) (pytorch#469) **keep model.output as nn.Linear**: it's a common practice to NOT apply fp8 on final output layer * specify `skip_fqn_list` in swapping * when applying TP to model.output, use plain `ColwiseParallel` instead of `Float8ColwiseParallel` credit to @awgu, we do not need tokentizer vacab size to be divisible by 16 pytorch#461 1D TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4` 1D TP + float8 all-gather, compile mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 --training.compile` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2 --training.compile` 1D TP + float8 all-gather trace: see float8 and all-gather in the trace <img width="1611" alt="Screenshot 2024-07-19 at 1 16 59 PM" src="https://github.com/user-attachments/assets/9a95dfd9-40e0-4133-b2bb-e22ddf5b8472"> 2D + float8 all-gather trace: see float8 and FSDP collectives and TP collectives <img width="1038" alt="Screenshot 2024-07-19 at 1 29 59 PM" src="https://github.com/user-attachments/assets/6a34bcaa-bcae-402b-9994-cc892554fec7"> * remove CI for FSDP2 + fp8 all-gather (pytorch#470) per discussion from pytorch#469 (comment) we are planning BC breaking changes in float8_experimental. remove CI for FSDP2 + fp8 all-gather for now. When public APIs are finalized, we can discuss bringing it back * dynamically update torch.compile cache config to ensure async tp support, enhance async tp UX (pytorch#471) This PR adds some enhancements for supporting async tp: 1 - if async tp is active, auto updates the torch.dynamo cache limit to 10K. If this is not updated, async tp will not be activated on larger models as it will quietly stop compilation due to 'cache limit reached' with no info for the user. This config update is logged. 2 - if async tp is enabled, verifies that torch.compile is set to true for this job config. If not, it warns and then activates torch.compile to ensure user gets working async tp. (see WARNING in below screenshot) <img width="1345" alt="Screenshot 2024-07-20 at 4 33 04 PM" src="https://github.com/user-attachments/assets/26e5a48e-4bb8-4f33-b1b5-8939c1517c1d"> 3 - Updates the 'Applied Tensor Parallel' to the model to be 'Applied Async Tensor Parallel' when async tp is active to make it clear in the logs which TP is active. (see above screenshot) * Fix 8gpu PP failure due to 2D DCP disablement DCP recently added safeties to avoid using it for 2D/3D since strided sharding (a feature needed for safe 2D/3D resharding) is not ready yet. PP uses DCP to load a seed checkpoint. Disabling the safety mechanism is enough to make 3D/PP still work (for the case where we train from the beginning or do not re-shard. (Resharding refers to saving a checkpoint from one world size/parallelism config and loading/resuming under a different one). ghstack-source-id: c069d21 Pull Request resolved: pytorch#460 * update float8 integration after UX changes (pytorch#484) Summary: float8_experimental landed various BC-breaking UX changes last week. This PR updates torchtitan to work with the version of float8_experimental after pytorch-labs/float8_experimental#332 and pytorch-labs/float8_experimental#337 Test Plan: ``` with-proxy CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * Re-enable FSDP2 Mem Tracker integration tests ghstack-source-id: 8344603 Pull Request resolved: pytorch#485 * Used `partial` instead of global vars for LR scheduling ghstack-source-id: 12c4418 Pull Request resolved: pytorch#487 * [EZ] Add logs for some basic training params so that we can verify in… (pytorch#491) As title, while testing on 405B model, I found that we need to somehow need the logs for some training params. So added some here. Tested locally and the logging is shown as in the screenshot: <img width="900" alt="image" src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786"> * make float8 scaling type configurable (pytorch#489) Summary: Adds config options to configure float8 scaling type for input, weight, grad_output. Performance is not ideal yet, but that's because we have not optimized it. Test Plan: ``` // repeat for input, weight, grad_out with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.float8_scaling_type_weight delayed --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [PP] add flexible interleaved 1f1b schedule pytorch#490 (pytorch#493) This was approved in pytorch#490, but merged into the wrong branch, merging this into main * move float8 callsites to torchao.float8 (pytorch#492) Summary: The `float8_experimental` repository moved to `torchao.float8` in pytorch/ao#551 This PR updates `torchtitan` to use float8 from the new location. Test Plan: ``` with-proxy CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags: * [BE][1/n] simplify train.py ghstack-source-id: 3879e76 Pull Request resolved: pytorch#494 * [BE][2/n] use proper method signatures in parallelize_llama ghstack-source-id: 17a1ee9 Pull Request resolved: pytorch#495 * [BE][3/n] wrap fp8 logic using Float8Handler ghstack-source-id: e94c7f6 Pull Request resolved: pytorch#496 * Bring LLaMa 3.1 405B to TorchTitan family (pytorch#481) With the official launch of LLaMa 3.1 model, we want to add the config to TorchTitan. Of course, there are more work to be done, but we want to go an incremental way. So more PRs will be needed. For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The perf number is wps: 109 mfu: 29%. Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4). <img width="1037" alt="image" src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e"> Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4). ![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0) * [TP] Infer local n_heads instead of ad-hoc model changes ghstack-source-id: 587e3d6 Pull Request resolved: pytorch#498 * some compile-related updates ghstack-source-id: 63af802 Pull Request resolved: pytorch#443 * [EZ][405B] Use scientific notation for 405B model lr (pytorch#504) As title, use `8e-5` rather than `0.8e-4`. * [BE][4/n] split pipeline_llama into a separate file ghstack-source-id: 5ebb4ad Pull Request resolved: pytorch#499 * [fix] float8 should be applied on all model_parts ghstack-source-id: 52ed683 Pull Request resolved: pytorch#500 * Add warning to compile rmsnorm (pytorch#505) as titled, add warning to compile rmsnorm as it's not fully ready yet, i.e. this issue pytorch#497 We can remove this warning once we fix the issue * add float8 to README (pytorch#509) add float8 link in README so we can redirect people from dev-discuss post to torchtitan repo README looks like this after rendering <img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM" src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4"> float8.md looks like this <img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM" src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4"> I tried the command locally and traces are looking good <img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM" src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb"> * address TODOs as 2D recompiles is fixed ghstack-source-id: 2927f0a Pull Request resolved: pytorch#508 * [BE][5/n] simply pp vs. non-pp set up ghstack-source-id: 003bfbf Pull Request resolved: pytorch#510 * [BE][6/n] replace large c4_mini datasets by c4_test with the first 2K entries ghstack-source-id: 319f496 Pull Request resolved: pytorch#512 * Create composability.md (pytorch#511) Explain the rationale and challenges behind certain changes we made to llama model to support 3D parallelism. --------- Co-authored-by: tianyu-l <[email protected]> * depend on torchdata 0.8.0 instead of nightly ghstack-source-id: 1965d31 Pull Request resolved: pytorch#513 * add support for torchbench --------- Co-authored-by: Andrew Gu <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Yifu Wang <[email protected]> Co-authored-by: Vasiliy Kuznetsov <[email protected]> Co-authored-by: Will Constable <[email protected]> Co-authored-by: Wei (Will) Feng <[email protected]> Co-authored-by: Chien-Chin Huang <[email protected]> Co-authored-by: Less Wright <[email protected]> Co-authored-by: Sanket Jayant Purandare <[email protected]> Co-authored-by: Hugo <[email protected]> Co-authored-by: Howard Huang <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Wanchao <[email protected]> Co-authored-by: Will Constable <[email protected]>

ghstack-source-id: cc224db Pull Request resolved: pytorch#426

Adding integration test for FSDP Memory Tracking and Estimation

3c6a391

[ghstack-poisoned]

sanketpurandare added a commit that referenced this pull request Jun 25, 2024

Adding integration test for FSDP Memory Tracking and Estimation

95dd0ef

ghstack-source-id: 9d25d1f Pull Request resolved: #426

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 25, 2024

sanketpurandare added the integration test Adding integration tests label Jun 25, 2024

sanketpurandare requested review from lessw2020, awgu and tianyu-l June 25, 2024 21:03

Update on "Adding integration test for FSDP Memory Tracking and Estim…

79ddab2

…ation" Adds an integration test for `FSDPMemTracker` which will help keep `estimation.py` in sync with `train.py`. `python test_runner.py test_outputs --test fsdp2_mem_tracker` cc: gnadathur [ghstack-poisoned]

sanketpurandare added a commit that referenced this pull request Jun 25, 2024

Adding integration test for FSDP Memory Tracking and Estimation

841bb53

ghstack-source-id: cc224db Pull Request resolved: #426

gnadathur approved these changes Jun 25, 2024

View reviewed changes

fduwjj approved these changes Jun 25, 2024

View reviewed changes

sanketpurandare merged commit 79ddab2 into gh/sanketpurandare/3/base Jun 25, 2024
6 checks passed

sanketpurandare added a commit that referenced this pull request Jun 25, 2024

Adding integration test for FSDP Memory Tracking and Estimation

cb73810

ghstack-source-id: cc224db Pull Request resolved: #426

sanketpurandare deleted the gh/sanketpurandare/3/head branch June 25, 2024 22:28

sanketpurandare mentioned this pull request Jun 28, 2024

Modifying memory estimation options and minor changes #435

Merged

tianyu-l pushed a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024

Adding integration test for FSDP Memory Tracking and Estimation

87bc0b5

ghstack-source-id: cc224db Pull Request resolved: pytorch#426

philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024

Adding integration test for FSDP Memory Tracking and Estimation

134addd

ghstack-source-id: cc224db Pull Request resolved: pytorch#426

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding integration test for FSDP Memory Tracking and Estimation #426

Adding integration test for FSDP Memory Tracking and Estimation #426

Uh oh!

sanketpurandare commented Jun 25, 2024 •

edited

Loading

Uh oh!

gnadathur left a comment

Uh oh!

Uh oh!

Uh oh!

Adding integration test for FSDP Memory Tracking and Estimation #426

Adding integration test for FSDP Memory Tracking and Estimation #426

Uh oh!

Conversation

sanketpurandare commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnadathur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sanketpurandare commented Jun 25, 2024 •

edited

Loading