Skip to content

Commit 176498c

Browse files
authored
[TorchComms + titan] Update ReadME for torchComms repo (#1992)
We already release TorchComms so we want to update the README. Also there are some issues when it comes to PP degree larger than 2. We also want to document it.
1 parent bb308da commit 176498c

File tree

1 file changed

+10
-2
lines changed

1 file changed

+10
-2
lines changed

torchtitan/experiments/torchcomms/README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,21 @@
44

55
This folder provides a framework for composability testing with TorchComms and distributed training in TorchTitan. It enables flexible experimentation with distributed communication primitives and various parallelism strategies in PyTorch.
66

7-
> **TODO:** Additional documentation will be provided once TorchComms is publicly released.
7+
TorchComms repo: https://github.com/meta-pytorch/torchcomms
8+
9+
Simple installation of TorchComms:
10+
```bash
11+
pip install --pre torch torchcomms --index-url https://download.pytorch.org/whl/nightly/cu128
12+
```
13+
14+
If you want to compile torchcomms from source, please follow the instructions in the TorchComms repo.
815

916
### Quick Start
1017

1118
The following command uses Llama 3 as an example:
1219

1320
```bash
14-
TEST_BACKEND=nccl TRAIN_FILE=torchtitan.experiments.torchcomms.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh
21+
TEST_BACKEND=ncclx TRAIN_FILE=torchtitan.experiments.torchcomms.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
1522
```
1623

1724
### Features
@@ -49,6 +56,7 @@ Locally tested with:
4956
### Known Issues
5057

5158
- **Memory Overhead** - TorchComms requires higher peak memory usage. As a workaround, we need to reduce `local_batch_size` to avoid out of memory error.
59+
- **Pipeline Parallelism** - Pipeline Parallelism is not supported yet when PP degree is larger than 2.
5260

5361
## Roadmap
5462

0 commit comments

Comments
 (0)