|
| 1 | +# Domino Example |
| 2 | + |
| 3 | +## Install Dependency Libraries |
| 4 | +``` |
| 5 | +pip install -r requirements.txt |
| 6 | +``` |
| 7 | + |
| 8 | +## Prepare the Dataset |
| 9 | +Follow the instructions from [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#download-and-pre-process-training-dataset) to prepare the training dataset. |
| 10 | + |
| 11 | +## Execute Domino Training |
| 12 | + |
| 13 | +To start training, adjust the following parameters in the script as needed: |
| 14 | + |
| 15 | +- **GPUS_PER_NODE**: Number of GPUs per node. |
| 16 | +- **CHECKPOINT_PATH**: Path to the checkpoint, if applicable. |
| 17 | +- **VOCAB_FILE**, **MERGE_FILE**, **DATA_PATH**: Paths to the dataset files. |
| 18 | +- **--micro-batch-size**: Batch size per GPU. |
| 19 | + |
| 20 | +### Available Models and Scripts |
| 21 | + |
| 22 | +| Model | Script | |
| 23 | +|------------|--------------------------| |
| 24 | +| GPT-3 2.7B | `pretrain_gpt3_2.7b.sh` | |
| 25 | +| GPT-3 6.7B | `pretrain_gpt3_6.7b.sh` | |
| 26 | +| LLaMA 7B | `pretrain_llama_7b.sh` | |
| 27 | +| LLaMA 13B | `pretrain_llama_13b.sh` | |
| 28 | + |
| 29 | +### Example |
| 30 | + |
| 31 | +To train the GPT-3 2.7B model, run the following command: |
| 32 | + |
| 33 | +```bash |
| 34 | +bash pretrain_gpt3_2.7b.sh |
| 35 | +``` |
| 36 | + |
| 37 | +The output should look like this: |
| 38 | + |
| 39 | +``` |
| 40 | +training ... |
| 41 | +iteration: 1 | loss: 11.318 | iteration time (ms): 2174.0469932556152 |
| 42 | +iteration: 2 | loss: 11.307 | iteration time (ms): 1414.4024848937988 |
| 43 | +iteration: 3 | loss: 11.323 | iteration time (ms): 1385.9455585479736 |
| 44 | +iteration: 4 | loss: 11.310 | iteration time (ms): 1475.5175113677979 |
| 45 | +iteration: 5 | loss: 11.306 | iteration time (ms): 1395.7207202911377 |
| 46 | +iteration: 6 | loss: 11.315 | iteration time (ms): 1392.2104835510254 |
| 47 | +iteration: 7 | loss: 11.314 | iteration time (ms): 1402.6703834533691 |
| 48 | +iteration: 8 | loss: 11.309 | iteration time (ms): 1450.613260269165 |
| 49 | +iteration: 9 | loss: 11.305 | iteration time (ms): 1473.1688499450684 |
| 50 | +iteration: 10 | loss: 11.320 | iteration time (ms): 1398.4534740447998 |
| 51 | +[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73015 exits successfully. |
| 52 | +[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73017 exits successfully. |
| 53 | +[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73014 exits successfully. |
| 54 | +[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73016 exits successfully. |
| 55 | +``` |
| 56 | + |
| 57 | +## Advanced Usage |
| 58 | +You can compile Pytorch and Apex from source for better performance. |
| 59 | + |
| 60 | +### Compile PyTorch from Source |
| 61 | +Compile PyTorch from source could enable JIT script. |
| 62 | +``` |
| 63 | +git clone -b v2.1.0 https://github.com/pytorch/pytorch.git |
| 64 | +git submodule sync |
| 65 | +git submodule update --init --recursive |
| 66 | +conda install cmake ninja |
| 67 | +pip install -r requirements.txt |
| 68 | +conda install intel::mkl-static intel::mkl-include |
| 69 | +conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo |
| 70 | +export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} |
| 71 | +python setup.py develop |
| 72 | +
|
| 73 | +# Build torchvision |
| 74 | +git clone https://github.com/pytorch/vision.git |
| 75 | +python setup.py develop |
| 76 | +``` |
| 77 | + |
| 78 | +## Build Apex |
| 79 | +``` |
| 80 | +git clone https://github.com/NVIDIA/apex |
| 81 | +cd apex |
| 82 | +# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... |
| 83 | +pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./ |
| 84 | +# otherwise |
| 85 | +pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./ |
| 86 | +``` |
0 commit comments