Skip to content

Commit 3dd1a71

Browse files
zhangsmallsharkchengming-zhangshenzheyuroottjruwase
authored
DeepSpeed-Domino (#929)
* add domino * use transformer from deepspeed * clean args * mega opt * add opt & timer * add opt * fix loss * folder name * Change arguent in pretrain script * Add readme for domino * Update readme for domino * Fixing usage issues * update dataset * megatron dependencies * path * Update README.md * remove imports * update import * Update README.md * Minor example script changes * train bash * require * Update README.md --------- Co-authored-by: chengming-zhang <[email protected]> Co-authored-by: Zheyu SHEN <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
1 parent 1890ddb commit 3dd1a71

File tree

169 files changed

+36623
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

169 files changed

+36623
-0
lines changed
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Domino Example
2+
3+
## Install Dependency Libraries
4+
```
5+
pip install -r requirements.txt
6+
```
7+
8+
## Prepare the Dataset
9+
Follow the instructions from [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#download-and-pre-process-training-dataset) to prepare the training dataset.
10+
11+
## Execute Domino Training
12+
13+
To start training, adjust the following parameters in the script as needed:
14+
15+
- **GPUS_PER_NODE**: Number of GPUs per node.
16+
- **CHECKPOINT_PATH**: Path to the checkpoint, if applicable.
17+
- **VOCAB_FILE**, **MERGE_FILE**, **DATA_PATH**: Paths to the dataset files.
18+
- **--micro-batch-size**: Batch size per GPU.
19+
20+
### Available Models and Scripts
21+
22+
| Model | Script |
23+
|------------|--------------------------|
24+
| GPT-3 2.7B | `pretrain_gpt3_2.7b.sh` |
25+
| GPT-3 6.7B | `pretrain_gpt3_6.7b.sh` |
26+
| LLaMA 7B | `pretrain_llama_7b.sh` |
27+
| LLaMA 13B | `pretrain_llama_13b.sh` |
28+
29+
### Example
30+
31+
To train the GPT-3 2.7B model, run the following command:
32+
33+
```bash
34+
bash pretrain_gpt3_2.7b.sh
35+
```
36+
37+
The output should look like this:
38+
39+
```
40+
training ...
41+
iteration: 1 | loss: 11.318 | iteration time (ms): 2174.0469932556152
42+
iteration: 2 | loss: 11.307 | iteration time (ms): 1414.4024848937988
43+
iteration: 3 | loss: 11.323 | iteration time (ms): 1385.9455585479736
44+
iteration: 4 | loss: 11.310 | iteration time (ms): 1475.5175113677979
45+
iteration: 5 | loss: 11.306 | iteration time (ms): 1395.7207202911377
46+
iteration: 6 | loss: 11.315 | iteration time (ms): 1392.2104835510254
47+
iteration: 7 | loss: 11.314 | iteration time (ms): 1402.6703834533691
48+
iteration: 8 | loss: 11.309 | iteration time (ms): 1450.613260269165
49+
iteration: 9 | loss: 11.305 | iteration time (ms): 1473.1688499450684
50+
iteration: 10 | loss: 11.320 | iteration time (ms): 1398.4534740447998
51+
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73015 exits successfully.
52+
[2024-11-04 15:32:30,918] [INFO] [launch.py:351:main] Process 73017 exits successfully.
53+
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73014 exits successfully.
54+
[2024-11-04 15:32:30,919] [INFO] [launch.py:351:main] Process 73016 exits successfully.
55+
```
56+
57+
## Advanced Usage
58+
You can compile Pytorch and Apex from source for better performance.
59+
60+
### Compile PyTorch from Source
61+
Compile PyTorch from source could enable JIT script.
62+
```
63+
git clone -b v2.1.0 https://github.com/pytorch/pytorch.git
64+
git submodule sync
65+
git submodule update --init --recursive
66+
conda install cmake ninja
67+
pip install -r requirements.txt
68+
conda install intel::mkl-static intel::mkl-include
69+
conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
70+
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
71+
python setup.py develop
72+
73+
# Build torchvision
74+
git clone https://github.com/pytorch/vision.git
75+
python setup.py develop
76+
```
77+
78+
## Build Apex
79+
```
80+
git clone https://github.com/NVIDIA/apex
81+
cd apex
82+
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
83+
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./
84+
# otherwise
85+
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --config-settings "--build-option=--fast_layer_norm" ./
86+
```

training/DeepSpeed-Domino/domino/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)