Examples of how to distribute deep learning on a High Performance Computer (HPC).
These examples use Ray Train in a static job on a HPC. Ray handles most of the complexity of distributing the work, with minimal changes to your TensorFlow or PyTorch code.
First, install the Python environments for the required HPC: install_python_environments.md.
- Python script examples:
- TensorFlow
- MNIST end-to-end:
tensorflow_mnist_example.py. - MNIST tuning:
tensorflow_tune_mnist_example.py. - Train linear model with Ray Datasets:
tensorflow_linear_dataset_example.py.
- MNIST end-to-end:
- PyTorch
- Linear:
pytorch_train_linear_example.py. - Fashion MNIST:
pytorch_train_fashion_mnist_example.py. - HuggingFace Transformer:
pytorch_transformers_example.py. - Tune linear model with Ray Datasets:
pytorch_tune_linear_dataset_example.py.
- Linear:
- TensorFlow
- Then submit the job to HPC (choose one and update the Python script within it):
- ARC4 (SGE)
- CPU:
ray_train_on_arc4_cpu.bash. - GPU:
ray_train_on_arc4_gpu.bash.
- CPU:
- Bede (SLURM)
- GPU:
ray_train_on_bede.bash.
- GPU:
- JADE-2 (SLURM)
- GPU: ...
- ARC4 (SGE)
It's preferable to use a static job on the HPC. To do this, you could test out different ideas locally in a Jupyter Notebook, then when ready convert this to an executable script (.py) and move it over. However, it is also possible to use Jupyter Notebooks interactively on the HPC following the instructions here: jupyter_notebook_to_hpc.md.