Skip to content

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented May 6, 2021

This PR makes the train.py reference scripts of Classification, Detection and Segmentation compatible with SubmitIt.

We use a modified version of the script from DETR. The actual runner script will not be added in the repo but you can fetch the compatible version from here.

To ensure BC, we check that all possible runners remain functional by running the following commands:

# Classification
PYTHONPATH=$PYTHONPATH:`pwd` python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 2 --model mobilenet_v3_large --test-only --pretrained
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --model mobilenet_v3_large --test-only --pretrained
torchrun --nproc_per_node=2 train.py --model mobilenet_v3_large --test-only --pretrained
sbatch launch_job.sh --model mobilenet_v3_large --test-only --pretrained

# Detection
PYTHONPATH=$PYTHONPATH:`pwd` python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 2 --dataset coco --model ssd300_vgg16 --pretrained --test-only
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --dataset coco --model ssd300_vgg16 --pretrained --test-only
torchrun --nproc_per_node=2 train.py --dataset coco --model ssd300_vgg16 --pretrained --test-only
sbatch launch_job.sh --dataset coco --model ssd300_vgg16 --pretrained --test-only

# Segmentation
PYTHONPATH=$PYTHONPATH:`pwd` python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 2 --dataset coco --model lraspp_mobilenet_v3_large --test-only --pretrained
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --dataset coco --model lraspp_mobilenet_v3_large --test-only --pretrained
torchrun --nproc_per_node=2 train.py --dataset coco --model lraspp_mobilenet_v3_large --test-only --pretrained
sbatch launch_job.sh --dataset coco --model lraspp_mobilenet_v3_large --test-only --pretrained

All return the same expected results.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@datumbox datumbox merged commit c2ab0c5 into pytorch:master May 6, 2021
@datumbox datumbox deleted the submitit branch May 6, 2021 13:22
facebook-github-bot pushed a commit that referenced this pull request May 17, 2021
Summary:
* Add submitit script, partition param and parser on its own method.

* Fix method names, handle add_help correctly and refactoring.

* Delete run_with_submitit.py file

Reviewed By: datumbox

Differential Revision: D28473318

fbshipit-source-id: 1309baf825e3d666d5c262e62f65ad9c0a38b93d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants