Skip to content

Commit d23693a

Browse files
authored
fix rdzv_id
rdzv_id may be not equal in multi node setup using RANDOM
1 parent e48aaeb commit d23693a

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

distributed/ddp-tutorial-series/slurm/sbatch_run.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,11 @@ head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
1414
echo Node IP: $head_node_ip
1515
export LOGLEVEL=INFO
1616

17+
job_id=2024
1718
srun torchrun \
1819
--nnodes 4 \
1920
--nproc_per_node 1 \
20-
--rdzv_id $RANDOM \
21+
--rdzv_id ${jobid} \
2122
--rdzv_backend c10d \
2223
--rdzv_endpoint $head_node_ip:29500 \
23-
/shared/examples/multinode_torchrun.py 50 10
24+
/shared/examples/multinode_torchrun.py 50 10

0 commit comments

Comments
 (0)