Open
Description
Hi,
I encountered an error when I tried to run rpc.init_rpc
on a 4-GPU machine (node-0) using Kubernetes.
The code for initializing is as follows.
import os
import argparse
import torch
import torch.distributed.rpc as rpc
import logging
def run(rank, world_size):
logging.basicConfig(level=logging.INFO)
options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=8, rpc_timeout=20)
if rank == 0:
logging.info(f"PS{rank} initializing")
rpc.init_rpc(f"PS{rank}", rank=rank, world_size=world_size, rpc_backend_options=options)
logging.info(f"PS{rank} initialized")
else:
logging.info(f"Worker{rank} initializing")
rpc.init_rpc(f"worker{rank}", rank=rank, world_size=world_size, rpc_backend_options=options)
logging.info(f"Worker{rank} initialized")
rpc.shutdown()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Train VGG on CIFAR10 under synchronous strategy")
parser.add_argument("--rank", type=int, default=1, help="Global rank of this process.")
parser.add_argument("--world_size", type=int, default=2, help="Number of world size.")
parser.add_argument("--master_addr", type=str, default="localhost", help="Address of master.")
parser.add_argument("--master_port", type=str, default="29600", help="Port that master is listening on.")
args = parser.parse_args()
os.environ['MASTER_ADDR'] = args.master_addr
os.environ['MASTER_PORT'] = args.master_port
run(args.rank, args.world_size)
The yaml file I am using is as follows.
apiVersion: v1
kind: Service
metadata:
name: test
labels:
run: test
spec:
selector:
run: test
ports:
- protocol: TCP
port: 29600
targetPort: 29600
---
apiVersion: batch/v1
kind: Job
metadata:
labels:
run: test
name: ps
spec:
parallelism: 1
template:
metadata:
labels:
run: test
spec:
containers:
- name: test
image: myimage/test:v2
command: ["python3"]
args: ["/workspace/measurement/test_init.py","--rank=0","--master_addr=test","--world_size=2"]
ports:
- containerPort: 29600
volumeMounts:
- name: mydir
mountPath: /workspace/measurement
nodeSelector:
kubernetes.io/hostname: node-0
volumes:
- name: mydir
hostPath:
path: /home/ubuntu/measurement
restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
labels:
run: test
name: worker1
spec:
parallelism: 1
template:
metadata:
labels:
run: test
spec:
containers:
- name: test
image: myimage/test:v2
command: ["python3"]
args: ["/workspace/measurement/test_init.py","--rank=1","--master_addr=test","--world_size=2"]
ports:
- containerPort: 29600
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: mydir
mountPath: /workspace/measurement
nodeSelector:
kubernetes.io/hostname: node-0
volumes:
- name: mydir
hostPath:
path: /home/ubuntu/measurement
restartPolicy: Never
I got an error on the worker pod:
INFO:root:Worker1 initializing
terminate called after throwing an instance of 'std::runtime_error'
what(): In mapUuidsToGlobalIndices at /opt/conda/conda-bld/pytorch_1616554793803/work/third_party/tensorpipe/tensorpipe/channel/cuda_ipc/context_impl.cc:215 "iter == globalUuids.end()Couldn't find GPU #0 with UUID 4a97e8ca-5d1b-6d82-82cb-11f1cf44943e"
I tried to run python test_init.py --rank=0 --world_size=2
and python test_init.py --rank=1 --world_size=2
(not using Kubernetes) and It worked well.
I also tried it on a single-GPU machine (node-1) using Kubernetes and it also worked.
Here are the sorftwares I use.
ubuntu 18.04
python 3.8.8
torch 1.8.1
kubernetes 1.21.1
docker 20.10.2
Any help will be appreciated!