init_rpc on a multiple-GPU machine using Kubernetes

Hi,

I encountered an error when I tried to run `rpc.init_rpc` on a 4-GPU machine (node-0) using Kubernetes.
The code for initializing is as follows.

```
import os
import argparse

import torch
import torch.distributed.rpc as rpc
import logging

def run(rank, world_size):
    logging.basicConfig(level=logging.INFO)
    options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=8, rpc_timeout=20)
    
    if rank == 0:
        logging.info(f"PS{rank} initializing")
        rpc.init_rpc(f"PS{rank}", rank=rank, world_size=world_size, rpc_backend_options=options)
        logging.info(f"PS{rank} initialized")
    else:
        logging.info(f"Worker{rank} initializing")
        rpc.init_rpc(f"worker{rank}", rank=rank, world_size=world_size, rpc_backend_options=options)
        logging.info(f"Worker{rank} initialized")

    rpc.shutdown()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train VGG on CIFAR10 under synchronous strategy")
    parser.add_argument("--rank", type=int, default=1, help="Global rank of this process.")
    parser.add_argument("--world_size", type=int, default=2, help="Number of world size.")
    parser.add_argument("--master_addr", type=str, default="localhost", help="Address of master.")
    parser.add_argument("--master_port", type=str, default="29600", help="Port that master is listening on.")

    args = parser.parse_args()

    os.environ['MASTER_ADDR'] = args.master_addr
    os.environ['MASTER_PORT'] = args.master_port

    run(args.rank, args.world_size)
```

The yaml file I am using is as follows.

``` bash
apiVersion: v1
kind: Service
metadata:
  name: test
  labels:
    run: test
spec:
  selector:
    run: test
  ports:
    - protocol: TCP
      port: 29600
      targetPort: 29600
---
apiVersion: batch/v1
kind: Job
metadata:
  labels:
    run: test
  name: ps
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        run: test
    spec:
      containers:
        - name: test
          image: myimage/test:v2
          command: ["python3"]
          args: ["/workspace/measurement/test_init.py","--rank=0","--master_addr=test","--world_size=2"]
          ports:
            - containerPort: 29600
          volumeMounts:
            - name: mydir
              mountPath: /workspace/measurement
      nodeSelector:
        kubernetes.io/hostname: node-0
      volumes:
        - name: mydir
          hostPath:
            path: /home/ubuntu/measurement
      restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
  labels:
    run: test
  name: worker1
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        run: test
    spec:
      containers:
        - name: test
          image: myimage/test:v2
          command: ["python3"]
          args: ["/workspace/measurement/test_init.py","--rank=1","--master_addr=test","--world_size=2"]
          ports:
            - containerPort: 29600
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: mydir
              mountPath: /workspace/measurement
      nodeSelector:
        kubernetes.io/hostname: node-0
      volumes:
        - name: mydir
          hostPath:
            path: /home/ubuntu/measurement
      restartPolicy: Never
```

I got an error on the worker pod:

``` bash
INFO:root:Worker1 initializing   
terminate called after throwing an instance of 'std::runtime_error'
  what():  In mapUuidsToGlobalIndices at /opt/conda/conda-bld/pytorch_1616554793803/work/third_party/tensorpipe/tensorpipe/channel/cuda_ipc/context_impl.cc:215 "iter == globalUuids.end()Couldn't find GPU #0 with UUID 4a97e8ca-5d1b-6d82-82cb-11f1cf44943e"
```

I tried to run `python test_init.py --rank=0 --world_size=2` and `python test_init.py --rank=1 --world_size=2` (not using Kubernetes) and It worked well.
I also tried it on a single-GPU machine (node-1) using Kubernetes and it also worked.

Here are the sorftwares I use.
``` bash
ubuntu 18.04
python 3.8.8
torch 1.8.1
kubernetes 1.21.1
docker 20.10.2
```

Any help will be appreciated!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

init_rpc on a multiple-GPU machine using Kubernetes #918

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

init_rpc on a multiple-GPU machine using Kubernetes #918

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions