Skip to content

init_rpc on a multiple-GPU machine using Kubernetes #918

Open
@xbfu

Description

@xbfu

Hi,

I encountered an error when I tried to run rpc.init_rpc on a 4-GPU machine (node-0) using Kubernetes.
The code for initializing is as follows.

import os
import argparse

import torch
import torch.distributed.rpc as rpc
import logging

def run(rank, world_size):
    logging.basicConfig(level=logging.INFO)
    options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=8, rpc_timeout=20)
    
    if rank == 0:
        logging.info(f"PS{rank} initializing")
        rpc.init_rpc(f"PS{rank}", rank=rank, world_size=world_size, rpc_backend_options=options)
        logging.info(f"PS{rank} initialized")
    else:
        logging.info(f"Worker{rank} initializing")
        rpc.init_rpc(f"worker{rank}", rank=rank, world_size=world_size, rpc_backend_options=options)
        logging.info(f"Worker{rank} initialized")

    rpc.shutdown()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train VGG on CIFAR10 under synchronous strategy")
    parser.add_argument("--rank", type=int, default=1, help="Global rank of this process.")
    parser.add_argument("--world_size", type=int, default=2, help="Number of world size.")
    parser.add_argument("--master_addr", type=str, default="localhost", help="Address of master.")
    parser.add_argument("--master_port", type=str, default="29600", help="Port that master is listening on.")

    args = parser.parse_args()

    os.environ['MASTER_ADDR'] = args.master_addr
    os.environ['MASTER_PORT'] = args.master_port

    run(args.rank, args.world_size)

The yaml file I am using is as follows.

apiVersion: v1
kind: Service
metadata:
  name: test
  labels:
    run: test
spec:
  selector:
    run: test
  ports:
    - protocol: TCP
      port: 29600
      targetPort: 29600
---
apiVersion: batch/v1
kind: Job
metadata:
  labels:
    run: test
  name: ps
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        run: test
    spec:
      containers:
        - name: test
          image: myimage/test:v2
          command: ["python3"]
          args: ["/workspace/measurement/test_init.py","--rank=0","--master_addr=test","--world_size=2"]
          ports:
            - containerPort: 29600
          volumeMounts:
            - name: mydir
              mountPath: /workspace/measurement
      nodeSelector:
        kubernetes.io/hostname: node-0
      volumes:
        - name: mydir
          hostPath:
            path: /home/ubuntu/measurement
      restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
  labels:
    run: test
  name: worker1
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        run: test
    spec:
      containers:
        - name: test
          image: myimage/test:v2
          command: ["python3"]
          args: ["/workspace/measurement/test_init.py","--rank=1","--master_addr=test","--world_size=2"]
          ports:
            - containerPort: 29600
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: mydir
              mountPath: /workspace/measurement
      nodeSelector:
        kubernetes.io/hostname: node-0
      volumes:
        - name: mydir
          hostPath:
            path: /home/ubuntu/measurement
      restartPolicy: Never

I got an error on the worker pod:

INFO:root:Worker1 initializing   
terminate called after throwing an instance of 'std::runtime_error'
  what():  In mapUuidsToGlobalIndices at /opt/conda/conda-bld/pytorch_1616554793803/work/third_party/tensorpipe/tensorpipe/channel/cuda_ipc/context_impl.cc:215 "iter == globalUuids.end()Couldn't find GPU #0 with UUID 4a97e8ca-5d1b-6d82-82cb-11f1cf44943e"

I tried to run python test_init.py --rank=0 --world_size=2 and python test_init.py --rank=1 --world_size=2 (not using Kubernetes) and It worked well.
I also tried it on a single-GPU machine (node-1) using Kubernetes and it also worked.

Here are the sorftwares I use.

ubuntu 18.04
python 3.8.8
torch 1.8.1
kubernetes 1.21.1
docker 20.10.2

Any help will be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions