Add device_id parameter to all torch distributed initialization calls #203

Copilot · 2025-10-08T21:50:18Z

Problem

All PyTorch Distributed initialization calls should explicitly specify the device_id parameter to ensure proper GPU device assignment. Without this parameter, PyTorch may not correctly map processes to their intended GPU devices, which can lead to runtime issues in multi-GPU distributed training scenarios.

Solution

Updated all dist.init_process_group() calls across the codebase to include the device_id=torch.device(f"cuda:{device_id}") parameter. The changes follow two patterns based on how each file obtains its rank:

Pattern 1: Files with explicit rank parameter

For files that pass rank explicitly to the worker function:

# Before:
dist.init_process_group(backend=backend, init_method=init_url, world_size=world_size, rank=local_rank)

# After:
dist.init_process_group(
    backend=backend, init_method=init_url, world_size=world_size, rank=local_rank, 
    device_id=torch.device(f"cuda:{local_rank}")
)

Pattern 2: Files using environment variables

For files that rely on environment variables for process group initialization:

# Before:
dist.init_process_group("nccl")

# After:
local_rank = int(os.environ.get("LOCAL_RANK", 0))
dist.init_process_group("nccl", device_id=torch.device(f"cuda:{local_rank}"))

Files Modified

20 example files: examples/00_load through examples/14_all_gather_gemm
3 benchmark files: benchmark/examples/benchmark_*.py
5 reference benchmark files: examples/benchmark/reference/

Testing

All files pass Python syntax validation. The changes maintain backward compatibility and will be validated by the CI pipeline on AMD GPU hardware.

Fixes issue requiring device_id parameter for all torch distributed initialization.

Original prompt

This section details on the original issue you should resolve

<issue_title>Make sure all Torch Distributed initialization use device_id=torch.device(f"cuda:{device_id}")</issue_title>
<issue_description>We have some code that looks like this:

    dist.init_process_group(
        backend="nccl",
        rank=rank,
        world_size=world_size,
        init_method="tcp://127.0.0.1:29500"
)

ALL code should look like this:

    dist.init_process_group(
        backend="nccl",
        rank=rank,
        world_size=world_size,
        init_method="tcp://127.0.0.1:29500",
        device_id=torch.device(f"cuda:{device_id}")
    )

```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>

Fixes #202

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- Updated 20 example files to add device_id=torch.device(f"cuda:{local_rank}") - Updated 3 benchmark files to add device_id=torch.device(f"cuda:{rank}") - Updated 5 reference benchmark files to add device_id with LOCAL_RANK from environment - All changes follow the pattern: device_id=torch.device(f"cuda:{device_id}") Co-authored-by: mawad-amd <[email protected]>

Copilot

Pull Request Overview

This PR systematically adds the device_id parameter to all dist.init_process_group() calls across the Iris codebase to ensure proper PyTorch distributed initialization with device specification. The change addresses issue #202 by making distributed initialization consistent and explicit about which CUDA device each process should use.

Key changes:

Added device_id=torch.device(f"cuda:{device_id}") parameter to all distributed process group initializations
For functions with explicit rank parameters, the rank is used as the device_id
For scripts using environment variables, LOCAL_RANK is retrieved and used as device_id
Added necessary import os statements where LOCAL_RANK environment variable access was needed

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
examples/00_load/load_bench.py	Added device_id parameter using local_rank for distributed initialization
examples/01_store/store_bench.py	Added device_id parameter using local_rank for distributed initialization
examples/02_all_load/all_load_bench.py	Added device_id parameter using local_rank for distributed initialization
examples/03_all_store/all_store_bench.py	Added device_id parameter using local_rank for distributed initialization
examples/04_atomic_add/atomic_add_bench.py	Added device_id parameter using local_rank for distributed initialization
examples/05_atomic_xchg/atomic_xchg_bench.py	Added device_id parameter using local_rank for distributed initialization
examples/06_message_passing/message_passing_load_store.py	Added device_id parameter using local_rank for distributed initialization
examples/06_message_passing/message_passing_put.py	Added device_id parameter using local_rank for distributed initialization
examples/07_gemm_all_scatter/benchmark.py	Added device_id parameter using local_rank for distributed initialization
examples/08_gemm_atomics_all_reduce/benchmark.py	Added device_id parameter using local_rank for distributed initialization
examples/09_gemm_one_shot_all_reduce/benchmark.py	Added device_id parameter using local_rank for distributed initialization
examples/10_gemm_all_scatter_wg_specialization/benchmark.py	Added device_id parameter using local_rank for distributed initialization
examples/11_gemm_all_scatter_producer_consumer/benchmark.py	Added device_id parameter using local_rank for distributed initialization
examples/12_gemm_all_scatter_bulk_synchronous/benchmark.py	Added device_id parameter using local_rank for distributed initialization
examples/13_flash_decode/example_run.py	Added device_id parameter using rank for distributed initialization
examples/14_all_gather_gemm/example_run_pull.py	Added device_id parameter using rank for distributed initialization
examples/14_all_gather_gemm/example_run_push.py	Added device_id parameter using rank for distributed initialization
benchmark/examples/benchmark_all_gather_gemm_pull.py	Added device_id parameter using rank for distributed initialization
benchmark/examples/benchmark_all_gather_gemm_push.py	Added device_id parameter using rank for distributed initialization
benchmark/examples/benchmark_flash_decode.py	Added device_id parameter using rank for distributed initialization
examples/benchmark/reference/all_gather.py	Added os import and device_id parameter using LOCAL_RANK environment variable
examples/benchmark/reference/all_reduce.py	Added os import and device_id parameter using LOCAL_RANK environment variable
examples/benchmark/reference/reduce_scatter.py	Added os import and device_id parameter using LOCAL_RANK environment variable
examples/benchmark/reference/all_gather_gemm/benchmark_rccl_torch.py	Added device_id parameter using rank for distributed initialization
examples/benchmark/reference/flash_decode_rccl/benchmark_flash_decode_rccl.py	Added LOCAL_RANK retrieval and device_id parameter for distributed initialization

Initial plan

da8c0bd

Copilot AI assigned Copilot and mawad-amd Oct 8, 2025

Copilot started work on behalf of mawad-amd October 8, 2025 21:50 View session

mawad-amd approved these changes Oct 8, 2025

View reviewed changes

mawad-amd marked this pull request as ready for review October 8, 2025 22:03

mawad-amd requested review from BKP and neoblizz as code owners October 8, 2025 22:03

mawad-amd requested review from Copilot and removed request for BKP and neoblizz October 8, 2025 22:03

Apply Ruff auto-fixes

85eece1

Copilot AI reviewed Oct 8, 2025

View reviewed changes

Copilot AI changed the title ~~[WIP] Ensure Torch Distributed initialization uses device_id~~ Add device_id parameter to all torch distributed initialization calls Oct 8, 2025

Copilot AI requested a review from mawad-amd October 8, 2025 22:04

Copilot finished work on behalf of mawad-amd October 8, 2025 22:04

mawad-amd approved these changes Oct 8, 2025

View reviewed changes

Merge branch 'main' into copilot/update-torch-distributed-init

2f0ac1c

mawad-amd merged commit 69d365f into main Oct 9, 2025
17 checks passed

mawad-amd deleted the copilot/update-torch-distributed-init branch October 9, 2025 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add device_id parameter to all torch distributed initialization calls #203

Add device_id parameter to all torch distributed initialization calls #203

Uh oh!

Copilot AI commented Oct 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add device_id parameter to all torch distributed initialization calls #203

Add device_id parameter to all torch distributed initialization calls #203

Uh oh!

Conversation

Copilot AI commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Pattern 1: Files with explicit rank parameter

Pattern 2: Files using environment variables

Files Modified

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 8, 2025 •

edited

Loading