Skip to content

Conversation

youkaichao
Copy link
Member

@youkaichao youkaichao commented Apr 29, 2024

TL;DR:

  1. The line of code change is large, because I moved code from pynccl.py to pynccl_wrapper.py
  2. Before this PR, vLLM only supports one nccl communicator per process, and only support one tensor parallel group. After this PR, we can have multiple nccl communicators per process, e.g. one for tp and one for pp. Multiple tensor parallel groups are also supported.
  3. Before this PR, whether we use pynccl for allreduce is changed back and forth, but actually just depends on whether we use cudagraph. So I changed it to be just one function change_model_parallel_pynccl_allreduce .

There are three groups in vLLM:

  1. world group, created by init_distributed_environment
  2. tp group, created by initialize_model_parallel
  3. pp group, created by initialize_model_parallel (should refactor in the future, at least the name should be init_tp_and_pp stuff)

device communicator's life cycle should align with the corresponding group. However, currently our device communicator is actually bound to the world group, causing many troubles like init and destroy. It kind of work because we only have one tp group, but will bring trouble in future pp.

This pr fixed the problem, so that device communicators are aligned with the corresponding group.

@youkaichao youkaichao marked this pull request as draft April 29, 2024 21:17
@youkaichao youkaichao changed the title [WIP][Core][Distributed] refactor pynccl communicator [Distributed] refactor pynccl to support multilpe TP groups Apr 30, 2024
@youkaichao youkaichao marked this pull request as ready for review April 30, 2024 03:03
@youkaichao
Copy link
Member Author

I feel this PR is quite large. Will break it into small pieces.

@youkaichao youkaichao mentioned this pull request May 1, 2024
16 tasks
@youkaichao
Copy link
Member Author

close as this is split into #4512 #4566 and #4591 .

@youkaichao youkaichao closed this May 4, 2024
@youkaichao youkaichao deleted the pynccl_refactor branch May 4, 2024 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant