-
Notifications
You must be signed in to change notification settings - Fork 5
Feat: restructuring of communication methods (and buffered communication for CUDA-Aware MPI) #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A new DistributedMix class is create with the aim of simpflify and unify all comm. calls in both DistributedArray and operators (further hiding away all implementation details).
@tharittk great start! Regarding the setup, I completely agree with the need to change the installation process for CUDA-Aware MPI. I have personally so far mostly relied on conda to install Regarding the code, as I briefly mentioned offline, whilst I think this is the right way to go:
i am starting to feel that the number of branches in code is growing and it is about time to put it all in one place... what I am mostly concerned is that this kind of branches will not only be present in
@astroC86 we have also talked a bit about this in the context of your |
@tharittk I worked a bit more on this, but there is still quite a bit to do (added to the TODO list in the PR main comment).... Also i am not really sure why some tests are failing on some specific combinations of python/mpi/nranks but not on others... have not investigated yet... |
Objective
This PR has two main goals:
DistributedArray
and various linear operators to a single common place, namely theDistributedMixIn
class, whose methods are used by bothDistributedArray
and the linear operatorsPYLOPS_MPI_CUDA_AWARE
is introduced (defaults to 1 but can be set to 0 to force object communications)CUDA-Aware MPI
In order to have a CUDA-aware mpi4py installation mpi4py must be build against a CUDA-aware MPI installation. Since conda installations of mpi4py do not ship with a CUDA-aware MPI, it is therefore required to use pip for installing mpi4py. In the case for NCSA Delta, I create a new conda environment and do
module load openmpi/5.0.5+cuda
then
MPICC=/path/to/mpicc pip install --no-cache-dir --force-reinstall mpi4py
(where
--force-reinstall
is only needed because we install alreadympi4py
as part of the conda environment creation.And to run the test (assuming you're in the compute node already):
To Do
mpi_allgather
method uses the_prepare_allgather_inputs
method to prepare inputs such that they are all of the same size (via zero-padding). Whilst this is strictly needed for NCCL, we should instead consider leveraging MPI `AllGatherv' instead to avoid extra padding and cutting of arrays.MatrixMult
to remove any direct call to mpi4py communication methods - note that we need to either merge main or do this in a separate PR asMatrixMult
is changed from what we have in this branch..